Allan’s Musings

Just another weblog

Archive for May, 2009

On Extracting Transforming and Loading

Posted by Allan Peda on May 19, 2009


Tricky stuff this ETL. One might think it as simple as copying data from one spot to another, but as usual things can get complicated real fast.

Some questions that may come up:

  1. Who is upstream? Who is downstream? In the parlance of project management we might want to confer closely with both these shareholders.
  2. How often to the data feeds coming in? Are these feeds being pushed, pulled, or some combination thereof?
  3. Are these deltas to existing data, or are these simple additions to a data set?
  4. How large are the feeds?
  5. What if we miss a feed, as might happen during an extended network outage?
  6. Is one “feed” really multiple files which must be processed at one time? If so, is there a particular order they must be processed in?
  7. How often do the downstream users expect the data to materialize? How often is it generated?
  8. What type of processing is necessary to “digest” the data so that it is usable?
  9. Are there issues with auditing and data integrity? Do we need to always be able to trace data back to the source?

With these questions as a starting point we can identify possible issues and hopefully we can head off possible problems.

What are the tools we have at our disposal on a typical *nix system in a typical big firm?

  1. FTP, try as you might to avoid it, you will encounter old insecure FTP.
  2. Oracle SQL *Loader
  3. Perl and it’s newfangled relative Ruby, in conjunction with some variant of DBI.
  4. Unix cron and the fancier asynchronous utilities such as fsniper
  5. Oracle’s internal scheduler, which has replaced the older DBMS_JOB scheduling mechanism.

The Nitty Gritty: Files

On the face of it, a flat tab delimited file would seem to have little information in it beyond the contents and the filename, but if you reflect in it, every time that file is transferred across business unit boundaries, there is some (potential) data that someone may want to know. The following questions may need to be answered:

  1. When was the file first generated?
  2. Can we be sure it was not tampered with?
  3. How large is it?
  4. What time period does this report represent? Is the generation time to be assumed the same as the reporting time, or is there a delay? For example are reports for the previous year generated at the start of the next year?
  5. Again, what are the auditing requirements?

A lot of times a file may be updated as it traverses business unit boundaries, and naturally the size of the file and the checksum will change. It may not be sufficient to simply save the most recent data. Similarly data derived from the filesystem (such as mtime and ctime) could be inaccurate.

Posted in PL/SQL, Programming, Ruby, Unix Utilities | Leave a Comment »

Oracle Habits

Posted by Allan Peda on May 17, 2009

I am a big believer in following best practices, at least most of the time. In the case of Oracle databases that means creating surrogate keys. Even if they are of little use initially, I have found these unique identifiers are often useful later on. This is easy to do via auto numbering triggers.

Lets say we have a simple two column table, described below:

SQL> create table mytab (id integer primary key, descr varchar2(20));

The creation of an associated sequence and trigger will allow anyone to insert data into this table without bothering with the details of generating unique ID numbers.

SQL> create sequence mytab_seq;
Sequence created.
SQL> create or replace
trigger mytab_trg
before insert or update on mytab
for each row
  select nvl(,mytab_seq.nextval) into from dual;
2    3    4    5    6    7    8  /
Trigger created.
SQL> insert into mytab (id, descr) values (1,'one');
1 row created.
SQL> insert into mytab (id, descr) values (null,'two');
1 row created.
SQL> commit;
Commit complete.
SQL> select * from mytab;
---------- --------------------
	 1 one
	 2 two

Notice that the nvl() function always increments the sequence. This is important, as it allows users to manually add ID values without inadvertently triggering subsequent collisions. The only caveat I would add is that I am not sure if this is specified behavior, but I have observed it empirically

Posted in Programming, SQL | Tagged: , | Leave a Comment »

Oracle and rownum

Posted by Allan Peda on May 15, 2009

The other day I was looking at the pagination using the rownum pseudo column. Tom Kyte’s Oracle Column here has some useful hints and an excellent explanation of its dynamic nature. When I tried to recreate his query, it occurred to me that the following DQL would be more concise. I’m not sure if it is just as effective performance wise, but it is easier to read:

SQL> select T3.* from (select T2.*, rownum as outer_rownum
   2  from (select T1.report_date from mytab T1 order by T1.report_date )T2 )T3 
   3  where T3.outer_rownum between 3 and 5;

Which reminds me; much as I have a distaste for most things Microsoft®, it seems the Common Table Expressions (CTEs) are an overlooked way to make SQL more readable. It’s almost a secret, but Oracle 10g does supports CTE syntax.  Consider the following way of identifying violations of a business rule disallowing the same business unit being assigned to multiple companies (such as a Toyota Jeep being submitted in the same feed as a Chrysler Jeep):

with distinct_tab1_v as (select distinct
                              compid as COMP_ID,
                              unitid as UNIT_ID
                         from tab1)
select 'unit id: '|| unit_id || '  comp id: '|| comp_id
from distinct_tab1_v
where unit_id in ( select unit_id
                     from distinct_tab1_v
                 group by unit_id
                   having count(*) > 1 ) order by comp_id, unit_id;

The meat and potatoes of this query is the isolation of a unique collection of units and businesses, which are not needed elsewhere. Rather than create a view that sits around forever, a CTE accomplishes the same effect without globally modifying the schema.

It might be easier to look at this as three queries chained together. The first orders the select, the second assigns the row numbering, then the third imposes the limits on the desired rows. I rewrote the query using a table expression so as to make it easier to alias each derived select.

The raw data we are querying is as follows:

 select * from mytab order by commentary;

---------- -------------------- ----------
    10 ten         azure
     5 five        blue
     4 four        green
     6 six         indigo
     9 nine        ochre
     2 two         orange
     1 one         red
     8 eight       salmon
     7 seven       violet
     3 three       yellow

Selecting a subset from an ordered select is as simple (or as complex) as:

with ordered_sel as (
   from mytab
   order by commentary
select numbered_sel.* from (
      rownum as rnum
   from ordered_sel
) numbered_sel where numbered_sel.rnum between 2 and 5

Which results in:

---------- -------------------- ---------- ----------
	 5 five 		blue		    2
	 4 four 		green		    3
	 6 six			indigo		    4
	 9 nine 		ochre		    5

Yeah, SQL is weird, but after a while the logic becomes clearer.

Posted in PL/SQL, Programming | Tagged: | Leave a Comment »

The Ulam Spiral

Posted by Allan Peda on May 11, 2009

A few years ago I wrote a perl program to generate a map of prime numbers based on on a what is known as an Ulam Spiral. Sure enough the patterns are readily recognizible. The origin is barely discernible in the center of the image below as a lone red pixel, though you’ll need to zoom in to see it.

Prime numbers illustrated as Ulam spiral

Prime numbers Plotted as an Ulam Spiral

A color pdf version of the source is available here, and a downloadable version should be here

#!/usr/bin/env perl
# -*- cperl -*-

BEGIN { $^W = 1 }

use strict;
use diagnostics;
use warnings;
use Carp;
use English;
use Math::Trig;
use GD;
use constant PI => 4*atan(1);

# Allan Peda
# June 30, 2007
# Ulam prime number spiral
# we iterate over a grid, but we want to follow a spiral.
# Direction is cartesian based x being horizontal motion,
# and y being vertical. Up and right are postive motion.

    die "Please provide a png file to save to.";

my $pngfile = 'ulam_spiral.png';
if($ARGV[0] =~ m/\.png$/i){
    $pngfile = $ARGV[0];
} else {
    print "Supplied file not suffixed with .png defaulting to \"$pngfile\".\n";

# how many turns to make in the spiral
my $num_corners = 1600;
my @turning_points = mkturns($num_corners);
# which directions do we follow
my $max = $turning_points[$#turning_points];
print "Generating map for first $max primes.\n";
my ($x_width, $y_width) = (0,0);
$x_width = $y_width = int(sqrt($max+0.5));

print "Dimensions are: $x_width x $y_width\n";
my %primes = %{genprimes($max)};

my $im           = new GD::Image($x_width, $y_width);
my $white        = $im->colorAllocate(255,255,255);
my $black        = $im->colorAllocate(0,0,0);
my $origin_color = $im->colorAllocate(255,0,0);
my $position=$turning_points[0];
my ($x, $y) = (0,0);
my ($delx,$dely) = (1,0); # start out moving to the right

sub gdx { return( shift()+($x_width/2) ) }
sub gdy { return( ($y_width/2)-shift()) }


foreach my $next_turn (@turning_points) {
    for (my $i=$position;$isetPixel(gdx($x),gdy($y),$black);
    $position = $next_turn+1;

open(PNGFH, ">".$pngfile)
  or die "Could not open image file \"$pngfile\".";
print PNGFH $im->png;
close PNGFH;

# an array of run lengths up to each turn
# equation was guessed at after inspection of coordinates
# from a hand drawn cartesian spiral
sub mkturns{
    my $ulim = shift;
    my $len = 0;
    my @rv;
    foreach my $i (1..$ulim) {
        my $delta = ($i%2?$i+1:$i)/2;
        $len = $len+$delta;
        push( @rv, $len);
    return @rv;

# generates prime numbers as hash keys
sub genprimes {
    my $max = shift;
    my $sieve = '';
    my %primes = ( 2 => 1 );
  GUESS: for (my $guess = 2; $guess <= $max; $guess++) {
        next GUESS if vec($sieve,$guess,1);
        $primes{$guess} = 1;
        for (my $mults = $guess * $guess; $mults <= $max; $mults += $guess) {
            vec($sieve,$mults,1) = 1;
    return \%primes;

# uses memoized trig function
# function modded to go past 180 degrees
    my %memo = ();
    sub nextDirection {
        my ($x, $y) = (shift, shift);
        my $nextoffset = undef;
        if( defined($memo{$x}{$y})){
            $nextoffset = $memo{$x}{$y};
        } else {
            my $acos = acos($x);
            my $asin = asin($y);
            if( $asin $asin?$acos:$asin)/PI);
            # hard coded count of four directions in spiral (right, up, left down)
            $nextoffset = $memo{$x}{$y} = ($offset+1)%4;
        return to_xy($nextoffset);

# uses memoized trig function
    my %memo = ();
    sub to_xy {
        defined($memo{$_[0]}) and return @{$memo{$_[0]}};
        $memo{$_[0]}= [int(cos($_[0]*PI/2)),int(sin($_[0]*PI/2))];
        return @{$memo{$_[0]}};

Posted in Mathematical Recreations | Tagged: , , | Leave a Comment »

The Power of Ruby

Posted by Allan Peda on May 7, 2009

The other day I came across Oracle error ORA-01555 during the backup of a database. Thankfully Oracle had some good advice on addressing the issue here and I tossed together a Ruby script to make adjusting the system a bit more automated. Admittedly this could be done in one block of PLSQL, but it was easier to generate the DDL on the fly and print it our prior to running the script. Ruby rocks in this way, being able to do everything perl does, but (honestly I must admit) better, if only because it handles exceptions more neatly. To top all this off, I am not even talking about full application deployment with Rails, although Oracle covers that topic at great length.

# database stuff
require 'dbi'
require 'oci8'

# require 'rubygems'
# require 'optparse'

@verbose = false
def echo( arg )
  if @verbose
    puts arg

   # db handle
   oph = DBI.connect('DBI:OCI8:DB',  'system',  'pass')
   # first fetch prior recorded timestamp from the monitorin DB
   # only need first row
   tuple_snapshot_undo = oph.select_one(
      %Q[ /* tune_undotbs.rb */
            MIN(snap_id) as min_snap_id,
            MAX(snap_id) as max_snap_id
         FROM dba_hist_snapshot
   min_snap_id = tuple_snapshot_undo[0].to_i
   max_snap_id = tuple_snapshot_undo[1].to_i
   # puts "#{min_snap_id}, #{max_snap_id}"
   tune_proc = %Q[
        l_task_name  VARCHAR2(30) := '#{min_snap_id}_#{max_snap_id}_AWR_SNAPSHOT_UNDO';
        l_object_id  NUMBER;
        -- Create an ADDM task.
        DBMS_ADVISOR.create_task (
          advisor_name      => 'Undo Advisor',
          task_name         => l_task_name,
          task_desc         => 'Undo Advisor Task');

        DBMS_ADVISOR.create_object (
          task_name   => l_task_name,
          object_type => 'UNDO_TBS',
          attr1       => NULL, 
          attr2       => NULL, 
          attr3       => NULL, 
          attr4       => 'null',
          attr5       => NULL,
          object_id   => l_object_id);

        -- Set the target object.
        DBMS_ADVISOR.set_task_parameter (
          task_name => l_task_name,
          parameter => 'TARGET_OBJECTS',
          value     => l_object_id);

        -- Set the start and end snapshots.
        DBMS_ADVISOR.set_task_parameter (
          task_name => l_task_name,
          parameter => 'START_SNAPSHOT',
          value     => #{min_snap_id});

        DBMS_ADVISOR.set_task_parameter (
          task_name => l_task_name,
          parameter => 'END_SNAPSHOT',
          value     => #{max_snap_id});

        -- Execute the task.
        DBMS_ADVISOR.execute_task(task_name => l_task_name);
rescue DBI::DatabaseError => e
     STDERR.puts "Error code:    #{e.err}"
     STDERR.puts "Error message: #{e.errstr}"
     oph.disconnect if oph

Posted in Programming, Ruby | Tagged: | Leave a Comment »