Allan’s Musings

Just another weblog

On Extracting Transforming and Loading

Posted by Allan Peda on May 19, 2009


Tricky stuff this ETL. One might think it as simple as copying data from one spot to another, but as usual things can get complicated real fast.

Some questions that may come up:

  1. Who is upstream? Who is downstream? In the parlance of project management we might want to confer closely with both these shareholders.
  2. How often to the data feeds coming in? Are these feeds being pushed, pulled, or some combination thereof?
  3. Are these deltas to existing data, or are these simple additions to a data set?
  4. How large are the feeds?
  5. What if we miss a feed, as might happen during an extended network outage?
  6. Is one “feed” really multiple files which must be processed at one time? If so, is there a particular order they must be processed in?
  7. How often do the downstream users expect the data to materialize? How often is it generated?
  8. What type of processing is necessary to “digest” the data so that it is usable?
  9. Are there issues with auditing and data integrity? Do we need to always be able to trace data back to the source?

With these questions as a starting point we can identify possible issues and hopefully we can head off possible problems.

What are the tools we have at our disposal on a typical *nix system in a typical big firm?

  1. FTP, try as you might to avoid it, you will encounter old insecure FTP.
  2. Oracle SQL *Loader
  3. Perl and it’s newfangled relative Ruby, in conjunction with some variant of DBI.
  4. Unix cron and the fancier asynchronous utilities such as fsniper
  5. Oracle’s internal scheduler, which has replaced the older DBMS_JOB scheduling mechanism.

The Nitty Gritty: Files

On the face of it, a flat tab delimited file would seem to have little information in it beyond the contents and the filename, but if you reflect in it, every time that file is transferred across business unit boundaries, there is some (potential) data that someone may want to know. The following questions may need to be answered:

  1. When was the file first generated?
  2. Can we be sure it was not tampered with?
  3. How large is it?
  4. What time period does this report represent? Is the generation time to be assumed the same as the reporting time, or is there a delay? For example are reports for the previous year generated at the start of the next year?
  5. Again, what are the auditing requirements?

A lot of times a file may be updated as it traverses business unit boundaries, and naturally the size of the file and the checksum will change. It may not be sufficient to simply save the most recent data. Similarly data derived from the filesystem (such as mtime and ctime) could be inaccurate.


Sorry, the comment form is closed at this time.

%d bloggers like this: