Tuesday, May 24, 2011

Normalizing Abnormal Data

We've all been there, walk into a project that involve some sort of data aggregation, reporting, analytics or some such thing and thought "easy, no problem" only to later realize the data you are to query against is basically an Excel Spreadsheet tossed into a single database table, and over the years has evolved into a table with hundreds of thousands, in not millions of rows that have just been put there for archiving, with never a thought of actually using it.

Yep, always a grand job.

If such data is only a few hundred, or few thousand rows, it's no big deal to have a couple sloppy queries written quickly for the reporting, but, if there is an archive of millions of rows, you are in for a much more challenging experience.
The first step is to check out the data, take note of what is repetitive within it. Most likely, this will be the same data that will eventually make up a "parent" table in a normalized schema. In the instance of something like point of sale data, for example, repeating data might be the store number, department code and maybe a date. Child data would be perhaps a name, item sold and amount.

A simple query to create the parent table from the ugly source table would be, for example:

INSERT INTO parentData (store_number, department_code, sale_date) SELECT DISTINCT store_number, department_code, sale_date FROM POS_History

That would create the parent table, with a single row for each store and department for each date in one large dump.

To create a child table would be something similar in the simplest way:

INSERT INTO childData (name, item, amount, store_number, department_code, sale_date) SELECT name, item, amount, store_number, department_code, sale_date FROM POS_History

Now, this creates a child table that can be related back to the parent via store number, department code and sale date.

However, this could make for a slow query time after time selecting on 3 columns of possibly numbers, strings and dates...so we could do something tricky.

The parent table could have one extra field that would contain a concatenation of the three fields, and do the same thing in the child table, plus add a parentDate_id field to act as a foreign key to a primary key in the parentData table named "id", while removing the three individual fields.

Then, after the two large dumps are done to create those two tables, do an update to match up data based on that concatenation field like so:

UPDATE childData SET parentData_id = parentData.id FROM childData INNER JOIN parentData ON childData.concatField = parentData.concatField WHERE childData_id IS NULL;

By going this route you can do two bulk inserts and an update to determine relationship later, rather than one bulk update then loops through that data set (generally involving a cursor or other horribly expensive sin) and get the relationship thru the loop. In the interest of saving database disk space, you can also null out those temporary key fields after processing, and then this can also work as a flag to say "yep, already been processed" for future processes.

I have, in a couple instances, cut the process of normalizing from 14 hours to 7 minutes...literally.

0 comments: