Big Data Test – SCD – Introduction

I’ve set myself an interesting test.  What are some of the big data solutions for creating a slowly changing dimension (SCD)?

Making sure everyone is up to speed, a SCD is a form of a table in a data warehouse that contains data that slowly changes over time.  In a type 2 SCD table, the desire is to have the current version of a given record and all of the previous versions of the same record within the same table.  To use an example, the desire is to have the current record for each customer and all of the previous versions of the customer records.

Type 2 SCD keeps track of all of these versions by adding two columns to the table; a start date and an end date.  This makes retrieving records at any given instance in time a very easy filter.   That way all of the current records would with a filter that looks like:

WHERE StartDate <= today AND today < EndDate

Or all of the records current as of 11/20/2016 would use a filter like:

WHERE StartDate <= ’11/20/2016′ AND ’11/20/2016′ < EndDate

Unfortunately, data rarely comes to business intelligence with start and end dates, but what we do get very often is a single EffectiveDate, otherwise known as sales date, as of date, change date, etc.  The EffectiveDate can be transformed into a StartDate and EndDate by setting StartDate to EffectiveDate and EndDate to the next EffectiveDate.  If there is no next EffectiveDate, then the EndDate is set to a maximum date like 12/31/9999.  An example of this transform takes:

CustomerIDEffectiveDate
12110/13/2013
12105/23/2014
12106/07/2016
12202/05/2015
12202/06/2015
12209/14/2016

and turns it into:

CustomerIDStartDateEndDate
12110/13/201305/23/2014
12105/23/201406/07/2016
12106/07/201612/31/9999
12202/05/201502/06/2015
12202/06/201509/14/2016
12209/14/201612/31/9999

Traditionally, there are hundred of variations of the problem and the solutions for this transformation, but they really can be summarized into 3 methods.

  1. Set based update solution – Insert all of the data into the destination table setting CustomerID = CustomerID, StartDate = EffectiveDate, and EndDate = ’12/31/9999′.  Then update the destination table setting the EndDate of each record to the StartDate of another record according to the following conditions.
    • The update CustomerID = other record CustomerID
    • The update StartDate < the other record StartDate
    • If there are more than one other records in the previous two conditions use the record with the oldest (minimum) StartDate.
  2. Iterative solution – Insert all of the data into the destination table setting CustomerID = CustomerID, StartDate = EffectiveDate, and EndDate = ’12/31/9999′.  Then cursor through each record in the order of CustomerID ascending and StartDate descending performing the following steps.
    1. If this is the first record or if the CustomerID does not equal the previous CustomerID, then save the CustomerID and StartDate and skip to the next record.
    2. If the CustomerID is equal to the previous CustomerID, then set the EndDate equal to the previous StartDate.
    3. Save the CustomerID and StartDate.
  3. Set based table recreate solution – Insert all of the data into a staging table setting CustomerID = CustomerID, StartDate = EffectiveDate, and EndDate = ’12/31/9999′.  Then instead of inserting directly into the destination table, a number of simple inserts from the staging table and the destination table are applied to a new destination table.  Once all of the data is correctly in the new destination table, the orginal destination table is replaced with the new destination table.

The set based update solution, while quite feasible and often cheaper to develop,  can use too much database horsepower causing problems when the same process is applied too many tables at the same time.

The iterative solution uses less database horsepower although it is usually at the cost of development time.

The set based table recreate solution is often reserved for systems where inserts are very cheap, such as MPP databases.  I am quite sure that this method will be the best in big data systems like Hive and Spark.

That is the background, now to the test. Here are the requirements.

  1. The destination table will have 200 million records.  The new data to be added will be 200 thousand records.
  2. All of the data will be proved with EffectiveDate, so the load of the 200 million into a blank destination table will be performed by the same process that loads the 200 thousand records into the destination table with 200 million records.
  3. Data will be generated by a python script and will result in a number of files that will be moved to an S3 bucket.  CustomerID will not be restricted to a single file.
  4. Data with the exact same EffectiveDate may exist.
  5. The EffectiveDate range for the 200 million will be 1/1/2015 to 11/30/2015
  6. The EffectiveDate range for the 200 thousand will be 11/1/2015 to 12/31/2015
  7. The CustomerID should range from 1 to 1 million

This is a particularly difficult version of the SCD problem.  The requirements contain the following design issues.

  1. According to requirement #3, I can’t count on the files being partitioned by CustomerID
  2. According to requirement #4, I have to take care that there may be some records that are either duplicate records and should be ignored.  Or that that the records must be kept and that the StartDate and EndDate for all but one of the duplicates will be set to the same value (EffectiveDate).
  3. Requirements #5 and #6 have an overlapping date range.  So, changes can be backdated and the Calculation of EndDate needs to take that into account.
  4. Volumes of data in Requirements #1 and #2 are going to push the solution into the realms of big data.

My goal with this test is not to prove which big data system is better.  There are plenty of places to look for comparisons.  For me this test is to experience the same kind of difficult problem that I encounter all the time in tools I am less familiar with.