Big Data Test – SCD – Design

Given the requirements and the set based table recreate solution discussed in the introduction, we need more detail on the steps required to solve this SCD problem.

To start with we have the data generated by the python scripts.  This data exists in two folders of an AWS S3 bucket:

  • Initial – our initial load of 200 million records
  • Batch – our first batch of data with 200 thousand records

The solution is going to be the same for loading the initial data as it is for loading the first batch.  In the steps below the term “new” data will first apply to the initial load and when run a second time it will apply to the first batch.

High Level Design

We have records we have already processed and new records to process.  The new records may contain back dated records.  Where those exists we may have to change the EndDate for existing records according to StartDate in the back dated records.  We may also have to set the EndDate for back dated records  to StartDate values that are in the already processed records.  And lastly, we have EndDates in new records that need to be set according to StartDate values in new records.

That seems complex, so let’s think of the problem in terms of impacted (EndDate needs to be recalculated) and non-impacted (EndDate is already set correctly) records.  All of the new records are impacted.  The already processed records are mostly non-impacted.  This is simpler as we can separate the records into impacted and non-impacted records.  So the high level design steps start with:

  1. Copy the unimpacted records from Customer to a tables called CustomerNew.
  2. Copy the impacted records from Customer to CustomerWork1.
  3. Copy all of the records from CustomerStg to CustomerWorking1 setting EndDate to 12/31/9999.

Next we process the impacted records in CustomerWork1 and copy them over to CustomerNew.  But we have to beware of impacted records with the same CustomerID and StartDate.  These “duplicate” records can cause a problem when using set based logic, so they need to be processed separately.

  1. Copy all but one “duplicate” record from CustomerWork1 to CustomerNew setting the EndDate to the StartDate.
  2. Copy the non-“duplicate” records and the single duplicate record from CustomerWork1 to CustomerNew setting the EndDate to either the next higher StartDate for the same Customer where a next higher StartDate exists or to 12/31/9999.
  3. If it exists, drop the CustomerOld table
  4. Rename the Customer table to CustomerOld
  5. Rename the CustomerNew table to Customer.

Detailed Design

Now for the detailed design steps:

  1. If there is not a Customer table, create it.
    Field NameData Type
    CustomerIDInteger
    TxnIDInteger
    FileNoInteger
    LineNoInteger
    StartDateDatetime
    EndDateDatetime
  2. Load the new records into CustomerStg table.
    Field NameData Type
    TxnIDInteger
    FileNoInteger
    LineNoInteger
    CustomerIDInteger
    EffectiveDateDatetime
  3. Create the CustomerStgMinEffDate table.
    Field NameData Type
    CustomerIDInteger
    MinEffectiveDateDatetime
  4. Insert into CustomerStgMinEffDate
    select CustomerID, min(EffectiveDate)  as MinEffectiveDate
    from CustomerStg
    group by CustomerID

    This table is going to be used to distinguish the impacted Customer records into impacted history (essentially non-impacted) records and impacted current (essentially impacted) records.  While it is possible for records in CustomerStg to be back dated, it is likely that there will be many non-impacted Customer records with StartDate and EndDate before the earliest back dated record.  For the initial load that will again be no records, but normally impacted history records would be most of the effected Customer records.

  5. Create the CustomerNew table.
    Field NameData Type
    CustomerIDInteger
    TxnIDInteger
    FileNoInteger
    LineNoInteger
    StartDateDatetime
    EndDateDatetime
  6. Insert into CustomerNew
    select Customer.*
    from Customer left outer join CustomerStgMinEffDate
    on Customer.CustomerID = CustomerStgMinEffDate.CustomerID
    where Customer.CustomerID not in (select CustomerID from CustomerStg)
    or (CustomerStgMinEffDate.CustomerID is not null
    and Customer.EndDate < CustomerStgMinEffDate.MinEffectiveDate)

    This step takes care of the non-impacted records.  For the initial load that will be no records, but normally non-impacted records would be most of the records from the Customer table as each customer does not change their data on a regular basis.

  7. Create the CustomerWork1 table.
    Field NameData Type
    CustomerIDInteger
    TxnIDInteger
    FileNoInteger
    LineNoInteger
    StartDateDatetime
    EndDateDatetime
  8. Insert into CustomerWork1
    select CustomerID, TxnID, FileNo, LineNo, EffectiveDate as StartDate,
    12/31/9999 as EndDate
    from CustomerStg
  9. Insert into CustomerWork1
    select Customer.*
    from Customer left outer join CustomerStgMinEffDate
    on Customer.CustomerID = CustomerStgMinEffDate.CustomerID
    where Customer.EndDate >= CustomerStgMinEffDate.MinEffectiveDate
  10. Create the CustomerWorkDups table.
    Field NameData Type
    CustomerIDInteger
    StartDateDatetime
    MaxTxnIDInteger
  11. Insert into CustomerWorkDups
    select CustomerID, StartDate, max(TxnID) as MaxTxnID
    from CustomerWork1
    group by CustomerID, StartDate
  12. Insert into CustomerNew
    select CustomerWork1.CustomerID, CustomerWork1.TxnID, CustomerWork1.FileNo,
    CustomerWork1.LineNo, CustomerWork1.StartDate,
    CustomerWork1.StartDate as EndDate
    from CustomerWork1 join CustomerWorkDups
    on CustomerWork1.CustomerID = CustomerWorkDups.CustomerID
    and CustomerWork1.StartDate = CustomerWorkDups.StartDate
    where CustomerWork1.TxnID != CustomerWorkDups.MaxTxnID
  13. Delete CustomerWork1
    from CustomerWork1 join CustomerWorkDups
    on CustomerWork1.CustomerID = CustomerWorkDups.CustomerID
    and CustomerWork1.StartDate = CustomerWorkDups.StartDate
    where CustomerWork1.TxnID != CustomerWorkDups.MaxTxnID

    This should be a small number of records, so even if deletes are expensive it is probably cheaper than moving all of the non-“duplicate” records to CustomerWork2 and using that table for the rest of the steps.

  14. Create the CustomerWorkEndDate table. 
    Field NameData Type
    CustomerIDInteger
    StartDateDatetime
    MinStartDateDatetime
  15. Insert into CustomerWorkEndDate
    select CustomerWork1.CustomerID, CustomerWork1.StartDate,
    min(CW1.StartDate) as MinStartDate
    from CustomerWork1 join CustomerWork1 CW1
    on CustomerWork1.CustomerID = CW1.CustomerID
    and CustomerWork1.StartDate < CW1.StartDate
    group by CustomerWork1.CustomerID, CustomerWork1.StartDate

    This is the CustomerWork1 table joined to itself to find the right EndDate values.  For the initial load this is a very large join and should be partitioned by CustomerID to be efficient.

  16. Insert into CustomerNew
    select CustomerWork1.CustomerID, CustomerWork1.TxnID, CustomerWork1.FileNo,
    CustomerWork1.LineNo, CustomerWork1.StartDate,
    coalesce(CustomerWorkEndDate.MinStartDate, 12/31/9999) as EndDate
    from CustomerWork1 left outer join CustomerWorkEndDate
    on CustomerWork1.CustomerID = CustomerWorkEndDate.CustomerID
    and CustomerWork1.StartDate = CustomerWorkEndDate.StartDate

    This will leave the leave the CustomerNew table fully loaded.  Again for the initial load this is a very large join and should be partitioned by CustomerID to be efficient.

  17. If CustomerOld exists, drop table CustomerOld
  18. Rename Customer as CustomerOld
  19. RenameCustomerNew as Customer

The above steps contain fake SQL statements.  There will need to be translations into the big data system that is used.

A quick test of these steps are to confirm that the count of records in Customer = the count of records in CustomerOld + the count of records in CustomerStg.