Big Data Test – SCD – Design

Posted On November 26, 2017

Given the requirements and the set based table recreate solution discussed in the introduction, we need more detail on the steps required to solve this SCD problem.

To start with we have the data generated by the python scripts. This data exists in two folders of an AWS S3 bucket:

Initial – our initial load of 200 million records
Batch – our first batch of data with 200 thousand records

The solution is going to be the same for loading the initial data as it is for loading the first batch. In the steps below the term “new” data will first apply to the initial load and when run a second time it will apply to the first batch.

High Level Design

We have records we have already processed and new records to process. The new records may contain back dated records. Where those exists we may have to change the EndDate for existing records according to StartDate in the back dated records. We may also have to set the EndDate for back dated records to StartDate values that are in the already processed records. And lastly, we have EndDates in new records that need to be set according to StartDate values in new records.

That seems complex, so let’s think of the problem in terms of impacted (EndDate needs to be recalculated) and non-impacted (EndDate is already set correctly) records. All of the new records are impacted. The already processed records are mostly non-impacted. This is simpler as we can separate the records into impacted and non-impacted records. So the high level design steps start with:

Copy the unimpacted records from Customer to a tables called CustomerNew.
Copy the impacted records from Customer to CustomerWork1.
Copy all of the records from CustomerStg to CustomerWorking1 setting EndDate to 12/31/9999.

Next we process the impacted records in CustomerWork1 and copy them over to CustomerNew. But we have to beware of impacted records with the same CustomerID and StartDate. These “duplicate” records can cause a problem when using set based logic, so they need to be processed separately.

Copy all but one “duplicate” record from CustomerWork1 to CustomerNew setting the EndDate to the StartDate.
Copy the non-“duplicate” records and the single duplicate record from CustomerWork1 to CustomerNew setting the EndDate to either the next higher StartDate for the same Customer where a next higher StartDate exists or to 12/31/9999.
If it exists, drop the CustomerOld table
Rename the Customer table to CustomerOld
Rename the CustomerNew table to Customer.

Detailed Design

Now for the detailed design steps:

If there is not a Customer table, create it.

Field Name Data Type

CustomerID Integer

TxnID Integer

FileNo Integer

LineNo Integer

StartDate Datetime

EndDate Datetime
Load the new records into CustomerStg table.

Field Name Data Type

TxnID Integer

FileNo Integer

LineNo Integer

CustomerID Integer

EffectiveDate Datetime
Create the CustomerStgMinEffDate table.

Field Name Data Type

CustomerID Integer

MinEffectiveDate Datetime
Insert into CustomerStgMinEffDate
select CustomerID, min(EffectiveDate) as MinEffectiveDate
from CustomerStg
group by CustomerID

This table is going to be used to distinguish the impacted Customer records into impacted history (essentially non-impacted) records and impacted current (essentially impacted) records. While it is possible for records in CustomerStg to be back dated, it is likely that there will be many non-impacted Customer records with StartDate and EndDate before the earliest back dated record. For the initial load that will again be no records, but normally impacted history records would be most of the effected Customer records.
Create the CustomerNew table.

Field Name Data Type

CustomerID Integer

TxnID Integer

FileNo Integer

LineNo Integer

StartDate Datetime

EndDate Datetime
Insert into CustomerNew
select Customer.*
from Customer left outer join CustomerStgMinEffDate
on Customer.CustomerID = CustomerStgMinEffDate.CustomerID
where Customer.CustomerID not in (select CustomerID from CustomerStg)
or (CustomerStgMinEffDate.CustomerID is not null
and Customer.EndDate < CustomerStgMinEffDate.MinEffectiveDate)

This step takes care of the non-impacted records. For the initial load that will be no records, but normally non-impacted records would be most of the records from the Customer table as each customer does not change their data on a regular basis.
Create the CustomerWork1 table.

Field Name Data Type

CustomerID Integer

TxnID Integer

FileNo Integer

LineNo Integer

StartDate Datetime

EndDate Datetime
Insert into CustomerWork1
select CustomerID, TxnID, FileNo, LineNo, EffectiveDate as StartDate,
12/31/9999 as EndDate
from CustomerStg
Insert into CustomerWork1
select Customer.*
from Customer left outer join CustomerStgMinEffDate
on Customer.CustomerID = CustomerStgMinEffDate.CustomerID
where Customer.EndDate >= CustomerStgMinEffDate.MinEffectiveDate
Create the CustomerWorkDups table.

Field Name Data Type

CustomerID Integer

StartDate Datetime

MaxTxnID Integer
Insert into CustomerWorkDups
select CustomerID, StartDate, max(TxnID) as MaxTxnID
from CustomerWork1
group by CustomerID, StartDate
Insert into CustomerNew
select CustomerWork1.CustomerID, CustomerWork1.TxnID, CustomerWork1.FileNo,
CustomerWork1.LineNo, CustomerWork1.StartDate,
CustomerWork1.StartDate as EndDate
from CustomerWork1 join CustomerWorkDups
on CustomerWork1.CustomerID = CustomerWorkDups.CustomerID
and CustomerWork1.StartDate = CustomerWorkDups.StartDate
where CustomerWork1.TxnID != CustomerWorkDups.MaxTxnID
Delete CustomerWork1
from CustomerWork1 join CustomerWorkDups
on CustomerWork1.CustomerID = CustomerWorkDups.CustomerID
and CustomerWork1.StartDate = CustomerWorkDups.StartDate
where CustomerWork1.TxnID != CustomerWorkDups.MaxTxnID

This should be a small number of records, so even if deletes are expensive it is probably cheaper than moving all of the non-“duplicate” records to CustomerWork2 and using that table for the rest of the steps.
Create the CustomerWorkEndDate table.

Field Name Data Type

CustomerID Integer

StartDate Datetime

MinStartDate Datetime
Insert into CustomerWorkEndDate
select CustomerWork1.CustomerID, CustomerWork1.StartDate,
min(CW1.StartDate) as MinStartDate
from CustomerWork1 join CustomerWork1 CW1
on CustomerWork1.CustomerID = CW1.CustomerID
and CustomerWork1.StartDate < CW1.StartDate
group by CustomerWork1.CustomerID, CustomerWork1.StartDate

This is the CustomerWork1 table joined to itself to find the right EndDate values. For the initial load this is a very large join and should be partitioned by CustomerID to be efficient.
Insert into CustomerNew
select CustomerWork1.CustomerID, CustomerWork1.TxnID, CustomerWork1.FileNo,
CustomerWork1.LineNo, CustomerWork1.StartDate,
coalesce(CustomerWorkEndDate.MinStartDate, 12/31/9999) as EndDate
from CustomerWork1 left outer join CustomerWorkEndDate
on CustomerWork1.CustomerID = CustomerWorkEndDate.CustomerID
and CustomerWork1.StartDate = CustomerWorkEndDate.StartDate

This will leave the leave the CustomerNew table fully loaded. Again for the initial load this is a very large join and should be partitioned by CustomerID to be efficient.
If CustomerOld exists, drop table CustomerOld
Rename Customer as CustomerOld
RenameCustomerNew as Customer

Field Name	Data Type
CustomerID	Integer
TxnID	Integer
FileNo	Integer
LineNo	Integer
StartDate	Datetime
EndDate	Datetime

Field Name	Data Type
TxnID	Integer
FileNo	Integer
LineNo	Integer
CustomerID	Integer
EffectiveDate	Datetime

Field Name	Data Type
CustomerID	Integer
MinEffectiveDate	Datetime

Field Name	Data Type
CustomerID	Integer
TxnID	Integer
FileNo	Integer
LineNo	Integer
StartDate	Datetime
EndDate	Datetime

Field Name	Data Type
CustomerID	Integer
TxnID	Integer
FileNo	Integer
LineNo	Integer
StartDate	Datetime
EndDate	Datetime

Field Name	Data Type
CustomerID	Integer
StartDate	Datetime
MaxTxnID	Integer

Field Name	Data Type
CustomerID	Integer
StartDate	Datetime
MinStartDate	Datetime

The above steps contain fake SQL statements. There will need to be translations into the big data system that is used.

A quick test of these steps are to confirm that the count of records in Customer = the count of records in CustomerOld + the count of records in CustomerStg.