Big Data Test – SCD – Design
Given the requirements and the set based table recreate solution discussed in the introduction, we need more detail on the steps required to solve this SCD problem.
To start with we have the data generated by the python scripts. This data exists in two folders of an AWS S3 bucket:
- Initial – our initial load of 200 million records
- Batch – our first batch of data with 200 thousand records
The solution is going to be the same for loading the initial data as it is for loading the first batch. In the steps below the term “new” data will first apply to the initial load and when run a second time it will apply to the first batch.
High Level Design
We have records we have already processed and new records to process. The new records may contain back dated records. Where those exists we may have to change the EndDate for existing records according to StartDate in the back dated records. We may also have to set the EndDate for back dated records to StartDate values that are in the already processed records. And lastly, we have EndDates in new records that need to be set according to StartDate values in new records.
That seems complex, so let’s think of the problem in terms of impacted (EndDate needs to be recalculated) and non-impacted (EndDate is already set correctly) records. All of the new records are impacted. The already processed records are mostly non-impacted. This is simpler as we can separate the records into impacted and non-impacted records. So the high level design steps start with:
- Copy the unimpacted records from Customer to a tables called CustomerNew.
- Copy the impacted records from Customer to CustomerWork1.
- Copy all of the records from CustomerStg to CustomerWorking1 setting EndDate to 12/31/9999.
Next we process the impacted records in CustomerWork1 and copy them over to CustomerNew. But we have to beware of impacted records with the same CustomerID and StartDate. These “duplicate” records can cause a problem when using set based logic, so they need to be processed separately.
- Copy all but one “duplicate” record from CustomerWork1 to CustomerNew setting the EndDate to the StartDate.
- Copy the non-“duplicate” records and the single duplicate record from CustomerWork1 to CustomerNew setting the EndDate to either the next higher StartDate for the same Customer where a next higher StartDate exists or to 12/31/9999.
- If it exists, drop the CustomerOld table
- Rename the Customer table to CustomerOld
- Rename the CustomerNew table to Customer.
Detailed Design
Now for the detailed design steps:
- If there is not a Customer table, create it.
Field Name Data Type CustomerID Integer TxnID Integer FileNo Integer LineNo Integer StartDate Datetime EndDate Datetime - Load the new records into CustomerStg table.
Field Name Data Type TxnID Integer FileNo Integer LineNo Integer CustomerID Integer EffectiveDate Datetime - Create the CustomerStgMinEffDate table.
Field Name Data Type CustomerID Integer MinEffectiveDate Datetime - Insert into CustomerStgMinEffDate
select CustomerID, min(EffectiveDate) as MinEffectiveDate
from CustomerStg
group by CustomerIDThis table is going to be used to distinguish the impacted Customer records into impacted history (essentially non-impacted) records and impacted current (essentially impacted) records. While it is possible for records in CustomerStg to be back dated, it is likely that there will be many non-impacted Customer records with StartDate and EndDate before the earliest back dated record. For the initial load that will again be no records, but normally impacted history records would be most of the effected Customer records.
- Create the CustomerNew table.
Field Name Data Type CustomerID Integer TxnID Integer FileNo Integer LineNo Integer StartDate Datetime EndDate Datetime - Insert into CustomerNew
select Customer.*
from Customer left outer join CustomerStgMinEffDate
on Customer.CustomerID = CustomerStgMinEffDate.CustomerID
where Customer.CustomerID not in (select CustomerID from CustomerStg)
or (CustomerStgMinEffDate.CustomerID is not null
and Customer.EndDate < CustomerStgMinEffDate.MinEffectiveDate)This step takes care of the non-impacted records. For the initial load that will be no records, but normally non-impacted records would be most of the records from the Customer table as each customer does not change their data on a regular basis.
- Create the CustomerWork1 table.
Field Name Data Type CustomerID Integer TxnID Integer FileNo Integer LineNo Integer StartDate Datetime EndDate Datetime - Insert into CustomerWork1
select CustomerID, TxnID, FileNo, LineNo, EffectiveDate as StartDate,
12/31/9999 as EndDate
from CustomerStg - Insert into CustomerWork1
select Customer.*
from Customer left outer join CustomerStgMinEffDate
on Customer.CustomerID = CustomerStgMinEffDate.CustomerID
where Customer.EndDate >= CustomerStgMinEffDate.MinEffectiveDate - Create the CustomerWorkDups table.
Field Name Data Type CustomerID Integer StartDate Datetime MaxTxnID Integer - Insert into CustomerWorkDups
select CustomerID, StartDate, max(TxnID) as MaxTxnID
from CustomerWork1
group by CustomerID, StartDate - Insert into CustomerNew
select CustomerWork1.CustomerID, CustomerWork1.TxnID, CustomerWork1.FileNo,
CustomerWork1.LineNo, CustomerWork1.StartDate,
CustomerWork1.StartDate as EndDate
from CustomerWork1 join CustomerWorkDups
on CustomerWork1.CustomerID = CustomerWorkDups.CustomerID
and CustomerWork1.StartDate = CustomerWorkDups.StartDate
where CustomerWork1.TxnID != CustomerWorkDups.MaxTxnID - Delete CustomerWork1
from CustomerWork1 join CustomerWorkDups
on CustomerWork1.CustomerID = CustomerWorkDups.CustomerID
and CustomerWork1.StartDate = CustomerWorkDups.StartDate
where CustomerWork1.TxnID != CustomerWorkDups.MaxTxnIDThis should be a small number of records, so even if deletes are expensive it is probably cheaper than moving all of the non-“duplicate” records to CustomerWork2 and using that table for the rest of the steps.
- Create the CustomerWorkEndDate table.
Field Name Data Type CustomerID Integer StartDate Datetime MinStartDate Datetime - Insert into CustomerWorkEndDate
select CustomerWork1.CustomerID, CustomerWork1.StartDate,
min(CW1.StartDate) as MinStartDate
from CustomerWork1 join CustomerWork1 CW1
on CustomerWork1.CustomerID = CW1.CustomerID
and CustomerWork1.StartDate < CW1.StartDate
group by CustomerWork1.CustomerID, CustomerWork1.StartDateThis is the CustomerWork1 table joined to itself to find the right EndDate values. For the initial load this is a very large join and should be partitioned by CustomerID to be efficient.
- Insert into CustomerNew
select CustomerWork1.CustomerID, CustomerWork1.TxnID, CustomerWork1.FileNo,
CustomerWork1.LineNo, CustomerWork1.StartDate,
coalesce(CustomerWorkEndDate.MinStartDate, 12/31/9999) as EndDate
from CustomerWork1 left outer join CustomerWorkEndDate
on CustomerWork1.CustomerID = CustomerWorkEndDate.CustomerID
and CustomerWork1.StartDate = CustomerWorkEndDate.StartDateThis will leave the leave the CustomerNew table fully loaded. Again for the initial load this is a very large join and should be partitioned by CustomerID to be efficient.
- If CustomerOld exists, drop table CustomerOld
- Rename Customer as CustomerOld
- RenameCustomerNew as Customer
The above steps contain fake SQL statements. There will need to be translations into the big data system that is used.
A quick test of these steps are to confirm that the count of records in Customer = the count of records in CustomerOld + the count of records in CustomerStg.