Big Data Test – SCD – Redshift

Redshift is the first AWS technology that I am using for the big data test of slowly changing dimension processing.  The data we need has been generated by a python script.  This data exists in two folders of an AWS S3 bucket:

  • Initial – our initial load of 200 million records
  • Batch – our first batch of data with 200 thousand records

The solution is going to be the same for loading the initial data as it is for loading the first batch.  The Redshift script can be found at the end of this post.

The results are impressive.  The process averaged about 28 minutes on the initial load and just under 3 minutes on the batch load.  And that is on Redshift installed on one dc2.large instance (2 vCPU, 15.25 GB RAM, 2 slices).  When run on 4 dc2.large instances the duration dropped down to 9 minutes for the initial and 1 minute for the batch.

But this test was never about performance, it was about learning the technology.  Here are some of the things I learned.

  • In order to create robust, re-startable, scripts other SQL databases commonly use IF statements to test for the existence of objects before you try to create them.  Redshift SQL does not have an IF statement but has cleverly added optional clauses like “if exists” or “if not exists” to drop and create statements which perform this test.  Time will tell, but I think I like this better as I don’t need to know the systems tables in order to build a robust script.
  • The Redshift way to determine columnar compression seems to be:
    1. Create a table with no compression.
    2. Load sample data (100,000+ records) using COPY
    3. Let Redshift pick the compression
    4. Alter the create script with the Redshift suggestions
  • Columnar compression is a bit of a pain. Typical projects have table creation well before an data is ready to load.  Just means that the project is going to have to add a table optimization step into later stages of the project.
  • Is it better to copy into a table with no sort key and then deep copy to a table with a sort key?  I expected to find a difference.  I ran some tests.  I did not find any significant performance advantage with the quick tests I performed.
  •  Redshift has an interesting method of copying records from table to table using “alter table … append …”.  It is only good for moving exact copies of data, but that is a good tool to have in the toolbox.
  • I did use a delete in the script.  While Redshift can handle updates and deletes, you want to keep those to a minimum.  This is because the database stores data in 1 MB immutable blocks.  If data is deleted or changed, the entire block is recreated.  Granted the delete I used deleted hundreds of records in a table with hundreds of millions of records, but those deletes are likely spread out evenly across the data set, so the number of impacted blocks might be a higher percentage than the number of records.  A vacuum could clean this up, or a design change to another step table would also work.  I did not go down this path, but would for a production system.
  • I was curious about how views worked in Redshift given that table drops and recreates may happen frequently.  By default views are built on the system names of the tables and therefore do not work when the table is dropped and recreated.  Also, it is my understanding that the views interfere with query optimization.  The create view statement has the optional clause “with no schema binding” that will allow views to work after a drop and recreate, but it does not help with the query optimization, so in general I would likely advise against using a layer of views over the tables.

All in all, I am impressed with Redshift.  But I’ve always been a fan of MPP databases.

Note, to use the following script line 30 and 31 of the script have 4 pieces of information within <> that you will need to provide.


-- Step 1 - If there is not a Customer table, create it.

create table if not exists customer (
 txnid integer not null encode lzo,
 fileno integer encode delta,
 lineno integer encode delta32k,
 customerid integer encode delta,
 startdate timestamp encode lzo,
 enddate timestamp encode lzo,
 primary key (txnid))
distkey (customerid)
sortkey (customerid, startdate);


-- Step 2 - Load the new records into CustomerStg table.

create table if not exists customerstg (
 txnid integer not null encode lzo,
 fileno integer encode delta,
 lineno integer encode delta32k,
 customerid integer encode delta,
 effectivedate timestamp encode lzo,
 primary key (txnid))
distkey (customerid)
sortkey (customerid, effectivedate);

truncate table customerstg;

copy customerstg
from 's3://<put your bucket here>/scd/<initial for the first run, batch for the second run>/txn'
iam_role 'arn:aws:iam::<awsacctno>:role/<name of role allowing redshift to read the s3 data>'
bzip2
ignoreheader 1
csv
timeformat 'YYYYMMDDHHMISS';


-- Step 3 - Create the CustomerStgMinEffDate table.

create table if not exists customerstgmineffdate (
 customerid integer encode raw,
 mineffectivedate timestamp encode lzo,
 primary key (customerid))
distkey (customerid)
sortkey (customerid);

truncate table customerstgmineffdate;

-- Step 4 - Insert into CustomerStgMinEffDate

insert into customerstgmineffdate
select customerid, min(effectivedate) as mineffectivedate
from customerstg
group by customerid;


-- Step 5 - Create the CustomerNew table.

create table if not exists customernew (
 txnid integer not null encode lzo,
 fileno integer encode delta,
 lineno integer encode delta32k,
 customerid integer encode delta,
 startdate timestamp encode lzo,
 enddate timestamp encode lzo,
 primary key (txnid))
distkey (customerid)
sortkey (customerid, startdate);

truncate table customernew;

-- Step 6 - Insert into CustomerNew

insert into customernew
select c.txnid, c.fileno, c.lineno, c.customerid, c.startdate, c.enddate
from customer c 
left outer join customerstgmineffdate csmed
on c.customerid = csmed.customerid
where csmed.customerid is null
or (csmed.customerid is not null and c.enddate < csmed.mineffectivedate);


-- Step 7 - Create the CustomerWork1 table.

create table if not exists customerwork1 (
 txnid integer not null encode lzo,
 fileno integer encode delta,
 lineno integer encode delta32k,
 customerid integer encode delta,
 startdate timestamp encode lzo,
 enddate timestamp encode lzo,
 primary key (txnid))
distkey (customerid)
sortkey (customerid, startdate);

truncate table customerwork1;


-- Step 8 - Insert into CustomerWork1

insert into customerwork1
select txnid, fileno, lineno, customerid, effectivedate as startdate, to_date('12/31/9999', 'MM/DD/YYYY') as enddate
from customerstg;

-- Step 9 - Insert into CustomerWork1

insert into customerwork1
select c.txnid, c.fileno, c.lineno, c.customerid, c.startdate, c.enddate
from customer c
left outer join customerstgmineffdate csmed
on c.customerid = csmed.customerid
where c.enddate >= csmed.mineffectivedate;

-- Step 10 - Create the CustomerWorkDups table.

create table if not exists customerworkdups (
 customerid integer encode raw,
 startdate timestamp encode lzo,
 maxtxnid integer encode lzo,
 primary key (customerid))
distkey (customerid)
sortkey (customerid);

truncate table customerworkdups;

-- Step 11 - Insert into CustomerWorkDups

insert into customerworkdups
select customerid, startdate, max(txnid) as maxtxnid
from customerwork1
group by customerid, startdate
having count(*) > 1;

-- Step 12 - Insert into CustomerNew

insert into customernew
select cw1.txnid, cw1.fileno, cw1.lineno, cw1.customerid, cw1.startdate, cw1.startdate as enddate
from customerwork1 cw1
join customerworkdups cwd
on cw1.customerid = cwd.customerid
and cw1.startdate = cwd.startdate
where cw1.txnid != cwd.maxtxnid;

-- Step 13 - Delete CustomerWork1

delete from customerwork1
using customerworkdups
where customerwork1.customerid = customerworkdups.customerid
and customerwork1.startdate = customerworkdups.startdate
and customerwork1.txnid != customerworkdups.maxtxnid;

-- Step 14 - Create the CustomerWorkEndDate table.

create table if not exists customerworkenddate (
 customerid integer encode raw,
 startdate timestamp encode lzo,
 minstartdate timestamp encode lzo,
 primary key (customerid))
distkey (customerid)
sortkey (customerid);

truncate table customerworkenddate;

-- Step 15 - Insert into CustomerWorkEndDate

insert into customerworkenddate
select cw1a.customerid, cw1a.startdate, min(cw1b.startdate) as minstartdate
from customerwork1 cw1a
join customerwork1 cw1b
on cw1a.customerid = cw1b.customerid
and cw1a.startdate < cw1b.startdate
group by cw1a.customerid, cw1a.startdate;


-- Step 16 - Insert into CustomerNew

insert into customernew
select cw1.txnid, cw1.fileno, cw1.lineno, cw1.customerid, cw1.startdate,
 coalesce(cwed.minstartdate, to_date('12/31/9999', 'MM/DD/YYYY')) as enddate
from customerwork1 cw1
left outer join customerworkenddate cwed
on cw1.customerid = cwed.customerid
and cw1.startdate = cwed.startdate;

vacuum full customernew;

analyze customernew;


-- Step 17 - If CustomerOld exists, drop table CustomerOld

drop table if exists Customerold;


-- Step 18 - Rename Customer as CustomerOld

alter table customer rename to customerold;


-- Step 19 - Rename CustomerNew as Customer

alter table customernew rename to customer;