Big Data Test – SCD – Redshift
Redshift is the first AWS technology that I am using for the big data test of slowly changing dimension processing. The data we need has been generated by a python script. This data exists in two folders of an AWS S3 bucket:
- Initial – our initial load of 200 million records
- Batch – our first batch of data with 200 thousand records
The solution is going to be the same for loading the initial data as it is for loading the first batch. The Redshift script can be found at the end of this post.
The results are impressive. The process averaged about 28 minutes on the initial load and just under 3 minutes on the batch load. And that is on Redshift installed on one dc2.large instance (2 vCPU, 15.25 GB RAM, 2 slices). When run on 4 dc2.large instances the duration dropped down to 9 minutes for the initial and 1 minute for the batch.
But this test was never about performance, it was about learning the technology. Here are some of the things I learned.
- In order to create robust, re-startable, scripts other SQL databases commonly use IF statements to test for the existence of objects before you try to create them. Redshift SQL does not have an IF statement but has cleverly added optional clauses like “if exists” or “if not exists” to drop and create statements which perform this test. Time will tell, but I think I like this better as I don’t need to know the systems tables in order to build a robust script.
- The Redshift way to determine columnar compression seems to be:
- Create a table with no compression.
- Load sample data (100,000+ records) using COPY
- Let Redshift pick the compression
- Alter the create script with the Redshift suggestions
- Columnar compression is a bit of a pain. Typical projects have table creation well before an data is ready to load. Just means that the project is going to have to add a table optimization step into later stages of the project.
- Is it better to copy into a table with no sort key and then deep copy to a table with a sort key? I expected to find a difference. I ran some tests. I did not find any significant performance advantage with the quick tests I performed.
- Redshift has an interesting method of copying records from table to table using “alter table … append …”. It is only good for moving exact copies of data, but that is a good tool to have in the toolbox.
- I did use a delete in the script. While Redshift can handle updates and deletes, you want to keep those to a minimum. This is because the database stores data in 1 MB immutable blocks. If data is deleted or changed, the entire block is recreated. Granted the delete I used deleted hundreds of records in a table with hundreds of millions of records, but those deletes are likely spread out evenly across the data set, so the number of impacted blocks might be a higher percentage than the number of records. A vacuum could clean this up, or a design change to another step table would also work. I did not go down this path, but would for a production system.
- I was curious about how views worked in Redshift given that table drops and recreates may happen frequently. By default views are built on the system names of the tables and therefore do not work when the table is dropped and recreated. Also, it is my understanding that the views interfere with query optimization. The create view statement has the optional clause “with no schema binding” that will allow views to work after a drop and recreate, but it does not help with the query optimization, so in general I would likely advise against using a layer of views over the tables.
All in all, I am impressed with Redshift. But I’ve always been a fan of MPP databases.
Note, to use the following script line 30 and 31 of the script have 4 pieces of information within <> that you will need to provide.
-- Step 1 - If there is not a Customer table, create it. create table if not exists customer ( txnid integer not null encode lzo, fileno integer encode delta, lineno integer encode delta32k, customerid integer encode delta, startdate timestamp encode lzo, enddate timestamp encode lzo, primary key (txnid)) distkey (customerid) sortkey (customerid, startdate); -- Step 2 - Load the new records into CustomerStg table. create table if not exists customerstg ( txnid integer not null encode lzo, fileno integer encode delta, lineno integer encode delta32k, customerid integer encode delta, effectivedate timestamp encode lzo, primary key (txnid)) distkey (customerid) sortkey (customerid, effectivedate); truncate table customerstg; copy customerstg from 's3://<put your bucket here>/scd/<initial for the first run, batch for the second run>/txn' iam_role 'arn:aws:iam::<awsacctno>:role/<name of role allowing redshift to read the s3 data>' bzip2 ignoreheader 1 csv timeformat 'YYYYMMDDHHMISS'; -- Step 3 - Create the CustomerStgMinEffDate table. create table if not exists customerstgmineffdate ( customerid integer encode raw, mineffectivedate timestamp encode lzo, primary key (customerid)) distkey (customerid) sortkey (customerid); truncate table customerstgmineffdate; -- Step 4 - Insert into CustomerStgMinEffDate insert into customerstgmineffdate select customerid, min(effectivedate) as mineffectivedate from customerstg group by customerid; -- Step 5 - Create the CustomerNew table. create table if not exists customernew ( txnid integer not null encode lzo, fileno integer encode delta, lineno integer encode delta32k, customerid integer encode delta, startdate timestamp encode lzo, enddate timestamp encode lzo, primary key (txnid)) distkey (customerid) sortkey (customerid, startdate); truncate table customernew; -- Step 6 - Insert into CustomerNew insert into customernew select c.txnid, c.fileno, c.lineno, c.customerid, c.startdate, c.enddate from customer c left outer join customerstgmineffdate csmed on c.customerid = csmed.customerid where csmed.customerid is null or (csmed.customerid is not null and c.enddate < csmed.mineffectivedate); -- Step 7 - Create the CustomerWork1 table. create table if not exists customerwork1 ( txnid integer not null encode lzo, fileno integer encode delta, lineno integer encode delta32k, customerid integer encode delta, startdate timestamp encode lzo, enddate timestamp encode lzo, primary key (txnid)) distkey (customerid) sortkey (customerid, startdate); truncate table customerwork1; -- Step 8 - Insert into CustomerWork1 insert into customerwork1 select txnid, fileno, lineno, customerid, effectivedate as startdate, to_date('12/31/9999', 'MM/DD/YYYY') as enddate from customerstg; -- Step 9 - Insert into CustomerWork1 insert into customerwork1 select c.txnid, c.fileno, c.lineno, c.customerid, c.startdate, c.enddate from customer c left outer join customerstgmineffdate csmed on c.customerid = csmed.customerid where c.enddate >= csmed.mineffectivedate; -- Step 10 - Create the CustomerWorkDups table. create table if not exists customerworkdups ( customerid integer encode raw, startdate timestamp encode lzo, maxtxnid integer encode lzo, primary key (customerid)) distkey (customerid) sortkey (customerid); truncate table customerworkdups; -- Step 11 - Insert into CustomerWorkDups insert into customerworkdups select customerid, startdate, max(txnid) as maxtxnid from customerwork1 group by customerid, startdate having count(*) > 1; -- Step 12 - Insert into CustomerNew insert into customernew select cw1.txnid, cw1.fileno, cw1.lineno, cw1.customerid, cw1.startdate, cw1.startdate as enddate from customerwork1 cw1 join customerworkdups cwd on cw1.customerid = cwd.customerid and cw1.startdate = cwd.startdate where cw1.txnid != cwd.maxtxnid; -- Step 13 - Delete CustomerWork1 delete from customerwork1 using customerworkdups where customerwork1.customerid = customerworkdups.customerid and customerwork1.startdate = customerworkdups.startdate and customerwork1.txnid != customerworkdups.maxtxnid; -- Step 14 - Create the CustomerWorkEndDate table. create table if not exists customerworkenddate ( customerid integer encode raw, startdate timestamp encode lzo, minstartdate timestamp encode lzo, primary key (customerid)) distkey (customerid) sortkey (customerid); truncate table customerworkenddate; -- Step 15 - Insert into CustomerWorkEndDate insert into customerworkenddate select cw1a.customerid, cw1a.startdate, min(cw1b.startdate) as minstartdate from customerwork1 cw1a join customerwork1 cw1b on cw1a.customerid = cw1b.customerid and cw1a.startdate < cw1b.startdate group by cw1a.customerid, cw1a.startdate; -- Step 16 - Insert into CustomerNew insert into customernew select cw1.txnid, cw1.fileno, cw1.lineno, cw1.customerid, cw1.startdate, coalesce(cwed.minstartdate, to_date('12/31/9999', 'MM/DD/YYYY')) as enddate from customerwork1 cw1 left outer join customerworkenddate cwed on cw1.customerid = cwed.customerid and cw1.startdate = cwed.startdate; vacuum full customernew; analyze customernew; -- Step 17 - If CustomerOld exists, drop table CustomerOld drop table if exists Customerold; -- Step 18 - Rename Customer as CustomerOld alter table customer rename to customerold; -- Step 19 - Rename CustomerNew as Customer alter table customernew rename to customer;