Big Data Test – SCD – Generating Data

According to the introduction, this test needs to generate some test data.  The requirements for the test data are.

  1. The destination table will have 200 million records.  The new data to be added will be 200 thousand records.
  2. All of the data will be proved with EffectiveDate, so the load of the 200 million into a blank destination table will be performed by the same process that loads the 200 thousand records into the destination table with 200 million records.
  3. Data will be generated by a python script and will result in a number of files that will be moved to an S3 bucket.  CustomerID will not be restricted to a single file.
  4. Data with the exact same EffectiveDate may exist.
  5. The EffectiveDate range for the 200 million will be 1/1/2015 to 11/30/2015
  6. The EffectiveDate range for the 200 thousand will be 11/1/2015 to 12/31/2015
  7. The CustomerID should range from 1 to 1 million

I created a script generate.zip to fulfill the requirements.  I’ll let the script speak for itself.  You can download it or take a look at it below.

I need a place to run this python script.  My first thought is maybe I can do this in Lambda.  But this is going to be gigabytes of data and it really would be pushing Lamdba limits of CPU quite a bit, so I’ll spin up a EC2 spot instance.

I used the following instance properties:

  • Amazon Linux AMI 64 bit
  • i3.large  (don’t need CPU, but data volume and instance store would be nice)
  • Use a security group that will allow you to ssh in
  • Add a role to allow S3 permissions
  • need a key pair for SSH

I was able to obtain a i3.large spot instance in us-east-1 for $0.02 per hour.  Probably not going to need it for more than an hour.  Cheap.

Steps to take to generate the data.

  1. sudo su (yeah, bad idea, but this is all temporary)
  2. yum update -y (always)
  3. yum install python35 -y
  4. lsblk (find the instance store, mine was nvme0n1)
  5. mkfs -t ext4 /dev/nvme0n1
  6. mkdir /generate
  7. mount /dev/nvme0n1 /generate
  8. cd /generate
  9. wget https://jimburnham.cloud/generate.zip
  10. unzip generate.zip
  11. chmod +x generate.py
  12. ./generate.py -r 200000000 -l 2000000 -c 1000000 -s 1/1/2015 -e 11/30/2015
  13. bzip2 data/*
  14. aws s3 cp data s3://<put your bucket here>/initial/ –recursive
  15. rm -rf data
  16. ./generate.py -t 200000000 -r 200000 -l 20000 -c 1000000 -s 11/1/2015 -e 12/31/2015
  17. bzip2 data/*
  18. aws s3 cp data s3://<put your bucket here>/batch/ –recursive

You can now cancel and terminate the spot instance.

Here is the code


#!/usr/bin/python3

from random import random
from datetime import datetime, timedelta
from argparse import ArgumentParser
from os.path import isdir
from os import mkdir

parser = ArgumentParser(description='Data generator for SCD big data test')
parser.add_argument('-r', '--records', help='Total number of records to generate', required=True)
parser.add_argument('-l', '--lines', help='Number of records per file', required=True)
parser.add_argument('-c', '--customers', help='Maximum number of customers', required=True)
parser.add_argument('-s', '--startdate', help='Start date in MM/DD/YYYY format', required=True)
parser.add_argument('-e', '--enddate', help='End date in MM/DD/YYYY format', required=True)
args = parser.parse_args()

records = int(args.records)
lines = int(args.lines)
customers = int(args.customers)
month, day, year = args.startdate.split('/')
start = datetime(int(year), int(month), int(day), 0, 0, 0)
month, day, year = args.enddate.split('/')
end = datetime(int(year), int(month), int(day), 23, 59, 59)
startend = end - start
seconds = startend.total_seconds()

if isdir('data') == False:
    mkdir('data')

bf = open('data/bigfile.csv', 'w')
bf.write('txnid,fileno,lineno,custid,eventtime\n')

filebase = 'data/txn'
fileno = 1
lineno = 0

filename = filebase + str(fileno).zfill(3) + '.csv'
f = open(filename, 'w')

for txnid in range (0, records):
    lineno = lineno + 1
    if lineno >  lines:
        f.close()
        lineno = 1
        fileno = fileno + 1
        filename = filebase + str(fileno).zfill(3) + '.csv'
        f = open(filename, 'w')
    if lineno == 1:
        f.write('txnid,fileno,lineno,custid,eventtime\n')

    custid = int(random() * customers) + 1
    td = timedelta (seconds = int(random() * seconds))
    et = start + td
    eventtime = et.strftime('%Y%m%d%H%M%S')

    val = str(txnid + 1) + ',' + str(fileno) + ',' + str(lineno) + ',' + str(custid) + ',' + eventtime + '\n'

    bf.write(val)
    f.write(val)

bf.close()
f.close()