Big Data Test – SCD – Generating Data
According to the introduction, this test needs to generate some test data. The requirements for the test data are.
- The destination table will have 200 million records. The new data to be added will be 200 thousand records.
- All of the data will be proved with EffectiveDate, so the load of the 200 million into a blank destination table will be performed by the same process that loads the 200 thousand records into the destination table with 200 million records.
- Data will be generated by a python script and will result in a number of files that will be moved to an S3 bucket. CustomerID will not be restricted to a single file.
- Data with the exact same EffectiveDate may exist.
- The EffectiveDate range for the 200 million will be 1/1/2015 to 11/30/2015
- The EffectiveDate range for the 200 thousand will be 11/1/2015 to 12/31/2015
- The CustomerID should range from 1 to 1 million
I created a script generate.zip to fulfill the requirements. I’ll let the script speak for itself. You can download it or take a look at it below.
I need a place to run this python script. My first thought is maybe I can do this in Lambda. But this is going to be gigabytes of data and it really would be pushing Lamdba limits of CPU quite a bit, so I’ll spin up a EC2 spot instance.
I used the following instance properties:
- Amazon Linux AMI 64 bit
- i3.large (don’t need CPU, but data volume and instance store would be nice)
- Use a security group that will allow you to ssh in
- Add a role to allow S3 permissions
- need a key pair for SSH
I was able to obtain a i3.large spot instance in us-east-1 for $0.02 per hour. Probably not going to need it for more than an hour. Cheap.
Steps to take to generate the data.
- sudo su (yeah, bad idea, but this is all temporary)
- yum update -y (always)
- yum install python35 -y
- lsblk (find the instance store, mine was nvme0n1)
- mkfs -t ext4 /dev/nvme0n1
- mkdir /generate
- mount /dev/nvme0n1 /generate
- cd /generate
- wget https://jimburnham.cloud/generate.zip
- unzip generate.zip
- chmod +x generate.py
- ./generate.py -r 200000000 -l 2000000 -c 1000000 -s 1/1/2015 -e 11/30/2015
- bzip2 data/*
- aws s3 cp data s3://<put your bucket here>/initial/ –recursive
- rm -rf data
- ./generate.py -t 200000000 -r 200000 -l 20000 -c 1000000 -s 11/1/2015 -e 12/31/2015
- bzip2 data/*
- aws s3 cp data s3://<put your bucket here>/batch/ –recursive
You can now cancel and terminate the spot instance.
Here is the code
#!/usr/bin/python3
from random import random
from datetime import datetime, timedelta
from argparse import ArgumentParser
from os.path import isdir
from os import mkdir
parser = ArgumentParser(description='Data generator for SCD big data test')
parser.add_argument('-r', '--records', help='Total number of records to generate', required=True)
parser.add_argument('-l', '--lines', help='Number of records per file', required=True)
parser.add_argument('-c', '--customers', help='Maximum number of customers', required=True)
parser.add_argument('-s', '--startdate', help='Start date in MM/DD/YYYY format', required=True)
parser.add_argument('-e', '--enddate', help='End date in MM/DD/YYYY format', required=True)
args = parser.parse_args()
records = int(args.records)
lines = int(args.lines)
customers = int(args.customers)
month, day, year = args.startdate.split('/')
start = datetime(int(year), int(month), int(day), 0, 0, 0)
month, day, year = args.enddate.split('/')
end = datetime(int(year), int(month), int(day), 23, 59, 59)
startend = end - start
seconds = startend.total_seconds()
if isdir('data') == False:
mkdir('data')
bf = open('data/bigfile.csv', 'w')
bf.write('txnid,fileno,lineno,custid,eventtime\n')
filebase = 'data/txn'
fileno = 1
lineno = 0
filename = filebase + str(fileno).zfill(3) + '.csv'
f = open(filename, 'w')
for txnid in range (0, records):
lineno = lineno + 1
if lineno > lines:
f.close()
lineno = 1
fileno = fileno + 1
filename = filebase + str(fileno).zfill(3) + '.csv'
f = open(filename, 'w')
if lineno == 1:
f.write('txnid,fileno,lineno,custid,eventtime\n')
custid = int(random() * customers) + 1
td = timedelta (seconds = int(random() * seconds))
et = start + td
eventtime = et.strftime('%Y%m%d%H%M%S')
val = str(txnid + 1) + ',' + str(fileno) + ',' + str(lineno) + ',' + str(custid) + ',' + eventtime + '\n'
bf.write(val)
f.write(val)
bf.close()
f.close()