Big Data Test – SCD – Generating Data
According to the introduction, this test needs to generate some test data. The requirements for the test data are.
- The destination table will have 200 million records. The new data to be added will be 200 thousand records.
- All of the data will be proved with EffectiveDate, so the load of the 200 million into a blank destination table will be performed by the same process that loads the 200 thousand records into the destination table with 200 million records.
- Data will be generated by a python script and will result in a number of files that will be moved to an S3 bucket. CustomerID will not be restricted to a single file.
- Data with the exact same EffectiveDate may exist.
- The EffectiveDate range for the 200 million will be 1/1/2015 to 11/30/2015
- The EffectiveDate range for the 200 thousand will be 11/1/2015 to 12/31/2015
- The CustomerID should range from 1 to 1 million
I created a script generate.zip to fulfill the requirements. I’ll let the script speak for itself. You can download it or take a look at it below.
I need a place to run this python script. My first thought is maybe I can do this in Lambda. But this is going to be gigabytes of data and it really would be pushing Lamdba limits of CPU quite a bit, so I’ll spin up a EC2 spot instance.
I used the following instance properties:
- Amazon Linux AMI 64 bit
- i3.large (don’t need CPU, but data volume and instance store would be nice)
- Use a security group that will allow you to ssh in
- Add a role to allow S3 permissions
- need a key pair for SSH
I was able to obtain a i3.large spot instance in us-east-1 for $0.02 per hour. Probably not going to need it for more than an hour. Cheap.
Steps to take to generate the data.
- sudo su (yeah, bad idea, but this is all temporary)
- yum update -y (always)
- yum install python35 -y
- lsblk (find the instance store, mine was nvme0n1)
- mkfs -t ext4 /dev/nvme0n1
- mkdir /generate
- mount /dev/nvme0n1 /generate
- cd /generate
- wget https://jimburnham.cloud/generate.zip
- unzip generate.zip
- chmod +x generate.py
- ./generate.py -r 200000000 -l 2000000 -c 1000000 -s 1/1/2015 -e 11/30/2015
- bzip2 data/*
- aws s3 cp data s3://<put your bucket here>/initial/ –recursive
- rm -rf data
- ./generate.py -t 200000000 -r 200000 -l 20000 -c 1000000 -s 11/1/2015 -e 12/31/2015
- bzip2 data/*
- aws s3 cp data s3://<put your bucket here>/batch/ –recursive
You can now cancel and terminate the spot instance.
Here is the code
#!/usr/bin/python3 from random import random from datetime import datetime, timedelta from argparse import ArgumentParser from os.path import isdir from os import mkdir parser = ArgumentParser(description='Data generator for SCD big data test') parser.add_argument('-r', '--records', help='Total number of records to generate', required=True) parser.add_argument('-l', '--lines', help='Number of records per file', required=True) parser.add_argument('-c', '--customers', help='Maximum number of customers', required=True) parser.add_argument('-s', '--startdate', help='Start date in MM/DD/YYYY format', required=True) parser.add_argument('-e', '--enddate', help='End date in MM/DD/YYYY format', required=True) args = parser.parse_args() records = int(args.records) lines = int(args.lines) customers = int(args.customers) month, day, year = args.startdate.split('/') start = datetime(int(year), int(month), int(day), 0, 0, 0) month, day, year = args.enddate.split('/') end = datetime(int(year), int(month), int(day), 23, 59, 59) startend = end - start seconds = startend.total_seconds() if isdir('data') == False: mkdir('data') bf = open('data/bigfile.csv', 'w') bf.write('txnid,fileno,lineno,custid,eventtime\n') filebase = 'data/txn' fileno = 1 lineno = 0 filename = filebase + str(fileno).zfill(3) + '.csv' f = open(filename, 'w') for txnid in range (0, records): lineno = lineno + 1 if lineno > lines: f.close() lineno = 1 fileno = fileno + 1 filename = filebase + str(fileno).zfill(3) + '.csv' f = open(filename, 'w') if lineno == 1: f.write('txnid,fileno,lineno,custid,eventtime\n') custid = int(random() * customers) + 1 td = timedelta (seconds = int(random() * seconds)) et = start + td eventtime = et.strftime('%Y%m%d%H%M%S') val = str(txnid + 1) + ',' + str(fileno) + ',' + str(lineno) + ',' + str(custid) + ',' + eventtime + '\n' bf.write(val) f.write(val) bf.close() f.close()