Pipeline Consists of various modules:
Data is captured in real time from the goodreads API using the Goodreads Python wrapper (View usage - Fetch Data Module). The data collected from the goodreads API is stored on local disk and is timely moved to the Landing Bucket on AWS S3. ETL jobs are written in spark and scheduled in airflow to run every 10 minutes.
EMR - I used a 3 node cluster with below Instance Types:
m5.xlarge 4 vCore, 16 GiB memory, EBS only storage EBS Storage:64 GiB
Redshift: For Redshift I used 2 Node cluster with Instance Types
I have written detailed instruction on how to setup Airflow using AWS CloudFormation script. Check out - Airflow using AWS CloudFormation
NOTE: This setup uses EC2 instance and a Postgres RDS instance. Make sure to check out charges before running the CloudFromation Stack.
sshtunnel to submit spark jobs using a ssh connection from the EC2 instance. This setup does not automatically install
sshtunnel for apache airflow. You can install by running below command:
pip install apache-airflow[sshtunnel]
Finally, copy the dag and plugin folder to EC2 inside airflow home directory. Also, checkout Airflow Connection for setting up connection to EMR and Redshift from Airflow.
Spinning up EMR cluster is pretty straight forward. You can use AWS Guide available here.
ETL jobs in the project uses psycopg2 to connect to Redshift cluster to run staging and warehouse queries. To install psycopg2 on EMR:
sudo pip-3.6 install psycopg2
postgresql-libs, and sometimes pscopg2 installation may fail if these dependencies are not available. To install run commands:
sudo yum install postgresql-libs sudo yum install postgresql-devel
ETL jobs also use boto3 move files between s3 buckets. To install boto3 run:
pip-3.6 install boto3 --user
Finally, pyspark uses python2 as default setup on EMR. To change to python3, setup environment variables:
export PYSPARK_DRIVER_PYTHON=python3 export PYSPARK_PYTHON=python3
Copy the ETL scripts to EMR and we have our EMR ready to run jobs.
Make sure Airflow webserver and scheduler is running.
Open the Airflow UI
http://< ec2-instance-ip >:< configured-port >
goodreadsfaker module in this project generates Fake data which is used to test the ETL pipeline on heavy load.
To test the pipeline I used
goodreadsfaker to generate 11.4 GB of data which is to be processed every 10 minutes (including ETL jobs + populating data into warehouse + running analytical queries) by the pipeline which equates to around 68 GB/hour and about 1.6 TB/day.
Data increase by 100x. read > write. write > read
Pipelines would be run on 7am daily. how to update dashboard? would it still work?
Make it available to 100+ people