Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Elastic Mapreduce Ruby | 86 | 9 years ago | 8 | apache-2.0 | Ruby | |||||
Amazon's elastic mapreduce ruby client. Ruby 1.9.X compatible | ||||||||||
Lemur | 85 | 6 years ago | 8 | apache-2.0 | Clojure | |||||
Lemur is a tool to launch hadoop jobs locally or on EMR, based on a configuration file, referred to as a jobdef. The jobdef file describes your EMR cluster, local environment, pre- and post-actions and zero or more "steps". | ||||||||||
Rail | 70 | 3 years ago | 26 | other | Python | |||||
Scalable RNA-seq analysis | ||||||||||
Social Graph Analysis | 56 | 12 years ago | other | Python | ||||||
Social Graph Analysis using Elastic MapReduce and PyPy | ||||||||||
Elasticrawl | 50 | 1 | 7 years ago | 10 | February 15, 2017 | 1 | mit | Ruby | ||
Launch AWS Elastic MapReduce jobs that process Common Crawl data. | ||||||||||
Terraform Aws Emr Cluster | 35 | 4 years ago | 3 | apache-2.0 | HCL | |||||
A Terraform module to create an Amazon Web Services (AWS) Elastic MapReduce (EMR) cluster. | ||||||||||
Cc Helloworld | 33 | 9 years ago | 1 | Java | ||||||
CommonCrawl Hello World example | ||||||||||
Emrio | 30 | 9 years ago | Python | |||||||
Elastic MapReduce instance optimizer | ||||||||||
Ceteri Mapred | 19 | 12 years ago | Python | |||||||
MapReduce examples | ||||||||||
Spark Emr | 17 | 10 years ago | 9 | Scala | ||||||
Spark Elastic MapReduce bootstrap and runnable examples. |
This project demonstrates some simple social graph analysis based on the largest publicly available crawl of the twitter social graph. This data was collected by Kwak, Haewoon and Lee, Changhyun and Park, Hosung and Moon, Sue in 2009.
This 5 gigabyte compressed (26 gigabyte uncompressed) dataset makes for a good excuse to use MapReduce and MrJob for processing. MrJob makes it easy to test MapReduce jobs locally as well as run them on a local Hadoop cluster or on Amazon's Elastic MapReduce.
This project contains two MapReduce jobs:
jobs/follower_count.py
jobs/follower_histogram.py
The following assumes you have a modern Python and have already installed MrJob
(pip install MrJob
or easy_install MrJob
or install it from source).
To run the sample data locally:
$ python jobs/follower_count.py data/twitter_synthetic.txt
This should print out a summary of how many followers each user (represented by id) has:
5 2 6 1 7 3 8 2 9 1
You can also run a larger sample (the first 10 million rows of the full dataset mentioned above) locally though it will likely take several minutes to process:
$ python jobs/follower_count.py data/twitter_sample.txt.gz
After editing conf/mrjob-emr.conf
you can also run the sample on Elastic MapReduce:
$ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr -o s3://your-bucket/your-output-location --no-output data/twitter_sample.txt.gz
You can also upload data to an S3 bucket and reference it that way:
$ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr -o s3://your-bucket/your-output-location --no-output s3://your-bucket/twitter_sample.txt.gz
You may also download the full dataset and run either the follower count or the histogram job. The following general steps are required:
split -l 10000000
for example).python jobs/follower_histogram.py -c conf/mrjob-emr.conf -r emr -o s3://your-bucket/your-output-location --no-output s3://your-split-input-bucket/
While there are lots of other things to explore in the data, I also wanted to be able to run PyPy on Elastic MapReduce. Through the use of bootstrap actions, we can prepare our environment to use PyPy and tell MrJob to execute jobs with PyPy instead of system Python. The following need to be added to your configuration file (and vary between 32 and 64 bit):
# Use PyPy instead of system Python bootstrap_scripts: - bootstrap-pypy-64bit.sh python_bin: /home/hadoop/bin/pypy
This configuration change (available in conf/mrjob-emr-pypy-32bit.conf
and conf/mrjob-emr-pypy-64bit.conf
)
also makes use of a custom bootstrap script found in conf/bootstrap-pypy-32bit.sh
and conf/bootstrap-pypy-64bit.sh
).
A single run of follower_histogram.py
with 8 c1.xlarge
instances took approximately 66 minutes
using Elastic MapReduce's system Python. A single run with the same configuration took approximately 44
minutes. While not a scientific comparison, that's a pretty impressive speedup for such a simple task. PyPy should speed things up even more for more complex tasks.