Ceteri Mapred

Alternatives To Ceteri Mapred
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Data Science Ipython Notebooks25,242
3 months ago34otherPython
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Bigdata Notes14,410
7 days ago37Java
大数据入门指南 :star:
5 months ago110apache-2.0
The Data Engineering Cookbook
13 hours ago107apache-2.0Java
Apache Hive
Scalding3,43337404 months ago43September 14, 2016319apache-2.0Scala
A Scala API for Cascading
Mrjob2,5841122a year ago62September 17, 2020211otherPython
Run MapReduce jobs on Hadoop or Amazon Web Services
6 years ago9bsd-3-clauseGo
A search engine which can hold 100 trillion lines of log data.
Mongo Hadoop1,51178102 years ago14January 27, 201716Java
MongoDB Connector for Hadoop
Bigdata Interview1,397
2 years agon,ull
:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Bigdata Growth1,045
15 hours ago1mitShell
Alternatives To Ceteri Mapred
Select To Compare

Alternative Project Comparisons

Getting Started on Hadoop

Paco Nathan [email protected]

Silicon Valley Cloud Computing Meetup


Mountain View, 2010-07-19

GitHub src repo: ceteri/ceteri-mapred

Presentation slides available here in Keynote format or online at SlideShare: doc/enron.key http://www.slideshare.net/pacoid/getting-started-on-hadoop

See the "WordCount" example at: bin/run_wc.sh

See the "Enron Email Dataset" demo at: bin/run_enron.sh

R statistics demo: thresh.R, thresh.tsv

Gephi graph demo: graph.gephi

to run your own code on Elastic MapReduce

1. create a bucket in S3
2. copy the Python scripts into a "src" folder there
3. determine some subset of the email message input
cat msgs.tsv | head -1000 > input
4. copy "input" to your S3 "src" folder
5. follow examples in slide deck, based on params below

Hadoop job flow 1 on Elastic MapReduce

-input s3n://ceteri-mapred/enron/src/input -output s3n://ceteri-mapred/enron/src/output -mapper '"python map_parse.py http://ceteri-mapred.s3.amazonaws.com/ stopwords"' -reducer '"python red_idf.py 2500"' -cacheFile s3n://ceteri-mapred/enron/src/map_parse.py#map_parse.py -cacheFile s3n://ceteri-mapred/enron/src/red_idf.py#red_idf.py -cacheFile s3n://ceteri-mapred/enron/src/stopwords#stopwords

Hadoop job flow 2 on Elastic MapReduce

-input s3n://ceteri-mapred/enron/src/output -output s3n://ceteri-mapred/enron/src/filter -mapper '"python map_filter.py"' -reducer '"python red_filter.py 0.0633"' -cacheFile s3n://ceteri-mapred/enron/src/map_filter.py#map_filter.py -cacheFile s3n://ceteri-mapred/enron/src/red_filter.py#red_filter.py

after downloading the partition file named "filter" from S3, then

run the following command to build a lexicon:

cat filter/part-* | sort -k1 -k4 -nr > lexicon

Popular Mapreduce Projects
Popular Hadoop Projects
Popular Data Processing Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.