Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Data Science Ipython Notebooks | 25,242 | 3 months ago | 34 | other | Python | |||||
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines. | ||||||||||
Bigdata Notes | 14,410 | 7 days ago | 37 | Java | ||||||
大数据入门指南 :star: | ||||||||||
Cookbook | 11,769 | 5 months ago | 110 | apache-2.0 | ||||||
The Data Engineering Cookbook | ||||||||||
Hive | 5,079 | 13 hours ago | 107 | apache-2.0 | Java | |||||
Apache Hive | ||||||||||
Scalding | 3,433 | 37 | 40 | 4 months ago | 43 | September 14, 2016 | 319 | apache-2.0 | Scala | |
A Scala API for Cascading | ||||||||||
Mrjob | 2,584 | 112 | 2 | a year ago | 62 | September 17, 2020 | 211 | other | Python | |
Run MapReduce jobs on Hadoop or Amazon Web Services | ||||||||||
Poseidon | 1,543 | 6 years ago | 9 | bsd-3-clause | Go | |||||
A search engine which can hold 100 trillion lines of log data. | ||||||||||
Mongo Hadoop | 1,511 | 78 | 10 | 2 years ago | 14 | January 27, 2017 | 16 | Java | ||
MongoDB Connector for Hadoop | ||||||||||
Bigdata Interview | 1,397 | 2 years ago | n,ull | |||||||
:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结 | ||||||||||
Bigdata Growth | 1,045 | 15 hours ago | 1 | mit | Shell | |||||
大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。 |
GitHub src repo: ceteri/ceteri-mapred
Presentation slides available here in Keynote format or online at SlideShare: doc/enron.key http://www.slideshare.net/pacoid/getting-started-on-hadoop
See the "WordCount" example at: bin/run_wc.sh
See the "Enron Email Dataset" demo at: bin/run_enron.sh
R statistics demo: thresh.R, thresh.tsv
Gephi graph demo: graph.gephi
1. create a bucket in S3
2. copy the Python scripts into a "src" folder there
3. determine some subset of the email message input
cat msgs.tsv | head -1000 > input
4. copy "input" to your S3 "src" folder
5. follow examples in slide deck, based on params below
-input s3n://ceteri-mapred/enron/src/input -output s3n://ceteri-mapred/enron/src/output -mapper '"python map_parse.py http://ceteri-mapred.s3.amazonaws.com/ stopwords"' -reducer '"python red_idf.py 2500"' -cacheFile s3n://ceteri-mapred/enron/src/map_parse.py#map_parse.py -cacheFile s3n://ceteri-mapred/enron/src/red_idf.py#red_idf.py -cacheFile s3n://ceteri-mapred/enron/src/stopwords#stopwords
-input s3n://ceteri-mapred/enron/src/output -output s3n://ceteri-mapred/enron/src/filter -mapper '"python map_filter.py"' -reducer '"python red_filter.py 0.0633"' -cacheFile s3n://ceteri-mapred/enron/src/map_filter.py#map_filter.py -cacheFile s3n://ceteri-mapred/enron/src/red_filter.py#red_filter.py
cat filter/part-* | sort -k1 -k4 -nr > lexicon