This repository contains source code for the assignments of Udacity's course, Introduction to Hadoop and MapReduce, which was unveiled on 15th November, 2013.
This is a short course by Cloudera guys in association with Udacity. Instructors for this course are Sarah Sproehnle and Ian Wrigley, both from Cloudera and Gundega Dekena, Course Developer is from Udacity.
Course does not mandate any programming language for writing Hadoop MapReduce jobs; but they have mainly used / taught Hadoop MapReduce jobs using
Python [i.e. with Hadoop Streaming approach for running jobs] during the course.
I have developed Hadoop MapReduce code for the 2 problem statements [3 questions each] in 2 programming languages;
Python as well as
Please refer instructions document provided by Course Instructors for details on the Hadoop Virtual Machine [VM henceforth] setup required for running these examples.
As mentioned in the above document, VM image with Hadoop installed and preconfigured, can be downloaded from Udacity CDN.
Please be forewarned, the size of this compressed VM archive is 1.7 GB. Also it does not uncompress with either 7-Zip or Windows default Zip utility. You might have to use WinRAR or WinZip or even Cygwin unzip to uncompress the same, if you are on a Windows platform. On other Operating Systems, probably
unzip command might work just fine. Uncompressed size of this VM is 4.2 GB.
Credentials to login to this Virtual Machine are:
training. You will not need
root access for any of the assignments of this Course. But just in case if you need, the password for
Please ensure that you configure the VM to at least 1.5 GB of RAM in VMware Player. It might run much better with 2 GB though. I have used VMware Player v5.0.2, the current latest version as of this writing [i.e. 28th November, 2013] is v6.0.1.
Update at 11/27/2013 10:00:26 PM IST: Had to remove these input files from the repo as the GitHub Windows client is not able to sync the repo [or rather getting badly stuck with illegitimate alphabets] with these compressed archives.
These input compressed archives can also be downloaded from Udacity servers. Please check here for input file for Problem Statement 1 and here for Problem Statement 2.
These links are also mentioned in the instructions document provided by Udacity Course Instructors.
Output for the problem statements ProblemStatement#1 and ProblemStatement#2 have also been uploaded to this GitHub repo for quick reference and validation of the output.
This output is the Hadoop MR Job output which is obtained after processing and analyzing the specific question.
Instead of breaking the sales down by store, instead give us a sales breakdown by product category across all of our stores.
pur_p1q1.tsv for the output of this problem statement.
Find the monetary value for the highest individual sale for each separate store.
pur_p1q2.tsv for the output of this problem statement.
Find the total sales value across all the stores, and the total number of sales. Assume there is only one reducer.
pur_p1q3.tsv for the output of this problem statement.
Write a MapReduce program which will display the number of hits for each different file on the Web site.
acc_p2q1.tsv for the output of this problem statement.
Write a MapReduce program which determines the number of hits to the site made by each different IP Address.
acc_p2q2.tsv for the output of this problem statement.
Find the most popular file on the Web site. In other words, the file which had the most hits. Your Reducer should just write out the name of the file and number of hits into HDFS.
acc_p2q3.tsv for the output of this problem statement.
Copyright © 2013 Prashanth Babu.
Licensed under the Apache License, Version 2.0.