Learning Hadoop And Spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
Alternatives To Learning Hadoop And Spark
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Spark35,9232,394882a day ago46May 09, 2021274apache-2.0Scala
Apache Spark - A unified analytics engine for large-scale data processing
Data Science Ipython Notebooks25,025
a month ago33otherPython
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Bigdata Notes13,291
4 months ago33Java
大数据入门指南 :star:
Deeplearning4j12,9653821a day ago15January 27, 2017620apache-2.0Java
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
2 months ago110apache-2.0
The Data Engineering Cookbook
2 years ago7
a day ago1,717apache-2.0Java
Apache Doris is an easy-to-use, high performance and unified analytics database.
God Of Bigdata7,992
2 months ago2
H2o 36,2991830a day ago232September 19, 20222,688apache-2.0Jupyter Notebook
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Alluxio6,2583145a day ago54August 05, 2022857apache-2.0Java
Alluxio, data orchestration for analytics and machine learning in the cloud
Alternatives To Learning Hadoop And Spark
Select To Compare

Alternative Project Comparisons

Learning Hadoop and Spark


This is the companion repo to my LinkedIn Learning Courses on Apache Hadoop and Apache Spark.

🐘 1. Learning Hadoop - link
- uses mostly GCP Dataproc
- for running Hadoop & associated libraries (i.e. Hive, Pig, Spark...) workloads

🌩️ 2. Cloud Hadoop: Scaling Apache Spark - link
- uses GCP DataProc, AWS EMR --or--
- Databricks on AWS

⛈️ 3. Azure Databricks Spark Essential Training - link
- uses Azure with Databricks
- for scaling Apache Spark workloads

Development Environment Setup Information

You have a number of options - although it is possible for you to set up a local Hadoop/Spark cluster, I do NOT recommended this approach as it's needlessly complex for initial study. Rather I do recommend that you use a partially or fully-managed cluster. For learning, I most often use a fully-managed (free tier) cluster.

1. SaaS - Databricks --> MANAGED

Databricks offers managed Apache Spark clusters. Databricks can run on AWS, Azure or GCP --> announced in 2021 - link. In this course, I use Databricks running on AWS, as the community editor is simple and fast to set up for learning purposes.

  • Use Databricks Community Edition (managed, hosted Apache Spark), run on AWS. Example notebook shown in screenshot above.
    • uses Databricks (Jupyter-style) notebooks to connect to a one or more custom-sized and managed Spark clusters
    • creates and manages your data files stored in cloud buckets as part of Databricks service
    • uses DFS file system in cluster data operations
    • use Databricks AWS community edition (simplest set up - free tier on AWS) - link --OR--
    • use Databricks Azure trial edition - Azure may require a pay-as-you-go account to get needed CPU/GPU resources
    • try Databricks on GCP beta - announced recently - link

2. PaaS Cloud on GCP (or AWS) --> PARTIALLY-MANAGED

  • Setup a Hadoop/Spark managed cloud-cluster via GCP Dataproc or AWS EMR
    • see setup-hadoop folder in this Repo for instructions/scripts
      • create a GCS (or AWS) bucket for input/output job data
      • see example_datasets folder in this Repo for sample data files
    • for GCP use DataProc includes Jupyter notebook interface --OR--
    • for AWS use EMR you can use EMR Studio (which includes managed Jupyter instances) - link example screenshot shown above
    • for Azure it is possible to use their HDInsight service. I prefer Databricks on Azure because I find it to be more feature complete and performant.

3. IaaS local or cloud --> MANUAL

  • Setup Hadoop/Spark locally or on a 'raw' cloud VM, such as AWS EC2
    • NOT RECOMMENDED for learning - too complex to set up
    • Cloudera Learning VM - also NOT recommended, changes too often, documentation not aligned

Example Jobs or Scripts

EXAMPLES from org.apache.hadoop_or_spark.examples - link for Spark examples

  • Run a Hadoop WordCount Job with Java (jar file)
  • Run a Hadoop and/or Spark CalculatePi (digits) Script with PySpark or other libraries
  • Run using Cloudera shared demo env
    • at https://demo.gethue.com/
    • login is user:demo, pwd:demo

Other LinkedIn Learning Courses on Hadoop or Spark

There are ~ 10 courses on Hadoop/Spark topics on LinkedIn Learning. See graphic below
Learning Paths

  • Hadoop for Data Science Tips and Tricks - link
    • Set up Cloudera Enviroment
    • Working with Files in HDFS
    • Connecting to Hadoop Hive
    • Complex Data Structures in Hive
  • Spark courses - link
    • Various Topics - see screenshot below


Popular Spark Projects
Popular Hadoop Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Apache Spark