Learning Hadoop And Spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
Alternatives To Learning Hadoop And Spark
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Spark35,9232,394882a day ago46May 09, 2021274apache-2.0Scala
Apache Spark - A unified analytics engine for large-scale data processing
Data Science Ipython Notebooks25,025
a month ago33otherPython
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Bigdata Notes13,291
4 months ago33Java
大数据入门指南 :star:
Deeplearning4j12,9653821a day ago15January 27, 2017620apache-2.0Java
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
Cookbook11,769
2 months ago110apache-2.0
The Data Engineering Cookbook
It_book8,543
2 years ago7
本项目收藏这些年来看过或者听过的一些不错的常用的上千本书籍,没准你想找的书就在这里呢,包含了互联网行业大多数书籍和面试经验题目等等。有人工智能系列(常用深度学习框架TensorFlow、pytorch、keras。NLP、机器学习,深度学习等等),大数据系列(Spark,Hadoop,Scala,kafka等),程序员必修系列(C、C++、java、数据结构、linux,设计模式、数据库等等)
Doris8,414
a day ago1,717apache-2.0Java
Apache Doris is an easy-to-use, high performance and unified analytics database.
God Of Bigdata7,992
2 months ago2
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...
H2o 36,2991830a day ago232September 19, 20222,688apache-2.0Jupyter Notebook
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Alluxio6,2583145a day ago54August 05, 2022857apache-2.0Java
Alluxio, data orchestration for analytics and machine learning in the cloud
Alternatives To Learning Hadoop And Spark
Select To Compare


Alternative Project Comparisons
Readme

Learning Hadoop and Spark

Contents

This is the companion repo to my LinkedIn Learning Courses on Apache Hadoop and Apache Spark.

🐘 1. Learning Hadoop - link
- uses mostly GCP Dataproc
- for running Hadoop & associated libraries (i.e. Hive, Pig, Spark...) workloads

🌩️ 2. Cloud Hadoop: Scaling Apache Spark - link
- uses GCP DataProc, AWS EMR --or--
- Databricks on AWS

⛈️ 3. Azure Databricks Spark Essential Training - link
- uses Azure with Databricks
- for scaling Apache Spark workloads


Development Environment Setup Information

You have a number of options - although it is possible for you to set up a local Hadoop/Spark cluster, I do NOT recommended this approach as it's needlessly complex for initial study. Rather I do recommend that you use a partially or fully-managed cluster. For learning, I most often use a fully-managed (free tier) cluster.

1. SaaS - Databricks --> MANAGED

Databricks offers managed Apache Spark clusters. Databricks can run on AWS, Azure or GCP --> announced in 2021 - link. In this course, I use Databricks running on AWS, as the community editor is simple and fast to set up for learning purposes.

  • Use Databricks Community Edition (managed, hosted Apache Spark), run on AWS. Example notebook shown in screenshot above.
    • uses Databricks (Jupyter-style) notebooks to connect to a one or more custom-sized and managed Spark clusters
    • creates and manages your data files stored in cloud buckets as part of Databricks service
    • uses DFS file system in cluster data operations
    • use Databricks AWS community edition (simplest set up - free tier on AWS) - link --OR--
    • use Databricks Azure trial edition - Azure may require a pay-as-you-go account to get needed CPU/GPU resources
    • try Databricks on GCP beta - announced recently - link

2. PaaS Cloud on GCP (or AWS) --> PARTIALLY-MANAGED

  • Setup a Hadoop/Spark managed cloud-cluster via GCP Dataproc or AWS EMR
    • see setup-hadoop folder in this Repo for instructions/scripts
      • create a GCS (or AWS) bucket for input/output job data
      • see example_datasets folder in this Repo for sample data files
    • for GCP use DataProc includes Jupyter notebook interface --OR--
    • for AWS use EMR you can use EMR Studio (which includes managed Jupyter instances) - link example screenshot shown above
    • for Azure it is possible to use their HDInsight service. I prefer Databricks on Azure because I find it to be more feature complete and performant.

3. IaaS local or cloud --> MANUAL

  • Setup Hadoop/Spark locally or on a 'raw' cloud VM, such as AWS EC2
    • NOT RECOMMENDED for learning - too complex to set up
    • Cloudera Learning VM - also NOT recommended, changes too often, documentation not aligned

Example Jobs or Scripts

EXAMPLES from org.apache.hadoop_or_spark.examples - link for Spark examples

  • Run a Hadoop WordCount Job with Java (jar file)
  • Run a Hadoop and/or Spark CalculatePi (digits) Script with PySpark or other libraries
  • Run using Cloudera shared demo env
    • at https://demo.gethue.com/
    • login is user:demo, pwd:demo

Other LinkedIn Learning Courses on Hadoop or Spark

There are ~ 10 courses on Hadoop/Spark topics on LinkedIn Learning. See graphic below
Learning Paths

  • Hadoop for Data Science Tips and Tricks - link
    • Set up Cloudera Enviroment
    • Working with Files in HDFS
    • Connecting to Hadoop Hive
    • Complex Data Structures in Hive
  • Spark courses - link
    • Various Topics - see screenshot below

LinkedInLearningSpark

Popular Spark Projects
Popular Hadoop Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Html
Aws
Azure
Spark
Gcp
Hadoop
Mapreduce
Apache Spark