Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Spark | 35,923 | 2,394 | 882 | a day ago | 46 | May 09, 2021 | 274 | apache-2.0 | Scala | |
Apache Spark - A unified analytics engine for large-scale data processing | ||||||||||
Data Science Ipython Notebooks | 25,025 | a month ago | 33 | other | Python | |||||
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines. | ||||||||||
Bigdata Notes | 13,291 | 4 months ago | 33 | Java | ||||||
大数据入门指南 :star: | ||||||||||
Deeplearning4j | 12,965 | 38 | 21 | a day ago | 15 | January 27, 2017 | 620 | apache-2.0 | Java | |
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation. | ||||||||||
Cookbook | 11,769 | 2 months ago | 110 | apache-2.0 | ||||||
The Data Engineering Cookbook | ||||||||||
It_book | 8,543 | 2 years ago | 7 | |||||||
本项目收藏这些年来看过或者听过的一些不错的常用的上千本书籍,没准你想找的书就在这里呢,包含了互联网行业大多数书籍和面试经验题目等等。有人工智能系列(常用深度学习框架TensorFlow、pytorch、keras。NLP、机器学习,深度学习等等),大数据系列(Spark,Hadoop,Scala,kafka等),程序员必修系列(C、C++、java、数据结构、linux,设计模式、数据库等等) | ||||||||||
Doris | 8,414 | a day ago | 1,717 | apache-2.0 | Java | |||||
Apache Doris is an easy-to-use, high performance and unified analytics database. | ||||||||||
God Of Bigdata | 7,992 | 2 months ago | 2 | |||||||
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive... | ||||||||||
H2o 3 | 6,299 | 18 | 30 | a day ago | 232 | September 19, 2022 | 2,688 | apache-2.0 | Jupyter Notebook | |
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc. | ||||||||||
Alluxio | 6,258 | 31 | 45 | a day ago | 54 | August 05, 2022 | 857 | apache-2.0 | Java | |
Alluxio, data orchestration for analytics and machine learning in the cloud |
This is the companion repo to my LinkedIn Learning Courses on Apache Hadoop and Apache Spark.
🐘 1. Learning Hadoop - link
- uses mostly GCP Dataproc
- for running Hadoop & associated libraries (i.e. Hive, Pig, Spark...) workloads
🌩️ 2. Cloud Hadoop: Scaling Apache Spark - link
- uses GCP DataProc, AWS EMR --or--
- Databricks on AWS
⛈️ 3. Azure Databricks Spark Essential Training - link
- uses Azure with Databricks
- for scaling Apache Spark workloads
You have a number of options - although it is possible for you to set up a local Hadoop/Spark cluster, I do NOT recommended this approach as it's needlessly complex for initial study. Rather I do recommend that you use a partially or fully-managed cluster. For learning, I most often use a fully-managed (free tier) cluster.
Databricks offers managed Apache Spark clusters. Databricks can run on AWS, Azure or GCP --> announced in 2021 - link. In this course, I use Databricks running on AWS, as the community editor is simple and fast to set up for learning purposes.
setup-hadoop
folder in this Repo for instructions/scripts
example_datasets
folder in this Repo for sample data filesEXAMPLES from org.apache.hadoop_or_spark.examples
- link for Spark examples
https://demo.gethue.com/
demo
, pwd:demo
There are ~ 10 courses on Hadoop/Spark topics on LinkedIn Learning. See graphic below