Project Name	Stars	Repos Using This	Packages Using This	Most Recent Commit	Total Releases	Latest Release	Open Issues	License	Language
Data Science Ipython Notebooks	25,668			7 months ago			34	other	Python
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Awesome Bigdata	12,759			2 months ago			38	mit
A curated list of awesome big data frameworks, ressources and other awesomeness.
Trino	9,118		29	3 months ago	83	November 30, 2023	2,496	apache-2.0	Java
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Vaex	8,161	2	29	2 months ago	69	July 21, 2023	508	mit	Python
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Catboost	7,564		12	3 months ago	20	September 19, 2023	539	apache-2.0	Python
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
H2o 3	6,618	62	33	3 months ago	49	August 09, 2023	2,746	apache-2.0	Jupyter Notebook
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Pachyderm	6,035		1	3 months ago	613	December 04, 2023	897	apache-2.0	Go
Data-Centric Pipelines and Data Versioning
Feast	5,053		28	3 months ago	116	September 07, 2023	149	apache-2.0	Python
Feature Store for Machine Learning
Synapseml	4,967		6	4 days ago	12	November 27, 2023	335	mit	Scala
Simple and Distributed Machine Learning
Koalas	3,291	1	16	7 months ago	47	October 19, 2021	112	apache-2.0	Python
Koalas: pandas API on Apache Spark

Alternatives To Awesome Bigdata

Select To Compare

Data Science Ipython Notebooks ⭐ 25,668

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

most recent commit 7 months ago

Awesome Bigdata ⭐ 12,759

A curated list of awesome big data frameworks, ressources and other awesomeness.

most recent commit 2 months ago

Trino ⭐ 9,118

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

dependent packages 29total releases 83most recent commit 3 months ago

Vaex ⭐ 8,161

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

dependent packages 29total releases 69most recent commit 2 months ago

Catboost ⭐ 7,564

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

dependent packages 12total releases 20most recent commit 3 months ago

H2o 3 ⭐ 6,618

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

dependent packages 33total releases 49most recent commit 3 months ago

Pachyderm ⭐ 6,035

Data-Centric Pipelines and Data Versioning

dependent packages 1total releases 613most recent commit 3 months ago

Feast ⭐ 5,053

Feature Store for Machine Learning

dependent packages 28total releases 116most recent commit 3 months ago

Synapseml ⭐ 4,967

Simple and Distributed Machine Learning

dependent packages 6total releases 12most recent commit 4 days ago

Koalas ⭐ 3,291

Koalas: pandas API on Apache Spark

dependent packages 16total releases 47most recent commit 7 months ago

Suggest An Alternative To awesome-bigdata

Alternative Project Comparisons

Awesome Bigdata vs Data Science Ipython Notebooks

Awesome Bigdata vs Trino

Awesome Bigdata vs Vaex

Awesome Bigdata vs Catboost

Awesome Bigdata vs H2o 3

Awesome Bigdata vs Pachyderm

Awesome Bigdata vs Feast

Awesome Bigdata vs Synapseml

Awesome Bigdata vs Koalas

Popular Data Science Projects

Ml For Beginners ⭐ 63,698

12 weeks, 26 lessons, 52 quizzes, classic Machine Learning for all

most recent commit 4 months ago

Keras ⭐ 60,854

Deep Learning for humans

dependent packages 697total releases 87latest release December 06, 2023most recent commit 14 days ago

Superset ⭐ 58,778

Apache Superset is a Data Visualization and Data Exploration Platform

dependent packages 21total releases 6latest release April 18, 2023most recent commit 2 days ago

Scikit Learn ⭐ 57,160

scikit-learn: machine learning in Python

dependent packages 11,480total releases 73latest release October 23, 2023most recent commit 3 months ago

Pandas ⭐ 41,935

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

dependent packages 36,464total releases 122latest release December 08, 2023most recent commit 3 days ago

Popular Big Data Projects

Awesome Scalability ⭐ 50,409

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

most recent commit 4 months ago

Spark ⭐ 37,661

Apache Spark - A unified analytics engine for large-scale data processing

dependent packages 939total releases 46latest release May 09, 2021most recent commit 3 months ago

Clickhouse ⭐ 34,124

ClickHouse® is a free analytics DBMS for big data

total releases 699latest release December 16, 2021most recent commit 3 days ago

Flink ⭐ 22,747

Apache Flink

dependent packages 413total releases 119latest release November 10, 2023most recent commit 3 months ago

Tdengine ⭐ 22,519

TDengine is an open source, high-performance, cloud native time-series database optimized for Internet of Things (IoT), Connected Cars, Industrial IoT and DevOps.

dependent packages 2total releases 12latest release April 14, 2022most recent commit 3 months ago

Popular Data Processing Categories