Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Data Science Ipython Notebooks | 25,242 | 3 months ago | 34 | other | Python | |||||
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines. | ||||||||||
Awesome Bigdata | 12,350 | 2 months ago | 36 | mit | ||||||
A curated list of awesome big data frameworks, ressources and other awesomeness. | ||||||||||
Trino | 8,575 | 18 | a day ago | 67 | July 14, 2023 | 2,461 | apache-2.0 | Java | ||
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io) | ||||||||||
Vaex | 7,985 | 2 | 26 | a month ago | 69 | July 21, 2023 | 504 | mit | Python | |
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀 | ||||||||||
Catboost | 7,375 | 6 | a day ago | 60 | September 26, 2022 | 519 | apache-2.0 | Python | ||
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU. | ||||||||||
H2o 3 | 6,493 | 18 | 32 | a day ago | 241 | July 25, 2023 | 2,708 | apache-2.0 | Jupyter Notebook | |
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc. | ||||||||||
Pachyderm | 5,981 | 1 | a day ago | 504 | August 04, 2023 | 880 | apache-2.0 | Go | ||
Data-Centric Pipelines and Data Versioning | ||||||||||
Feast | 4,804 | 23 | 2 days ago | 113 | April 24, 2023 | 98 | apache-2.0 | Python | ||
Feature Store for Machine Learning | ||||||||||
Synapseml | 4,571 | 3 | 2 days ago | 9 | November 22, 2022 | 322 | mit | Scala | ||
Simple and Distributed Machine Learning | ||||||||||
Koalas | 3,291 | 1 | 13 | 14 days ago | 47 | October 19, 2021 | 112 | apache-2.0 | Python | |
Koalas: pandas API on Apache Spark |
A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data.
Your contributions are always welcome!
Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as "columns").
Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored next to each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.
The former group is referred to as "key map data model" here. The line between these and the Key-value Data Model stores is fairly blurry.
The latter, being more about the storage format than about the data model, is listed under Columnar Databases.
You can read more about this distinction on Prof. Daniel Abadi's blog: Distinguishing two major types of Column Stores.
Note please read the note on Key-Map Data Model section.