Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for big data
big-data
x
1,346 search results found
Awesome Scalability
⭐
50,409
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
Spark
⭐
37,661
Apache Spark - A unified analytics engine for large-scale data processing
Clickhouse
⭐
34,124
ClickHouse® is a free analytics DBMS for big data
Data Science Ipython Notebooks
⭐
25,668
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Flink
⭐
22,747
Apache Flink
Tdengine
⭐
22,519
TDengine is an open source, high-performance, cloud native time-series database optimized for Internet of Things (IoT), Connected Cars, Industrial IoT and DevOps.
Shardingsphere
⭐
19,381
Distributed SQL transaction & query engine for data sharding, scaling, encryption, and more - on any database.
Gun
⭐
17,626
An open source cybersecurity protocol for syncing decentralized graph data.
Bigdata Notes
⭐
14,872
大数据入门指南 ⭐
Questdb
⭐
13,178
An open source time-series database for fast ingest and SQL queries
Awesome Bigdata
⭐
12,800
A curated list of awesome big data frameworks, ressources and other awesomeness.
Cookbook
⭐
12,557
The Data Engineering Cookbook
Predictionio
⭐
12,548
PredictionIO, a machine learning server for developers and ML engineers.
Cmak
⭐
11,670
CMAK is a tool for managing Apache Kafka clusters
Nebula
⭐
9,841
A distributed, fast open-source graph database featuring horizontal scalability and high availability
Juicefs
⭐
9,252
JuiceFS is a distributed POSIX file system built on top of Redis and S3.
Trino
⭐
9,118
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Cython
⭐
8,667
The most widely used Python to C compiler
God Of Bigdata
⭐
8,483
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive.
Vaex
⭐
8,161
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Kafka Ui
⭐
7,779
Open-Source Web UI for Apache Kafka Management
Catboost
⭐
7,564
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Beam
⭐
7,355
Apache Beam is a unified programming model for Batch and Streaming data processing.
Starrocks
⭐
7,191
StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.
Databend
⭐
7,183
𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
Delta
⭐
6,656
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
H2o 3
⭐
6,618
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Zeppelin
⭐
6,259
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Arkime
⭐
6,088
Arkime is an open source, large scale, full packet capturing, indexing, and database system.
Pachyderm
⭐
6,035
Data-Centric Pipelines and Data Versioning
Couchdb
⭐
5,922
Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability
Risingwave
⭐
5,799
The distributed streaming database. Engineered to offer the simplest and most cost-efficient way for stream processing and management.
Hazelcast
⭐
5,738
Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.
Data Engineer Handbook
⭐
5,650
This is a repo with links to everything you'd ever want to learn about data engineering
Quickwit
⭐
5,615
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
Hive
⭐
5,222
Apache Hive
Vespa
⭐
5,115
AI + Data, online. https://vespa.ai
Hudi
⭐
5,064
Upserts, Deletes And Incremental Processing on Big Data.
Feast
⭐
5,053
Feature Store for Machine Learning
Synapseml
⭐
4,967
Simple and Distributed Machine Learning
Stream Framework
⭐
4,714
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
Ignite
⭐
4,626
Apache Ignite
Arrow Datafusion
⭐
4,514
Apache Arrow DataFusion SQL Query Engine
Calcite
⭐
4,216
Apache Calcite
Iotdb
⭐
4,157
Apache IoTDB
Vue Virtual Scroll List
⭐
4,049
⚡️A vue component support big amount data list with high render performance and efficient.
Chunjun
⭐
3,893
A data integration framework
Crate
⭐
3,864
CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.
Volcano
⭐
3,577
A Cloud Native Batch System (Project under CNCF)
Sql Generator
⭐
3,346
🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~
Koalas
⭐
3,291
Koalas: pandas API on Apache Spark
Fastjson2
⭐
3,251
🚄 FASTJSON2 is a Java JSON library with excellent performance.
Graphscope
⭐
3,033
🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba 来自阿里巴巴的一站式大规模图计算系统 图分析 图查询 图机器学习
Img2dataset
⭐
2,986
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Cboard
⭐
2,909
An easy to use, self-service open BI reporting and BI dashboard platform.
Avro
⭐
2,691
Apache Avro is a data serialization system.
Dpark
⭐
2,637
Python clone of Spark, a MapReduce alike framework in Python
Incubator Hugegraph
⭐
2,549
A graph database that supports more than 100+ billion data, high performance and scalability (Include OLTP Engine & REST-API & Backends)
Featurebase
⭐
2,504
A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase
Flume
⭐
2,475
Mirror of Apache Flume
Nakedtensor
⭐
2,471
Bare bone examples of machine learning in TensorFlow
Data Science Roadmap
⭐
2,445
Data Science Roadmap from A to Z
Bigdataguide
⭐
2,355
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Root
⭐
2,329
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
Griddb
⭐
2,310
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.
Parquet Mr
⭐
2,296
Apache Parquet
Lakesoul
⭐
2,248
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
H2o 2
⭐
2,242
Please visit https://github.com/h2oai/h2o-3 for latest H2O
Alldata
⭐
2,130
🔥🔥 AllData大数据产品是可定义数据中台,以数据平台为底座,以数据中台为桥梁,以机器学习平台为中层框
Ambari
⭐
2,030
Apache Ambari simplifies provisioning, managing, and monitoring of Apache Hadoop clusters.
Flinkstreamsql
⭐
1,972
基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法
Spark
⭐
1,963
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Poli
⭐
1,920
An easy-to-use BI server built for SQL lovers. Power data analysis in SQL and gain faster business insights.
Drill
⭐
1,856
Apache Drill is a distributed MPP query layer for self describing data
Bookkeeper
⭐
1,828
Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads
Byzer Lang
⭐
1,813
Byzer (former MLSQL): A low-code open-source programming language for data pipeline, analytics and AI.
Kudu
⭐
1,776
Mirror of Apache Kudu
Gaffer
⭐
1,724
A large-scale entity and relation database supporting aggregation of properties
Ytsaurus
⭐
1,694
YTsaurus is a scalable and fault-tolerant open-source big data platform.
Genie
⭐
1,659
Distributed Big Data Orchestration Service
Incubator Paimon
⭐
1,647
Apache Paimon(incubating) is a streaming data lake platform that supports high-speed data ingestion, change data tracking and efficient real-time analytics.
Parquet Format
⭐
1,559
Apache Parquet
Poseidon
⭐
1,543
A search engine which can hold 100 trillion lines of log data.
Spark Py Notebooks
⭐
1,515
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Bitsail
⭐
1,514
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
Moosefs
⭐
1,509
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Just Dashboard
⭐
1,489
📊 📋 Dashboards using YAML or JSON files
Fluid
⭐
1,488
Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)
Vaquarkhan
⭐
1,464
Autocrawler
⭐
1,454
Google, Naver multiprocess image web crawler (Selenium)
Optimus
⭐
1,446
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Mysql_perf_analyzer
⭐
1,420
MySQL performance monitoring and analysis.
Carbondata
⭐
1,401
High performance data store solution
Bigdata Interview
⭐
1,397
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop
Bigdataview
⭐
1,309
100+套大数据可视化炫酷大屏Html5模板;包含行业:社区、物业、政务、交通、金融银行等,全网最新
Dremio Oss
⭐
1,260
Dremio - the missing link in modern data
Matano
⭐
1,259
Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS
Bigdata Growth
⭐
1,256
大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。
Tensorbase
⭐
1,217
TensorBase is a new big data warehousing with modern efforts.
Avsc
⭐
1,209
Avro for JavaScript ⚡
1-100 of 1,346 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.