Awesome Open Source

Programming Languages

Search results for big data

1,346 search results found

Awesome Scalability ⭐ 50,409

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

Spark ⭐ 37,661

Apache Spark - A unified analytics engine for large-scale data processing

Clickhouse ⭐ 34,124

ClickHouse® is a free analytics DBMS for big data

Data Science Ipython Notebooks ⭐ 25,668

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Flink ⭐ 22,747

Tdengine ⭐ 22,519

TDengine is an open source, high-performance, cloud native time-series database optimized for Internet of Things (IoT), Connected Cars, Industrial IoT and DevOps.

Shardingsphere ⭐ 19,381

Distributed SQL transaction & query engine for data sharding, scaling, encryption, and more - on any database.

An open source cybersecurity protocol for syncing decentralized graph data.

Bigdata Notes ⭐ 14,872

大数据入门指南 ⭐

Questdb ⭐ 13,178

An open source time-series database for fast ingest and SQL queries

Awesome Bigdata ⭐ 12,800

A curated list of awesome big data frameworks, ressources and other awesomeness.

Cookbook ⭐ 12,557

The Data Engineering Cookbook

Predictionio ⭐ 12,548

PredictionIO, a machine learning server for developers and ML engineers.

Cmak ⭐ 11,670

CMAK is a tool for managing Apache Kafka clusters

Nebula ⭐ 9,841

A distributed, fast open-source graph database featuring horizontal scalability and high availability

Juicefs ⭐ 9,252

JuiceFS is a distributed POSIX file system built on top of Redis and S3.

Trino ⭐ 9,118

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Cython ⭐ 8,667

The most widely used Python to C compiler

God Of Bigdata ⭐ 8,483

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive.

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

Kafka Ui ⭐ 7,779

Open-Source Web UI for Apache Kafka Management

Catboost ⭐ 7,564

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Apache Beam is a unified programming model for Batch and Streaming data processing.

Starrocks ⭐ 7,191

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

Databend ⭐ 7,183

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com

Delta ⭐ 6,656

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

H2o 3 ⭐ 6,618

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Zeppelin ⭐ 6,259

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Arkime ⭐ 6,088

Arkime is an open source, large scale, full packet capturing, indexing, and database system.

Pachyderm ⭐ 6,035

Data-Centric Pipelines and Data Versioning

Couchdb ⭐ 5,922

Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability

Risingwave ⭐ 5,799

The distributed streaming database. Engineered to offer the simplest and most cost-efficient way for stream processing and management.

Hazelcast ⭐ 5,738

Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.

Data Engineer Handbook ⭐ 5,650

This is a repo with links to everything you'd ever want to learn about data engineering

Quickwit ⭐ 5,615

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

Vespa ⭐ 5,115

AI + Data, online. https://vespa.ai

Upserts, Deletes And Incremental Processing on Big Data.

Feast ⭐ 5,053

Feature Store for Machine Learning

Synapseml ⭐ 4,967

Simple and Distributed Machine Learning

Stream Framework ⭐ 4,714

Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:

Ignite ⭐ 4,626

Arrow Datafusion ⭐ 4,514

Apache Arrow DataFusion SQL Query Engine

Calcite ⭐ 4,216

Iotdb ⭐ 4,157

Vue Virtual Scroll List ⭐ 4,049

⚡️A vue component support big amount data list with high render performance and efficient.

Chunjun ⭐ 3,893

A data integration framework

Crate ⭐ 3,864

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

Volcano ⭐ 3,577

A Cloud Native Batch System (Project under CNCF)

Sql Generator ⭐ 3,346

🔨 用 JSON 来生成结构化的 SQL 语句，基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现，项目简单（重逻辑轻页面）、适合练手~

Koalas ⭐ 3,291

Koalas: pandas API on Apache Spark

Fastjson2 ⭐ 3,251

🚄 FASTJSON2 is a Java JSON library with excellent performance.

Graphscope ⭐ 3,033

🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba 来自阿里巴巴的一站式大规模图计算系统图分析图查询图机器学习

Img2dataset ⭐ 2,986

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Cboard ⭐ 2,909

An easy to use, self-service open BI reporting and BI dashboard platform.

Apache Avro is a data serialization system.

Dpark ⭐ 2,637

Python clone of Spark, a MapReduce alike framework in Python

Incubator Hugegraph ⭐ 2,549

A graph database that supports more than 100+ billion data, high performance and scalability (Include OLTP Engine & REST-API & Backends)

Featurebase ⭐ 2,504

A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase

Flume ⭐ 2,475

Mirror of Apache Flume

Nakedtensor ⭐ 2,471

Bare bone examples of machine learning in TensorFlow

Data Science Roadmap ⭐ 2,445

Data Science Roadmap from A to Z

Bigdataguide ⭐ 2,355

大数据学习，从零开始学习大数据，包含大数据学习各阶段学习视频、面试资料

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

Griddb ⭐ 2,310

GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.

Parquet Mr ⭐ 2,296

Lakesoul ⭐ 2,248

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

H2o 2 ⭐ 2,242

Please visit https://github.com/h2oai/h2o-3 for latest H2O

Alldata ⭐ 2,130

🔥🔥 AllData大数据产品是可定义数据中台，以数据平台为底座，以数据中台为桥梁，以机器学习平台为中层框

Ambari ⭐ 2,030

Apache Ambari simplifies provisioning, managing, and monitoring of Apache Hadoop clusters.

Flinkstreamsql ⭐ 1,972

基于开源的flink，对其实时sql进行扩展；主要实现了流与维表的join，支持原生flink SQL所有的语法

Spark ⭐ 1,963

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

An easy-to-use BI server built for SQL lovers. Power data analysis in SQL and gain faster business insights.

Drill ⭐ 1,856

Apache Drill is a distributed MPP query layer for self describing data

Bookkeeper ⭐ 1,828

Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads

Byzer Lang ⭐ 1,813

Byzer (former MLSQL): A low-code open-source programming language for data pipeline, analytics and AI.

Mirror of Apache Kudu

Gaffer ⭐ 1,724

A large-scale entity and relation database supporting aggregation of properties

Ytsaurus ⭐ 1,694

YTsaurus is a scalable and fault-tolerant open-source big data platform.

Genie ⭐ 1,659

Distributed Big Data Orchestration Service

Incubator Paimon ⭐ 1,647

Apache Paimon(incubating) is a streaming data lake platform that supports high-speed data ingestion, change data tracking and efficient real-time analytics.

Parquet Format ⭐ 1,559

Poseidon ⭐ 1,543

A search engine which can hold 100 trillion lines of log data.

Spark Py Notebooks ⭐ 1,515

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Bitsail ⭐ 1,514

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

Moosefs ⭐ 1,509

MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)

Just Dashboard ⭐ 1,489

📊 📋 Dashboards using YAML or JSON files

Fluid ⭐ 1,488

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)

Vaquarkhan ⭐ 1,464

Autocrawler ⭐ 1,454

Google, Naver multiprocess image web crawler (Selenium)

Optimus ⭐ 1,446

🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Mysql_perf_analyzer ⭐ 1,420

MySQL performance monitoring and analysis.

Carbondata ⭐ 1,401

High performance data store solution

Bigdata Interview ⭐ 1,397

🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop

Bigdataview ⭐ 1,309

100+套大数据可视化炫酷大屏Html5模板；包含行业：社区、物业、政务、交通、金融银行等，全网最新

Dremio Oss ⭐ 1,260

Dremio - the missing link in modern data

Matano ⭐ 1,259

Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS

Bigdata Growth ⭐ 1,256

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

Tensorbase ⭐ 1,217

TensorBase is a new big data warehousing with modern efforts.

Avro for JavaScript ⚡

1-100 of 1,346 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.