Awesome Open Source

Programming Languages

Search results for big data

1,346 search results found

Spark Doc Zh ⭐ 1,186

Apache Spark 官方文档中文版

Egads ⭐ 1,136

A Java package to automatically detect anomalies in large scale time-series data

Scikit Learn Intelex ⭐ 1,116

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

Arrow Ballista ⭐ 1,111

Apache Arrow Ballista Distributed Query Engine

Datumbox Framework ⭐ 1,089

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Arcticdb ⭐ 1,071

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

Hazelcast Jet ⭐ 1,065

Distributed Stream and Batch Processing

Kube Batch ⭐ 1,065

A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

Kube Batch ⭐ 1,055

A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

Odd Platform ⭐ 1,047

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

Utils4s ⭐ 1,033

scala、spark使用过程中，各种测试用例以及相关资料整理

Distributed DataFrame for Python designed for the cloud, powered by Rust

Phoenix ⭐ 1,006

Mirror of Apache Phoenix

Accumulo ⭐ 1,003

Apache Accumulo

Automated Deep Learning without ANY human intervention. 1'st Solution for AutoDL challenge@NeurIPS.

Traildb ⭐ 987

TrailDB is an efficient tool for storing and querying series of events

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

Sparkling Water ⭐ 957

Sparkling Water provides H2O functionality inside Spark cluster

C# and F# language binding and extensions to Apache Spark

Data syncing in golang for ClickHouse.

Coding Now ⭐ 925

学习记录的一些笔记，以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、

Titanoboa ⭐ 905

Titanoboa makes complex workflows easy. It is a low-code workflow orchestration platform for JVM - distributed, highly scalable and fault tolerant.

Tispark ⭐ 872

TiSpark is built for running Apache Spark on top of TiDB/TiKV

Dataflowjavasdk ⭐ 853

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

Incubator Livy ⭐ 840

Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

Mirror of Apache Sqoop

Rakam Api ⭐ 798

📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)

Kafka Streams ⭐ 797

equivalent to kafka-streams 🐙 for nodejs ✨🐢🚀✨

Mirror of Apache Samza

Onlinestats.jl ⭐ 786

⚡ Single-pass algorithms for statistics

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

Gearpump ⭐ 758

Lightweight real-time big data streaming engine over Akka

Spark Movie Lens ⭐ 757

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

Scalable, redundant, and distributed object store for Apache Hadoop

Visualpython ⭐ 748

GUI-based Python code generator for data science, extension to Jupyter Lab, Jupyter Notebook and Google Colab.

Sciblog_support ⭐ 742

Support content for my blog

Incubator Celeborn ⭐ 725

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.

Flink Boot ⭐ 725

懒松鼠Flink-Boot 脚手架让Flink全面拥抱Spring生态体系，使得开发者可以以Java WEB开发模式开发出分布式运行的流处理程序，懒松鼠让跨界变得更加简单。懒松鼠旨在让开发者以更底上手成 ORM框架，Hibernate Validator校验框架,Spring Retry重试框架等，具体见下面的脚手架特性。

Graphchi Cpp ⭐ 710

GraphChi's C++ version. Big Data - small machine.

Workflows and interfaces for neuroimaging packages

Pgm Index ⭐ 693

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes

Mirror of Apache Oozie

Data Science Career ⭐ 661

Career Resources for Data Science, Machine Learning, Big Data and Business Analytics Career Repository

Flink Kubernetes Operator ⭐ 657

Apache Flink Kubernetes Operator

Datav Vue ⭐ 654

A Powerful Data Visualization Tool. Uses TypeScript And Vue3. Scenario-specific templates. User-friendly interfaces. 一款数据可视化应用搭建工具

Delta Sharing ⭐ 654

An open protocol for secure data sharing

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler

Dataengineeringproject ⭐ 644

Example end to end data engineering project.

Oio Sds ⭐ 634

High Performance Software-Defined Object Storage for Big Data and AI, that supports Amazon S3 and Openstack Swift

CORTX Community Object Storage is 100% open source object storage uniquely optimized for mass capacity storage devices.

Wedatasphere ⭐ 624

WeDataSphere is a financial grade, one-stop big data platform suite.

Opendata.cern.ch ⭐ 620

Source code for the CERN Open Data portal

Spark Rapids ⭐ 619

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

Amoro is a Lakehouse management system built on open data lake formats.

Listenbrainz Server ⭐ 613

Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

Scanner ⭐ 602

Efficient video analysis at scale

Courses ⭐ 590

Answers for Quizzes & Assignments that I have taken

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

oneAPI Data Analytics Library (oneDAL)

Mirror of Apache Giraph

Parquetviewer ⭐ 574

Simple windows desktop application for viewing & querying Apache Parquet files

Nussknacker ⭐ 564

Low-code tool for automating actions on real time data | Stream processing for the users.

Tugraph Analytics ⭐ 557

TuGraph Analytics is the fastest OLAP graph database.

Redislite ⭐ 555

Redis in a python module.

Data Lineage Tracking And Visualization Solution

Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components.

Bigartm ⭐ 537

Fast topic modeling platform

Metorikku ⭐ 536

A simplified, lightweight ETL Framework based on Apache Spark

Bigdata Ecosystem ⭐ 536

BigData Ecosystem Dataset

Running Elasticsearch Fun Profit ⭐ 534

A book about running Elasticsearch

Bigslice ⭐ 525

A serverless cluster computing system for the Go programming language

Datawave ⭐ 512

DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.

Mockneat ⭐ 511

MockNeat - the modern faker lib.

Clickbench ⭐ 510

ClickBench: a Benchmark For Analytical Databases

Hudi Resources ⭐ 509

汇总Apache Hudi相关资料

Magellan ⭐ 509

Geo Spatial Data Analytics on Spark

Sidekick ⭐ 503

High Performance HTTP Sidecar Load Balancer

Fit Sne ⭐ 499

Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)

Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++

Decentralized Internet ⭐ 486

A SDK/library for decentralized web and distributing computing projects

Vue Bigdata Table ⭐ 476

基于Vue.js的百万级数据表格组件，支持编辑、筛选、过滤、粘贴、拖动调整列宽等多种功能

Jigsaw七巧板 provides a set of web components based on Angular5/8/9+. The main purpose of Jigsaw is to help the application developers to construct complex & intensive interacting & user friendly web pages. Jigsaw is supporting the development of all applications of Big Data Product of ZTE.

Kafka Connect Hdfs ⭐ 473

Kafka Connect HDFS connector

A fast, log structured key-value store.

Kusto Query Language ⭐ 464

Kusto Query Language is a simple and productive language for querying Big Data.

Conjure Up ⭐ 456

Deploying complex solutions, magically.

Circosjs ⭐ 454

d3 library to build circular graphs

Sparklearning ⭐ 451

A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.

Cogcomp Nlp ⭐ 448

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

Big Data Demo ⭐ 448

基于Vue、three.js、echarts，数据可视化展示项目，包含三维模型导入交互、三维模型标注

Awesome Data Catalogs ⭐ 441

📙 Awesome Data Catalogs and Observability Platforms.

Mirror of Apache Helix

Oie Resources ⭐ 435

A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️

Kotlin Spark Api ⭐ 425

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x

Mlcraft ⭐ 418

Synmetrix – open source semantic layer / Boost your LLM precision

Stroom is a highly scalable data storage, processing and analysis platform.

Docker Spark Cluster ⭐ 413

A simple spark standalone cluster for your testing environment purposses

101-200 of 1,346 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.