Awesome Open Source

Programming Languages

Search results for apache spark

583 search results found

Spark ⭐ 37,661

Apache Spark - A unified analytics engine for large-scale data processing

Mlflow ⭐ 16,343

Open source platform for the machine learning lifecycle

Data Engineer Handbook ⭐ 5,650

This is a repo with links to everything you'd ever want to learn about data engineering

Upserts, Deletes And Incremental Processing on Big Data.

Synapseml ⭐ 4,989

Simple and Distributed Machine Learning

Bigdl ⭐ 4,728

Accelerate LLM with low-bit (FP4 / INT4 / FP8 / INT8) optimizations using bigdl-llm

Sparkinternals ⭐ 4,665

Notes talking about the design and implementation of Apache Spark

Lakefs ⭐ 3,900

lakeFS - Data version control for your data lake | Git for data

Spark Nlp ⭐ 3,578

State of the Art Natural Language Processing

Coolplayspark ⭐ 3,447

酷玩 Spark: Spark 源代码解析、Spark 类库等

Koalas ⭐ 3,291

Koalas: pandas API on Apache Spark

Spark Notebook ⭐ 3,148

Interactive and Reactive Data Science using Scala and Spark.

Deequ ⭐ 3,044

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Analytics Zoo ⭐ 2,592

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

Spark On K8s Operator ⭐ 2,526

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

Ballista ⭐ 2,244

Distributed compute platform implemented in Rust, and powered by Apache Arrow.

Transmogrifai ⭐ 2,099

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

Spark ⭐ 1,963

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

A new arguably faster implementation of Apache Spark from scratch in Rust

Feathr ⭐ 1,886

Feathr – A scalable, unified data and AI engineering platform for enterprise

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Docker Spark ⭐ 1,783

Apache Spark docker image

Awesome Spark ⭐ 1,461

A curated list of awesome Apache Spark packages and resources.

Dr Elephant ⭐ 1,301

Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark

Spark Doc Zh ⭐ 1,186

Apache Spark 官方文档中文版

Sparkit Learn ⭐ 1,054

PySpark + Scikit-learn = Sparkit-learn

Spark Sklearn ⭐ 1,039

(Deprecated) Scikit-learn integration package for Apache Spark

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

C# and F# language binding and extensions to Apache Spark

Sparklyr ⭐ 922

R interface for Apache Spark

Livy is an open source REST interface for interacting with Apache Spark from anywhere

Tispark ⭐ 872

TiSpark is built for running Apache Spark on top of TiDB/TiKV

Incubator Livy ⭐ 840

Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

Extraction Framework ⭐ 802

The software used to extract structured data from Wikipedia

Kafka Storm Starter ⭐ 729

[PROJECT IS NO LONGER MAINTAINED] Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

Streaming Readings ⭐ 640

Streaming System 相关的论文读物

Flintrock ⭐ 627

A command-line tool for launching Apache Spark clusters.

Docker Spark ⭐ 626

Docker build for Apache Spark

Spark Rapids ⭐ 619

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

Dist Keras ⭐ 611

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

Sparkmeasure ⭐ 603

This is the development repository for sparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/

Goodreads_etl_pipeline ⭐ 593

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Sparklearning ⭐ 573

Learning Apache spark,including code and data .Most part can run local.

pyspark methods to enhance developer productivity 📣 👯 🎉

Learningsparkv2 ⭐ 570

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

Openscoring ⭐ 565

REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models

Data Lineage Tracking And Visualization Solution

Awesome Kafka ⭐ 549

A list about Apache Kafka

Parquet Dotnet ⭐ 457

Fully managed Apache Parquet implementation

Sparkle ⭐ 442

Haskell on Apache Spark.

Agile_data_code_2 ⭐ 435

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Sparkling ⭐ 423

A Clojure library for Apache Spark: fast, fully-features, and developer friendly

Spark Corenlp ⭐ 409

Stanford CoreNLP wrapper for Apache Spark

Machinelearning ⭐ 406

Machine Learning

Spark Perf ⭐ 346

Performance tests for Apache Spark

Eclairjs Node ⭐ 340

Node.js API for Apache Spark with Remote Client

Wirbelsturm ⭐ 333

[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.

Morpheus ⭐ 330

Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.

Parquet Dotnet ⭐ 319

🏐 Apache Parquet for modern .NET

Incubator Hivemall ⭐ 308

Mirror of Apache Hivemall (incubating)

Serverless proxy for Spark cluster

Sparkflow ⭐ 301

Easy to use library to bring Tensorflow on Apache Spark

Delight ⭐ 299

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

Sparktorch ⭐ 297

Train and run Pytorch models on Apache Spark.

Data Accelerator ⭐ 295

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

Akka Analytics ⭐ 281

Large-scale event processing with Akka Persistence and Apache Spark

Spark Gotchas ⭐ 276

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

Spark Programming In Python ⭐ 269

Apache Spark 3 - Spark Programming in Python for Beginners

Cuelake ⭐ 266

Use SQL to build ELT pipelines on a data lakehouse.

Spark Jupyter Aws ⭐ 255

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

Pysparkling ⭐ 253

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

Spark Indexedrdd ⭐ 247

An efficient updatable key-value store for Apache Spark

Succinct ⭐ 239

Enabling queries on compressed data.

Spark Workshop ⭐ 231

Apache Spark™ and Scala Workshops

Azure Event Hubs Spark ⭐ 225

Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs

Ruby Spark ⭐ 215

Ruby wrapper for Apache Spark

Spark_dbscan ⭐ 215

DBSCAN clustering algorithm on top of Apache Spark

Databricks ⭐ 212

Repository of sample Databricks notebooks

Sql Data Analysis And Visualization Projects ⭐ 200

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

Spark Snowflake ⭐ 196

Snowflake Data Source for Apache Spark.

Azure Cosmosdb Spark ⭐ 194

Apache Spark Connector for Azure Cosmos DB

⛈️ RumbleDB 1.21.0 "Hawthorn blossom" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Sparkrdma ⭐ 191

RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

Vn.vitk ⭐ 189

A Vietnamese Text Processing Toolkit

Spark.jl ⭐ 180

Julia binding for Apache Spark

Whylogs Java ⭐ 179

Profile and monitor your ML data pipeline end-to-end

Awesome Ai Infrastructures ⭐ 171

Infrastructures™ for Machine Learning Training/Inference in Production.

Learning Hadoop And Spark ⭐ 160

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

Spark Authorizer ⭐ 158

A Spark SQL extension which provides SQL Standard Authorization for Apache Spark

Spark Operator ⭐ 155

Operator for managing the Spark clusters on Kubernetes and OpenShift.

Bigdata Playground ⭐ 154

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

Spark Ext ⭐ 147

Spark Extension : ML transformers, SQL aggregations, etc that are missing in Apache Spark

Dbscan On Spark ⭐ 146

An implementation of DBSCAN runing on top of Apache Spark

Spark On Lambda ⭐ 144

Apache Spark on AWS Lambda

A recommender system for discovering GitHub repos, built with Apache Spark

Sparknotebook ⭐ 142

An example of running Apache Spark using Scala in ipython notebook

Pyspark Cheatsheet ⭐ 140

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

Sansa Stack ⭐ 139

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

Hydrograph ⭐ 138

A visual ETL development and debugging tool for big data

Related Searches

Scala Apache Spark (497)

1-100 of 583 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.