Ballista

Experimental Distributed Compute Platform based on Kubnernetes and Apache Arrow
Alternatives To Ballista
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Arrow11,8254931,1697 hours ago38May 06, 20223,546apache-2.0C++
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Cudf5,550
7 hours ago24August 18, 2022848apache-2.0C++
cuDF - GPU DataFrame Library
Feather2,418201312 years ago8April 27, 20204apache-2.0JavaScript
Feather: fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow
Ballista2,244132 years ago4May 10, 2020apache-2.0
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Influxdb_iox1,659
9 hours ago471apache-2.0Rust
Pronounced (influxdb eye-ox), short for iron oxide. This is the new core of InfluxDB written in Rust on top of Apache Arrow.
Transform958
5 days ago46apache-2.0Python
Input pipeline framework
Ballista411
3 years ago32apache-2.0Rust
Experimental Distributed Compute Platform based on Kubnernetes and Apache Arrow
Parquet Cpp312
5 years agoapache-2.0C++
Apache Parquet
Rust Dataframe250
3 years ago12apache-2.0Rust
A Rust DataFrame implementation, built on Apache Arrow
Fletcher22713 months ago16January 17, 2021mitPython
Pandas ExtensionDType/Array backed by Apache Arrow
Alternatives To Ballista
Select To Compare


Alternative Project Comparisons
Readme

Ballista

License Version Gitter Chat

Overview

Ballista is an experimental distributed compute platform based on Kubernetes and Apache Arrow that I am developing in my spare time as a way to learn more about distributed data processing. It is largely inspired by Apache Spark.

Ballista aims to be language-agnostic with an architecture that is capable of supporting any language supported by Apache Arrow, which currently includes C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

Ballista Goals

  • Define a logical query plan in protobuf format. See ballista.proto
  • Provide DataFrame style interfaces for JVM (Java, Kotlin, Scala), Rust, and Python
  • Provide a JDBC Driver, allowing Ballista to be used from existing BI and SQL tools
  • Use Apache Flight for sending query plans between nodes, and streaming results between nodes
  • Allow clusters to be created, consisting of executors implemented in any language that supports Flight
  • Distributed compute jobs should be capable of invoking code in more than one language (with some performance trade-offs for IPC overhead)
  • Provide integrations with Apache Spark (e.g. Spark V2 Data Source allowing Spark to interact with Ballista)

Ballista Anti Goals

  • Ballista is not intended to replace Apache Spark but to augment it

Status

I learned a lot from the initial PoC (see below for a demo and more info) but have decided to start the project again due to the changes in scope mentioned above so the project is currently in a state of flux and nothing works right now but I am in the process of building a second PoC.

Here is a rough plan for delivering PoC #2:

  • [ ] Implement a Rust server implementing Flight protocol that can receive a logical plan and validate it and execute it (in progress)
  • [ ] Implement a Kotlin DataFrame client that can build a plan and execute it against the Rust server (in progress)
  • [ ] Implement a Rust DataFrame client that can build a plan and execute it against the Rust server (in progress)
  • [ ] Implement a JDBC driver that can execute a SQL statement against the Rust server (in progress)
  • [ ] Implement a Scala server implementing the Flight protocol that can receive a logical plan and translate it to Spark and execute it
  • [ ] Build a benchmark client in Kotlin that can run against the Rust and Scala servers

PoC #1

This demo shows a Ballista cluster being created in Minikube and then shows the nyctaxi example being executed, causing a distributed query to run in the cluster, with each executor pod performing an aggregate query on one partition of the data, and then the driver merges the results and runs a secondary aggregate query to get the final result.

asciicast

Here are the commands being run, with some explanation:

# create a cluster with 12 executors
cargo run --bin ballista -- create-cluster --name nyctaxi --num-executors 12 --template examples/nyctaxi/templates/executor.yaml

# check status
kubectl get pods

# run the nyctaxi example application, that executes queries using the executors
cargo run --bin ballista -- run --name nyctaxi --template examples/nyctaxi/templates/application.yaml

# check status again to find the name of the application pod
kubectl get pods

# watch progress of the application
kubectl logs -f ballista-nyctaxi-app-n5kxl

Note that PoC #1 is now archived here.

Contributing

See CONTRIBUTING.md for information on contributing to this project.

Popular Arrow Projects
Popular Apache Projects
Popular User Interface Components Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Rust
Kubernetes
Apache
Spark
Arrow
Dataframe