Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Arrow | 11,825 | 493 | 1,169 | 7 hours ago | 38 | May 06, 2022 | 3,546 | apache-2.0 | C++ | |
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing | ||||||||||
Cudf | 5,550 | 7 hours ago | 24 | August 18, 2022 | 848 | apache-2.0 | C++ | |||
cuDF - GPU DataFrame Library | ||||||||||
Feather | 2,418 | 201 | 31 | 2 years ago | 8 | April 27, 2020 | 4 | apache-2.0 | JavaScript | |
Feather: fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow | ||||||||||
Ballista | 2,244 | 13 | 2 years ago | 4 | May 10, 2020 | apache-2.0 | ||||
Distributed compute platform implemented in Rust, and powered by Apache Arrow. | ||||||||||
Influxdb_iox | 1,659 | 9 hours ago | 471 | apache-2.0 | Rust | |||||
Pronounced (influxdb eye-ox), short for iron oxide. This is the new core of InfluxDB written in Rust on top of Apache Arrow. | ||||||||||
Transform | 958 | 5 days ago | 46 | apache-2.0 | Python | |||||
Input pipeline framework | ||||||||||
Ballista | 411 | 3 years ago | 32 | apache-2.0 | Rust | |||||
Experimental Distributed Compute Platform based on Kubnernetes and Apache Arrow | ||||||||||
Parquet Cpp | 312 | 5 years ago | apache-2.0 | C++ | ||||||
Apache Parquet | ||||||||||
Rust Dataframe | 250 | 3 years ago | 12 | apache-2.0 | Rust | |||||
A Rust DataFrame implementation, built on Apache Arrow | ||||||||||
Fletcher | 227 | 1 | 3 months ago | 16 | January 17, 2021 | mit | Python | |||
Pandas ExtensionDType/Array backed by Apache Arrow |
Ballista is an experimental distributed compute platform based on Kubernetes and Apache Arrow that I am developing in my spare time as a way to learn more about distributed data processing. It is largely inspired by Apache Spark.
Ballista aims to be language-agnostic with an architecture that is capable of supporting any language supported by Apache Arrow, which currently includes C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
I learned a lot from the initial PoC (see below for a demo and more info) but have decided to start the project again due to the changes in scope mentioned above so the project is currently in a state of flux and nothing works right now but I am in the process of building a second PoC.
Here is a rough plan for delivering PoC #2:
This demo shows a Ballista cluster being created in Minikube and then shows the nyctaxi example being executed, causing a distributed query to run in the cluster, with each executor pod performing an aggregate query on one partition of the data, and then the driver merges the results and runs a secondary aggregate query to get the final result.
Here are the commands being run, with some explanation:
# create a cluster with 12 executors
cargo run --bin ballista -- create-cluster --name nyctaxi --num-executors 12 --template examples/nyctaxi/templates/executor.yaml
# check status
kubectl get pods
# run the nyctaxi example application, that executes queries using the executors
cargo run --bin ballista -- run --name nyctaxi --template examples/nyctaxi/templates/application.yaml
# check status again to find the name of the application pod
kubectl get pods
# watch progress of the application
kubectl logs -f ballista-nyctaxi-app-n5kxl
Note that PoC #1 is now archived here.
See CONTRIBUTING.md for information on contributing to this project.