|Project Name||Stars||Downloads||Repos Using This||Packages Using This||Most Recent Commit||Total Releases||Latest Release||Open Issues||License||Language|
|Arrow||11,825||493||1,169||7 hours ago||38||May 06, 2022||3,546||apache-2.0||C++|
|Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing|
|Cudf||5,550||7 hours ago||24||August 18, 2022||848||apache-2.0||C++|
|cuDF - GPU DataFrame Library|
|Feather: fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow|
|Ballista||2,244||13||2 years ago||4||May 10, 2020||apache-2.0|
|Distributed compute platform implemented in Rust, and powered by Apache Arrow.|
|Influxdb_iox||1,659||9 hours ago||471||apache-2.0||Rust|
|Pronounced (influxdb eye-ox), short for iron oxide. This is the new core of InfluxDB written in Rust on top of Apache Arrow.|
|Transform||958||5 days ago||46||apache-2.0||Python|
|Input pipeline framework|
|Ballista||411||3 years ago||32||apache-2.0||Rust|
|Experimental Distributed Compute Platform based on Kubnernetes and Apache Arrow|
|Parquet Cpp||312||5 years ago||apache-2.0||C++|
|Rust Dataframe||250||3 years ago||12||apache-2.0||Rust|
|A Rust DataFrame implementation, built on Apache Arrow|
|Fletcher||227||1||3 months ago||16||January 17, 2021||mit||Python|
|Pandas ExtensionDType/Array backed by Apache Arrow|
Ballista is an experimental distributed compute platform based on Kubernetes and Apache Arrow that I am developing in my spare time as a way to learn more about distributed data processing. It is largely inspired by Apache Spark.
I learned a lot from the initial PoC (see below for a demo and more info) but have decided to start the project again due to the changes in scope mentioned above so the project is currently in a state of flux and nothing works right now but I am in the process of building a second PoC.
Here is a rough plan for delivering PoC #2:
This demo shows a Ballista cluster being created in Minikube and then shows the nyctaxi example being executed, causing a distributed query to run in the cluster, with each executor pod performing an aggregate query on one partition of the data, and then the driver merges the results and runs a secondary aggregate query to get the final result.
Here are the commands being run, with some explanation:
# create a cluster with 12 executors cargo run --bin ballista -- create-cluster --name nyctaxi --num-executors 12 --template examples/nyctaxi/templates/executor.yaml # check status kubectl get pods # run the nyctaxi example application, that executes queries using the executors cargo run --bin ballista -- run --name nyctaxi --template examples/nyctaxi/templates/application.yaml # check status again to find the name of the application pod kubectl get pods # watch progress of the application kubectl logs -f ballista-nyctaxi-app-n5kxl
Note that PoC #1 is now archived here.
See CONTRIBUTING.md for information on contributing to this project.