Data-Centric Pipelines and Data Versioning
Alternatives To Pachyderm
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Airflow33,62332013 hours ago169November 27, 2023881apache-2.0Python
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Prefect14,110115216 hours ago249December 08, 2023653apache-2.0Python
Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines
Dagster9,690213314 hours ago585December 07, 20232,412apache-2.0Python
An orchestration platform for the development, production, and observation of data assets.
Tpot9,4224022a day ago62August 15, 2023285lgpl-3.0Python
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
Great_expectations9,27653a day ago256December 08, 2023194apache-2.0Python
Always know what to expect from your data.
Mage Ai6,587
a day ago314December 06, 2023207apache-2.0Python
🧙 The modern replacement for Airflow. Build, run, and manage data pipelines for integrating and transforming data.
Pachyderm6,0531a day ago613December 04, 2023899apache-2.0Go
Data-Centric Pipelines and Data Versioning
a day ago26October 27, 2023177apache-2.0Python
Turns Data and AI algorithms into production-ready web applications in no time.
9 months ago19December 13, 2022125apache-2.0TypeScript
Build data pipelines, the easy way 🛠️
6 months ago20
Open Source Data Science Resources.
Alternatives To Pachyderm
Select To Compare

Alternative Project Comparisons

GitHub release GitHub license GoDoc Go Report Card Slack Status CLA assistant

Pachyderm – Automate data transformations with data versioning and lineage

Pachyderm is cost-effective at scale, enabling data engineering teams to automate complex pipelines with sophisticated data transformations across any type of data. Our unique approach provides parallelized processing of multi-stage, language-agnostic pipelines with data versioning and data lineage tracking. Pachyderm delivers the ultimate CI/CD engine for data.


  • Data-driven pipelines automatically trigger based on detecting data changes.
  • Immutable data lineage with data versioning of any data type.
  • Autoscaling and parallel processing built on Kubernetes for resource orchestration.
  • Uses standard object stores for data storage with automatic deduplication.
  • Runs across all major cloud providers and on-premises installations.

Getting Started

To start deploying your end-to-end version-controlled data pipelines, run Pachyderm locally or you can also deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete documentation to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:


Official Documentation


Keep up to date and get Pachyderm support via:

  • Twitter Follow us on Twitter.
  • Slack Status Join our community Slack Channel to get help from the Pachyderm team and other users.


To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

Popular Pipeline Projects
Popular Data Science Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Data Science
Data Analysis
Big Data
Distributed Systems