Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for data pipeline
data-pipeline
x
212 search results found
Airflow
⭐
34,468
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Airbyte
⭐
12,918
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Dolphinscheduler
⭐
11,613
Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
Dagster
⭐
9,467
An orchestration platform for the development, production, and observation of data assets.
Snowplow
⭐
6,677
The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP
Mage Ai
⭐
6,324
🧙 The modern replacement for Airflow. Build, run, and manage data pipelines for integrating and transforming data.
Kestra
⭐
5,257
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
Unstructured
⭐
4,404
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Orchest
⭐
3,876
Build data pipelines, the easy way 🛠️
Rudder Server
⭐
3,841
Privacy and Security focused Segment-alternative, in Golang and React
Memphis
⭐
3,078
Memphis.dev is a highly scalable and effortless data streaming platform
Data Engineering Howto
⭐
2,949
A list of useful resources to learn Data Engineering from scratch
Whylogs
⭐
2,533
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
Fluvio
⭐
2,373
Lean and mean distributed stream processing system written in rust and web assembly.
Elementary
⭐
1,721
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
Go Streams
⭐
1,656
A lightweight stream processing library for Go
Doit
⭐
1,590
task management & automation tool
Bitsail
⭐
1,514
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
Mleap
⭐
1,479
MLeap: Deploy ML Pipelines to Production
Meltano
⭐
1,460
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
Data Science On Gcp
⭐
1,249
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Odd Platform
⭐
1,047
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
Data Engineering Wiki
⭐
934
The best place to learn data engineering. Built and maintained by the data engineering community.
Klio
⭐
822
Smarter data pipelines for audio.
Dataform
⭐
757
Dataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift
Optimus
⭐
707
Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management.
Dataengineeringproject
⭐
644
Example end to end data engineering project.
Covalent
⭐
608
Pythonic tool for running machine-learning/high performance/quantum-computing workflows in heterogeneous environments.
Awesome Kafka
⭐
549
A list about Apache Kafka
Transfer
⭐
495
Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift) in real-time.
Piperider
⭐
443
Code review for data in dbt
Tributary
⭐
424
Streaming reactive and dataflow graphs in Python
Versatile Data Kit
⭐
389
One framework to develop, deploy and operate data workflows with Python and SQL.
Zdh_web
⭐
379
大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台,包含数据采集,调度,权限,审批
Seatunnel Web
⭐
365
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Conduit
⭐
321
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
Nonechucks
⭐
315
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Dbt Data Reliability
⭐
304
Data anomalies monitoring as dbt tests and dbt artifacts uploader.
Recap
⭐
292
Work with your web service, database, and streaming schemas in a single format.
Cuelake
⭐
266
Use SQL to build ELT pipelines on a data lakehouse.
Augraphy
⭐
258
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Gusty
⭐
202
Making DAG construction easier
Feldera
⭐
199
Feldera Continuous Analytics Platform
Practical Data Engineering
⭐
191
Real estate dagster pipeline
Flupy
⭐
182
Fluent data pipelines for python and your shell
Scicloj.ml
⭐
176
A Clojure machine learning library
Mobydq
⭐
175
🐳 Tool to automate data quality checks on data pipelines
Pureml
⭐
174
Developer platform for production ML.
Dataplane
⭐
171
Dataplane is an Airflow inspired unified data platform with additional data mesh and RPA capability to automate, schedule and design data pipelines and workflows. Dataplane is written in Golang with a React front end.
Awesome Kubeflow
⭐
169
A curated list of awesome projects and resources related to Kubeflow (a CNCF incubating project)
Datajoint Python
⭐
158
Relational data pipelines for the science lab
Dud
⭐
158
A lightweight CLI tool for versioning data alongside source code and building data pipelines.
Scalable Data Science Platform
⭐
153
Content for architecting a data science platform for products using Luigi, Spark & Flask.
Aws Pdf Textract Pipeline
⭐
148
🔍 Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
Core
⭐
138
An Open Source PHP Reporting Framework that helps you to write perfect data reports or to construct awesome dashboards in PHP. Working great with all PHP versions from 5.6 to latest 8.0. Fully compatible with all kinds of MVC frameworks like Laravel, CodeIgniter, Symfony.
Atom
⭐
137
Automated Tool for Optimized Modelling
Public Datasets Pipelines
⭐
131
Cloud-native, data onboarding architecture for Google Cloud Datasets
Watchmen Matryoshka Doll
⭐
124
Watchmen Platform is a low code data platform for data pipeline, meta data management , analysis, and quality management
Patterns Devkit
⭐
101
Data pipelines from re-usable components
Datajob
⭐
99
Build and deploy a serverless data pipeline on AWS with no effort.
Thedataengineeringbook
⭐
96
The Data Engineering Book - หนังสือวิศวกรรมข้อมูล ของคนไทย เพื่อคนไทย
Premier League
⭐
88
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
Ob_bulkstash
⭐
87
Bulk Stash is a docker rclone service to sync, or copy, files between different storage services. For example, you can copy files either to or from a remote storage services like Amazon S3 to Google Cloud Storage, or locally from your laptop to a remote storage.
Smart Data Lake
⭐
87
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Tensorpipe
⭐
86
High Performance Tensorflow Data Pipeline with State of Art Augmentations and low level optimizations.
Udacity Data Eng Proj 1
⭐
81
Developed a data pipeline to automate data warehouse ETL by building custom airflow operators that handle the extraction, transformation, validation and loading of data from S3 -> Redshift -> S3
Datacater
⭐
80
The developer-friendly ETL platform for transforming data in real-time. Based on Apache Kafka® and Kubernetes®.
Hookah
⭐
78
A cross-platform tool for data pipelines.
Pansori
⭐
74
Tools for ASR Corpus Generation from Online Video
Jayvee
⭐
68
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
Hoptimator
⭐
68
Multi-hop declarative data pipelines
Delta Architecture
⭐
66
Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline
Spark
⭐
65
Open Source D-APM (Data-Application Performance Monitoring) for Apache Spark
Beneath
⭐
64
Beneath is a serverless real-time data platform ⚡️
Udacity Data Engineer Nanodegree
⭐
64
Classwork projects and home works done through Udacity data engineering nano degree
Sqlpipe
⭐
52
SQLpipe makes it easy to move the result of one query from one database to another.
Dc Sdk Js
⭐
50
一个基于浏览器环境的数据采集SDK
Serverless Data Pipeline Sam
⭐
50
Serverless Data Pipeline powered by Kinesis Firehose, API Gateway, Lambda, S3, and Athena
Dbt Snowplow Web
⭐
47
A fully incremental model, that transforms raw web event data generated by the Snowplow JavaScript tracker into a series of derived tables of varying levels of aggregation.
Datapipelines Essentials Python
⭐
45
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Streams Explorer
⭐
43
Explore Apache Kafka data pipelines in Kubernetes.
Trembita
⭐
43
Model complex data transformation pipelines easily
Sqrl
⭐
41
Compiler for streaming data pipelines and data microservices with configurable engines.
Terraform Aws Efs Backup
⭐
41
Terraform module designed to easily backup EFS filesystems to S3 using DataPipeline
Pipeline
⭐
40
OONI data processing pipeline
Typestream
⭐
39
⚡️ Next-generation data transformation framework for TypeScript that puts developer experience first
Ml In Production
⭐
39
The practical use-cases of how to make your Machine Learning Pipelines robust and reliable using Apache Airflow.
Mycelial
⭐
39
Move your Edge data with ease.
Conductor Python
⭐
38
Conductor OSS SDK for Python programming language
Datatap Python
⭐
37
Focus on Algorithm Design, Not on Data Wrangling
Spark Transformers
⭐
37
Spark-Transformers: Library for exporting Apache Spark MLLIB models to use them in any Java application with no other dependencies.
Stairs
⭐
35
Framework which helps you to make parallel/distributed calculations using data pipelines
Didact Engine
⭐
34
The REST API and execution engine for the Didact Platform.
Pandas To Postgres
⭐
33
Copy Pandas DataFrames and HDF5 files to PostgreSQL database
Feagen
⭐
33
(deprecated) A fast and memory-efficient Python data engineering framework for machine learning.
Blast
⭐
31
Blast is a data orchestration tool that can run SQL and Python against Google BigQuery and Snowflake. It supports templating with Jinja, data quality tests, query validation, environment management and more.
Mldotnet Real Time Data Streaming Workshop
⭐
31
A Machine Learning and Real-Time Data Analytics Workshop
Tf2 Tutorial
⭐
30
Tensorflow 2 Tutorials (use tensorflow and keras in a better way!)
Awesome Public Dbt Projects
⭐
30
A curated list of awesome public DBT projects
Debussy_concert
⭐
29
Debussy is an opinionated Data Architecture and Engineering framework, enabling data analysts and engineers to build better platforms and pipelines.
1-100 of 212 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.