= Awesome Open-Source Data Engineering
This https://github.com/topics/awesome-list[Awesome List] aims at providing an overview of https://opensource.org/licenses[open-source] projects related to data engineering.
This is a community effort: please https://github.com/gunnarmorling/awesome-opensource-data-engineering/blob/master/CONTRIBUTING.md[contribute] and send your pull requests for growing this list!
For a list including non-OSS tools, see this amazing https://github.com/igorbarinov/awesome-data-engineering[Awesome List].
https://spark.apache.org/[Apache Spark] - A unified analytics engine for large-scale data processing. Includes APIs in Scala, Java, Python (known as PySpark), and R (SparkR).
https://beam.apache.org/[Apache Beam] - An open-source implementation of Google DataFlow. Provides capabilites of batch and streaming data processing jobs that run on any execution engine, including Spark, Flink, or its own DirectRunner. Supports multiple APIs in Java, Python, and Go.
https://flink.apache.org/[Apache Flink] - Stateful computations over data streams.
https://trino.io/[Trino (formerly known as PrestoSQL)] - Distributed SQL Query Engine for Big Data.
== Business Intelligence
https://superset.incubator.apache.org/[Apache Superset] - A modern, enterprise-ready business intelligence web application.
https://gethue.com/[HUE] - The Hadoop User Interface. Similar to Superset, but interfaces between RDBMS, Hive, Impala, HBase, Spark, HDFS & S3, Oozie, Pig, YARN Job Explorer, and more. Offers an extensible Django environment for custom app integration.
https://www.metabase.com/[Metabase] - An easy way for everyone in your company to ask questions and learn from data.
https://redash.io/[Redash] - All the tools to unlock your data.
== Change Data Capture
== Data Governance and Registries
== Data Virtualization
== Data Orchestration
https://github.com/Alluxio/alluxio[Alluxio] - Scalable, multi-tiered distributed caching for HDFS, S3, Ceph, NFS, and related filestores. Provides integrations for SQL queries into a Catalog from Spark, Hive, and Presto.
https://avro.apache.org/[Apache Avro] - A data serialization system.
https://parquet.apache.org/[Apache Parquet] - A columnar storage format.
https://orc.apache.org/[Apache ORC] - Another columnar storage format.
https://thrift.apache.org/[Apache Thrift] - Data type and service interface definitions and code generator.
https://arrow.apache.org/[Apache Arrow] - A cross-language development platform for in-memory data. It specifies a standardized, language-independent, columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy IPC and streaming messaging.
https://capnproto.org/[Cap’n Proto] - A data interchange format and capability-based RPC system.
https://msgpack.org/index.html[MessagePack] - An efficient binary serialization format. It lets you exchange data among multiple languages like JSON.
https://developers.google.com/protocol-buffers[Protocol Buffers] - Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data.
== Messaging Infrastructure
== Specifications and Standards
== Stream Processing
== Workflow Management
== Related Resources
only overview contents, no specific tools
=== Slide Decks, Recordings and Podcasts
=== Blog Posts and Articles
The contents of this repository is licensed under the "Creative Commons Attribution-ShareAlike 4.0 International License".