Awesome Open Source
Awesome Open Source

Kamu

build Release

Welcome to kamu - a new-generation data management and transformation tool!

About

kamu is a reference implementation of Open Data Fabric - a Web 3.0 technology that powers a distributed structured data supply chain for providing timely, high-quality, and verifiable data for data science, smart contracts, web and applications.

Open Data Fabric

Using kamu you can become a member of the world's first peer-to-peer data pipeline that:

  • Connects publishers and consumers of data worldwide.
  • Enables effective collaboration of people around data transformation and cleaning.
  • Ensures data propagates with minimal latency.
  • Provides the most complete, secure, and fully accurate lineage and provenance information on where every piece of data came from and how it was produced.
  • Guarantees reproducibility of all data workflows.

Documentation

Our documentation is still evolving, so many topics (those without links) have not been covered yet. Answers to most questions around theory, however, can be found in the ODF specification

Learning Materials

Kamu 101 - First Steps

Features

  • For Data Publishers

    • Create and share your own dataset with the world
    • Ingest any existing data set from the web
    • Easily keep track of any updates to the data source in the future
    • Close the feedback loop and see who and how uses your data Pull Data
  • For Data Professionals

    • Collaborate on cleaning and improving data of existing datasets
    • Create derivative datasets by transforming, enriching, and summarizing data others have published
    • Write query once - run it forever with one of our state of the art stream processing engines
    • Always stay up-to-date by pulling latest updates from the data sources with just one command
    • Built-in support for GIS data
  • For Data Consumers

    • Download a dataset from a shared repository
    • Easily verify that all data comes from trusted sources
    • Audit the chain of transformations this data went through
    • Validate that downloaded data was in fact produced by the declared transformations
  • For Data Exploration

    • Explore data and run ad-hoc SQL queries (backed by the power of Apache Spark) SQL Shell
    • Launch a Jupyter notebook with one command
    • Join, filter, and shape your data using SQL
    • Visualize the result using your favorite library Jupyter

Project Status Disclaimer

kamu is an alpha quality software. Our main goal currently is to demonstrate the potential of the Open Data Fabric protocol and its transformative properties to the community and the industry and validate our ideas.

Naturally, we don't recommend using kamu for any critical tasks - it's definitely not prod-ready. We are, however absolutely delighted to use kamu for our personal data analytics needs and small projects, and hoping you will enjoy it too.

If you do - simply make sure to maintain your source data separately and don't rely on kamu for data storage. This way any time a new version comes out that breaks some compatibility you can simply delete your kamu workspace and re-create it from scratch in a matter of seconds.

Also, please be patient with current performance and resource usage. We fully realize that waiting 15s to process a few KiB of CSV isn't great. Stream processing technologies is a relatively new area, and the data processing engines kamu uses (e.g. Apache Spark and Flink) are tailored to run in large clusters, not on a laptop. They take a lot of resources to just boot up, so the start-stop-continue nature of kamu's transformations is at odds with their design. We are hoping that the industry will recognize our use-case and expect to see a better support for it in future. We are committed to improving the performance significantly in the near future.


Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
rust (4,751
blockchain (843
sql (725
spark (364
jupyter (287
open-data (89
flink (61
data-management (36