kamu - a new-generation data management and transformation tool!
kamu is a reference implementation of Open Data Fabric - a Web 3.0 technology that powers a distributed structured data supply chain for providing timely, high-quality, and verifiable data for data science, smart contracts, web and applications.
kamu you can become a member of the world's first peer-to-peer data pipeline that:
- Connects publishers and consumers of data worldwide.
- Enables effective collaboration of people around data transformation and cleaning.
- Ensures data propagates with minimal latency.
- Provides the most complete, secure, and fully accurate lineage and provenance information on where every piece of data came from and how it was produced.
- Guarantees reproducibility of all data workflows.
Our documentation is still evolving, so many topics (those without links) have not been covered yet. Answers to most questions around theory, however, can be found in the ODF specification
- First Steps
- Exporting Data
- Transformation model
- Supported Engines
- Streaming Aggregations
- Temporal Table Joins
- Stream-to-Stream Joins
- Geo-Spatial Data
- Schema Evolution
- Adding / deprecating columns
- Upstream schema changes
- Backwards incompatible changes
- Root Dataset Evolution
- Handling source URL changes
- Handling upstream format changes
- Derivative Dataset Evolution
- Handling upstream changes
- Evolving transformations
Handling Bad Data
- Corrections and compensations
- Bad data upon ingestion
- Bad data in upstream datasets
- PII and sensitive data
- Exploring Data
For Data Publishers
- Create and share your own dataset with the world
- Ingest any existing data set from the web
- Easily keep track of any updates to the data source in the future
- Close the feedback loop and see who and how uses your data
For Data Professionals
- Collaborate on cleaning and improving data of existing datasets
- Create derivative datasets by transforming, enriching, and summarizing data others have published
- Write query once - run it forever with one of our state of the art stream processing engines
- Always stay up-to-date by pulling latest updates from the data sources with just one command
- Built-in support for GIS data
For Data Consumers
- Download a dataset from a shared repository
- Easily verify that all data comes from trusted sources
- Audit the chain of transformations this data went through
- Validate that downloaded data was in fact produced by the declared transformations
For Data Exploration
- Explore data and run ad-hoc SQL queries (backed by the power of Apache Spark)
- Launch a Jupyter notebook with one command
- Join, filter, and shape your data using SQL
- Visualize the result using your favorite library
Project Status Disclaimer
kamu is an alpha quality software. Our main goal currently is to demonstrate the potential of the Open Data Fabric protocol and its transformative properties to the community and the industry and validate our ideas.
Naturally, we don't recommend using
kamu for any critical tasks - it's definitely not prod-ready. We are, however absolutely delighted to use
kamu for our personal data analytics needs and small projects, and hoping you will enjoy it too.
If you do - simply make sure to maintain your source data separately and don't rely on
kamu for data storage. This way any time a new version comes out that breaks some compatibility you can simply delete your kamu workspace and re-create it from scratch in a matter of seconds.
Also, please be patient with current performance and resource usage. We fully realize that waiting 15s to process a few KiB of CSV isn't great. Stream processing technologies is a relatively new area, and the data processing engines
kamu uses (e.g. Apache Spark and Flink) are tailored to run in large clusters, not on a laptop. They take a lot of resources to just boot up, so the start-stop-continue nature of
kamu's transformations is at odds with their design. We are hoping that the industry will recognize our use-case and expect to see a better support for it in future. We are committed to improving the performance significantly in the near future.