If you just want to get started and quickly start the demo in a few minutes, go to the quick start to setup the infrastructure (on GCP) and run the demo.
You can also check out the 20min video recording with a live demo: Streaming Machine Learning at Scale from 100000 IoT Devices with HiveMQ, Apache Kafka and TensorFLow.
There is also a blog post with more details about use cases for event streaming and streaming analytics in the automotive industry.
You want to see an IoT example at huge scale? Not just 100 or 1000 devices producing data, but a really scalable demo with millions of messages per second from tens of thousands of devices?
This is the right demo for you! The demo shows how you can integrate with tens or hundreds of thousands IoT devices and process the data in real time. The demo use case is predictive maintenance (i.e. anomaly detection) in a connected car infrastructure to predict motor engine failures.
If you need more background about the challenges of building a scalable IoT infrastructure, the differences and relation between MQTT and Apache Kafka, and best practices for realizing a cloud-native IoT infrastructure based on Kubernetes, check out the slide deck "Best Practices for Streaming IoT Data with MQTT and Apache Kafka".
In addition to the predictive maintenance scenario with machine learning, we also implemented an example of a Digital Twin. More thoughts on this here: "Apache Kafka as Digital Twin for Open, Scalable, Reliable Industrial IoT (IIoT)".
In our example, we use Kafka as ingestion layer and MongoDB for storage and analytics. However, this is just one of various IoT architectures for building a digital twin with Apache Kafka.
This project implements a scenario where you can integrate with tens of thousands (simulated) cars using a scalable MQTT platform and an event streaming platform. The demo trains new analytic models from streaming data - without the need for an additional data store - to do predictive maintenance on real time sensor data from cars:
We built two different analytic models using different approaches:
Unsupervised learning (used by default in our project): An Autoencoder neural network is trained to detect anomaly detection without any labeled data.
Supervised learning: A LSTM neural network is trained on labeled data to be able to do predictions about future sensor events.
The digital twin implementation with Kafka and MongoDB is discussed on its own page, including implementation details and configuration examples.
Another example demonstrates how you can store all sensor data in a data lake - in this case GCP Google Cloud Storage (GCS) - for further analytics.
We use HiveMQ as open source MQTT broker to ingest data from IoT devices, ingest the data in real time into an Apache Kafka cluster for preprocessing (using Kafka Streams / KSQL), and model training + inference (using TensorFlow 2.0 and its TensorFlow I/O Kafka plugin).
We leverage additional enterprise components from HiveMQ and Confluent to allow easy operations, scalability and monitoring.
Here is the architecture of the MVP we created for setting up the MQTT and Kafka infrastructure on Kubernetes:
And this is the architecture of the final demo where we included TensorFlow and TF I/O's Kafka plugin for model training and model inference:
We used Google Cloud Platform (GCP) and Google Kubernetes Engine (GKE) for different reasons. What you get here is not a fully-managed solution, but a mix of a cloud provider, a fully managed Kubernetes service, and self-managed components for MQTT, Kafka, TensorFlow and client applications. Here is why we chose this setup:
Get started quickly with two or three Terraform and Shell commands to setup the whole infrastructure; but being able to configure and change the details for playing around and starting your own POC for your specific problem and use case.
Show how to build a cloud-native infrastructure which is flexible, scalable and elastic.
Demonstrate an architecture which is applicable on different cloud providers like GCP, AWS, Azure, Alibaba, and on premises.
Go deep dive into configuration to show different options and allow flexible S, L, XL or XLL setups for connectivity, throughput, data processing, model training and model inference.
Understand the complexity of self-managed distributed systems (even with Kubernetes, Terraform and a cloud provider like GCP, you have to do a lot to do such a deployment in a reliable, scalable and elastic way).
Get motivated to cloud out fully-managed services for parts of your infrastructure if it makes sense, e.g. Google's Cloud Machine Learning Engine for model training and model inference using TensorFlow ecosystem or Confluent Cloud as fully managed Kafka as a Service offering with consumption-based pricing and SLAs.
We generate streaming test data at scale using a Car Data Simulator. The test data uses Apache Avro file format to leverage features like compression, schema versioning and Confluent features like Schema Registry or KSQL's schema inference.
You can either use some test data stored in the CSV file car-sensor-data.csv or better generate continuous streaming data using the included script (as described in the quick start). Check out the Avro file format here: cardata-v1.avsc.
Here is the schema and one row of the test data:
Typically, analytic models are trained in batch mode where you first ingest all historical data in a data store like HDFS, AWS S3 or GCS. Then you train the model using a framework like Spark MLlib, TensorFlow or Google ML.
TensorFlow I/O is a component of the TensorFlow framework which allows native integration with various technologies.
One of these integrations is tensorflow_io.kafka which allows streaming ingestion into TensorFlow from Kafka WITHOUT the need for an additional data store! This significantly simplifies the architecture and reduces development, testing and operations costs.
Yong Tang, member of the SIG TensorFlow I/O team, did a great presentation about this at Kafka Summit 2019 in New York (video and slide deck available for free).
You can pick and choose the right components from the Apache Kafka and TensorFlow ecosystems to build your own machine learning infrastructure for data integration, data processing, model training and model deployment:
This demo will do the following steps:
Optional steps (nice to have)
Again, you don't need another data store anymore! Just ingest the data directly from the distributed commit log of Kafka:
This totally simplfies your architecture as you don't need another data store in the middle. In conjunction with Tiered Storage - a Confluent add-on to Apache Kafka, you can built a cost-efficient long-term storage (aka data lake) with Kafka. No need for HDFS or S3 as additional storage. See the blog post "Streaming Machine Learning with Tiered Storage and Without a Data Lake" for more details.
Streaming ingestion for model training is fantastic. You don't need a data store anymore. This simplifies the architecture and reduces operations and developemt costs.
However, one common misunderstanding has to be clarified - as this question comes up every time you talk about TensorFlow I/O and Apache Kafka: As long as machine learning / deep learning frameworks and algorythms expect data in batches, you cannot achieve real online training (i.e. re-training / optimizing the model with each new input event).
Only a few algoryhtms and implementations are available today, like Online Clustering.
Thus, even with TensorFlow I/O and streaming ingestion via Apache Kafka, you still do batch training. Though, you can configure and optimize these batches to fit your use case. Additionally, only Kafka allows ingestion at large scale for use cases like connected cars or assembly lines in factories. You cannot build a scalable, reliable, mission-critical ML infrastructure just with Python.
The combination of TensorFlow I/O and Apache Kafka is a great step closer to real time training of analytic models at scale!
I posted many articles and videos about this discussion. Get started with How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka and check out my other resources if you want to learn more.
We have prepared a terraform script to deploy the complete environment in Google Kubernetes Engine (GKE). This includes:
The setup is pretty straightforward. No previous experience required for getting the demo running. You just need to install some CLIs on your laptop (gcloud, kubectl, helm, terraform) and then run two or three script commands as described in the quick start guide.
With default configuration, the demo starts at small scale. This is sufficient to show an impressive demo. It also reduces cost and to enables free usage without the need for commercial licenses. You can also try it out at extreme scale (100000+ IoT connections). This option is also described in the quick start.
Afterwards, you execute one single command to set up the infrastructure and one command to generate test data. Of course, you can configure everything to your needs (like the cluster size, test data, etc).
Follow the instructions in the quick start to setup the cluster. Please let us know if you have any problems setting up the demo so that we can fix it!
If you are just interested in the "Streaming ML" part for model training and model inference using Python, Kafka and TensorFlow I/O, check out:
Python Application leveraging Apache Kafka and TensorFlow for Streaming Model Training and Inference. python-scripts/LSTM-TensorFlow-IO-Kafka/README.md
You can also checkout two simple examples which use Kafka Python clients to produce data to Kafka topics and then consume the streaming data directly with TensorFlow I/O for streaming ML without an additional data store:
Autoencoder for anomaly detection of sensor data into TensorFlow via Kafka. Producer (Python Client) and Consumer (TensorFlow I/O Kafka Plugin) + Model Training.