Spark Structured Streaming Examples

Spark Structured Streaming / Kafka / Cassandra / Elastic
Alternatives To Spark Structured Streaming Examples
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Android Nosql287
3 years ago3apache-2.0Java
Lightweight, simple structured NoSQL database for Android
Spark Structured Streaming Examples153
4 years ago4apache-2.0Scala
Spark Structured Streaming / Kafka / Cassandra / Elastic
Cassandra River37
10 years ago2apache-2.0Java
Cassandra river for Elastic search.
Titan Cookbook17
8 years agoapache-2.0Ruby
Chef cookbook for Titan distributed graph database (embedded w/ cassandra + elastic search)
2 years agoDockerfile
Janus + Elastic Search + Cassandra docker container with SSL Client Certificates implemented.
Iot Streaming6
6 years agoJava
IoT Streaming using Flink to connect Kafka and Cassandra, Elastic
7 years ago1apache-2.0Shell
Elastic service using Cassandra and HAProxy to act as pilot service
2 days ago1Scala
7 years ago1apache-2.0Scala
GZero simplifies graph-based computing, storage, and machine learning model predictions.
Simple Apm1
2 years agomitGo
an open source application/in-house performance management server that monitors your services and allows you to perform historical/real-time analysis for your endpoints
Alternatives To Spark Structured Streaming Examples
Select To Compare

Alternative Project Comparisons

Kafka / Cassandra / Elastic with Spark Structured Streaming

Codacy Badge

Stream the number of time Drake is broadcasted on each radio. And also, see how easy is Spark Structured Streaming to use using Spark SQL's Dataframe API

Run the Project

Step 1 - Start containers

Start the ZooKeeper, Kafka, Cassandra containers in detached mode (-d)


It will run these 2 commands together so you don't have to

docker-compose up -d
# create Cassandra schema
docker-compose exec cassandra cqlsh -f /schema.cql;

# confirm schema
docker-compose exec cassandra cqlsh -e "DESCRIBE SCHEMA;"

Step 2 - start spark structured streaming

sbt run

Run the project after another time

As checkpointing enables us to process our data exactly once, we need to delete the checkpointing folders to re run our examples.

rm -rf checkpoint/
sbt run


docker-compose exec kafka  \
 kafka-console-consumer --bootstrap-server localhost:9092 --topic test --from-beginning


{"radio":"nova","artist":"Drake","title":"From Time","count":18}
{"radio":"nova","artist":"Drake","title":"4pm In Calabasas","count":1}



curl -L`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose


brew install docker-compose

Input data

Coming from radio stations stored inside a parquet file, the stream is emulated with .option("maxFilesPerTrigger", 1) option.

The stream is after read to be sink into Kafka. Then, Kafka to Cassandra

Output data

Stored inside Kafka and Cassandra for example only. Cassandra's Sinks uses the ForeachWriter and also the StreamSinkProvider to compare both sinks.

One is using the Datastax's Cassandra saveToCassandra method. The other another method, messier (untyped), that uses CQL on a custom foreach loop.

From Spark's doc about batch duration:

Trigger interval: Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has completed. If a trigger time is missed because the previous processing has not completed, then the system will attempt to trigger at the next trigger point, not immediately after the processing has completed.

Kafka topic

One topic test with only one partition

List all topics

docker-compose exec kafka  \
  kafka-topics --list --zookeeper zookeeper:32181

Send a message to be processed

docker-compose exec kafka  \
 kafka-console-producer --broker-list localhost:9092 --topic test

> {"radio":"skyrock","artist":"Drake","title":"Hold On WeRe Going Home","count":38}

Cassandra Table

There are 3 tables. 2 used as sinks, and another to save kafka metadata. Have a look to schema.cql for all the details.

 docker-compose exec cassandra cqlsh -e "SELECT * FROM structuredstreaming.radioOtherSink;"

 radio   | title                    | artist | count
 skyrock |                Controlla |  Drake |     1
 skyrock |                Fake Love |  Drake |     9
 skyrock | Hold On WeRe Going Home |  Drake |    35
 skyrock |            Hotline Bling |  Drake |  1052
 skyrock |  Started From The Bottom |  Drake |    39
    nova |         4pm In Calabasas |  Drake |     1
    nova |             Feel No Ways |  Drake |     2
    nova |                From Time |  Drake |    34
    nova |                     Hype |  Drake |     2

Kafka Metadata

@TODO Verify this below information. Cf this SO comment

When doing an application upgrade, we cannot use checkpointing, so we need to store our offset into a external datasource, here Cassandra is chosen. Then, when starting our kafka source we need to use the option "StartingOffsets" with a json string like

""" {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """

Learn more in the official Spark's doc for Kafka.

In the case, there is not Kafka's metadata stored inside Cassandra, earliest is used.

docker-compose exec cassandra cqlsh -e "SELECT * FROM structuredstreaming.kafkametadata;"
 partition | offset
         0 |    171

Useful links


Inspired by

Popular Elastic Projects
Popular Cassandra Projects
Popular Companies Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.