Spark Kafka

Low level integration of Spark and Kafka
Alternatives To Spark Kafka
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Bigdata Notes13,291
2 months ago33Java
大数据入门指南 :star:
Flink Learning13,198
a month agoapache-2.0Java
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Technology Talk13,004
a month ago10
汇总java生态圈常用技术框架、开源中间件,系统架构、数据库、大公司架构案例、常用三方类库、项目管理、线上问题排查、个人成长、思考等知识
Data Engineering Zoomcamp12,969
6 days ago45Jupyter Notebook
Free Data Engineering course!
Cookbook11,362
3 months ago108apache-2.0
The Data Engineering Cookbook
God Of Bigdata7,992
a day ago2
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...
Pipeline4,140
5 months ago85July 18, 20171apache-2.0Jsonnet
PipelineAI Kubeflow Distribution
Bigdataguide1,714
5 months ago7Java
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Szt Bigdata1,702
4 months ago15otherScala
深圳地铁大数据客流分析系统🚇🚄🌟
Jvm Profiler1,661
5 months ago17otherJava
JVM Profiler Sending Metrics to Kafka, Console Output or Custom Reporter
Alternatives To Spark Kafka
Select To Compare


Alternative Project Comparisons
Readme

NOTE: this project is no longer updated or maintained

Build Status

spark-kafka

Spark-kafka is a library that facilitates batch loading data from Kafka into Spark, and from Spark into Kafka.

This library does not provide a Kafka Input DStream for Spark Streaming. For that please take a look at the spark-streaming-kafka library that is part of Spark itself.

SimpleConsumerConfig

This is the configuration that KafkaRDD needs to consume data from Kafka. It includes metadata.broker.list (a comma-separated list of Kafka brokers for bootstrapping) and some SimpleConsumer related settings such as timeouts and buffer sizes. Only metadata.broker.list is required.

KafkaRDD

KafkaRDD is an RDD that extracts data from Kafka. To use it you need to provide a Spark Context, a Kafka topic, offset ranges per Kafka partition (start offset is inclusive, stop offset exclusive) and a SimpleConsumerConfig. Instead of offsets you can also provide times (start and/or stop time) which will be used to calculate offsets at construction time. It is fairly common to use exact start offsets and a stop time of OffsetRequest.LatestTime, which basically means to read everything from a known starting position (where you left off) up to the latest message in Kafka (at the time the KafkaRDD was created; newer messages will be ignored). KafkaRDD will create a Spark partition/split per Kafka partition.

KafkaRDD is an RDD[PartitionOffsetMessage], where PartitionOffsetMessage is a case class that contains the Kafka partition, offset and the message (which contains key and value/payload). The Kafka partition and offset being part of the RDD makes it easy to calculate the last offset read for each Kafka partition, which can then be used to derive the offsets to start reading for the next batch load.

Kafka is a dynamic system that deletes old messages and appends new messages. KafkaRDD on the other hand has a fixed offset range per Kafka partition set at construction time. This means that messages added to Kafka after creation of a KafkaRDD will not be visible. It also means that messages deleted from Kafka that are within the offset range can lead to errors within KafkaRDD. For example if you define a KafkaRDD with a start time of OffsetRequest.EarliestTime and you access the RDD many hours later you might see an OffsetOutOfRangeException as Kafka has cleaned up the data you are trying to access.

The KafkaRDD companion object contains methods writeWithKeysToKafka and writeToKafka, which can be used to write an RDD to Kafka. For this you will need to provide (besides the RDD itself) the Kafka topic and a ProducerConfig. The ProducerConfig will need to have metadata.broker.list and probably also serializer.class.

writeToKafka can also be used in Spark Streaming to save the underlying RDDs of the DStream to Kafka (using foreachRDD method on the DStream). However keep in mind that for every invocation of writeToKafka a new Kafka Producer is created for every partition.

The unit test infrastructure was copied from/inspired by spark-streaming-kafka.

Have fun! Team @ Tresata

Popular Kafka Projects
Popular Spark Projects
Popular Data Processing Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Scala
Time
Spark
Kafka
Streaming
Spark Streaming