|Project Name||Stars||Downloads||Repos Using This||Packages Using This||Most Recent Commit||Total Releases||Latest Release||Open Issues||License||Language|
|Bigdata Notes||13,291||2 months ago||33||Java|
|Flink Learning||13,198||a month ago||apache-2.0||Java|
|flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例，还有 Flink 落地应用的大型项目案例（PVUV、日志存储、百亿数据实时去重、监控告警）分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》|
|Technology Talk||13,004||a month ago||10|
|Data Engineering Zoomcamp||12,969||6 days ago||45||Jupyter Notebook|
|Free Data Engineering course!|
|Cookbook||11,362||3 months ago||108||apache-2.0|
|The Data Engineering Cookbook|
|God Of Bigdata||7,992||a day ago||2|
|Pipeline||4,140||5 months ago||85||July 18, 2017||1||apache-2.0||Jsonnet|
|PipelineAI Kubeflow Distribution|
|Bigdataguide||1,714||5 months ago||7||Java|
|Szt Bigdata||1,702||4 months ago||15||other||Scala|
|Jvm Profiler||1,661||5 months ago||17||other||Java|
|JVM Profiler Sending Metrics to Kafka, Console Output or Custom Reporter|
Spark-kafka is a library that facilitates batch loading data from Kafka into Spark, and from Spark into Kafka.
This library does not provide a Kafka Input DStream for Spark Streaming. For that please take a look at the spark-streaming-kafka library that is part of Spark itself.
This is the configuration that KafkaRDD needs to consume data from Kafka. It includes metadata.broker.list (a comma-separated list of Kafka brokers for bootstrapping) and some SimpleConsumer related settings such as timeouts and buffer sizes. Only metadata.broker.list is required.
KafkaRDD is an RDD that extracts data from Kafka. To use it you need to provide a Spark Context, a Kafka topic, offset ranges per Kafka partition (start offset is inclusive, stop offset exclusive) and a SimpleConsumerConfig. Instead of offsets you can also provide times (start and/or stop time) which will be used to calculate offsets at construction time. It is fairly common to use exact start offsets and a stop time of OffsetRequest.LatestTime, which basically means to read everything from a known starting position (where you left off) up to the latest message in Kafka (at the time the KafkaRDD was created; newer messages will be ignored). KafkaRDD will create a Spark partition/split per Kafka partition.
KafkaRDD is an RDD[PartitionOffsetMessage], where PartitionOffsetMessage is a case class that contains the Kafka partition, offset and the message (which contains key and value/payload). The Kafka partition and offset being part of the RDD makes it easy to calculate the last offset read for each Kafka partition, which can then be used to derive the offsets to start reading for the next batch load.
Kafka is a dynamic system that deletes old messages and appends new messages. KafkaRDD on the other hand has a fixed offset range per Kafka partition set at construction time. This means that messages added to Kafka after creation of a KafkaRDD will not be visible. It also means that messages deleted from Kafka that are within the offset range can lead to errors within KafkaRDD. For example if you define a KafkaRDD with a start time of OffsetRequest.EarliestTime and you access the RDD many hours later you might see an OffsetOutOfRangeException as Kafka has cleaned up the data you are trying to access.
The KafkaRDD companion object contains methods writeWithKeysToKafka and writeToKafka, which can be used to write an RDD to Kafka. For this you will need to provide (besides the RDD itself) the Kafka topic and a ProducerConfig. The ProducerConfig will need to have metadata.broker.list and probably also serializer.class.
writeToKafka can also be used in Spark Streaming to save the underlying RDDs of the DStream to Kafka (using foreachRDD method on the DStream). However keep in mind that for every invocation of writeToKafka a new Kafka Producer is created for every partition.
The unit test infrastructure was copied from/inspired by spark-streaming-kafka.
Have fun! Team @ Tresata