Camus is LinkedIn's Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. It includes the following features:
It is used at LinkedIn where it processes tens of billions of messages per day. You can get a basic overview from this paper: Building LinkedIn’s Real-time Activity Data Pipeline.
There is a Google Groups mailing list that you can email or search if you have any questions.
For a more detailed documentation on the main Camus components, please see Camus InputFormat and OutputFormat Behavior
All work is done within a single Hadoop job divided into three stages:
Setup stage fetches available topics and partitions from Zookeeper and the latest offsets from the Kafka Nodes.
Hadoop job stage allocates topic pulls among a set number of tasks. Each task does the following:
Cleanup stage reads counts from all tasks, aggregates the values, and submits the results to Kafka for consumption by Kafka Audit.
Setup stage fetches from Zookeeper Kafka broker urls and topics (in /brokers/id and /brokers/topics). This data is transient and will be gone once Kafka server is down.
Topic offsets stored in HDFS. Camus maintains its own status by storing offset for each topic in HDFS. This data is persistent.
Setup stage allocates all topics and partitions among a fixed number of tasks.
Each hadoop task uses a list of topic partitions with offsets generated by setup stage as input. It uses them to initialize Kafka requests and fetch events from Kafka brokers. Each task generates four types of outputs (by using a custom MultipleOutputFormat): Avro data files, Count statistics files, Updated offset files, and Error files.
Once a task has successfully completed, all topics pulled are committed to their final output directories. If a task doesn't complete successfully, then none of the output is committed. This allows the hadoop job to use speculative execution. Speculative execution happens when a task appears to be running slowly. In that case the job tracker then schedules the task on a different node and runs both the main task and the speculative task in parallel. Once one of the tasks completes, the other task is killed. This prevents a single overloaded hadoop node from slowing down the entire ETL.
Successful tasks also write audit counts to HDFS.
Final offsets are written to HDFS and consumed by the subsequent job.
Once the hadoop job has completed, the main client reads all the written audit counts and aggregates them. The aggregated results are then submitted to Kafka.
You can build Camus with:
mvn clean package
Note that there are two jars that are not currently in a public Maven repo. These jars (kafka-0.7.2 and avro-schema-repository-1.74-SNAPSHOT) are supplied in the lib directory, and maven will automatically install them into your local Maven cache (usually ~/.m2).
We hope to eventually create a more out of the box solution, but until we get there you will need to create a custom decoder for handling Kafka messages. You can do this by implementing the abstract class com.linkedin.batch.etl.kafka.coders.KafkaMessageDecoder. Internally, we use a schema registry that enables obtaining an Avro schema using an identifier included in the Kafka byte payload. For more information on other options, you can email [email protected]. Once you have created a decoder, you will need to specify that decoder in the properties as described below. You can also start by taking a look at the existing Decoders to see if they will work for you, or as examples if you need to implement a new one.
Camus can also process JSON messages. Set "camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder" in camus.properties. Additionally, there are two more options "camus.message.timestamp.format" (default value: "[dd/MMM/yyyy:HH:mm:ss Z]") and "camus.message.timestamp.field" (default value: "timestamp").
By default Camus writes Avro data. But you can also write to different formats by implementing and RecordWriterProvider. For examples see https://github.com/linkedin/camus/blob/master/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common/AvroRecordWriterProvider.java and https://github.com/linkedin/camus/blob/master/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common/StringRecordWriterProvider.java. You can specify which writer to use with "etl.record.writer.provider.class".
Camus can be run from the command line as Java App. You will need to set some properties either by specifying a properties file on the classpath using -p (filename), or an external properties file using -P (filepath to local file system, or to
hdfs:), or from the command line itself using -D property=value. If the same property is set using more than one of the previously mentioned methods, the order of precedence is command-line, external file, classpath file.
Here is an abbreviated list of commonly used parameters. An example properties file is also located https://github.com/linkedin/camus/blob/master/camus-example/src/main/resources/camus.properties.
Camus can be run from the command line using hadoop jar. Here is the usage:
usage: hadoop jar camus-example-<version>-SNAPSHOT.jar com.linkedin.camus.etl.kafka.CamusJob <br/> -D <property=value> use value for given property<br/> -P <arg> external properties filename<br/> -p <arg> properties filename from the classpath<br/>