Apache Kafka

Getting Started with Apache Kafka by Ryan Plant Pluralsight course page

.Net clients

List maintained by Apache

find-package kafka | Select-Object ID, Version, Description, DownloadCount | Sort-Object -Descending -Property DownloadCount | fl

ExactTargetDev/kafka-net This is a .NET implementation of a client for Kafka using C# for Kafka 0.8. It provides for an implementation that covers most basic functionalities to include a simple Producer and Consumer.

Microsoft/CSharpClient-for-Kafka .Net implementation of the Apache Kafka Protocol that provides basic functionality through Producer/Consumer classes. The project also offers balanced consumer implementation. The project is a fork from ExactTarget’s Kafka-net Client.

Jroland/kafka-net Pure C# client with full protocol support. Includes consumer, producer, lower level components and gzip support (no snappy)

From https://cwiki.apache.org/confluence/display/KAFKA/Clients#Clients-.NET

https://github.com/ah-/rdkafka-dotnet is a C# client for Apache Kafka based on librdkafka.

https://github.com/Microsoft/Kafkanet .NET implementation of the Apache Kafka Protocol that provides basic functionality through Producer/Consumer classes. The project also offers balanced consumer implementation. The project is a fork from ExactTarget’s Kafka-net Client.

https://github.com/ntent-ad/kafka4net C# client, asynchronous, all 3 compressions supported (read and write), tracks leader partition changes transparently, long time in production.

https://github.com/criteo/kafka-sharp kafka-sharp - High Performance .NET Kafka Driver

Background

  • Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. What does all that mean?
  • Apache™ Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system.
  • Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication.
  • Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data.
  • Kafka can message geospatial data from a fleet of long-haul trucks or sensor data from heating and cooling equipment in office buildings.
  • Whatever the industry or use case, Kafka brokers massive message streams for low-latency analysis in Enterprise Apache Hadoop.

From http://hortonworks.com/apache/kafka/

  • Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala.
  • The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. It is, in its essence, a massively scalable pub/sub message queue architected as a distributed transaction log 2, making it highly valuable for enterprise infrastructures to process streaming data.
  • The design is heavily influenced by transaction logs. 3

From https://en.wikipedia.org/wiki/Apache_Kafka


First let’s review some basic messaging terminology:

  • Kafka maintains feeds of messages in categories called topics.
  • We'll call processes that publish messages to a Kafka topic producers.
  • We'll call processes that subscribe to topics and process the feed of published messages consumers.
  • Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

So, at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers like this:

alt text

Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol.

We provide a Java client for Kafka, but clients are available in many languages.

From http://kafka.apache.org/documentation.html#introduction

See our web site for details on the project. You need to have Gradle and Java installed. Kafka requires Gradle 2.0 or higher. Java 7 should be used for building in order to support both Java 7 and Java 8 at runtime.

From https://github.com/apache/kafka

The Log: What every software engineer should know about real-time data's unifying abstraction

From https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

What Kafka Does

Apache Kafka supports a wide range of use cases as a general-purpose messaging system for scenarios where high throughput, reliable delivery, and horizontal scalability are important. Apache Storm and Apache HBase both work very well in combination with Kafka. Common use cases include:

  • Stream Processing
  • Website Activity Tracking
  • Metrics Collection and Monitoring
  • Log Aggregation

Some of the important characteristics that make Kafka such an attractive option for these use cases include the following:

Feature Description
Scalability Distributed system scales easily with no downtime
Durability Persists messages on disk, and provides intra-cluster replication
Reliability Replicates data, supports multiple subscribers, and automatically balances consumers in case of failure
Performance High throughput for both publishing and subscribing, with disk structures that provide constant performance even with many terabytes of stored messages

From < http://hortonworks.com/apache/kafka/#section_1>

How Kafka Works

Kafka's system design can be thought of as that of a distributed commit log, where incoming data is written sequentially to disk. There are four main components involved in moving data in and out of Kafka:

  • Topics
  • Producers
  • Consumers
  • Brokers

partition diagram

For Kafka consumers, keeping track of which messages have been consumed (processed) is simply a matter of keeping track of an Offset , which is a sequential id number that uniquely identifies a message within a partition. Because Kafka retains all messages on disk (for a configurable amount of time), consumers can rewind or skip to any point in a partition simply by supplying an offset value. Finally, this design eliminates the potential for back-pressure when consumers process messages at different rates.

From http://hortonworks.com/apache/kafka/#section_2

One of the keys to Kafka's high performance is the simplicity of the brokers' responsibilities. In Kafka, topics consist of one or more Partitions that are ordered, immutable sequences of messages. Since writes to a partition are sequential, this design greatly reduces the number of hard disk seeks (with their resulting latency).

Another factor contributing to Kafka's performance and scalability is the fact that Kafka brokers are not responsible for keeping track of what messages have been consumed – that responsibility falls on the consumer. In traditional messaging systems such as JMS, the broker bore this responsibility, severely limiting the system's ability to scale as the number of consumers increased.

broker diagram

For Kafka consumers, keeping track of which messages have been consumed (processed) is simply a matter of keeping track of an Offset, which is a sequential id number that uniquely identifies a message within a partition. Because Kafka retains all messages on disk (for a configurable amount of time), consumers can rewind or skip to any point in a partition simply by supplying an offset value. Finally, this design eliminates the potential for back-pressure when consumers process messages at different rates.

to-do

Apache Kafka for Beginners