Understanding Apache Kafka: A Comprehensive Overview
Written on
Introduction to Streaming Concepts
As we delve into microservices, we encounter a variety of concepts, patterns, protocols, and tools such as Messaging, AMQP, RabbitMQ, Event-sourcing, gRPC, CQRS, and more. One term that piqued my interest is Apache Kafka.
A common misconception is that Kafka is merely another messaging system. While it can be utilized as such, limiting its use to that function may lead to inefficient resource utilization.
The intention of this article is to clarify how Kafka operates and the reasons you might consider implementing it in your projects. I will introduce key concepts and illustrate why it transcends a simple messaging system, aiming to assist newcomers like myself.
What is Apache Kafka?
If you visit the Kafka website, you will find its definition prominently displayed:
> A distributed streaming platform.
But what does "a distributed streaming platform" mean? To grasp this, we first need to define what a stream is. My understanding of streams is that they represent endless data—data that continuously flows and can be processed in real time.
The term "distributed" indicates that Kafka functions within a cluster, where each unit in this cluster is referred to as a Broker. These brokers are essentially servers running instances of Apache Kafka.
Thus, we can summarize that Kafka is a collection of machines collaborating to manage and process boundless real-time data.
Its distributed design contributes significantly to Kafka's popularity. The brokers enhance its resilience, reliability, scalability, and fault tolerance, making Kafka both high-performing and secure.
So, why is there confusion about Kafka being just another messaging system? To answer that, we must first understand the mechanics of messaging.
Understanding Messaging
Messaging is simply the process of transmitting messages from one location to another. It involves three key components:
- Producer: The entity that creates and sends messages to one or more queues.
- Queue: A data structure that acts as a buffer, receiving messages from producers and delivering them to consumers in a FIFO (First-In First-Out) manner. Once a message is delivered, it is permanently removed from the queue, with no possibility of retrieval.
- Consumer: The entity subscribed to one or more queues, receiving messages as they are published.
This summarizes the messaging process (very briefly, as there’s much more detail to it). Notice that there’s no mention of streams, real-time processing, or clusters (though some tools allow for clustering, it is not intrinsic like in Kafka).
Kafka Architecture
Now that we understand messaging, let's explore Kafka's framework. In Kafka, we also have Producers and Consumers, who function similarly to those in traditional messaging systems, both creating and consuming messages.
As you can see, it resembles the messaging concept, but instead of a Queue, we have Topics.
A Topic represents a specific data stream and functions similarly to a queue, receiving and delivering messages. However, there are several key points to understand about topics:
- A topic is segmented into partitions; each topic can have multiple partitions, and this number must be specified when creating the topic. You can visualize a topic as a folder in an operating system, with each partition as a subfolder.
- If a message is produced without a specified key, producers will distribute messages in a round-robin manner across partitions. This means we cannot guarantee the order of delivery at the partition level; to consistently direct messages to the same partition, we must assign a key.
- Each message is stored on the broker's disk and is assigned a unique offset (identifier). Offsets are unique at the partition level, with each partition maintaining its own offsets. This is another feature that sets Kafka apart, as it retains messages (akin to a database) for future retrieval, unlike typical messaging systems where messages are deleted post-consumption.
- Producers utilize offsets to read messages, processing them from oldest to newest. In the event of a consumer failure, the recovery process starts from the last offset.
Brokers
As previously mentioned, Kafka operates in a distributed manner. A Kafka cluster can consist of multiple brokers as needed.
Each broker in a cluster is assigned an ID and must hold at least one partition of a topic. To define the number of partitions per broker, we set a Replication Factor when creating a topic. For instance, if we have three brokers, a topic with three partitions, and a replication factor of three, then each broker will manage one partition of the topic.
As illustrated, Topic_1 contains three partitions, with each broker overseeing one. It is crucial that the number of partitions aligns with the number of brokers so that each broker manages a single partition.
To bolster cluster reliability, Kafka introduces the concept of a Partition Leader. Each partition in a broker has a leader, and only one leader can exist per partition. The leader is responsible for receiving messages, while its replicas synchronize the data. This ensures that data remains intact even if a broker fails due to replicas.
Should a leader become unavailable, Zookeeper will automatically elect a new leader from the replicas.
In the example above, Broker 1 leads Partition 1 of Topic 1 with a replica in Broker 2. If Broker 1 were to fail, Zookeeper would designate Broker 2 as the new leader for Partition 1, showcasing the power of Kafka's distributed architecture.
Producers
Similar to traditional messaging systems, Producers in Kafka generate and send messages to topics.
Messages are dispatched in a round-robin manner. For example, Message 01 may go to partition 0 of Topic 1, while Message 02 goes to partition 1 of the same topic. This means that to ensure messages from the same producer are consistently sent to the same topic, you must define a key, which Kafka will hash to determine the appropriate partition.
The hash accounts for the number of partitions in the topic, which is why this number cannot be altered after the topic's creation.
When it comes to message handling, we encounter Acknowledgment (ack). The ack signifies that a message was successfully delivered. In Kafka, we can customize this acknowledgment level during message production:
- ack = 0: This setting indicates we don't wish to receive an ack from Kafka. If a broker fails, the message is lost.
- ack = 1: This default setting requests an ack from the leader of the partition. Data will only be lost if the leader fails (though there is still a risk).
- ack = all: This is the most secure option, as it requires confirmation not only from the leader but also from its replicas. This configuration minimizes data loss, provided the replicas are in-sync (ISR). If a replica is not in-sync, Kafka will wait for it to synchronize before sending the ack.
Consumers and Consumer Groups
Consumers are applications that subscribe to one or more topics and read messages from them. They can access one or multiple partitions.
When a consumer reads from a single partition, the order of reading is guaranteed. However, if a consumer reads from multiple partitions, it will do so in parallel, which means the reading order is not assured. For instance, a later message may be read before an earlier one.
This necessitates careful consideration when determining the number of partitions and during message production.
Another significant concept in Kafka is Consumer Groups, which are essential for scaling message reading.
When a single consumer is tasked with reading from numerous partitions, it becomes resource-intensive. Here is where consumer groups come into play.
Messages from a single topic will be distributed across the consumers in a group, allowing them to effectively manage and process the data.
Ideally, the number of consumers in a group should match the number of partitions in a topic, ensuring that each consumer reads from just one. However, adding more consumers than partitions can lead to some consumers being idle and not processing any messages.
Conclusion
Kafka stands out as one of the most robust and efficient tools available today. Through studying its core concepts, I have recognized that it is much more than just a messaging system. Its distributed architecture and capability for real-time processing make it adaptable for various scenarios and use cases.
In this article, I sought to delve deeply into Kafka, presenting its key concepts to facilitate understanding and highlight its distinctions from traditional messaging systems.
Due to time constraints, I omitted some important topics like idempotent producers, compression, and transactions, but I hope this overview encourages further exploration of the Kafka ecosystem.
References
- https://www.cloudkarafka.com/blog/2016-11-30-part1-kafka-for-beginners-what-is-apache-kafka.html
- https://fullcycle.com.br/apache-kafka-trabalhando-com-mensageria-e-real-time/
- https://www.educba.com/kafka-replication/
- https://www.javatpoint.com/apache-kafka-producer
- https://docs.cloudera.com/cdp-private-cloud-base/latest/kafka-developing-applications/topics/kafka-develop-groups-fetching.html