Learn about Apache Kafka

In this my blog article, it is going to present you on Apache Kafka. The content of the system like need of the messaging system, What is the Kafka?, Kafka features, Kafka components and Kafka installation.

Need of the messaging system:

Data pipelines: Communication is required between different systems in the real time scenario, which is done by using data pipelines. For an example let’s consider that chat server needs to communicate with Database server for storing messages.

Data Pipeline

In the organizations, they have lot of servers like, database server, email servers and FTP servers like so on. According to the highest needs of the company, they have lot of nodes or applications to access the database. Suppose that, Front-end, Hadoop, Database slave and chat server need to access the database server.

Simple pipeline

In here it has multiple pipelines to the database server to achieve the communication with Database server by the other nodes. With increase of the nodes, it increases the number of pipelines also.

As well as organizations have lot of servers in the back-end environment also. Suppose that there is another extra security service server also. Then all applications need to communication with security service server also. Now infrastructure of the system is going to high rather than previous one.

Complex pipeline

Complex data pipelines:

Complex pipeline- image 02

Time to time, the architecture of the server side is being complexing. Now system is more complicated. Adding a new server to the system is very hard and very complex. Efficiency and reliability of the pipelines are going down. Other than that, adding or dropping a pipeline from the system is very complex.

Solution: Solution to the problem is to introduce a messaging system between the applications and servers. Messaging system helps to manage the complexity of pipelines. In here communication mechanism is simple and manageable. By using messaging system, it is easy to establish remote connection and reduce the complexity of the system. It is easy to send data via a cross network. Different systems use different languages and different platforms. But in this messaging system, it provides a common architecture and common platform to messaging without any dependency on language or platforms. It establishes asynchronous messaging communication and sender does not need to wait until process the message by receiver. The solution provides a better platform to enhance the reliable communication here.

How Kafka solve the problem:

Usage of Kafka

By using Kafka, decouples the data pipeline.

All applications are consuming data from Kafka. Other servers produce the data to the Kafka. Here it is very easy to add or drop any application or server to the system. Suppose that I have added a new application to the system, then it needs to subscribe categories according to the respective server needed.

What is Apache Kafka?

Apache Kafka is a distributed publish-subscribe messaging system
It was originally developed by LinkedIn and later it became a part of Apache project.
Kafka is fast, scalable, durable, fault – tolerance and distributed by design.

Kafka with LinkedIn

Kafka in LinkedIn

Kafka Growth Expanding

More than 1/3 of all Fortune 500 companies use Kafka
Linkedln, Microsoft and Netflix process billing of messages a day with Kafka(1,000,000,000,000)
Kafka is used for real-time streams of data and usage to collect big data for real time analysis.

Kafka Terminology

Topic: A topic is a category or feed name to which records are published

Producers: A producer can be any application who can publish messages to a topic

Broker: Kafka cluster is a set of servers, each of which is called a broker.

Consumer: A consumer can be any application

Partition: Topics are broken up into ordered commit logs called partitions

Zookeeper: Zookeeper is used for managing and coordinating Kafka broker

Kafka Cluster

Apache Kafka Cluster

Multiple producers produce data to Kafka broker and these Kafka brokers are in Kafka cluster. Multiple consumers, consume data from Kafka cluster. The Kafka cluster is getting manage by Zookeeper.

Kafka Architecture

Kafka Architecture

Kafka Features

High throughput: provides support for hundreds of thousands of messages with modest hardware
Scalability: Highly scalable distributed system with no downtime
Data Loss: Kafka ensures no data loss once configured properly
Stream Processing: Kafka can be used along with real time streaming applications like Spark and Storm.
Durability: Provides support to persisting message on disk.
Replication: Messages can be replicated across clusters, which supports multiple subscribes.

Kafka Components – Topics & Partitions

A Topic is a catalog or feed name to which records are published
Topics are broken up into ordered commit logs called partitions
Each message in a partition is assigned a sequential id called an offset
Data in a Topic is related for a configurable period of time
Writes to a partition are generally sequential there by reducing the number of hard disk seeks
Reading messages can either be from beginning and also can rewind or skip to any point in partition by giving an offset value

Kafka Components – Producer

Producer publishes a new message to a specific topic
The producer does not care what partition a specific message is written to and will balance messages over every partition of a topic evenly
Directing messages to a partition is done using the message key and a partition, this will generate a hash of the key and map it to a partition
Every message a producer publishes in the form of a key: value pair

Kafka Components – Consumer

Consumers read messages
The consumer subscribes to one or more topic and reads the messages sequentially
The consumer keeps track of the messages; it has consumed by keeping track on the offset of messages
The offset is bit of metadata that Kafka adds to each message as it is produced
Each partition has unique offset which is stored

With the offset of the last consumed message, a consumer can stop and restart without losing its current state

Kafka Components – Zookeeper

Zookeeper is used for managing and coordinating Kafka broker
Zookeeper service is mainly used for coordinating between broken in the Kafka cluster
Kafka cluster is connected to Zookeeper to get information about any failure nodes

Kafka Cluster

Kafka broker are designed to operate as part of a cluster
One broker will also function as the cluster controller
Controller is responsive for administrative operations, like assigning partitions to broken, monitoring for broker failures in a cluster
A particular partition is owned by a broker and that broker is called the leader of the partition.
All consumers and producers operating on that partitioning must connect to the leader.

Types of Kafka Cluster

Single Node - single broker cluster
Single Node - Multiple broker cluster
Multiple Node - Multiple broker cluster

Install Apache Kafka on Ubuntu 18.04

Prerequisites

One Ubuntu server and a non-root user with sudo privileges.
At least 4GB of RAM on the server.
OpenJDK 8 installed on your server

Step 01: Creating a user for kafka

Since Kafka can handle requests over a network, you should create a dedicated user for it. This minimizes damage to Ubuntu machine should the Kafka server be compromised. It needs to create a dedicated Kafka user in this step, but it must create a different non-root user to perform other tasks on this server once you have finished setting up Kafka.

Create a user called kafka

$ sudo useradd kafka -m

Set the password for kafka

$ sudo passwd kafka

Add the kafka user to sudo group, so it needs privileges to install dependencies

$ sudo adduser kafka sudo

Now kafka user is ready to work. Log to the kafka user by typing,

$ su -l kafka

Step 02: Downloading and Extracting the Kafka Binaries

Create a directory in /home/kafka called Downloads to store your downloads

$ mkdir ~/Downloads

Downloads the Kafka binaries

$ curl "http://www-eu.apache.org/dist/kafka/1.1.0/kafka_2.12-1.1.0.tgz" -o ~/Downloads/kafka.tgz

Create a directory called kafka and change to this directory. This will be the base directory of the Kafka installation:

$ mkdir ~/kafka && cd ~/kafka

Extract the archive you downloaded using the tar command:

$ tar -xvzf ~/Downloads/kafka.tgz --strip 1

Step 03: Configuring the Kafka Server

Kafka's configuration options are specified in server.properties. Open this file with nano or your favorite editor:

$ nano ~/kafka/config/server.properties

Let's add a setting that will allow us to delete Kafka topics. Add the following to the bottom of the file.

$ delete.topic.enable = true

Save file and exit nano.

Step 04: Creating Systemd Unit Files and Starting the Kafka Server

This will help us perform common service actions such as starting, stopping, and restarting Kafka in a manner consistent with other Linux service

Create the unit file for zookeeper:

$ sudo nano /etc/systemd/system/zookeeper.service

Enter the following unit definition into the file:

[Unit]
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=simple
User=kafka
ExecStart=/home/kafka/kafka/bin/zookeeper-server-start.sh /home/kafka/kafka/config/zookeeper.properties
ExecStop=/home/kafka/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal

[Install]
WantedBy=multi-user.target

Save and exit the file. Next, create the systemd service file for kafka:

$ sudo nano /etc/systemd/system/kafka.service

Enter the following unit definition into the file:

[Unit]
Requires=zookeeper.service
After=zookeeper.service

[Service]
Type=simple
User=kafka
ExecStart=/bin/sh -c '/home/kafka/kafka/bin/kafka-server-start.sh /home/kafka/kafka/config/server.properties > /home/kafka/kafka/kafka.log 2>&1'
ExecStop=/home/kafka/kafka/bin/kafka-server-stop.sh
Restart=on-abnormal

[Install]
WantedBy=multi-user.target

Now that the units have been defined, start Kafka with the following command:

$ sudo systemctl start kafka

To ensure that the server has started successfully, check the journal logs for the kafka unit:

$ sudo journalctl -u kafka

Output similar to the following:

Jul 17 18:38:59 kafka-ubuntu systemd[1]: Started kafka.service.

You now have a Kafka server listening on port 9092.

Step 05: Testing the installation

Let's publish and consume a "Hello World" message to make sure the Kafka server is behaving correctly. Publishing messages in Kafka requires:

A producer, which enables the publication of records and data to topics. A consumer, which reads messages and data from topics. First, create a topic named TutorialTopic by typing:

$ ~/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic TutorialTopic

You can create a producer from the command line using the kafka-console-producer.sh script. It expects the Kafka server's hostname, port, and a topic name as arguments.

Publish the string "Hello, World" to the TutorialTopic topic by typing:

$ echo "Hello, World" | ~/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic TutorialTopic > /dev/null

You can create a Kafka consumer using the kafka-console-consumer.sh script. It expects the ZooKeeper server's hostname and port, along with a topic name as arguments. The following command consumes messages from TutorialTopic. Note the use of the --from-beginning flag, which allows the consumption of messages that were published before the consumer was started:

$ ~/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TutorialTopic --from-beginning

If there are no configuration issues, you should see Hello, World in your terminal:

Step 06: Setting Up a Multi-Node Cluster (Optional)

If you want to create a multi-broker cluster using more Ubuntu 18.04 machines, you should repeat Step 1, Step 4, and Step 5 on each of the new machines. Additionally, you should make the following changes in the server.properties file for each.

The value of the broker.id property should be changed such that it is unique throughout the cluster. This property uniquely identifies each server in the cluster and can have any string as its value. For example, "server1", "server2", etc.

The value of the zookeeper.connect property should be changed such that all nodes point to the same ZooKeeper instance. This property specifies the Zookeeper instance's address and follows the <HOSTNAME/IP_ADDRESS>:<PORT> format. For example, "203.0.113.0:2181", "203.0.113.1:2181" etc.

If you want to have multiple ZooKeeper instances for your cluster, the value of the zookeeper.connect property on each node should be an identical, comma-separated string listing the IP addresses and port numbers of all the ZooKeeper instances.

Search This Blog

Shades of Life

Learn about Apache Kafka

Comments

Post a Comment

Popular posts from this blog

Is strategies important to professionals?

What is the art of living?