Usage of Kafka at WodenSoft
We use Kafka at WodenSoft in IT Integration Solutions for sevaral benefits such as Real-time Messaging (low latency), Removing DB links, Reliability, Scalability, High Performance / Low Cost, Mirroring, Distributed Messaging, Manage huge number of transactions.
In order to use big data efficiently, two important factors come up. These are respectively;
- Collecting big data.
- To analyze big data.
We need a messaging system (queue) to collect big data blocks quickly and quickly and transfer them to other systems. At this point Apache Kafka allows you to transfer the streamed data into a queue to transfer it to other systems such as Hadoop, Spark, Elasticsearch.
Here is the question you might have, why would you want to integrate a component such as Kafka while typing the data into a client and having the possibility to integrate it to other systems.
In fact, when we collect a lot of data from external systems, the following situations and needs may arise:
- The amount of data is big. We need a scalable messaging system to get the data faster than other systems. So, the messaging system we use should be able to work in parallel on multiple machines.
- We may want to access the data immediately.
- During data transfer, you should not lose messages when one of the systems is down (replication factor)
- We may want to keep the data for a certain period of time. In this case, the system that will transfered the data, even if it is closed for a few days, may keep this data, (persistence)
- Setup and usage of the system should be easy.
- At this point Apache Kafka offers us a high-performance distributed messaging system. It is also more powerful than other messaging systems.
What is Kafka?
Apache Kafka™ is a distributed streaming platform at WodenSoft we are using this implemantation to improve our customer system performance.
- Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another.
- Kafka is suitable for both offline and online message consumption.
- Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss.
- Kafka is built on top of the ZooKeeper synchronization service.
- It integrates very well with Apache Storm and Spark for real-time streaming data analysis.
- Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable.
Capabilities of Kafka
We think of a streaming platform as having three key capabilities:
- It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.
- It lets you store streams of records in a fault-tolerant way.
- It lets you process streams of records as they occur.
It gets used for two broad classes of application:
- Building real-time streaming data pipelines that reliably get data between systems or applications
- Building real-time streaming applications that transform or react to the streams of data
- Kafka is run as a cluster on one or more servers.
- The Kafka cluster stores streams of recordsin categories called topics.
- Each record consists of a key, a value, and a timestamp.
Benefits of Kafka
- Reliability − It is distributed, partitioned, replicated and fault tolerance.
- Scalability − Its messaging system scales easily without down time.While getting more processed data to send, kafka performance keep itself stable.
- Durability − It uses Distributed commit log which means messages persists on disk as fast as possible, hence it is durable..
- Performance − It has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored.
It is very fast and guarantees zero downtime and zero data loss.
Single producer thread, no replication : 821,557 records/sec (78.3 MB/sec)
Single producer thread, 3x synchronous replication : 421,823 records/sec (40.2 MB/sec)
Single producer thread, 3x asynchronous replication : 786,980 records/sec (75.1 MB/sec)
Three producers, 3x asynchronous replication : 2,024,032 records/sec (193.0 MB/sec)
Test Machines Specifications * Intel Xeon 2.5 GHz processor with six cores * Six 7200 RPM SATA drives * 32GB of RAM * 1Gb Ethernet
Use Cases of Kafka
It can be used in many Use Cases. Some of them are listed below :
- Metrics− It is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
- Log Aggregation Solution− It can be used across an organization to collect logs from multiple services and make them available in a standard format to multiple consumers.
- Stream Processing− Popular frameworks such as Storm and Spark Streaming read data from a topic, processes it, and write processed data to a new topic where it becomes available for users and applications. Kafka’s strong durability is also very useful in the context of stream processing.
Arhitecture of Kafka
It has four core APIs:
1.The Producer API allows an application to publish a stream of records to one or more Kafka topics.
2.The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
3.The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
4.The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.