Apache Kafka simply explained
Learning Apache Kafka doesn’t have to be difficult. Read on to get a friendly explanation of the Apache Kafka fundamentals.
Apache Kafka® simply explained with an e-commerce project example
Apache Kafka® is widely used in the industry, but the learning curve can be steep and understanding the building blocks of this technology can be challenging. That’s why the goal for this article is to look at the fundamentals of Apache Kafka in simple terms.
What is Apache Kafka?
Apache Kafka is an event streaming platform that is distributed, scalable, high-throughput, low-latency, and has a very large ecosystem.
Or, simply put, it is a platform to handle transportation of messages across your multiple systems, multiple microservices, or any other working modules. This can be just frontend/backend applications, a set of IoT devices, or some other modules.
Apache Kafka platform is distributed, meaning that it relies on multiple servers, with data replicated over multiple locations, making sure that if some servers fail, we’re still fine.
It is scalable and you can have as many servers as you need. You can start small and add more servers as your system grows. These servers can handle trillions of messages per day, ending up in petabytes of data persistently stored over disks.
And what is great about Apache Kafka is its community and a wide ecosystem surrounding the technology. This includes the client libraries available for different programming languages and a set of data connectors to integrate Kafka with your existing external systems. Thus, you don’t need to reinvent the wheel to start using Apache Kafka, instead you can rely on the work of amazing developers who solved similar issues already.
Where Apache Kafka is used
To understand where the need for Apache Kafka is coming from, we’ll look at an example of a product.
Imagine that we decided to build an e-commerce project. When starting to work on the project, maybe during its MVP (minimal viable product) stage, we chose to keep all subsystems next to each other as a single monolith. That’s why, from the beginning, we kept our frontend and backend services, as well as the data store, closely interconnected.
This might be not ideal, but at start this approach can be effective and will work as long as we have a small number of users and a limited amount of functionality.
However, once we start scaling and adding more and more modules (for example introducing a recommendation engine, notification service, etc.), very quickly the current architecture and the information flow will become a complete chaos which is difficult to support and expand. And with the development team growing, no single person will be able to keep up with the data flow of this product.
That’s why eventually we’ll need to have a tough conversation on how to split our monolith into a set of independent microservices with clear, agreed and documented communication interfaces.
What’s even more crucial, our new architecture must allow the product to rely on real-time events, where users don’t have to wait till tomorrow to get meaningful recommendations based on their latest purchases.
And this is a lot to ask. Introducing such processing of events is an immensely high volume operation and needs to be resistant to failures.
Lucky for us, these are exactly the challenges with which Apache Kafka can help. Apache Kafka is great at untangling data flows, simplifying the way we handle real time data and decouple subsystems.
Apache Kafka’s way of thinking
To understand how Apache Kafka works, and how we can work with it effectively, we need to talk about Apache Kafka’s way of thinking about data.
The approach which Apache Kafka takes is simple, but clever. Instead of working with data in the form of static objects, or final facts that are aggregated and stored in a database, Apache Kafka describes entities by continuously arriving events.
For example, in our e-commerce product we have a list of goods that we sell. Their availability and other characteristics can be presented in a database as numbers, as shown below.
This gives us some valuable information, some final aggregated results. However, we need to plan very carefully what information we store, so that it is sufficient to cover calculations of future insights. Since we don’t know what the future holds, it is very tough to predict what data should be kept long term and what is safe to throw away.
Apache Kafka suggests that instead of storing aggregated object characteristics, we view this data as a flow of events:
This article originally appeared on aiven.io, to read the full article, click here.
Nastel Technologies is the global leader in Integration Infrastructure Management (i2M). It helps companies achieve flawless delivery of digital services powered by integration infrastructure by delivering tools for Middleware Management, Monitoring, Tracking, and Analytics to detect anomalies, accelerate decisions, and enable customers to constantly innovate, to answer business-centric questions, and provide actionable guidance for decision-makers. It is particularly focused on IBM MQ, Apache Kafka, Solace, TIBCO EMS, ACE/IIB and also supports RabbitMQ, ActiveMQ, Blockchain, IOT, DataPower, MFT, IBM Cloud Pak for Integration and many more.
The Nastel i2M Platform provides:
- Secure self-service configuration management with auditing for governance & compliance
- Message management for Application Development, Test, & Support
- Real-time performance monitoring, alerting, and remediation
- Business transaction tracking and IT message tracing
- AIOps and APM
- Automation for CI/CD DevOps
- Analytics for root cause analysis & Management Information (MI)
- Integration with ITSM/SIEM solutions including ServiceNow, Splunk, & AppDynamics