Monitoring Kafka: Real-Time Ops Challenge

Let’s say you’re a long-time expert on message-oriented middleware – say, an MQSeries (now IBM MQ) pro from back in the day – and someone asked you to create an entirely new middleware product following modern, cloud-native architectural principles, able to process trillions of messages per day. What would it look like?

The answer: it would probably look a lot like Apache Kafka. Kafka is an open-source stream processing platform that originated at LinkedIn, where the engineering team built Kafka to support real-time data feeds to facilitate real-time analytics.

Kafka is cloud-native, which means that its creators built it to leverage the architectural best practices of the cloud, including unlimited horizontal scale, rapid elasticity, high performance, and no single point of failure.

Kafka can, of course, run in the cloud as well – but as middleware, it can connect to both cloud or on-premises endpoints, or run fully on-premises if so desired. Regardless of such deployment details, Kafka brings the architectural advantages of cloud-native software to the middleware world.

The Real-Time Context for Kafka

Kafka’s ability to handle streaming data feeds in real-time enables executives and other line-of-business personnel to obtain real-time insights into information relevant to their business at any point in time.

To achieve this level of performance, Kafka operates with low latency, processing data in memory in a fully distributed manner, while scaling writes and reads with partitioned, distributed commit logs.

In addition, Kafka offers built-in load balancing, coupled with its rapid partitioning of data (also called sharding). In other words, the Kafka team designed the platform to be more like a distributed database transaction log than a traditional middleware system.

Following its cloud-native heritage, the Kafka team also built replication into its architecture – once again, in a horizontally scalable, elastic way. Kafka accomplishes this task with a complicated set of records, topics, consumers, producers, brokers, logs, partitions, and clusters – often with one-to-many relationships, with numerous message traffic flows among them.

The end result is both scalable as well as fault tolerant, as no node acts as a single point of failure. Traditional middleware like IBM MQ, in contrast, depends upon technologies like queues to guarantee message delivery – but with queues, failures delay the arrival of messages. As a real-time platform, the Kafka architecture routes around failures.

Monitoring Kafka

The real-time, fault tolerant performance of Kafka, as with any piece of software, depends upon its proper configuration and the proper operation of the infrastructure that supports it. It would be foolish to assume that the platform’s fault-tolerant architecture doesn’t require monitoring and management.

Monitoring the performance of Kafka, however, means monitoring its various components, as well as the message flows among them, either individually or as end-to-end transactions. Furthermore, such monitoring must be in real-time in order to keep up with the real-time data flowing through the Kafka implementation.

The diagram below from Nastel AutoPilot for Kafka illustrates this challenge. On the left is a Kafka sender, which interacts with three readers, and in turn, with five topics. Note, however, that Kafka is an elastic environment, so it may spawn additional components as necessary.

Monitoring Kafka (Source: Nastel)

Monitoring Kafka (Source: Nastel)

AutoPilot monitors each of the components and the message flows connecting them – a task that would overwhelm a traditional monitoring tool.

In fact, AutoPilot provides both operational and transactional monitoring. It also offers forensics to diagnose Kafka problems, and monitors Kafka performance and availability via end-to-end stream monitoring and metrics tracking from the various Kafka components as well as Zookeeper, Kafka’s configuration service.

The Intellyx Take

As streaming data become increasingly common, organizations are coming to depend upon the real-time visibility at scale that such technologies provide. The real-time streaming capabilities that Kafka delivers add new value to the business and can help evolve traditional data-centric offerings, from data warehouses to business intelligence.

This broad-based trend to real-time insights raises the bar on monitoring and management, both for operations as well as the transaction management essential for business visibility.

Customers aren’t going to wait for real-time performance, so the business cannot afford to wait either. Every aspect of the technology infrastructure, from the network to the middleware to the applications, must now perform in a cloud-native, real-time context – including ops management and monitoring.

Copyright © Intellyx LLC. Nastel is an Intellyx client. At the time of writing, none of the other organizations mentioned in this article are Intellyx clients. Intellyx retains full editorial control over the content of this paper.