Alephys

Our Locations : Hyderabad, Texas, Singapore

Monitoring Confluent Platform with Prometheus & Grafana

A Best-Practice Guide to JMX Exporter, Node Exporter, and Kafka Lag Exporter

Modern event-driven platforms built on Confluent Platform and Apache Kafka require strong observability to operate reliably at scale. With multiple distributed components—Kafka Brokers, KRaft or ZooKeeper Controllers, Schema Registry, Kafka Connect, ksqlDB, and Control Center—effective monitoring is essential for:

  1. Performance tuning
  2. Capacity planning
  3. Early anomaly detection
  4. Meeting strict SLAs and SLO

This guide explains how a Prometheus + Grafana monitoring stack, combined with purpose-built Kafka exporters, provides deep visibility across the application, JVM, and infrastructure layers of the Confluent ecosystem.

The Observability Stack

Prometheus – The Metrics Engine

Prometheus is a high-performance, pull-based monitoring system designed for dynamic, distributed environments. It:

  1. Scrapes metrics over HTTP from exporter endpoints
  2. Stores metrics as time-series data
  3. Uses PromQL for flexible querying and aggregation
  4. Evaluates alerting rules and integrates with Alertmanager for notifications

Prometheus serves as the single source of truth for operational metrics in most Kafka monitoring architectures.

Grafana – The Visualization Layer

Grafana provides the visualization and user interface layer for observability. When connected to Prometheus, it enables:

  1. Real-time and historical dashboards
  2. Correlation across application, JVM, and host metrics
  3. Alert visualization and management (typically backed by Prometheus Alertmanager)

Grafana acts as the “single pane of glass” for Kafka operators and SRE teams.

High-Level Architecture Overview

The strength of this monitoring stack lies in its ability to aggregate metrics from multiple layers:

  1. Kafka & Confluent services → JVM and application metrics
  2. Operating system → CPU, memory, disk, and network metrics

Consumers → Lag and throughput indicators

End-to-End Metrics Flow

1. JMX Exporter – JVM and Kafka Metrics

All Kafka and Confluent Platform services are Java-based and expose internal performance metrics via JMX (Java Management Extensions). Since Prometheus cannot scrape JMX directly, the JMX Exporter is used as a bridge.

Role

  1. Attaches to the JVM and reads JMX MBeans
  2. Converts JMX metrics into Prometheus-compatible format
  3. Exposes them over an HTTP /metrics endpoint

Deployment Best Practice

  1. Run the JMX Exporter as a Java Agent attached to each Kafka or Confluent service process
  2. Use carefully curated JMX rules to avoid excessive metric cardinality

Commonly Monitored Metrics

  1. Broker throughput (Bytes In / Bytes Out)
  2. Under-Replicated Partitions (URP)
  3. ISR shrink/expand events
  4. Request latency metrics
  5. JVM heap usage and garbage collection pauses

Example Java Agent Configuration

-javaagent:/opt/jmx_prometheus_javaagent.jar=<PORT>:/opt/kafka-jmx.yml

Best practice: Treat JMX Exporter metrics as signal-rich but expensive—collect what you need, not everything.

2. Node Exporter – Host-Level Metrics

Kafka is both storage-intensive and network-intensive. Application-level metrics alone are insufficient without understanding the behavior of the underlying hosts.
The Node Exporter provides visibility into operating system and hardware metrics.

Role

  1. Runs as a lightweight daemon on each Kafka host
  2. Exposes OS-level metrics to Prometheus

Why It’s Critical

Node Exporter enables correlation between Kafka performance issues and infrastructure constraints such as:

  1. Disk I/O saturation
  2. Network bandwidth limits
  3. CPU contention or steal time (especially in virtualized or cloud environments)

Key Metrics

  1. node_cpu_seconds_total
  2. node_filesystem_avail_bytes
  3. node_disk_io_time_seconds_total
  4. node_network_receive_bytes_total

Best practice: Always analyze Kafka broker metrics alongside disk and network metrics—most Kafka bottlenecks originate at the OS layer.

3. Kafka Lag Exporter – Consumer Health and SLAs

Consumer lag is one of the most important indicators of Kafka system health. It represents the gap between:

  1. The Log End Offset (LEO) of a partition
  2. The committed offset of a consumer group

The Kafka Lag Exporter calculates and exposes these metrics by connecting to Kafka as a client.

Role

  1. Reads consumer group offsets from Kafka
  2. Computes partition-level and group-level lag
  3. Exposes lag metrics via /metrics

Key Metrics

  1. kafka_consumergroup_lag
  2. kafka_consumergroup_max_lag_seconds

Important Note on Time-Based Lag

Time-based lag is an estimate, derived from recent throughput rates. It assumes:

  1. Stable production rates
  2. Consistent consumer processing
  3. No frequent rebalances

Best practice: Use time-based lag as a directional indicator, not an exact SLA measurement.

Prometheus Scrape Configuration (Example)

Prometheus is configured to scrape exporter endpoints at regular intervals.

scrape_configs:
– job_name: ‘kafka-jmx’
    static_configs:
      – targets: [‘<broker-ip>:<port>’]

  – job_name: ‘node-exporter’
    static_configs:
      – targets: [‘<host-ip>:<port>’]

  – job_name: ‘kafka-lag’
    static_configs:
      – targets: [‘<lag-exporter-ip>:<port>’]

Best practice: In Kubernetes or cloud environments, use service discovery (Kubernetes SD, EC2 SD, Consul) instead of static targets.

Resulting Operational Insights

By combining JMX Exporter, Node Exporter, and Kafka Lag Exporter, Grafana dashboards can provide a holistic operational view:

  1. Broker Health: Throughput, request latency, partition state
  2. JVM Health: Heap usage, GC pauses, thread counts
  3. Host Health: CPU utilization, disk I/O, network saturation
  4. Consumer Health: Real-time lag, backlog trends, SLA risk

This layered observability approach allows teams to move from reactive firefighting to proactive optimization, ensuring that the Confluent Platform remains scalable, resilient, and performant under production workloads.

Ready to Elevate Your Kafka Observability?

If you’re looking to implement a production-grade monitoring stack or need to fine-tune your Confluent Platform performance, Alephys can help you architect a robust observability framework tailored to your infrastructure.

Whether you’re troubleshooting persistent consumer lag, optimizing JMX metric cardinality, or designing high-availability Prometheus architectures, our team of data experts handles the technical complexities. We help you gain deep operational insights and maintain 99.9% uptime while you focus on delivering value to your customers.

Author: Siva Munaga, Solution Architect at Alephys. I specialize in building scalable data infrastructure and observability solutions that keep modern event-driven applications running smoothly. Let’s connect on LinkedIn to discuss your Kafka monitoring challenges and infrastructure goals!

 

Leave a Comment

Your email address will not be published. Required fields are marked *