Apache Kafka | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Apache Kafka

Apache Kafka is the dominant open-source event streaming platform. Created at LinkedIn in 2010-2011 by Jay Kreps, Neha Narkhede, and Jun Rao, donated to the Apache Software Foundation in 2011, and now the de facto standard for moving data in real time.

Apache Kafka is the open-source event streaming platform that became the de facto standard for moving data in real time. It is, in plain English, a giant append-only log file that many producers can write to and many consumers can read from independently — distributed across a cluster of machines, replicated for durability, and fast enough to handle millions of events per second.

This page is about the open-source project governed by the Apache Software Foundation. The commercial company founded by Kafka's creators — the one that sells managed Kafka and built most of the surrounding ecosystem — is Confluent, and it has its own page. The two are tightly linked but they are not the same entity, and conflating them is one of the most common mistakes people make when reasoning about this category.

The LinkedIn Origin Story

Kafka was born at LinkedIn in 2010-2011 to solve a problem that was eating LinkedIn's data team alive: point-to-point integrations between systems were unmanageable. LinkedIn had dozens of services — profiles, connections, messages, search, recommendations, the news feed — and each one needed to send and receive data from many others. The result was a tangled mess of custom pipelines, each with its own format, failure modes, and latency characteristics. The team wanted a single, unified pipe that could carry every event LinkedIn produced and let any consumer subscribe.

Three engineers led the effort: Jay Kreps, Neha Narkhede, and Jun Rao. They were inspired by the design of distributed commit logs in databases — the internal append-only log that databases use to record changes before applying them to tables. Their insight was to invert the relationship: make the log the canonical thing, and treat downstream systems as consumers of the log. Kreps named the project after Franz Kafka because, as he later said, "it is a system optimized for writing."

LinkedIn open-sourced Kafka in January 2011. The project entered the Apache Incubator later that year and was promoted to an Apache top-level project in October 2012. Since then it has been governed under the standard Apache Software Foundation model: a project management committee of elected committers, public mailing lists, KIPs (Kafka Improvement Proposals) for major design changes, and Apache 2.0 licensing.

How the ASF Governance Actually Matters

It is fashionable to be cynical about "open core" projects where one company dominates the contributor base. The honest read on Kafka is somewhere in the middle. Confluent contributes the majority of the code, and the project's roadmap is heavily shaped by Confluent's priorities. But because Kafka is an ASF project, the trademark, the codebase, and the release process belong to the foundation, not to Confluent. Anyone can fork it, anyone can run it, and anyone can build a competing product on top of it — which is exactly what AWS, Redpanda, WarpStream, and Aiven have all done.

The result: Kafka the project is genuinely open in a way that some other "open source" infrastructure projects are not. The ecosystem around it is a different question, and we cover that on the Confluent page.

The Core Abstractions

Strip away every piece of marketing and Kafka boils down to a small set of primitives. Once you understand these, every other Kafka concept is a derivation.

Topics. A topic is a named stream of events, like user-clicks or order-events. Topics are the unit of subscription — producers write to a topic, consumers read from it.

Partitions. Each topic is split into one or more partitions, which are the unit of parallelism and ordering. Within a partition, events are strictly ordered. Across partitions, there is no global order. Partition count is roughly the number of consumers that can read a topic in parallel.

Offsets. Within a partition, every event gets a monotonically increasing offset — a 64-bit integer that identifies its position in the log. Offsets are how consumers track their progress; they are how Kafka delivers replayability; they are how you "rewind" a stream to reprocess events from a point in time.

Brokers. A Kafka cluster is a set of broker processes. Each partition is replicated across multiple brokers (the replication factor controls how many). One broker is the leader for each partition; the others are followers that stay in sync. If the leader fails, a follower takes over.

Producers and consumer groups. A producer writes events to topics. A consumer reads them. Consumers are organized into consumer groups — if five consumer instances belong to the same group, Kafka assigns each partition to one instance, so each event is processed exactly once by the group as a whole. Different groups (e.g., warehouse-loader vs fraud-detector) read the same events independently and track their own offsets.

Retention. Unlike a queue, Kafka does not delete events when they are consumed. They sit in the log until a configurable retention policy (time-based or size-based) removes them. This is the magic that makes streams replayable: a new consumer that arrives next month can read events from the beginning, and a buggy consumer can rewind and reprocess.

That is essentially the entire model. Topics, partitions, offsets, brokers, consumer groups, retention. Everything else — Connect, Streams, Schema Registry, KRaft, tiered storage, MirrorMaker — is built on top of these six concepts.

Why Kafka Won

Kafka beat a half-dozen alternatives (RabbitMQ, ActiveMQ, NATS, ZeroMQ, JMS-flavored brokers, even early versions of Kinesis) for a few specific reasons that are worth naming:

1. The log-based architecture was the right primitive. Traditional message queues treated messages as ephemeral — once delivered, gone. Kafka treats the log as canonical and consumption as a side effect. That single inversion made everything downstream possible: replay, multiple independent consumers, change data capture, event sourcing. In hindsight it seems obvious. In 2011 it was not.

2. Horizontal scalability was real, not aspirational. Add brokers to add capacity. Add partitions to add parallelism. A well-tuned Kafka cluster can sustain millions of messages per second with sub-100ms end-to-end latency. The largest deployments in the world (LinkedIn, Netflix, Uber) handle trillions of messages per day on Kafka.

3. Durability was a first-class concern. Replication, ISR (in-sync replicas), acks=all, and (later) idempotent producers made Kafka safe enough for financial-grade workloads. Most pre-Kafka message brokers were faster than databases but lossier; Kafka closed that gap.

4. The ecosystem compounded. Once Kafka had critical mass, every adjacent tool integrated with Kafka first and everything else second. Debezium for CDC. Flink, Spark, ksqlDB for processing. Snowflake, ClickHouse, Elasticsearch for downstream sinks. The network effect is now overwhelming.

The KRaft Milestone: Goodbye ZooKeeper

For most of its life, Kafka depended on Apache ZooKeeper for cluster metadata: which broker is leader for which partition, which topics exist, which consumers are in which group. ZooKeeper worked, but it was a separate system to operate, debug, and scale, and it was the source of a disproportionate share of Kafka's operational pain.

Starting in 2019, the Kafka community began work on KRaft (Kafka Raft) — a built-in consensus protocol that lets Kafka manage its own metadata using a Raft-based quorum of brokers, no ZooKeeper required. KRaft was production-ready in Kafka 3.3 (October 2022), became the default for new clusters in Kafka 3.5, and ZooKeeper mode was fully removed in Kafka 4.0 (released in 2025). This is the most significant architectural change in Kafka's history. It eliminates an entire class of failure modes and makes Kafka materially easier to operate.

Three companion projects under the Apache umbrella round out the Kafka story:

Kafka Connect. A framework for moving data between Kafka and other systems via reusable connectors — Postgres, MySQL, Snowflake, S3, Elasticsearch, and hundreds more. Connect is the unsexy plumbing that makes Kafka actually useful in enterprise environments, since most events do not start or end in Kafka.

Kafka Streams. A Java library (not a separate cluster) for stateful stream processing on top of Kafka. You embed Streams in your application, and it handles the partitioning, state stores (via RocksDB), and failure recovery. Lighter weight than Flink, but limited to the JVM and to topologies you can express in the Streams DSL.

MirrorMaker 2. Replicates topics between Kafka clusters for disaster recovery, geo-replication, or migration scenarios.

ksqlDB, Schema Registry, and Confluent Connect Hub are not ASF projects — they are Confluent products with various licenses. That distinction matters when you are evaluating what is genuinely community-owned versus what locks you into a vendor.

Who Runs Kafka in Production

Essentially every name-brand internet company. LinkedIn (the original home, still one of the largest deployments). Netflix (uses Kafka as the backbone of its data pipeline, processing trillions of events per day). Uber (Kafka as the core of its real-time data platform). Airbnb. Pinterest. Stripe. Goldman Sachs. Most large banks. Most large retailers. The pattern is consistent: any organization that has to move events between systems at scale eventually ends up on Kafka.

Self-Host or Use a Managed Service?

This is the most important practical question for anyone adopting Kafka in 2026. The honest answer:

Self-host if you have a strong platform team, your workload is large enough that managed pricing hurts, you have specific compliance or network constraints, or you want full control over Kafka versions and tuning. Self-hosting Kafka is genuinely possible — the ASF release is what most managed services run under the hood — but it is not a weekend project. You will spend real engineering time on broker tuning, partition rebalancing, monitoring, upgrade choreography, and capacity planning.

Use a managed service if you want Kafka without the operational tax. The main options are Confluent Cloud (the most full-featured, run by the people who maintain Kafka), AWS MSK (cheapest if you live in AWS), Aiven (multi-cloud, friendlier pricing), and the Kafka-compatible alternatives Redpanda and WarpStream (which speak the Kafka protocol but reimplement the broker for different cost or performance tradeoffs).

The default recommendation for most teams in 2026 is: start on a managed service, only self-host if you have a clear reason to. Operating Kafka well at scale is a specialized skill, and renting that skill from a vendor is usually cheaper than building it in-house.

Where Kafka Sits in the Stack

Kafka is the central nervous system of a real-time data architecture. Producers write events from operational databases (via CDC tools like Debezium), application code, IoT devices, or SaaS APIs. Consumers read events into stream processors (Flink, ksqlDB), data warehouses, real-time OLAP databases, search indexes, and downstream microservices. Kafka itself is dumb pipe — it stores and delivers events, but does not transform them.

How TextQL Works with Apache Kafka

TextQL does not connect to Kafka directly — Kafka is a transport, not a query engine. Instead, TextQL Ana queries the systems downstream of Kafka: the warehouses, lakehouses, and real-time OLAP databases where Kafka events eventually land. In a typical Kafka-powered architecture, the freshness of data in TextQL is determined by how quickly your stream-to-warehouse pipeline runs. Most modern pipelines (Kafka into Flink into Iceberg, or Kafka into Snowpipe Streaming) put data in front of TextQL within seconds to minutes of the original event.

See TextQL in action

Apache Kafka

Created 2010-2011 at LinkedIn

Open-sourced January 2011

Apache TLP October 2012

Original creators Jay Kreps, Neha Narkhede, Jun Rao

Governance Apache Software Foundation

License Apache 2.0

Written in Java, Scala

Notable users LinkedIn, Netflix, Uber, Airbnb, Pinterest, Goldman Sachs

Commercial steward Confluent

Category Event Streaming

Monthly mindshare ~500K · ~28K GitHub stars; the streaming standard; massive Stack Overflow tag activity