NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Event Streaming
Event streaming platforms move data as it happens, treating every change in your business as a message on a durable, replayable log. Apache Kafka is the dominant open-source standard, with Confluent as its commercial steward, AWS MSK as Amazon's managed version, Redpanda as a Kafka-compatible rewrite, and Apache Pulsar plus AWS Kinesis as the main alternatives.
An event streaming platform is a system that moves data as it happens, in the order it happens, and stores it long enough that anyone who cares can read it — including readers who arrived an hour, a day, or a week later. The simplest mental model is this: it is a giant, append-only log file that many producers can write to and many consumers can read from independently. Every event has a timestamp and a position; nothing is ever overwritten; and consumers track their own place in the log.
That sounds mundane, but it is the architectural foundation for almost everything modern in data infrastructure. When you order from Uber and the driver's location updates on your screen, that is event streaming. When a fraud-detection system blocks your credit card mid-transaction, that is event streaming. When your company syncs Salesforce changes into a data warehouse within seconds, that is event streaming. The pattern is the same: something happened, an event was emitted, a log captured it, and one or more downstream systems reacted.
Before going further, here is the distinction that this wiki insists on and most other resources blur: the open-source project and the commercial company that sells it are not the same thing. Conflating them leads to muddled architectural decisions and worse vendor evaluations.
In streaming, the most important version of this distinction is:
The same pattern holds for Pulsar:
If you are trying to understand who is winning, who is at risk, and what you should actually deploy, you have to think about the project and the vendor separately. The project's adoption is one question. The commercial viability of any single vendor selling that project is a different question.
| OSS project / standard | Commercial vendor(s) | Notes |
|---|---|---|
| —- | —- | —- |
| Apache Kafka | Confluent, AWS MSK, Aiven, Instaclustr | The dominant standard. Multiple vendors all selling managed versions of the same Apache project. |
| Apache Kafka (wire protocol only) | Redpanda, WarpStream (now Confluent) | Kafka-compatible rewrites. Speak the protocol, reimplement the broker. |
| Apache Pulsar | StreamNative, DataStax (Astra Streaming) | Two commercial vendors splitting a smaller pie. |
| AWS Kinesis (proprietary) | AWS only | Amazon's homegrown service. Not Kafka-based. |
| Google Pub/Sub (proprietary) | Google only | Google's homegrown service. Not Kafka-based. |
| Azure Event Hubs | Microsoft only | Speaks a Kafka-compatible API but is Microsoft's proprietary backend. |
This is a common point of confusion that this wiki will spell out plainly: AWS Kinesis is not Kafka, and it is not Kafka-compatible. It is Amazon's own proprietary streaming service, with its own API, its own client libraries, and its own conceptual vocabulary.
Kinesis terminology even differs from Kafka in ways that catch people off guard:
If you switch from Kafka to Kinesis (or the reverse), you are rewriting your producer and consumer code, not just changing connection strings. The two are conceptually similar — both are durable, partitioned, replayable logs — but they are not interchangeable.
The other AWS streaming product, AWS MSK (Managed Streaming for Apache Kafka), is genuine Kafka. MSK runs the actual Apache Kafka project on AWS-managed infrastructure. If you want Kafka on AWS without managing brokers yourself, MSK is the AWS answer; Kinesis is a different product for a different purpose.
The most important conceptual distinction in this category is streams vs. queues, and most people get it wrong.
A traditional queue (RabbitMQ, ActiveMQ, Amazon SQS) is like a to-do list. A producer writes a message; a consumer reads it; the message is deleted. If two consumers want the same message, you have to set up a fan-out exchange. If a consumer crashes after reading but before processing, you need acknowledgments and dead-letter queues. The queue is a transport, not a storage system — once the message is delivered, it is gone.
A stream (Kafka, Kinesis, Pulsar) is like a tape recording. The producer appends events to the tape; the tape never erases. Any number of consumers can play the tape from any starting point, at their own speed, independently of each other. If a new analytics team shows up next year and wants the last 30 days of data, they just rewind. The stream is both the transport and the durable record.
The streaming model won because modern data systems are fundamentally about decoupling producers from consumers across time. In 2010, if your e-commerce backend wanted to feed five downstream systems (warehouse, search index, recommendation engine, fraud detector, email service), you wrote five custom integrations and prayed nothing broke. After Kafka, you publish one event ("OrderPlaced") to one topic, and each consumer reads it on their own schedule. The integration count dropped from N times M to N plus M.
Topics, partitions, and offsets. A topic is a named stream (like user-clicks or order-events). A topic is split into partitions, which are the unit of parallelism — each partition is a separate ordered log, and consumers can read partitions in parallel. Within a partition, every event has an offset: a monotonically increasing number that identifies its position. Consumers track which offsets they have processed.
Producers and consumers. A producer writes events to topics. A consumer reads them. Consumers are organized into consumer groups — if five consumer instances are in the same group, Kafka balances partitions across them so each event is processed exactly once by the group. Different groups (e.g., "warehouse-loader" vs. "fraud-detector") read the same events independently.
Retention. Unlike a queue, a stream keeps events for a configurable retention period — hours, days, or forever. This is the magic that makes streams replayable. A bug in your downstream service? Reset the offset and reprocess. A new analytics use case? Read from the beginning of the topic.
Schema registry. Producers and consumers need to agree on the shape of events. A schema registry stores Avro/Protobuf/JSON Schema definitions for each topic, enforces compatibility rules, and prevents producers from breaking consumers with backward-incompatible changes. Confluent's was the first widely used one and remains the de facto standard.
Why "log" is the magic word. In a database, the write-ahead log is an internal implementation detail — the canonical thing is the table. Jay Kreps's insight in the early 2010s was: what if we make the log the canonical thing, and treat tables as derived views of the log? That single inversion is the foundation of Kafka, change data capture, event sourcing, and the entire "streaming-first" architecture. If you read one essay on this category, read Kreps's "The Log: What every software engineer should know about real-time data's unifying abstraction."
Here is the honest take: Apache Kafka won this category, and the only real question is how you run it. Every other open-source streaming platform has been relegated to niche or geographic-specific roles. The competition is not "Kafka vs. Pulsar" — it is "self-hosted Kafka vs. Confluent Cloud vs. AWS MSK vs. Redpanda vs. WarpStream."
Apache Kafka. The OSS project. Created at LinkedIn in 2010-2011 by Jay Kreps, Neha Narkhede, and Jun Rao to solve LinkedIn's data integration problem, donated to Apache in 2011, and now governed by the ASF. The de facto standard.
Confluent. The commercial company founded in 2014 by Kafka's three original creators. Sells Confluent Cloud (managed Kafka), Confluent Platform (enterprise on-prem), Schema Registry, Connect, and a managed Flink offering. Went public in 2021. The most full-featured Kafka vendor and the one whose roadmap most heavily shapes the upstream project.
AWS MSK and AWS Kinesis. Two distinct AWS products that are often confused. MSK is Amazon's managed version of upstream Apache Kafka — same code, same wire protocol, same clients. Kinesis is Amazon's homegrown proprietary streaming service with its own API and concepts. If you want Kafka, use MSK. If you want a tightly-AWS-integrated alternative and do not need the Kafka ecosystem, Kinesis is fine.
Apache Pulsar and StreamNative. Created at Yahoo in 2012-2013 to handle multi-tenant messaging across Yahoo properties; open-sourced in 2016, ASF top-level in 2018. StreamNative is the commercial company behind it. Pulsar's architectural pitch — separating compute (brokers) from storage (BookKeeper), built-in geo-replication, native multi-tenancy — is technically elegant but has lost the mindshare war to Kafka outside of a handful of large adopters (Tencent, ByteDance, Yahoo, telcos).
Redpanda. A Kafka-compatible C++ rewrite founded in 2019 by Alexander Gallego. Speaks the Kafka wire protocol but ships without a JVM, without ZooKeeper, with a Raft-based architecture. Pitched as "10x faster than Kafka" with simpler operations. Licensed under BSL, not Apache 2.0.
WarpStream (now part of Confluent). A Kafka-compatible streaming platform that pushes storage entirely onto S3 to eliminate broker disks and inter-AZ traffic costs. Acquired by Confluent in September 2024 and now part of Confluent Cloud's cost-optimized lineup.
Upsolver and the "streaming ETL" niche. A different category of vendor that sits on top of Kafka or Kinesis to provide SQL-based stream-to-lake ingestion. Useful, but not a streaming platform itself.
| Scenario | Use streaming? | Why |
|---|---|---|
| —- | —- | —- |
| Sync changes from operational DB to warehouse in real time | Yes (CDC into Kafka) | The standard pattern for modern data integration |
| Power a real-time dashboard or fraud detector | Yes | Latency from event to query in seconds, not hours |
| Decouple microservices that emit/consume events | Yes | Kafka is the dominant inter-service backbone |
| Nightly batch ETL of CSV files from a vendor | No | Use a workflow orchestrator and object storage |
| Send a single email when a user signs up | No (overkill) | Use a queue (SQS, RabbitMQ) or just call the function |
| Replayable audit log of all system events | Yes | Streams are inherently append-only and replayable |
The general rule: if you need replayability, multiple independent consumers, or sub-minute latency from event to action, you want a stream. If you just need a producer to hand a job to a single consumer, a queue is simpler.
Open-source projects:
Commercial vendors:
Event streaming platforms are typically upstream of the systems TextQL queries directly. Events flowing through Kafka are usually landed into a data warehouse, lakehouse, or real-time analytics database — and that is where TextQL Ana connects. The streaming layer matters to TextQL users indirectly: it determines how fresh the data is when a business user asks a question. If your stream-to-warehouse pipeline lands events in 30 seconds, TextQL can answer questions about events that happened 30 seconds ago.
See TextQL in action