NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →

NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →

Wiki Event Streaming Event Streaming

Event Streaming

Event streaming platforms move data as it happens, treating every change in your business as a message on a durable, replayable log. Apache Kafka is the dominant open-source standard, with Confluent as its commercial steward, AWS MSK as Amazon's managed version, Redpanda as a Kafka-compatible rewrite, and Apache Pulsar plus AWS Kinesis as the main alternatives.

An event streaming platform is a system that moves data as it happens, in the order it happens, and stores it long enough that anyone who cares can read it — including readers who arrived an hour, a day, or a week later. The simplest mental model is this: it is a giant, append-only log file that many producers can write to and many consumers can read from independently. Every event has a timestamp and a position; nothing is ever overwritten; and consumers track their own place in the log.

That sounds mundane, but it is the architectural foundation for almost everything modern in data infrastructure. When you order from Uber and the driver's location updates on your screen, that is event streaming. When a fraud-detection system blocks your credit card mid-transaction, that is event streaming. When your company syncs Salesforce changes into a data warehouse within seconds, that is event streaming. The pattern is the same: something happened, an event was emitted, a log captured it, and one or more downstream systems reacted.

The Most Important Distinction: Open-Source Project vs Commercial Vendor

Before going further, here is the distinction that this wiki insists on and most other resources blur: the open-source project and the commercial company that sells it are not the same thing. Conflating them leads to muddled architectural decisions and worse vendor evaluations.

In streaming, the most important version of this distinction is:

  • Apache Kafka is the open-source project, governed by the Apache Software Foundation, licensed Apache 2.0, and downloadable for free.
  • Confluent is the company founded in 2014 by Kafka's original creators to commercialize Kafka. It sells a managed cloud service (Confluent Cloud), an enterprise distribution (Confluent Platform), and a constellation of value-add tooling (Schema Registry, Connect, Flink integration).
  • AWS MSK is Amazon's managed version of the same Apache Kafka project — different vendor, same underlying technology. AWS Kinesis is something else entirely (more on this in a moment).
  • Redpanda is a Kafka-compatible rewrite — it speaks the same wire protocol as Kafka but is built from scratch in C++ with a completely different implementation. From a client's perspective it looks like Kafka; under the hood it is not.

The same pattern holds for Pulsar:

  • Apache Pulsar is the open-source project, originally created at Yahoo and now governed by the ASF.
  • StreamNative is the company founded by Pulsar's original creators to commercialize it — the Confluent of Pulsar, with all the structural challenges that implies.

If you are trying to understand who is winning, who is at risk, and what you should actually deploy, you have to think about the project and the vendor separately. The project's adoption is one question. The commercial viability of any single vendor selling that project is a different question.

OSS Projects to Commercial Vendors — The Streaming Map

OSS project / standardCommercial vendor(s)Notes
—-—-—-
Apache KafkaConfluent, AWS MSK, Aiven, InstaclustrThe dominant standard. Multiple vendors all selling managed versions of the same Apache project.
Apache Kafka (wire protocol only)Redpanda, WarpStream (now Confluent)Kafka-compatible rewrites. Speak the protocol, reimplement the broker.
Apache PulsarStreamNative, DataStax (Astra Streaming)Two commercial vendors splitting a smaller pie.
AWS Kinesis (proprietary)AWS onlyAmazon's homegrown service. Not Kafka-based.
Google Pub/Sub (proprietary)Google onlyGoogle's homegrown service. Not Kafka-based.
Azure Event HubsMicrosoft onlySpeaks a Kafka-compatible API but is Microsoft's proprietary backend.

A Note on AWS Kinesis: It Is Not Kafka

This is a common point of confusion that this wiki will spell out plainly: AWS Kinesis is not Kafka, and it is not Kafka-compatible. It is Amazon's own proprietary streaming service, with its own API, its own client libraries, and its own conceptual vocabulary.

Kinesis terminology even differs from Kafka in ways that catch people off guard:

  • Kafka has topics; Kinesis has streams.
  • Kafka has partitions; Kinesis has shards.
  • Kafka has consumer groups; Kinesis has enhanced fan-out consumers with a different model.
  • Kafka offsets are 64-bit integers per partition; Kinesis has sequence numbers with different semantics.

If you switch from Kafka to Kinesis (or the reverse), you are rewriting your producer and consumer code, not just changing connection strings. The two are conceptually similar — both are durable, partitioned, replayable logs — but they are not interchangeable.

The other AWS streaming product, AWS MSK (Managed Streaming for Apache Kafka), is genuine Kafka. MSK runs the actual Apache Kafka project on AWS-managed infrastructure. If you want Kafka on AWS without managing brokers yourself, MSK is the AWS answer; Kinesis is a different product for a different purpose.

Streams vs Queues: Why Kafka Beat RabbitMQ

The most important conceptual distinction in this category is streams vs. queues, and most people get it wrong.

A traditional queue (RabbitMQ, ActiveMQ, Amazon SQS) is like a to-do list. A producer writes a message; a consumer reads it; the message is deleted. If two consumers want the same message, you have to set up a fan-out exchange. If a consumer crashes after reading but before processing, you need acknowledgments and dead-letter queues. The queue is a transport, not a storage system — once the message is delivered, it is gone.

A stream (Kafka, Kinesis, Pulsar) is like a tape recording. The producer appends events to the tape; the tape never erases. Any number of consumers can play the tape from any starting point, at their own speed, independently of each other. If a new analytics team shows up next year and wants the last 30 days of data, they just rewind. The stream is both the transport and the durable record.

The streaming model won because modern data systems are fundamentally about decoupling producers from consumers across time. In 2010, if your e-commerce backend wanted to feed five downstream systems (warehouse, search index, recommendation engine, fraud detector, email service), you wrote five custom integrations and prayed nothing broke. After Kafka, you publish one event ("OrderPlaced") to one topic, and each consumer reads it on their own schedule. The integration count dropped from N times M to N plus M.

Key Concepts, Explained Simply

Topics, partitions, and offsets. A topic is a named stream (like user-clicks or order-events). A topic is split into partitions, which are the unit of parallelism — each partition is a separate ordered log, and consumers can read partitions in parallel. Within a partition, every event has an offset: a monotonically increasing number that identifies its position. Consumers track which offsets they have processed.

Producers and consumers. A producer writes events to topics. A consumer reads them. Consumers are organized into consumer groups — if five consumer instances are in the same group, Kafka balances partitions across them so each event is processed exactly once by the group. Different groups (e.g., "warehouse-loader" vs. "fraud-detector") read the same events independently.

Retention. Unlike a queue, a stream keeps events for a configurable retention period — hours, days, or forever. This is the magic that makes streams replayable. A bug in your downstream service? Reset the offset and reprocess. A new analytics use case? Read from the beginning of the topic.

Schema registry. Producers and consumers need to agree on the shape of events. A schema registry stores Avro/Protobuf/JSON Schema definitions for each topic, enforces compatibility rules, and prevents producers from breaking consumers with backward-incompatible changes. Confluent's was the first widely used one and remains the de facto standard.

Why "log" is the magic word. In a database, the write-ahead log is an internal implementation detail — the canonical thing is the table. Jay Kreps's insight in the early 2010s was: what if we make the log the canonical thing, and treat tables as derived views of the log? That single inversion is the foundation of Kafka, change data capture, event sourcing, and the entire "streaming-first" architecture. If you read one essay on this category, read Kreps's "The Log: What every software engineer should know about real-time data's unifying abstraction."

The Vendor Landscape: Kafka Won, Everyone Else Is a Footnote

Here is the honest take: Apache Kafka won this category, and the only real question is how you run it. Every other open-source streaming platform has been relegated to niche or geographic-specific roles. The competition is not "Kafka vs. Pulsar" — it is "self-hosted Kafka vs. Confluent Cloud vs. AWS MSK vs. Redpanda vs. WarpStream."

Apache Kafka. The OSS project. Created at LinkedIn in 2010-2011 by Jay Kreps, Neha Narkhede, and Jun Rao to solve LinkedIn's data integration problem, donated to Apache in 2011, and now governed by the ASF. The de facto standard.

Confluent. The commercial company founded in 2014 by Kafka's three original creators. Sells Confluent Cloud (managed Kafka), Confluent Platform (enterprise on-prem), Schema Registry, Connect, and a managed Flink offering. Went public in 2021. The most full-featured Kafka vendor and the one whose roadmap most heavily shapes the upstream project.

AWS MSK and AWS Kinesis. Two distinct AWS products that are often confused. MSK is Amazon's managed version of upstream Apache Kafka — same code, same wire protocol, same clients. Kinesis is Amazon's homegrown proprietary streaming service with its own API and concepts. If you want Kafka, use MSK. If you want a tightly-AWS-integrated alternative and do not need the Kafka ecosystem, Kinesis is fine.

Apache Pulsar and StreamNative. Created at Yahoo in 2012-2013 to handle multi-tenant messaging across Yahoo properties; open-sourced in 2016, ASF top-level in 2018. StreamNative is the commercial company behind it. Pulsar's architectural pitch — separating compute (brokers) from storage (BookKeeper), built-in geo-replication, native multi-tenancy — is technically elegant but has lost the mindshare war to Kafka outside of a handful of large adopters (Tencent, ByteDance, Yahoo, telcos).

Redpanda. A Kafka-compatible C++ rewrite founded in 2019 by Alexander Gallego. Speaks the Kafka wire protocol but ships without a JVM, without ZooKeeper, with a Raft-based architecture. Pitched as "10x faster than Kafka" with simpler operations. Licensed under BSL, not Apache 2.0.

WarpStream (now part of Confluent). A Kafka-compatible streaming platform that pushes storage entirely onto S3 to eliminate broker disks and inter-AZ traffic costs. Acquired by Confluent in September 2024 and now part of Confluent Cloud's cost-optimized lineup.

Upsolver and the "streaming ETL" niche. A different category of vendor that sits on top of Kafka or Kinesis to provide SQL-based stream-to-lake ingestion. Useful, but not a streaming platform itself.

When You Actually Need Event Streaming

ScenarioUse streaming?Why
—-—-—-
Sync changes from operational DB to warehouse in real timeYes (CDC into Kafka)The standard pattern for modern data integration
Power a real-time dashboard or fraud detectorYesLatency from event to query in seconds, not hours
Decouple microservices that emit/consume eventsYesKafka is the dominant inter-service backbone
Nightly batch ETL of CSV files from a vendorNoUse a workflow orchestrator and object storage
Send a single email when a user signs upNo (overkill)Use a queue (SQS, RabbitMQ) or just call the function
Replayable audit log of all system eventsYesStreams are inherently append-only and replayable

The general rule: if you need replayability, multiple independent consumers, or sub-minute latency from event to action, you want a stream. If you just need a producer to hand a job to a single consumer, a queue is simpler.

Tools in This Category

Open-source projects:

  • Apache Kafka — The dominant open-source streaming platform.
  • Apache Pulsar — Yahoo-born alternative with multi-tenancy and tiered storage as core features.

Commercial vendors:

  • Confluent — The canonical commercial steward of Apache Kafka.
  • Redpanda — Kafka-compatible C++ rewrite, sold by Redpanda Data.
  • StreamNative — The commercial company behind Apache Pulsar.
  • AWS Kinesis — Amazon's proprietary streaming service (not Kafka).
  • Upsolver — SQL-based streaming ETL on top of Kafka/Kinesis.

How TextQL Works with Event Streaming

Event streaming platforms are typically upstream of the systems TextQL queries directly. Events flowing through Kafka are usually landed into a data warehouse, lakehouse, or real-time analytics database — and that is where TextQL Ana connects. The streaming layer matters to TextQL users indirectly: it determines how fresh the data is when a business user asks a question. If your stream-to-warehouse pipeline lands events in 30 seconds, TextQL can answer questions about events that happened 30 seconds ago.

See TextQL in action

See TextQL in action

Event Streaming
Category Real-time data movement
Also called Message brokers, log-based messaging, pub/sub, event buses
Not to be confused with Stream processing (compute on streams), traditional queues (RabbitMQ, SQS)
Dominant OSS project Apache Kafka (2011)
Key commercial vendors Confluent, Redpanda, AWS (Kinesis & MSK), StreamNative
Typical users Platform engineers, data engineers, backend teams
Monthly mindshare ~600K · broad concept; everyone running event-driven systems