Apache Flink | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Apache Flink

Apache Flink is the dominant open-source stream processing engine. It originated from the Stratosphere research project at TU Berlin in 2010, became an Apache top-level project in 2014, and is the engine of choice for serious real-time computation at companies like Netflix, Uber, Stripe, and Alibaba.

Apache Flink is the dominant open-source stream processing engine. If you have a real-time data pipeline at any serious scale — joining streams, maintaining state, computing windowed aggregations, doing ML feature engineering on live events — the odds are very high that you are running Flink, or you are evaluating it, or you are using a managed service that runs Flink under the hood.

The shortest version of the Flink story is: it is the only open-source streaming engine that took the hard problems of stream processing seriously from day one and got them right. Event time, watermarks, exactly-once semantics, stateful processing with checkpointing, true low-latency operation — Flink shipped credible answers to all of these years before its competitors. That technical lead has compounded into category dominance.

The Stratosphere Origin Story

Flink began life around 2010 as Stratosphere, an academic research project at Technical University of Berlin led by Volker Markl, with Stephan Ewen and Kostas Tzoumas as key contributors. Stratosphere was funded by the German Research Foundation and aimed to build a next-generation big data platform that would go beyond MapReduce — supporting iterative algorithms, in-memory computation, and a more general dataflow model than Hadoop offered.

In 2014, the Stratosphere team renamed the project Flink (German for "nimble" or "quick," with a kingfisher as the logo) and donated it to the Apache Software Foundation. It graduated to a top-level project in December 2014. The same team founded data Artisans (later renamed Ververica) as the commercial sponsor, in the same way Confluent stewards Kafka or Databricks stewards Spark. Ververica was acquired by Alibaba in early 2019, after Alibaba had become one of the largest Flink users in the world.

The German academic origin matters culturally. Flink was built by people deeply immersed in database research and stream processing theory — people who had read all the papers and were determined to do streaming "correctly" rather than retrofit it onto a batch engine. The contrast with Spark Streaming, which started as a micro-batch hack on top of Spark and only later evolved into Structured Streaming, is sharp. Flink was streaming-native.

What Flink Actually Does

At its core, Flink is a distributed dataflow engine. You define a pipeline as a graph of operators (sources, transformations, sinks), and Flink deploys that graph across a cluster of workers (TaskManagers), with a coordinator (JobManager) handling scheduling, checkpointing, and failure recovery. The engine handles the hard parts:

Stateful processing. Operators can maintain keyed state (per-key counters, sets, lists) and operator state (per-task buffers). State is stored in a configurable state backend — typically RocksDB on local disk for large state, or in JVM heap for small state.

Checkpointing and exactly-once. Flink periodically takes a consistent snapshot of all operator state across the cluster using the Chandy-Lamport algorithm (adapted for dataflow). On failure, the cluster restarts from the last checkpoint and replays events from the source. Combined with transactional or idempotent sinks, this delivers true end-to-end exactly-once semantics.

Event time and watermarks. Flink was the first open-source engine to take event time seriously as the primary time dimension for stream processing. Watermarks flow through the dataflow graph alongside events, telling each operator when it is safe to close a window.

Windowing. Tumbling, sliding, session, and global windows, all with configurable triggers and allowed lateness. The semantics are precise enough that you can express almost any windowed aggregation you would express in batch SQL, plus things batch SQL cannot express.

Multiple APIs. Flink supports the low-level DataStream API (Java/Scala), the higher-level Table API, and Flink SQL, which compiles SQL queries into the same dataflow runtime. Flink SQL has become the most strategically important interface in recent years — it is what Confluent and AWS expose to users.

The Flink Wave: Why It Won

By 2017-2019 it was clear Flink had won the open-source streaming war. The reasons:

It got the semantics right. Exactly-once, event time, late data handling — Flink solved these before anyone else, and the alternatives spent years catching up.
It scaled to the largest workloads in the world. Alibaba runs Flink at a scale that no other open-source engine has matched — processing trillions of events per day during Singles' Day shopping events. That stress test made Flink robust.
It became the default for managed services. AWS Kinesis Data Analytics (now AWS Managed Service for Apache Flink), Confluent's streaming compute offering, and Alibaba Cloud Realtime Compute all run Flink under the hood. When the major cloud vendors all converge on the same engine, the category is settled.
Flink SQL closed the accessibility gap. Early Flink required writing Java or Scala, which limited adoption to engineering-heavy teams. Flink SQL made it possible to express the same workloads in SQL, opening the door to data engineers and analysts.

Where Flink Loses (or Just Isn't the Right Tool)

Flink is not always the right answer:

Operational complexity. Running Flink yourself is real work. You need to configure state backends, tune checkpointing, set up high availability for the JobManager, manage resource allocation, and debug things like backpressure. This is why managed services (Confluent, AWS, Ververica, Aiven) exist and are popular.
Resource overhead at small scale. A Flink cluster has a non-trivial baseline. For small workloads — "I want to maintain a few materialized views over Kafka topics" — the operational overhead of Flink can outweigh the benefit. Materialize or ksqlDB may be a better fit.
Not a database. Flink is a compute engine, not a query layer. If you want users to interactively query streaming results with low latency, you typically pair Flink with a serving database — ClickHouse, Druid, Pinot, or a key-value store.
JVM tax. Flink runs on the JVM, and JVM tuning is part of operating Flink at scale. Garbage collection pauses, off-heap memory management, and JVM-specific failure modes are part of the deal.

The Confluent + Flink Convergence

The most important strategic shift in stream processing in 2024-2025 was Confluent's adoption of Flink as its preferred stream processing engine, replacing its earlier focus on ksqlDB. Confluent acquired Immerok (a Flink company founded by ex-Ververica engineers) in early 2023 and made Flink SQL the centerpiece of Confluent Cloud's stream processing offering. The signal: even Kafka's commercial steward has accepted that Flink is the right answer for serious streaming compute.

Where Flink Sits in the Stack

Flink reads from sources (Kafka, Kinesis, Pulsar, Iceberg, Pulsar, files, CDC streams from Debezium), processes events through a dataflow graph, and writes results to sinks (Kafka topics, warehouses, real-time OLAP databases, Iceberg tables, key-value stores). It is the compute layer between event transport and analytical serving.

How TextQL Works with Flink

TextQL does not query Flink directly — Flink is a compute engine, not a queryable database. Instead, TextQL Ana queries the destinations Flink writes into: warehouses, lakehouses, and real-time analytics databases. The role Flink plays in a TextQL stack is to make sure the data those destinations contain is fresh, joined, enriched, and shaped the way analysts and AI systems need it. Flink is the engine that turns raw event streams into the well-modeled tables that TextQL's natural-language interface can reason about.

See TextQL in action

Apache Flink

Origin Stratosphere research project at TU Berlin (2010)

Apache incubation 2014

Apache top-level December 2014

Original creators Volker Markl, Stephan Ewen, Kostas Tzoumas (TU Berlin)

Commercial sponsor Ververica (acquired by Alibaba 2019); Confluent (Flink SQL)

License Apache 2.0

Written in Java, Scala

Notable users Netflix, Uber, Stripe, Alibaba, ByteDance, Pinterest, Lyft

Category Stream Processing

Monthly mindshare ~120K · ~24K GitHub stars; the OSS stream processing leader; behind Kafka in mindshare