Apache Airflow | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Apache Airflow

Apache Airflow is the dominant open-source workflow orchestrator. Created at Airbnb in 2014 by Maxime Beauchemin, it defined the modern DAG-based approach to scheduling data pipelines.

Apache Airflow is the most widely deployed workflow orchestrator in the world and the de facto standard for scheduling data pipelines. If you have ever worked on a data team at a company larger than fifty people, there is a 70% chance there is an Airflow instance somewhere in your stack, and a 100% chance that someone has complained about it. Airflow is the SQL of orchestration: nobody loves it, everyone uses it, and any attempt to displace it has to reckon with the fact that the entire data engineering job market is built around it.

Origin Story: Airbnb, 2014

Airflow was created in October 2014 by Maxime Beauchemin, then a data engineer at Airbnb. Beauchemin had previously worked at Facebook and Yahoo on data infrastructure, and at Airbnb he ran into the same problem every fast-growing data team has: cron jobs and bash scripts could not keep up with the dependencies between Airbnb's growing pipelines. Existing tools — LinkedIn's Azkaban, Spotify's Luigi, Apache Oozie — each had limitations. Luigi was the closest in spirit but had no scheduler and no UI worth using. Oozie was painfully XML-driven and tied to Hadoop.

Beauchemin's insight was that pipelines should be defined in code — specifically, in regular Python files — rather than in YAML, XML, or a UI. A DAG in Airflow is just a .py file that imports DAG and Operator classes and instantiates them. This "configuration as code" approach felt natural to engineers, played nicely with version control, and allowed dynamic DAG generation with normal Python loops. It is the single design choice that made Airflow win.

Airbnb open-sourced Airflow in June 2015. It entered the Apache Incubator in March 2016 and graduated to a Top-Level Project in January 2019. Beauchemin went on to found Preset, the commercial company behind Apache Superset, which he had also created at Airbnb.

By 2020, Airflow had become the unambiguous default. Every major cloud built a managed offering: Amazon MWAA (2020), Google Cloud Composer (2018), and the entire commercial business of Astronomer. Airflow 2.0 shipped in December 2020 with a major rewrite of the scheduler (10x faster), a stable REST API, and the TaskFlow API for cleaner Python ergonomics. Airflow 3.0 followed in 2025 with a long-overdue separation of the DAG author experience from the runtime.

What Airflow Actually Is

Strip away the marketing and Airflow is four things glued together:

A scheduler. A long-running process that reads your DAG files, decides what should run based on schedule_interval and dependencies, and dispatches tasks to executors.
An executor. The thing that actually runs tasks. Options include the LocalExecutor (single machine), CeleryExecutor (distributed via a message queue), and KubernetesExecutor (one pod per task).
A metadata database. Postgres or MySQL, where Airflow stores DAG runs, task instances, connections, variables, and the entire history of what ran when.
A web UI. The famous "graph view," "tree view," and "Gantt chart" that data engineers stare at all day to figure out why something is red.

The user-facing object is the DAG — a Python file that defines tasks and their dependencies. Tasks are instances of Operators, of which Airflow ships hundreds: BashOperator, PythonOperator, SnowflakeOperator, KubernetesPodOperator, and so on. A typical DAG might look like: extract from S3 → load into Snowflake → run dbt → trigger downstream notification.

What Airflow Got Right

Configuration as Python code. This was the breakthrough. Before Airflow, defining workflows meant writing XML or filling out web forms. After Airflow, your DAG was a file in your repo that you could test, diff, code-review, and refactor like any other code. Every modern orchestrator copied this idea.

The provider ecosystem. Airflow has hundreds of officially-maintained "providers" — packaged integrations for AWS, GCP, Snowflake, Databricks, dbt, Slack, etc. If something exists in the data world, there's an Airflow operator for it. This network effect is the single biggest moat against challengers.

The community. Thousands of contributors, dozens of meetups, two annual conferences (Airflow Summit). Every data engineer has Airflow on their resume because every employer has Airflow in production.

Battle-tested at scale. Airbnb, Lyft, Stripe, Robinhood, Walmart, and a long list of others run Airflow with thousands of DAGs. Whatever weird edge case you hit, somebody has hit it before.

What Airflow Got Wrong

The honest list, from someone who likes Airflow:

It is task-centric, not asset-centric. Airflow knows that "task load_orders ran successfully on 2026-04-05 at 06:00 UTC." It does not know that "the orders table is now fresh." This sounds pedantic but it's the source of most pain in big Airflow deployments. When a downstream dbt model needs to run because the upstream table changed, you have to wire that dependency manually. Dagster made the opposite bet — assets first — and that bet has aged well.

The scheduler-and-DAG-parser couples authoring time and runtime. For years, the scheduler had to import every DAG file periodically to figure out the schedule. Slow imports = slow scheduler. This was partially fixed in Airflow 2 and properly addressed in Airflow 3, but it's a wart that shaped years of operational pain.

Local development is awkward. Running a DAG locally requires a full Airflow installation with a metadata DB, a scheduler, and a webserver. Compared to running a Python script, this is a lot of ceremony. Astronomer's astro CLI papered over this with a Docker-based local environment.

Configuration drift between authoring and execution. A DAG file is parsed by the scheduler and executed by the worker. Code that runs at the top level of a DAG file (outside of operator definitions) runs every time the scheduler parses it — which is constantly. New users learn this the hard way by accidentally hammering an API on every parse.

Versioning DAGs is unsolved. If you change a DAG file, the new version replaces the old version everywhere. Backfilling against "the version of the DAG that ran last Tuesday" is not a first-class concept. Dagster and Prefect both handle this better.

Airflow vs. the Challengers

Dagster is the more architecturally pure choice for new platforms — assets, types, and a coherent dev experience. Prefect is the more elegant Python API. But neither has cracked Airflow's network effect. The job listings still say "experience with Airflow." The ecosystem still ships providers for Airflow first. The cloud managed offerings still center on Airflow. Beauchemin himself has acknowledged the design debts and partly endorsed the asset-oriented future, but Airflow's installed base means it will be the dominant orchestrator for the rest of the decade regardless.

How TextQL Works with Apache Airflow

TextQL Ana does not replace Airflow — it reads the data that Airflow lands in your warehouse. Where Airflow metadata is exposed (via the REST API or by syncing run history into the warehouse), TextQL can use it to answer freshness questions: "when did the orders pipeline last succeed?", "which dashboards depend on a stale upstream?", "did last night's nightly job complete?" The orchestrator runs the pipelines; TextQL answers natural-language questions about their state and outputs.

See TextQL in action

Apache Airflow

Created 2014 (open-sourced June 2015)

Origin Airbnb

Creator Maxime Beauchemin

Apache TLP since 2019 (incubated 2016)

License Apache 2.0

Language Python

Category Orchestration

Commercial sponsor Astronomer

Monthly mindshare ~400K · ~36K GitHub stars; the de facto orchestration standard; massive Stack Overflow tag