Delta Lake (Databricks) | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Wiki Data Warehouses Delta Lake (Databricks)

Contents

Delta Lake (Databricks)

Delta Lake is the open-source table format Databricks built to give cloud object storage the ACID guarantees of a database -- the technical foundation of the lakehouse.

Delta Lake is what turns a folder full of Parquet files into something that behaves like a database table. It adds a transaction log on top of Parquet so that multiple readers and writers can work on the same data safely, schemas can be enforced and evolved, deletes and updates work, time travel is possible, and the lakehouse can offer the warehouse-like guarantees that bare object storage cannot.

The simple metaphor: Parquet is the brick; Delta Lake is the building inspector and the property records office. Bricks alone don't make a building — you need someone tracking which bricks are where, who added them, who removed them, and whether the structure is consistent. The Delta transaction log is that record-keeping layer.

This page is about Delta Lake as Databricks ships it — the version embedded in the Databricks Runtime, integrated with Unity Catalog, accelerated by Photon, and the default storage format for Databricks SQL. For the format itself in the abstract, see Delta Lake under table formats.

Origin Story

Delta Lake was built inside Databricks starting around 2017 to solve a problem that customers kept hitting: doing transactional updates on data in a data lake was almost impossible. The lake had cheap storage and infinite scale, but it lacked the most basic database property — atomicity. If two jobs wrote to the same table at the same time, you got partial files and corrupted state. If a job failed halfway through, you had to manually clean up. There was no UPDATE, no DELETE, no MERGE; everything was append-only or full-overwrite. Working with lake data felt like working with a database from the 1980s.

Delta was the answer. The core idea was to maintain a transaction log (a series of JSON files in a _delta_log directory) that records every commit to the table. Each commit is atomic; the log is the source of truth for which Parquet files belong to the table at any given version. This single architectural decision unlocked a cascade of warehouse-like features: ACID transactions, schema enforcement, schema evolution, MERGE INTO (upserts), DELETE, UPDATE, time travel ("show me this table as of last Tuesday"), and concurrent reads and writes without corruption.

Databricks open-sourced Delta Lake in April 2019, donated it to the Linux Foundation in 2019, and made it the default table format for the Databricks platform. The launch was a direct counter-move to the Hudi project (created at Uber in 2016, open-sourced 2017) and the Iceberg project (created at Netflix in 2017, open-sourced 2018), which were solving the same problem from different starting points. The three formats — Hudi, Iceberg, Delta — are technically siblings; the differences are real but relatively narrow, and the choice of which one a company uses is increasingly a function of which platform they're already on.

The reason Delta Lake exists is the foundational lakehouse argument: if you can give object storage the guarantees of a database, you no longer need a separate warehouse. Without Delta (or Iceberg, or Hudi), the lakehouse is a slogan. With it, the lakehouse is an architecture. Delta is the engineering substrate that makes Databricks' entire strategic narrative work.

Delta on Databricks vs. Open Delta

Delta Lake the open-source project is genuinely open and runs on Spark, Flink, Trino, Presto, DuckDB, and many other engines. But Delta Lake on Databricks gets a number of extras that the open-source version does not:

Photon-accelerated reads and writes, which is significantly faster than vanilla Spark on Delta.
Liquid Clustering, an alternative to traditional partitioning that adapts cluster keys over time without requiring table rewrites.
Predictive optimization, which automatically runs OPTIMIZE and VACUUM operations on tables based on usage patterns.
Deletion vectors, which speed up deletes and updates by marking rows as deleted in a side file rather than rewriting Parquet files.
Tight Unity Catalog integration, including column-level lineage, governance, and Delta Sharing.
Auto Loader and Delta Live Tables (now Lakeflow Declarative Pipelines) for streaming ingestion and declarative pipeline management on top of Delta tables.

Some of these features eventually flow back to the open-source project; others remain proprietary. The relationship is similar to Postgres vs. Aurora Postgres — the open core is real and useful, but the managed version has performance and operational features that give the vendor a meaningful edge.

What Delta Is Good At

ACID transactions on lake storage. The original and still the headline feature. Multiple jobs can write to the same table without stepping on each other.
MERGE INTO for CDC pipelines. This is the killer use case. Most production data pipelines that ingest from Kafka, Fivetran, Debezium, or any CDC source land in Delta tables via MERGE INTO. Delta makes upserts on lake data tractable.
Time travel. Querying a table as of a previous version is incredibly useful for debugging, audits, and reproducible ML training.
Schema enforcement and evolution. Delta refuses writes that don't match the table schema, preventing the silent corruption that plagued raw Parquet workflows. You can also evolve schemas safely with ALTER TABLE semantics.
Streaming sources and sinks. Structured Streaming reads from and writes to Delta natively, and Delta tables can act as both source and sink in the same pipeline — the foundation of medallion (bronze/silver/gold) architectures.

What Delta Is Not Great At

Cross-engine maturity. Delta runs on many engines, but the integration depth varies. Iceberg is the format the broadest set of engines support most uniformly today.
Many tiny files. Like all lake formats, Delta is happiest with files in the tens or hundreds of MB. Without OPTIMIZE runs, streaming workloads can leave you with thousands of small files and slow scans.
Single-row low-latency reads. Delta is a columnar analytical format. If you need millisecond point-lookups by primary key, you want a key-value store or an OLTP database, not Delta.

The Opinionated Take

Delta Lake is the most consequential piece of data infrastructure Databricks has ever shipped, and it's the technology that turned the lakehouse from a slide into an architecture that ships. The competition with Iceberg is real and ongoing, and the honest read in 2026 is that Iceberg has won the open-format mindshare while Delta has won the Databricks-installed-base depth. Snowflake, BigQuery, AWS, and most non-Databricks vendors have standardized on Iceberg for their open-table-format support. Databricks customers, meanwhile, are mostly on Delta because it's the path of least resistance and because Photon and Unity Catalog were built around it.

Databricks has hedged this very smartly. In 2024 they introduced UniForm, which lets a Delta table also expose itself as if it were an Iceberg table by writing Iceberg metadata alongside the Delta log. In effect, Databricks is saying "you can keep using Delta and still be readable by every Iceberg-compatible engine." This is the right defensive move: it neutralizes the lock-in argument against Delta while preserving Databricks' performance and feature edge on its native format. The acquisition of Tabular in mid-2024 (the company founded by the Iceberg creators) cemented the strategy — Databricks now employs the people who built Iceberg and is positioning to be a leader in both formats simultaneously.

The convergence story is starkest here. Delta and Iceberg are increasingly two dialects of the same idea, with bridge formats blurring even that distinction. The right way to think about Delta in 2026 is not "Databricks' proprietary format" but "the open table format optimized for the Databricks runtime, with an Iceberg compatibility layer." That's a more nuanced position than the format wars of 2021—2023 implied, and it reflects how rapidly the entire industry is converging on open storage as the foundation.

How TextQL Fits

Delta Lake is mostly invisible to TextQL — by the time queries reach Delta tables through Databricks SQL, TextQL is just generating SQL, and the storage format underneath is an implementation detail. But Delta features matter indirectly. Time travel makes it possible to ask Ana "what did this dashboard look like on Monday," reproducible against a specific table version. Column-level lineage from Delta operations flows into Unity Catalog, which TextQL uses to ground its query generation. And because Delta is the default storage for most Databricks customers, every TextQL Databricks deployment is, in effect, a TextQL-on-Delta deployment.

See TextQL in action

Delta Lake (Databricks)

Released April 2019 (open-sourced); created internally 2017

Vendor Databricks (creator and primary maintainer)

Type Open table format with ACID transactions

License Apache 2.0; governed by the Linux Foundation

Category Data Warehouse

Monthly mindshare ~80K · Databricks-flavor Delta; subset of Delta users on Databricks specifically