Table Formats | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Table Formats

Table formats like Apache Iceberg, Delta Lake, and Apache Hudi add database semantics on top of Parquet files in object storage. They are the critical layer that turns a data lake into a data lakehouse.

A table format is a specification that tells a query engine how to treat a pile of files in object storage as a single, coherent database table. If that sounds modest, it isn't. Table formats are the layer that turns a swamp of Parquet files in S3 into something you can INSERT, UPDATE, DELETE, and MERGE into safely — without losing your mind or your data.

The best way to think about it: a table format is a filing system on top of files. Parquet alone is a folder of documents. A table format is the cabinet, the labels, the index cards, and the librarian who remembers what you changed yesterday and can put it back if you change your mind. Without a table format, a "table" on a data lake is just a directory convention and a prayer.

Why Table Formats Exist

For a decade, the data lake promise was seductive: dump Parquet files in S3, point a query engine at them, and you have infinite analytics for cheap. The reality was messier. If two jobs wrote to the same "table" simultaneously, you got corrupted reads. If a job failed halfway through, you were left with orphaned partial files. Schema changes required rewriting everything. There was no notion of a transaction, no rollback, no time travel, and no way to run an UPDATE without rebuilding entire partitions by hand.

Warehouses like Snowflake and Redshift had none of these problems — because they stored data in proprietary, closed formats the engine fully controlled. The price for that reliability was lock-in: your data lived inside the vendor, and extracting it was painful.

Table formats are the compromise. They give you warehouse-grade semantics (ACID, schema evolution, time travel) on top of open file formats (Parquet) in your own object storage bucket. You keep the economics and openness of a lake. You get the correctness and ergonomics of a warehouse. This is the entire technical basis of the lakehouse architecture.

The Four Properties of a Table Format

Every serious table format promises the same four things. These are mutually exclusive and collectively exhaustive — if a system offers all four, it's a real table format; miss one and it's a partial solution.

1. ACID transactions. Writes are atomic: they either fully happen or fully don't. Concurrent readers and writers see consistent snapshots. No more half-written files. No more "I thought that job finished." Under the hood, every table format solves this the same way: a metadata layer tracks which files belong to which version of the table, and commits are atomic swaps of a pointer.

2. Schema evolution. You can add, drop, rename, or reorder columns without rewriting the underlying data. The metadata layer records the schema history so older files still read correctly under the new schema. This sounds trivial; it is not. A plain Parquet directory cannot do this reliably — renaming a column in one file and not the others will silently break queries.

3. Time travel. Because the metadata layer tracks every version of the table, you can query the table as it existed last Tuesday at 3pm. This is priceless for debugging bad pipelines, reproducing reports, auditing regulatory data, and "undoing" accidental deletes. Under the hood, it's snapshot isolation — old file manifests aren't deleted immediately; they're retained for a configurable window.

4. Partition evolution (and hidden partitioning). Partitioning strategy can change over time without rewriting old data, and the query engine figures out which files to read automatically. If you started partitioning by day and want to switch to hour, you don't have to rebuild the table. This is the property where Iceberg most clearly beats its rivals.

The layer cake, bottom to top: Object storage (S3) holds bytes. File formats (Parquet, ORC, Avro) organize those bytes into columns and rows within a single file. Table formats (Iceberg, Delta, Hudi) organize many files into a logical table with transactions and versioning. A catalog (Glue, Unity, Polaris, Nessie) tracks which tables exist and where their metadata lives. Query engines (Spark, Trino, Snowflake, Databricks) read the whole stack and run SQL.

The Format War: Iceberg vs Delta vs Hudi

From about 2019 through 2024, the data world fought a religious war over which table format would become the default. The three contenders:

Apache Iceberg — born at Netflix in 2017, donated to the Apache Software Foundation, designed by engineers who had been burned by Hive's metadata model. Cleanest design, broadest engine support, no single vendor in charge.
Delta Lake — born at Databricks in 2019, technically open source (Linux Foundation) but historically optimized first and best inside Databricks. The incumbent inside the world's largest Spark shop.
Apache Hudi — born at Uber in 2016 (arguably the first of the three), designed for streaming upserts and CDC, never quite caught on commercially.

Iceberg won. The market has called it. The pivotal moment was June 2024, when Databricks acquired Tabular — the commercial company founded by Iceberg's original creators Ryan Blue and Daniel Weeks — for a reported $1–2 billion. Databricks didn't need Tabular's revenue. Databricks needed to make sure that if Iceberg became the universal standard, Databricks wasn't left on the outside looking in. It was a defensive acquisition at a staggering price, and it confirmed what the market already suspected: Iceberg was the neutral standard everyone else was going to adopt.

Snowflake made the same call from a different angle. Starting in 2022 and fully committed by 2024, Snowflake backed Iceberg as its open table format, launched Polaris Catalog (an Iceberg REST catalog, later donated to Apache), and made Iceberg a first-class citizen inside the platform. When Snowflake and Databricks — the two companies who agree on nothing — both pick the same format, the war is over.

Delta Lake is increasingly seen as a Databricks-native thing. It is technically open. There is technically a Delta Standalone reader and the Delta Kernel project. In practice, the best Delta experience is inside Databricks, the newest Delta features land there first, and customers who want multi-engine access keep bumping into friction. Databricks' own response — UniForm, which writes Delta tables with Iceberg-compatible metadata — is a tacit admission that customers want to read their data with Iceberg-speaking engines. When you ship a compatibility layer for the rival format, you have acknowledged who is winning.

Hudi is fading. It still has a loyal base at Uber and in certain streaming-heavy shops, and its merge-on-read design is genuinely clever for CDC workloads. But its community and engine support have not kept pace, and most new lakehouse deployments in 2025 do not seriously consider it.

Why This Matters

The table format you pick determines what query engines can read your data, how much you'll pay for lock-in, and whether you can change your mind later. It is arguably the single most consequential architectural decision in a modern lakehouse. Picking the wrong one means either migrating petabytes (expensive, risky) or paying a vendor tax forever.

It also matters because table formats are where the control plane of data platforms is moving. Catalogs (Apache Polaris, Unity Catalog, Nessie, Gravitino) are all fighting to be the REST endpoint that knows where every table lives and who can read it. Whoever owns the catalog owns the governance layer. Whoever owns the governance layer owns the customer.

When You Need a Table Format (and When You Don't)

Scenario	Do you need a table format?
—-	—-
You have petabytes of Parquet on S3 and want SQL with updates	Yes. Iceberg or Delta.
You want multi-engine access (Spark + Trino + Snowflake + DuckDB)	Yes. Iceberg.
You're a pure Databricks shop, no plans to leave	Delta works fine. Iceberg via UniForm is safer long-term.
You only do append-only analytics on immutable data	Plain Parquet + a Hive catalog still works.
You have 100 GB of data and one analyst	Skip it. Use DuckDB or a warehouse.
You need high-throughput CDC upserts into a lake	Hudi's merge-on-read is still competitive here.

Tools in This Category

Table formats (the "filing system"):

Apache Iceberg — The winner. Netflix origin, neutral governance, broadest support.
Delta Lake — Databricks' format. Open in theory, Databricks-best in practice.
Apache Hudi — Uber's streaming-first underdog. Fading.

File formats (the bytes underneath):

Parquet — The columnar file format that made modern analytics possible. Universal.
ORC — The Hive-era columnar format. Technically excellent, commercially legacy.
Avro — Row-based, schema-first, the lingua franca of Kafka.

How TextQL Works with Table Formats

TextQL Ana queries data wherever it lives, and that increasingly means Iceberg and Delta tables read through Snowflake, Databricks, Trino, or a dedicated lakehouse query engine. Because table formats enforce schemas and expose consistent metadata, they give LLM-generated SQL the same reliability guarantees a warehouse does — with the open-format economics of a lake underneath.

See TextQL in action

Table Formats

Category Open storage abstraction for data lakes

Also called Open table formats, lakehouse table formats

Not to be confused with File formats (Parquet, ORC, Avro), which sit underneath

Main players Apache Iceberg, Delta Lake, Apache Hudi

File format layer Parquet, ORC, Avro

Built on top of Object storage (S3, ADLS, GCS)

Typical users Data platform engineers, lakehouse architects

Monthly mindshare ~100K · data engineers thinking about lakehouses; concept emerged 2017+