Apache Parquet | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Apache Parquet

Apache Parquet is the columnar file format that powers essentially every modern analytical data system. Born at Twitter and Cloudera in 2013, it is the invisible foundation of the data lakehouse.

Apache Parquet is the columnar file format that quietly runs the modern data ecosystem. Every serious data lake, every lakehouse, every modern analytical engine, and most cloud data warehouses use Parquet somewhere. It is so universal that most data engineers have stopped thinking about it as a choice — it's simply the default. When Iceberg, Delta Lake, or Hudi says "the table is stored as a bunch of files," those files are almost certainly Parquet.

The metaphor: if a row-based file format is a stack of index cards (each card has all of one person's info), Parquet is a set of filing cabinets, one per field — one cabinet for all the names, one for all the zip codes, one for all the revenues. When you ask a question that only cares about revenue, Parquet opens exactly one cabinet. Everything else is untouched. That single design choice — store by column, not by row — is the foundation of modern analytical performance.

Origin: Twitter, Cloudera, and Google's Dremel Paper

Parquet's story begins with a Google paper. In 2010, Google published Dremel, a paper describing a columnar storage format and query engine that could scan petabytes of data in seconds for interactive analytics. Dremel eventually became Google BigQuery internally. The paper was one of the most influential in the analytical data world.

In 2012, engineers at Twitter (led by Julien Le Dem) and Cloudera (led by Nong Li) decided to build an open-source columnar format inspired by Dremel's ideas. Twitter needed fast analytics on its Hadoop cluster; Cloudera needed a columnar format for Impala, its interactive SQL engine. They joined forces. The project was called Parquet, named after the hardwood flooring pattern, and was open-sourced in March 2013 at a Hadoop Summit.

Parquet entered the Apache incubator later in 2013 and became a top-level Apache project in April 2015. By then it had already become the default columnar format for Hadoop, Spark, Hive, Impala, Drill, and Presto.

Why Columnar Is the Whole Ball Game

Every explanation of Parquet has to start here. Traditional databases and file formats store data row by row: all the fields of row 1, then all the fields of row 2, and so on. This is great for OLTP (fetch all the data about user 42). It is terrible for analytics (sum the revenue column across a billion rows) because you end up reading every other field too.

Columnar formats flip this on its head: store all of column A together, then all of column B, and so on. For analytical queries this is transformative:

I/O reduction. A query that touches 3 columns out of 50 reads roughly 6% of the data, not 100%.
Compression. Values in the same column are similar (all dates, all currencies, all country codes). Column-wise compression ratios are often 5–10x better than row-wise.
Vectorization. Modern CPUs can process whole arrays of values in a single instruction (SIMD). Columnar layouts feed those instructions directly.
Encoding tricks. Dictionary encoding (replacing repeated strings with small integer IDs), run-length encoding, bit-packing, and delta encoding all work far better on homogeneous column data than on mixed row data.

These aren't marginal improvements. For typical analytical queries, a Parquet-based system is easily 10–100x faster and cheaper to store than an equivalent row-based format.

What Parquet Actually Looks Like Inside

A Parquet file is organized hierarchically:

File — the whole thing on disk.
Row groups — horizontal slices of the data (typically 128 MB to 1 GB per row group). Row groups let you parallelize reads and apply predicate pushdown at granular levels.
Column chunks — within a row group, one chunk per column.
Pages — within a column chunk, data is split into pages (typically ~1 MB). Each page is the unit of compression and encoding.
Footer — at the end of the file: the schema, row group locations, min/max statistics per column chunk, and other metadata. Readers seek to the footer first to learn the file layout.

That footer metadata — particularly the per-column-chunk statistics (min, max, null count, sometimes bloom filters) — is what enables predicate pushdown. When you run WHERE price > 100, the engine reads each column chunk's min/max and skips entire chunks whose max is below 100. This alone can make queries orders of magnitude faster.

Parquet supports nested schemas (arrays, maps, structs) using the repetition/definition level encoding from the Dremel paper. This is how Parquet handles JSON-like data while still being fully columnar underneath.

Parquet Is Not a Table Format

A common and damaging confusion: Parquet is a file format, not a table format. It describes how bytes are organized within a single file. It does not answer any of the questions a table format answers: which files belong to the table, what version of the schema they were written under, how to atomically add or remove files, how to time travel, how to do ACID transactions. A directory of Parquet files on S3 is not a table — it's a directory of files. Iceberg, Delta, and Hudi exist precisely to add the missing layer.

This is why Parquet and, say, Iceberg are not alternatives — they are complementary layers. Iceberg tables are made of Parquet files. Delta tables are made of Parquet files. Hudi base files are Parquet files. Parquet is the substrate; table formats are the coordination layer.

Compression and Encoding

Parquet supports multiple compression codecs: Snappy (the default; fast, moderate ratio), Gzip (slower, better ratio), Zstd (the modern favorite; great ratio and speed), LZ4, and Brotli. It also supports several encodings applied before compression: dictionary encoding, plain encoding, delta encoding, bit-packing, and byte stream split for floats. The combination of a good encoding and a good compressor is what lets Parquet reach compression ratios that row-based formats cannot touch.

In 2025–2026, the trend is toward Zstd as the default codec. Zstd gives Gzip-like compression ratios at Snappy-like speeds and is becoming the new default in many Parquet writer configurations.

Ubiquity and Ecosystem

Parquet is supported natively by essentially every modern data tool: Apache Spark, Apache Flink, Trino / Presto, Hive, Impala, Drill, DuckDB, ClickHouse, Snowflake (as an external format), BigQuery (both read and as Iceberg storage), Amazon Redshift Spectrum, Amazon Athena, Dremio, Pandas, Polars, Arrow, and every modern ML framework that touches tabular data. If a tool deals with tabular data at scale and does not speak Parquet, it is almost certainly on its way out.

The close kinship with Apache Arrow — the in-memory columnar format — is worth noting. Arrow and Parquet were designed by overlapping communities and are intended to be efficiently convertible: Arrow is the in-memory representation, Parquet is the on-disk representation. Zero-copy or near-zero-copy conversion between the two is a major reason modern analytical engines can be so fast.

Honest Take

Parquet won and nothing is threatening it. For the foreseeable future, columnar analytical data at rest is stored as Parquet, and that's simply the default. The only real debates around Parquet are in the encoding and compression details (Zstd vs Snappy, whether to enable bloom filters, page size tuning). These are operational, not strategic. If you're building a new data system and you're not sure what file format to use, use Parquet. You will not regret it.

How TextQL Works with Apache Parquet

Almost every table TextQL Ana queries — via Snowflake, Databricks, BigQuery, Redshift, Trino, or an Iceberg/Delta lakehouse — is ultimately Parquet on disk. Parquet's per-column statistics and predicate pushdown are the reason LLM-generated SQL can answer analytical questions in seconds instead of minutes: the engine only reads the columns and row groups the question actually needs.

See TextQL in action

Apache Parquet

Created 2013 at Twitter and Cloudera

Inspired by Google's Dremel paper (2010)

Apache top-level April 2015

License Apache 2.0

Type Columnar file format

Category Table Formats

Monthly mindshare ~500K · underlying file format for nearly every modern data tool; everyone uses it indirectly