NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Apache Parquet
Apache Parquet is the columnar file format that powers essentially every modern analytical data system. Born at Twitter and Cloudera in 2013, it is the invisible foundation of the data lakehouse.
Apache Parquet is the columnar file format that quietly runs the modern data ecosystem. Every serious data lake, every lakehouse, every modern analytical engine, and most cloud data warehouses use Parquet somewhere. It is so universal that most data engineers have stopped thinking about it as a choice — it's simply the default. When Iceberg, Delta Lake, or Hudi says "the table is stored as a bunch of files," those files are almost certainly Parquet.
The metaphor: if a row-based file format is a stack of index cards (each card has all of one person's info), Parquet is a set of filing cabinets, one per field — one cabinet for all the names, one for all the zip codes, one for all the revenues. When you ask a question that only cares about revenue, Parquet opens exactly one cabinet. Everything else is untouched. That single design choice — store by column, not by row — is the foundation of modern analytical performance.
Parquet's story begins with a Google paper. In 2010, Google published Dremel, a paper describing a columnar storage format and query engine that could scan petabytes of data in seconds for interactive analytics. Dremel eventually became Google BigQuery internally. The paper was one of the most influential in the analytical data world.
In 2012, engineers at Twitter (led by Julien Le Dem) and Cloudera (led by Nong Li) decided to build an open-source columnar format inspired by Dremel's ideas. Twitter needed fast analytics on its Hadoop cluster; Cloudera needed a columnar format for Impala, its interactive SQL engine. They joined forces. The project was called Parquet, named after the hardwood flooring pattern, and was open-sourced in March 2013 at a Hadoop Summit.
Parquet entered the Apache incubator later in 2013 and became a top-level Apache project in April 2015. By then it had already become the default columnar format for Hadoop, Spark, Hive, Impala, Drill, and Presto.
Every explanation of Parquet has to start here. Traditional databases and file formats store data row by row: all the fields of row 1, then all the fields of row 2, and so on. This is great for OLTP (fetch all the data about user 42). It is terrible for analytics (sum the revenue column across a billion rows) because you end up reading every other field too.
Columnar formats flip this on its head: store all of column A together, then all of column B, and so on. For analytical queries this is transformative:
These aren't marginal improvements. For typical analytical queries, a Parquet-based system is easily 10–100x faster and cheaper to store than an equivalent row-based format.
A Parquet file is organized hierarchically:
That footer metadata — particularly the per-column-chunk statistics (min, max, null count, sometimes bloom filters) — is what enables predicate pushdown. When you run WHERE price > 100, the engine reads each column chunk's min/max and skips entire chunks whose max is below 100. This alone can make queries orders of magnitude faster.
Parquet supports nested schemas (arrays, maps, structs) using the repetition/definition level encoding from the Dremel paper. This is how Parquet handles JSON-like data while still being fully columnar underneath.
A common and damaging confusion: Parquet is a file format, not a table format. It describes how bytes are organized within a single file. It does not answer any of the questions a table format answers: which files belong to the table, what version of the schema they were written under, how to atomically add or remove files, how to time travel, how to do ACID transactions. A directory of Parquet files on S3 is not a table — it's a directory of files. Iceberg, Delta, and Hudi exist precisely to add the missing layer.
This is why Parquet and, say, Iceberg are not alternatives — they are complementary layers. Iceberg tables are made of Parquet files. Delta tables are made of Parquet files. Hudi base files are Parquet files. Parquet is the substrate; table formats are the coordination layer.
Parquet supports multiple compression codecs: Snappy (the default; fast, moderate ratio), Gzip (slower, better ratio), Zstd (the modern favorite; great ratio and speed), LZ4, and Brotli. It also supports several encodings applied before compression: dictionary encoding, plain encoding, delta encoding, bit-packing, and byte stream split for floats. The combination of a good encoding and a good compressor is what lets Parquet reach compression ratios that row-based formats cannot touch.
In 2025–2026, the trend is toward Zstd as the default codec. Zstd gives Gzip-like compression ratios at Snappy-like speeds and is becoming the new default in many Parquet writer configurations.
Parquet is supported natively by essentially every modern data tool: Apache Spark, Apache Flink, Trino / Presto, Hive, Impala, Drill, DuckDB, ClickHouse, Snowflake (as an external format), BigQuery (both read and as Iceberg storage), Amazon Redshift Spectrum, Amazon Athena, Dremio, Pandas, Polars, Arrow, and every modern ML framework that touches tabular data. If a tool deals with tabular data at scale and does not speak Parquet, it is almost certainly on its way out.
The close kinship with Apache Arrow — the in-memory columnar format — is worth noting. Arrow and Parquet were designed by overlapping communities and are intended to be efficiently convertible: Arrow is the in-memory representation, Parquet is the on-disk representation. Zero-copy or near-zero-copy conversion between the two is a major reason modern analytical engines can be so fast.
Parquet won and nothing is threatening it. For the foreseeable future, columnar analytical data at rest is stored as Parquet, and that's simply the default. The only real debates around Parquet are in the encoding and compression details (Zstd vs Snappy, whether to enable bloom filters, page size tuning). These are operational, not strategic. If you're building a new data system and you're not sure what file format to use, use Parquet. You will not regret it.
Almost every table TextQL Ana queries — via Snowflake, Databricks, BigQuery, Redshift, Trino, or an Iceberg/Delta lakehouse — is ultimately Parquet on disk. Parquet's per-column statistics and predicate pushdown are the reason LLM-generated SQL can answer analytical questions in seconds instead of minutes: the engine only reads the columns and row groups the question actually needs.
See TextQL in action