Apache ORC | Data Ecosystem Wiki

Public Preview · May 18–29 Today: Ana Now Supports Firebolt →

Contents

Apache ORC

Apache ORC is the columnar file format born in the Hive ecosystem. Technically on par with Parquet and in some respects ahead of it, it has nonetheless become the legacy option outside Hive/Hortonworks shops.

Apache ORC (Optimized Row Columnar) is a columnar file format, in many ways technically comparable to Parquet, that was born inside the Hive ecosystem in 2013. For several years it was the best columnar format for Hive workloads, and for some specific use cases it was measurably faster than Parquet. Today, ORC is still a real format with real users, but outside the Hive and Hortonworks lineage it has become the legacy option. New deployments almost universally pick Parquet. The reasons are commercial, not technical.

The metaphor: if Parquet is the winning design that everyone agreed to standardize around, ORC is the technically excellent sibling that happened to be raised inside the wrong family — specifically, inside a Hadoop vendor (Hortonworks) that was eventually absorbed by another (Cloudera) that preferred the other format.

Origin: Hive, Hortonworks, and the RCFile Lineage

The ORC story starts with Hive's perennial problem: how do you make SQL queries over Hadoop files fast? Hive's original file formats — text and SequenceFile — were row-based and slow. In 2011, Hive added RCFile (Record Columnar File), the first attempt at a columnar format for Hive. RCFile was a real improvement but had design limitations: weak type information, limited compression options, and no support for complex predicate pushdown.

In 2013, engineers at Hortonworks (a Hadoop distribution company born out of Yahoo) and Facebook built ORC as the successor to RCFile. The goals were specific to Hive: better compression, strong type support, per-column statistics for predicate pushdown, lightweight indexes, and native handling of ACID-style updates (Hive had just added ACID tables, and ORC was the format backing them).

ORC was contributed to Apache and became a top-level project in 2015. For much of the Hadoop era, Hortonworks pushed ORC hard as the right format for Hive workloads, while Cloudera pushed Parquet as the right format for Impala and Spark workloads. This was one of several proxy wars between the two dominant Hadoop vendors of the era.

Why ORC Is Actually Very Good

On pure technical merits, ORC is genuinely excellent. A few things it does notably well:

Stripe-based layout. ORC files are divided into large stripes (default 64 MB). Each stripe contains all columns for that horizontal slice of rows. This is analogous to Parquet's row groups, but ORC stripes are typically larger and are laid out with indexes at the start of each stripe for fast predicate evaluation.

Lightweight indexes. Every stripe carries min/max statistics and, crucially, row-level bloom filters on configured columns. This enables highly selective predicate pushdown: a query like WHERE user_id = 12345 can skip almost every stripe in a large table if bloom filters are enabled on user_id. Parquet added bloom filters later.

Column encodings. ORC supports aggressive run-length encoding, dictionary encoding, bit-packing, and delta encoding per data type. For many workloads, ORC compresses somewhat better than Parquet out of the box.

ACID support in Hive. ORC was designed hand-in-hand with Hive's ACID transaction support. Hive's base+delta file model for row-level updates lives natively in ORC. For years, if you wanted update/delete on a Hive table, ORC was mandatory.

Rich type system. ORC has first-class support for complex types (structs, lists, maps, unions), decimal, timestamp with local time zone, and native date handling.

On several benchmarks, particularly Hive-style scan workloads with heavy predicate pushdown, ORC measurably outperformed Parquet in the mid-2010s. Even today, there are specific workloads where ORC is the faster option.

Why ORC Lost Anyway

If ORC is technically so strong, why is it losing? Three reasons, all commercial:

1. The ecosystem chose Parquet. Spark, Impala, Presto, Trino, DuckDB, ClickHouse, BigQuery, Snowflake, and Redshift all prioritized Parquet support. ORC support exists in most of these (sometimes very well), but Parquet is the default and gets the tuning attention. When you pick Parquet, everything just works. When you pick ORC, you occasionally run into the "supported but second-class" experience.

2. The Hortonworks-Cloudera merger. In 2019, Cloudera acquired Hortonworks and unified their distributions. Cloudera had always been the Parquet camp. After the merger, the strategic push for ORC effectively ended inside the combined company. Without a major commercial vendor advocating for it, ORC stopped gaining new mindshare.

3. Parquet + Arrow + the Python ecosystem. The rise of Apache Arrow and Python-native analytical tools (Pandas, Polars, DuckDB, Dask) cemented Parquet as the default on-disk format. Arrow's design was closely aligned with Parquet, and the Python data science community standardized on Parquet. ORC never had a comparable story in the Python/data-science world. For a generation of new data engineers, Parquet is simply the format they learned first.

4. Table formats built on Parquet, not ORC. Iceberg, Delta Lake, and Hudi all primarily use Parquet. Iceberg technically supports ORC, but in practice the overwhelming majority of real Iceberg deployments use Parquet. When the three dominant lakehouse table formats all standardized on Parquet, ORC's relevance outside of legacy Hive environments dropped sharply.

Where ORC Still Lives

ORC is absolutely still in production at scale, in specific places:

Legacy Hive warehouses. Massive Hive deployments from the 2014–2019 era, many still running at large enterprises, use ORC because that was the Hortonworks-era default. Migrating petabytes of ORC to Parquet is expensive and risky, so many organizations simply don't.
Facebook / Meta. Facebook was a major early contributor to ORC and has run enormous ORC-based workloads for years.
Some Trino / Presto shops. Trino's ORC reader is excellent, and for certain read patterns ORC still benchmarks competitively.
Hive ACID tables. Where Hive's built-in ACID is still the chosen transactional layer, ORC remains the required backing format.

It is telling that no modern greenfield lakehouse architecture picks ORC. The conversation in 2025–2026 is entirely Iceberg-on-Parquet or Delta-on-Parquet. ORC is preservation-mode technology.

ORC vs Parquet, Honestly

The honest technical comparison:

For pure scan-and-aggregate analytics, the two are roughly a wash. Micro-benchmarks swing both ways.
ORC has slightly better default compression on some datasets.
Parquet has dramatically better ecosystem support.
ORC's built-in bloom filters have historically been a modest advantage; Parquet has since caught up.
Parquet's integration with Apache Arrow makes it much more convenient in the modern Python data stack.
All three major lakehouse table formats treat Parquet as the default and ORC as a supported-but-rare option.

The gap is not technical. The gap is momentum. And momentum, in open source ecosystems, is everything.

Honest Take

If you are operating existing ORC data at scale, keep doing so — migrating for migration's sake is not worth it. If you are building anything new, pick Parquet. There is no architectural reason to choose ORC in 2026 unless you are specifically committed to the Hive ecosystem and Hive ACID tables. The format is not bad; the world has simply moved on.

How TextQL Works with Apache ORC

TextQL Ana queries ORC-backed tables through whichever engine exposes them — typically Hive, Trino, Spark, or a warehouse with external ORC table support. Because ORC carries per-column statistics and strong type information, Ana gets the same structured grounding for LLM-generated SQL that it gets from Parquet-backed tables. The format underneath rarely matters to Ana; what matters is the schema and statistics exposed by the catalog above it.

See TextQL in action

Apache ORC

Created 2013 at Hortonworks and Facebook

Originally built for Apache Hive

Apache top-level 2015

Full name Optimized Row Columnar

License Apache 2.0

Type Columnar file format

Category Table Formats

Monthly mindshare ~30K · Hive-era columnar format; mostly legacy now