HDFS | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

HDFS

HDFS (Hadoop Distributed File System) was the distributed file system at the heart of the Hadoop era. Released in 2006 and modeled on Google's GFS paper, it powered the first generation of big data analytics and is now largely obsolete, replaced by cloud object storage.

HDFS is the distributed file system at the heart of Apache Hadoop, and it was the storage layer of the first big data revolution. Between roughly 2008 and 2016, HDFS was the place petabytes of analytical data lived. Every company that wanted to run "big data" had a Hadoop cluster, and every Hadoop cluster had HDFS underneath. Cloudera, Hortonworks, and MapR sold distributions of Hadoop that bundled HDFS as the foundation. The pre-Snowflake, pre-cloud data warehouse era was, to a first approximation, the HDFS era.

In 2026, HDFS is functionally dead. It still exists. It still runs in some on-premise enterprise environments (especially in regulated industries and certain government deployments). The Apache project is still maintained. But for any new data architecture being built today, the answer is not HDFS — it is object storage on S3, GCS, or ADLS Gen2. The reasons for HDFS's decline are worth understanding, because they explain a lot about how the modern data stack came to look the way it does.

What HDFS Actually Was

HDFS is a distributed file system designed to store extremely large files across a cluster of commodity machines, with built-in fault tolerance through replication. The architecture had two main components:

A NameNode — a single (later HA-paired) master server that held the directory tree and the mapping from file paths to block locations.
Many DataNodes — worker machines that held the actual file blocks (typically 128 MB or 256 MB chunks). Each block was replicated three times across different DataNodes for fault tolerance.

When you wrote a 1 GB file to HDFS, it got chopped into ~8 blocks, each block was written to three DataNodes, and the NameNode recorded where everything went. When you read the file, the NameNode told you which DataNodes had which blocks, and you read them in parallel. The whole design was built around one principle: bring the compute to the data, not the data to the compute. MapReduce jobs scheduled tasks on the same machines that held the relevant blocks, so the data never had to cross the network.

In 2006, when network bandwidth inside data centers was the bottleneck, this design was brilliant. By the late 2010s, when 25 Gbps inside a cloud data center became cheap and S3 could serve many GB/s with minimal latency, the design's whole reason for existing had evaporated.

The Origin Story

HDFS was created in 2006 by Doug Cutting and Mike Cafarella as part of Apache Hadoop, which was itself an open-source reimplementation of the architecture described in two Google papers:

The Google File System (2003), which described GFS, the distributed file system that stored Google's web crawl.
MapReduce: Simplified Data Processing on Large Clusters (2004), which described Google's distributed compute framework.

Cutting was working on the Nutch web crawler at the time and needed a way to store and process the entire web. Reading the GFS and MapReduce papers, he realized Google had already solved the problem. He and Cafarella set out to build an open-source version. The result — Hadoop (named after Cutting's son's stuffed elephant) — became one of the most successful open-source projects in history.

Yahoo picked up Hadoop early and ran it at massive scale. Facebook built Hive on top of it. Cloudera was founded in 2008 to commercialize Hadoop. Hortonworks followed in 2011. By 2014, Hadoop was the default infrastructure for "big data," and Cloudera and Hortonworks were two of the most-watched private companies in tech.

What Killed HDFS

Three things killed HDFS, and they all happened at roughly the same time:

1. S3 got good and cheap. By the mid-2010s, S3 was durable, fast, and cheap enough that storing your data in S3 made more sense than running your own HDFS cluster. You didn't have to operate NameNodes. You didn't have to balance disks. You didn't have to worry about three-way replication eating your storage budget. You paid Amazon and the storage just worked. The network bandwidth that HDFS was originally designed to avoid was now abundant inside the cloud, so the "compute close to data" advantage stopped mattering.

2. The compute/storage decoupling pattern won. HDFS's architecture welded compute and storage together: a Hadoop cluster was both. This meant scaling storage required adding compute, and scaling compute required adding storage. Cloud warehouses like Snowflake and BigQuery proved that decoupling compute from storage was the right architecture: store everything cheaply in object storage, spin up compute on demand, scale them independently. HDFS could not do this without a fundamental redesign.

3. Spark replaced MapReduce, and Spark didn't care. Apache Spark, released in 2014, replaced MapReduce as the dominant distributed processing engine. Spark was happy to run against any storage backend — HDFS, S3, GCS, whatever — and most users picked S3 because it was easier to operate. The tight coupling between Hadoop and HDFS that had defined the early ecosystem dissolved.

By 2018, the writing was on the wall. Cloudera and Hortonworks merged in 2019, an explicit acknowledgment that the standalone Hadoop market was shrinking. The merged company pivoted hard toward cloud and "data platform" positioning. By 2024, almost no new analytical infrastructure was being built on HDFS.

The Opinionated Take: HDFS Is Dead, Long Live Its Influence

HDFS as a product is dead. HDFS as an idea is everywhere. Almost every concept in modern data infrastructure — distributed storage, replication for durability, schema-on-read, scale-out compute, the very phrase "data lake" — was either invented or popularized in the Hadoop era. The lakehouse architecture is, in some sense, "what we wanted Hadoop to be, finally built right on top of cloud object storage." Delta Lake, Iceberg, and Hudi are all conscious responses to the limitations of writing Parquet files directly to a file system without ACID guarantees — a problem that was first felt acutely on HDFS.

The honest assessment of why HDFS lost: it was a brilliant solution to a 2006 problem and a clumsy solution to a 2020 problem. It made the right tradeoffs for cheap commodity hardware in a single rack with 1 Gbps networking. It made the wrong tradeoffs for elastic cloud infrastructure with abundant network bandwidth. The technology was a victim of its own assumptions.

The other lesson, and this is one of the more important lessons in infrastructure history: the storage layer should outlive the compute layer. HDFS's mistake was being a storage system that was inseparable from a specific compute framework (MapReduce, then Spark). S3's genius was being a storage system that didn't care what compute you ran on top of it. Twenty years later, S3 is everywhere and HDFS is in maintenance mode.

Where HDFS Still Matters

If you encounter HDFS in 2026, it's almost always one of these situations:

On-premise legacy systems. Banks, telcos, government agencies, and other regulated enterprises that built Hadoop clusters in 2014-2018 and have not yet migrated.
Cloudera Data Platform deployments that still use HDFS as the underlying file system, often as part of a hybrid cloud strategy.
Specialized HPC and research environments where extreme parallel I/O requirements still favor a tightly coupled file system.
Historical reading, when trying to understand why the modern data stack looks the way it does.

How TextQL Works with HDFS

TextQL Ana connects to HDFS the same way it connects to any other storage backend — through the query engine that sits above it. If a customer is running Hive, Impala, or Spark SQL against HDFS-resident data, Ana can issue queries through that engine and return natural-language answers. In practice, the vast majority of TextQL deployments are against cloud warehouses or lakehouses; HDFS deployments are rare and almost always part of a legacy modernization conversation.

See TextQL in action

HDFS

Full name Hadoop Distributed File System

First release 2006 (as part of Apache Hadoop)

Created by Doug Cutting and Mike Cafarella

Inspired by Google File System (GFS) paper, 2003

License Apache 2.0

Category Storage / Data Lake (legacy)

Status Legacy. Largely replaced by cloud object storage.

Monthly mindshare ~200K · Hadoop legacy; mostly large enterprises and on-prem; declining