NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Data Lakehouse
The lakehouse architecture combines the flexibility of data lakes with the performance of data warehouses — or at least, that's the pitch. Here's what it actually means.
A data lakehouse is an architecture that attempts to combine the cheap, scalable, schema-agnostic storage of a data lake with the structured querying, ACID transactions, and governance features of a data warehouse. The term was coined by Databricks in 2020 to describe their platform and position it as the successor to both traditional data lakes and cloud warehouses.
In plain English: imagine you had a storage unit where you could throw in anything — files, images, logs, CSVs, Parquet files, whatever. That's a data lake. Now imagine you added shelving, labels, a checkout system, and a rule that says nobody can move two things at once without signing a ledger. That's basically what a lakehouse is — your storage unit got organized without losing the ability to throw anything in there.
Whether "lakehouse" describes a genuinely new architecture or is primarily a marketing term depends heavily on who you ask — and what they're selling.
The history matters because "data lakehouse" is arguably the most vendor-loaded term in the entire data ecosystem.
Databricks coined it. In January 2020, Databricks co-founders Ali Ghodsi and Matei Zaharia published a paper and began using "lakehouse" to describe an architecture where you store everything in open formats on cheap object storage (like S3) but layer on warehouse-grade features — transactions, indexing, SQL access. The implicit argument: you don't need a separate warehouse like Snowflake. The lake becomes your warehouse.
This was a strategic move. Databricks had built its business on Apache Spark and data lakes. Snowflake had built its business on cloud warehouses. By coining "lakehouse," Databricks reframed the conversation: warehouses were legacy; the lakehouse was the evolution.
If you ask Databricks, a lakehouse is the natural evolution beyond warehouses — why pay to copy data into a proprietary warehouse when your lake can do the same job?
If you ask Snowflake, they'd say their warehouse already does everything a lakehouse claims to do — they just don't use the buzzword. Snowflake would (and does) argue that external tables, Iceberg support, and their storage layer already provide lake-like flexibility without sacrificing warehouse performance.
If you ask Google, they'll show you that BigQuery has quietly added lakehouse features — BigLake, native Iceberg support, open storage connectors — without ever centering their marketing around the term.
The honest framing: "lakehouse" is roughly 50% genuine architectural pattern and 50% competitive positioning. That doesn't make it meaningless — the architecture is real — but you should understand the term's origin before evaluating vendor claims.
You've almost certainly seen the Databricks three-column comparison: Data Warehouse vs. Data Lake vs. Data Lakehouse, typically showing how warehouses have structure but are expensive and siloed, lakes are cheap but chaotic, and lakehouses magically get the best of both.
What that diagram gets right:
What it oversimplifies:
The diagram is useful as a mental model for understanding the motivation behind lakehouse architecture. It's less useful as an objective comparison of what modern platforms actually do.
Technically, a lakehouse adds four categories of capability on top of a raw data lake. These are the features that turn a pile of files into something you can query like a warehouse:
### 1. ACID Transactions
Raw data lakes have no transactional guarantees. If two processes write to the same dataset simultaneously, you can end up with corrupted or partial data. Lakehouse architectures add ACID (Atomicity, Consistency, Isolation, Durability) transactions through open table formats like Delta Lake, Apache Iceberg, or Apache Hudi. These formats maintain a transaction log alongside the data files, ensuring that reads and writes are consistent.
### 2. Schema Enforcement and Evolution
A data lake will happily accept a CSV with 12 columns today and 15 columns tomorrow with no warning. Lakehouse table formats enforce schemas — they validate that incoming data matches the expected structure — while also supporting controlled schema evolution (adding a column, changing a type) without breaking existing queries.
### 3. Indexing and Query Performance
Raw Parquet files on S3 are scannable, but slowly. Lakehouse architectures add data skipping, Z-ordering, compaction, and various indexing strategies so that a SQL query doesn't have to read every file in a dataset. This is what closes the performance gap with traditional warehouses.
### 4. Governance and Access Control
Data lakes historically had file-level permissions at best — you could control who accesses a bucket, but not who can see column X in table Y. Lakehouse platforms add fine-grained access control, data lineage, audit logging, and catalog integration (e.g., Unity Catalog for Databricks, or open catalogs like Apache Polaris for Iceberg).
If a platform provides all four on top of open-format storage, it fits the lakehouse definition regardless of what the vendor calls it.
Here's the part that most vendor marketing won't tell you: the lakehouse "won" the narrative war, but architecturally, every major platform has converged to roughly the same place. The distinction between warehouse and lakehouse is increasingly academic.
| Vendor | Originally | What They've Added | Lakehouse? |
|---|---|---|---|
| —- | —- | —- | —- |
| Databricks | Spark-based data lake platform | SQL warehouses, Unity Catalog, serverless SQL, BI integrations | Yes (they coined the term) |
| Snowflake | Cloud data warehouse | Iceberg tables, external tables, Snowpark for Python/ML, unstructured data support | Functionally yes, though they avoid the term |
| Google BigQuery | Cloud data warehouse | BigLake, native Iceberg/Delta support, object table queries, open storage | Functionally yes, marketed as "data cloud" |
| Amazon | Redshift (warehouse) + S3 + Glue (lake) | Redshift Spectrum, Lake Formation, zero-ETL integrations, Iceberg support | Converging, sold as multiple services |
| Microsoft | Azure Synapse + ADLS | Microsoft Fabric (OneLake), Delta Lake native, unified analytics | Yes, via Fabric |
The pattern is clear: warehouse vendors added lake capabilities (open formats, unstructured data, external tables). Lake vendors added warehouse capabilities (SQL engines, transactions, governance). They met in the middle.
When does the distinction actually matter in practice?
When is it just branding? If you're running SQL dashboards on structured data in Snowflake and someone tells you that you need to "migrate to a lakehouse," you probably don't. The workload hasn't changed — only the buzzword.
Strip away the marketing and a lakehouse is really two things:
That's it. Everything else — the branding, the diagrams, the three-column comparisons — is packaging around these two components. If you understand table formats and query engines, you understand lakehouses.
The industry's current trajectory suggests that Apache Iceberg is becoming the dominant open table format (with Snowflake, AWS, Google, and even Databricks adding Iceberg support alongside Delta Lake), which further erodes any meaningful distinction between "lakehouse" and "modern warehouse."
TextQL sits above the lakehouse layer entirely. Whether your data lives in a Databricks lakehouse, Snowflake warehouse, BigQuery, or a combination of all three, TextQL Ana connects to each and lets teams query across them as a single unified asset. The lakehouse-vs-warehouse distinction is invisible to end users — which, in a way, proves the point that the distinction is increasingly just plumbing.
See TextQL in action