Databricks | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

Databricks

Databricks — the lakehouse company founded by the creators of Apache Spark. Started as a notebook platform for data science, now one of the two dominant enterprise data platforms alongside Snowflake.

Databricks is what you get when the people who invented Apache Spark start a company and then spend ten years convincing the enterprise that the data warehouse is not the right shape for modern data. The short pitch: Databricks is a lakehouse — a platform that gives you warehouse-style SQL performance and governance on top of open file formats sitting in your own cloud object store. The longer pitch is that Databricks is trying to be the single platform for every kind of data workload a large enterprise has: SQL analytics, machine learning, data science, streaming, and AI model training.

If Snowflake's metaphor is "a database you don't manage," Databricks's metaphor is "a giant computer you can use for anything data-related" — batch ETL, Python notebooks, production ML pipelines, SQL dashboards, Spark jobs, fine-tuning an LLM. All of it, on all your data, in one place.

This flexibility is Databricks's greatest strength and its oldest PR problem. Everything is possible, but nothing is as simple as it is in Snowflake — at least not until recently.

Origin Story: Spark, Berkeley, and the Original Lakehouse Bet

Databricks was founded in 2013 by seven people from UC Berkeley's AMPLab, the same research group that had already produced Mesos (later Kubernetes-era infrastructure). The founding team included:

Matei Zaharia — creator of Apache Spark during his PhD at Berkeley. Spark started as a research project in 2009, was open-sourced in 2010, and became a top-level Apache project in 2014. Zaharia is now Databricks's CTO.
Ali Ghodsi — originally head of engineering, became CEO in 2016. Swedish-Iranian, academic background, one of the most respected operators in data infrastructure.
Ion Stoica — Berkeley professor, also co-founder of Anyscale (Ray). Databricks's first CEO.
Reynold Xin, Patrick Wendell, Andy Konwinski, Arsalan Tavakoli-Shiraji — early Spark committers and the original engineering and go-to-market core.

The founding thesis was that Hadoop was too hard. In 2013, doing "big data" meant running a Hadoop cluster with HDFS, Hive, Pig, and half a dozen other Apache projects duct-taped together. Spark was faster (10–100x for iterative workloads, thanks to in-memory execution) and had much nicer developer ergonomics (DataFrames, Python and Scala APIs). Databricks started life as "managed Spark in a notebook" — a cleaner, hosted way to run Spark on your data in S3 without the Hadoop operational tax.

For the first four or five years, Databricks was primarily a data science and ML tool. Data scientists loved it; data warehouse buyers mostly ignored it. The pivot that made Databricks an enterprise data platform happened in 2018–2020 with two moves: Delta Lake and the lakehouse thesis.

The Lakehouse Thesis

In 2019, Databricks open-sourced Delta Lake, a table format that adds ACID transactions, schema enforcement, and time travel to Parquet files sitting in cloud object storage. In 2020, Zaharia and co-authors published a paper titled "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics," making the argument that the warehouse/lake split was an accident of history, not an architectural necessity.

The thesis, paraphrased: Warehouses are expensive, closed, and bad at ML/unstructured data. Lakes are cheap and open but bad at reliability and SQL. A lakehouse fuses the two — open file formats in object storage, plus a transactional layer (Delta Lake, later also Iceberg), plus a fast SQL engine (Photon), plus governance (Unity Catalog). You get warehouse-grade SQL and lake-grade flexibility in the same system, on the same copy of data, with no proprietary storage format.

This was a direct shot at Snowflake, and it worked. "Lakehouse" went from a Databricks marketing term in 2020 to an industry-standard category by 2023. Every major vendor now claims to be lakehouse-compatible, even the ones who spent years calling it a made-up word.

Architecture: What Databricks Actually Is

A Databricks workspace has four layers worth understanding.

1. Open storage (your cloud object store). Data lives as Parquet files in S3, ADLS, or GCS, organized as Delta Lake tables (or increasingly Iceberg tables). The storage is in your own cloud account — Databricks does not host your data. This is a huge philosophical and commercial difference from Snowflake.

2. Compute clusters. You run workloads on ephemeral Spark clusters. Historically these were general-purpose Spark clusters; today they come in several flavors:

All-purpose clusters — interactive notebooks, data science.
Job clusters — one-shot compute for scheduled ETL jobs.
SQL Warehouses — serverless, auto-scaling SQL-only compute backed by Photon (Databricks's C++ vectorized engine). This is Databricks's direct answer to Snowflake virtual warehouses.
Serverless compute — launched 2024, cluster startup in seconds instead of minutes, a major usability improvement.

3. Unity Catalog. A unified governance layer over all tables, files, models, and notebooks. Provides access control, lineage, auditing, and cross-workspace sharing. Unity Catalog was Databricks's answer to the critique that lakes had no real governance; launched in 2022, it's now the center of gravity of the platform.

4. Workloads on top. MLflow (the de facto open-source ML experiment tracker, originated at Databricks), Databricks SQL, Databricks Workflows (orchestration), Model Serving, Mosaic AI (post-MosaicML acquisition, 2023, $1.3B), Genie and AI/BI Dashboards, Databricks Apps, Delta Live Tables for declarative pipelines, and Databricks Connect.

The mental model: one platform where a data engineer can write a dbt job, a data scientist can train an XGBoost model on the same table, and a BI analyst can hit it from a dashboard, all without copying the data anywhere.

Vendor Positioning: Databricks's Worldview

Databricks's official pitch is that the world is moving from warehouses to lakehouses, and every other vendor is either catching up to that (Snowflake adding Iceberg) or irrelevant (legacy warehouses). They're explicit about their rivals:

vs Snowflake. This is the defining rivalry of the data infrastructure world in 2020–2026. Databricks frames Snowflake as a closed, proprietary warehouse with a "taxi meter" pricing model that penalizes you for using your own data. Snowflake frames Databricks as a messy notebook tool bolting on SQL as an afterthought. Both are partially right and partially self-serving. The honest comparison: Databricks is ahead on ML, unstructured data, and openness. Snowflake is ahead on pure SQL usability, workload isolation, and the "it just works" feel. Large enterprises increasingly run both.
vs BigQuery. Less direct competition. Databricks acknowledges BigQuery is excellent within GCP but argues its proprietary storage and GCP lock-in limit it.
vs Cloudera / Hadoop. Effectively a won war. Databricks replaced Cloudera in most enterprises.
vs OpenAI / Anthropic for AI. Databricks has increasingly positioned itself as the enterprise AI platform — the place where your proprietary data meets LLM training and serving. Mosaic AI and the 2024 acquisition of Tabular (led by Iceberg co-creator Ryan Blue) both support this.

The Tabular acquisition (June 2024, ~$1–2B) was particularly notable: it brought the creators of Apache Iceberg in-house, signaling that Databricks is committing to Iceberg as a first-class format alongside Delta Lake, and trying to become the dominant lakehouse across both table formats. The open-source governance of Iceberg and Delta is converging under Databricks's influence.

Where Databricks's pitch is self-serving: the operational complexity is real. Running Databricks well requires genuine engineering skill. The "one platform for everything" pitch assumes you have the team to use it. Small companies often find Snowflake dramatically simpler to adopt.

What Databricks Is Good At (and Not)

Good at:

ML and data science at scale. Nothing else comes close. MLflow, Feature Store, Model Serving, notebook ergonomics, GPU cluster support, Mosaic AI for LLMs — Databricks is the clear leader for production ML and increasingly for LLM/AI workloads.
Unstructured and semi-structured data. Images, PDFs, audio, nested JSON — all comfortable in Databricks, awkward in a traditional warehouse.
Openness and portability. Your data lives in your cloud account in open Parquet + Delta (or Iceberg) format. If you ever leave Databricks, your data is still usable. This is a genuine, verifiable difference from Snowflake's native format.
Streaming. Structured Streaming and Delta Live Tables are mature, battle-tested streaming systems.
Cost at high utilization. Photon + well-tuned clusters can be materially cheaper than Snowflake for large, steady workloads.
Multi-cloud. First-class on AWS, Azure, and GCP. Azure Databricks in particular is a first-party Microsoft service, which matters enormously for large Microsoft-aligned enterprises.

Bad at (or honest weaknesses):

Simplicity for pure SQL/BI teams. If all you want is a SQL warehouse for BI, Snowflake is simpler to learn, adopt, and operate. Databricks SQL has closed much of the gap, but the platform's breadth still shows as complexity.
Cluster management legacy. Even with Serverless, Databricks's historical mental model ("you run clusters") still colors parts of the product. Cold-start latency on classic clusters is minutes, not seconds.
Governance maturity. Unity Catalog is very good but younger than Snowflake's governance layer. Enterprises migrating from workspace-scoped permissions to Unity Catalog is still a meaningful project.
Cost opacity. DBUs (Databricks Units) plus cloud compute plus storage plus network makes cost attribution genuinely hard. Many customers use third-party FinOps tools just to understand their Databricks bill.
Small team usability. For a 3-person startup data team, Databricks is overkill. BigQuery or Snowflake is usually a better starter warehouse.

Where the Puck Is Going

Databricks's strategy for the next phase is legible and ambitious:

Be the enterprise AI platform. Databricks is betting that the future of AI in the enterprise is "run LLMs on top of your own governed data, in your own cloud account." Mosaic AI, Model Serving, Vector Search, AI Functions, and Genie are all bets on this.
Own the open table format layer. With Delta Lake and (post-Tabular) Iceberg, Databricks is positioning itself as the vendor that champions open formats — and benefits most from customers adopting them.
Win the IPO narrative. Databricks has repeatedly delayed an IPO but is expected to eventually go public as one of the most anticipated software listings of the decade. Revenue growth and AI positioning are the headline numbers.
Unified governance. Unity Catalog is being extended to cover models, files, dashboards, and third-party data, becoming the enterprise control plane.

The honest view: Databricks and Snowflake are converging on the same product — a unified data + AI platform with open-format storage, SQL and Python workloads, governance, and AI primitives. The interesting question of 2026–2028 is whether enterprises pick one, run both, or whether a smaller open-source-native alternative disrupts them both from below.

TextQL and Databricks

TextQL Ana connects to Databricks via the SQL Warehouse endpoint and Unity Catalog, respecting table, row, and column-level access policies defined in Unity Catalog. Because Databricks holds so many different data shapes in one place — warehouse tables, Delta tables, streaming tables, and ML features — TextQL can reason across a broader surface than on a pure SQL warehouse. For customers running Databricks as their central platform, TextQL inherits the full Unity Catalog lineage and permission model, meaning business users get natural-language analytics with the same governance that covers data engineers and data scientists.

See TextQL in action

Databricks

Founded 2013

HQ San Francisco, CA

Founders Ali Ghodsi, Matei Zaharia, Reynold Xin, Ion Stoica, Patrick Wendell, Andy Konwinski, Arsalan Tavakoli-Shiraji

Status Private (last valuation ~$62B, 2024)

Category Lakehouse / Data + AI Platform

Runs on AWS, Azure, GCP

Key tech Apache Spark, Delta Lake, MLflow, Unity Catalog

Monthly mindshare ~500K · ~12K customers × ~40 users; dominant for ML/data engineering; ~$3B ARR