Apache Spark (Databricks) | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Wiki Data Warehouses Apache Spark (Databricks)

Contents

Apache Spark (Databricks)

Apache Spark is the distributed processing engine that Databricks was founded to commercialize -- the original foundation of the lakehouse and still the engine behind most non-SQL Databricks workloads.

Apache Spark is a distributed compute engine for processing large datasets in parallel across a cluster of machines. You give it data (in object storage, in a database, in Kafka), you give it a transformation (a SQL query, a DataFrame pipeline, a Python function, a streaming job), and it splits the work across many nodes, runs it, and gives you the result. It is the engine that does the actual computing inside almost every Databricks product.

The simple metaphor: Spark is a factory floor for data. Raw materials come in (your tables, files, streams), workstations (executors on cluster nodes) each handle a piece of the job in parallel, and finished goods come out (a transformed table, a model, a stream of events). Databricks built the factory; Spark is the assembly line.

Origin Story

Spark was created in 2009—2010 by Matei Zaharia as a PhD project at the UC Berkeley AMPLab (the same lab that gave us Mesos, Tachyon/Alluxio, and a startling fraction of the modern open-source data stack). The original motivation was practical and pointed: Hadoop MapReduce was too slow for iterative algorithms. A typical machine learning training run involved dozens of passes over the same dataset, and MapReduce wrote intermediate results to disk between every pass. For an iterative job, that was catastrophic.

Spark's key insight was to keep intermediate data in memory across stages, structured as an immutable distributed collection called a Resilient Distributed Dataset (RDD). The result was 10—100x speedups over MapReduce on iterative workloads. Spark went open source in 2010, became an Apache Top-Level Project in 2014, and almost immediately started displacing MapReduce as the default Hadoop-era compute engine.

Databricks was founded in 2013 by Zaharia and his AMPLab collaborators (Ali Ghodsi, Ion Stoica, Reynold Xin, Patrick Wendell, Andy Konwinski, Arsalan Tavakoli) explicitly to commercialize Spark. The company was unusual in that the founders kept Spark itself open source under Apache governance while building a managed Spark service as the commercial product. That bet — "open core engine, proprietary platform around it" — is the reason Databricks exists as a $40B+ company today.

How Spark Works

Spark has a few layers worth knowing:

Cluster model. A Spark application has one driver (the brain that builds the query plan) and many executors (workers running on cluster nodes). The driver splits work into tasks, sends them to executors, and collects results. Modern Spark on Databricks adds a layer above this — the Photon engine — which replaces parts of the JVM-based execution path with vectorized C++ code for massive speedups on SQL and DataFrame workloads.

APIs. Spark exposes itself in several flavors that target different audiences:

Spark SQL — the SQL interface, used by analysts and BI tools and the entire Databricks SQL Warehouse product. Spark SQL is also what dbt-on-Databricks compiles to.
DataFrame API (PySpark, Spark Scala, Spark R) — the Python/Scala/R DataFrame interface, which is functionally a typed wrapper over Spark SQL and is the most common way data engineers use Spark.
Structured Streaming — a streaming API that treats a stream as an unbounded table you query incrementally. It is the canonical way to do streaming on Databricks today (the older "DStreams" API is deprecated).
MLlib — a built-in distributed machine learning library, increasingly displaced in practice by other ML frameworks running on Spark clusters.
GraphX — a graph processing library, niche and largely superseded by GraphFrames and external graph systems.

Catalyst and Tungsten. Two long-standing internal projects worth naming. Catalyst is Spark's query optimizer; it takes your DataFrame or SQL code and rewrites it into an efficient execution plan. Tungsten was the project that overhauled Spark's memory and code-generation paths starting in Spark 1.5, dramatically improving JVM performance. Photon is the spiritual successor: a C++ rewrite of the execution layer that ships only inside Databricks (it is not open source) and is what powers Databricks SQL Warehouses and the modern Databricks runtime.

What Spark Is Good At

Large-scale ETL. Multi-terabyte joins, aggregations, and transformations that don't fit on a single machine.
Streaming pipelines. Structured Streaming + Delta Lake is one of the most battle-tested combinations for production streaming ETL.
Distributed feature engineering and ML training data prep. Spark is the workhorse for building training datasets at petabyte scale.
Notebook-based exploration of huge datasets. PySpark in a Databricks notebook is the canonical workflow for data scientists working with data that doesn't fit in pandas.
Polyglot teams. Python, Scala, SQL, and R users can all work against the same engine and the same data.

What Spark Is Not Good At

Small-data interactivity. Spark has overhead — a JVM driver, task scheduling, executor spin-up. For datasets that fit in a single machine, DuckDB, Polars, or pandas will smoke it.
Sub-second BI queries on cold clusters. This is exactly the gap Photon and serverless SQL Warehouses were built to close, and they've largely closed it — but vanilla Spark on a cold cluster is not a snappy BI experience.
Operational simplicity. Spark is powerful and configurable, which means it has hundreds of tuning knobs (executor count, executor memory, shuffle partitions, broadcast thresholds). On Databricks, most of these are auto-tuned, but Spark expertise is still a real and rare skill.

The Opinionated Take

Spark is the most successful open-source data project of the last 15 years, and it created the company that is now reshaping the entire data warehouse market. It would be easy to view Spark in 2026 as legacy — newer engines (DuckDB, Polars, Velox, DataFusion) are faster on small to medium data, and SQL-first lakehouses are winning new workloads — but that read misses how deeply Spark is woven into production. Almost every large enterprise data pipeline running on Databricks is, underneath, a Spark job. Photon makes those jobs fast; Delta Lake makes them transactional; Unity Catalog makes them governed; but Spark is still doing the work.

The strategic tension worth noting: Databricks has spent the last five years quietly making Spark optional from the user's perspective. A modern Databricks customer running a SQL Warehouse may never write a line of Spark code. They write SQL, the SQL is optimized by Catalyst, and it executes on Photon. From their seat, "Databricks" is a SQL warehouse, not a Spark platform. This is deliberate — it's how Databricks competes for the BI buyer who would never touch a notebook — but it's also a long-term hedge: if a faster engine ever supersedes Spark for the workloads Spark currently dominates, Databricks can swap it out without changing the user experience much. The brand is "Databricks lakehouse," not "Spark."

For now, though, Spark is still the engine of the lakehouse. Every Databricks competitor that wants to challenge it (Snowflake with Snowpark, pure-SQL lakehouse vendors like Starburst and Dremio, and a long tail of open-source alternatives) is implicitly arguing that Spark is more than you need. They might be right for some workloads. They are not right for the median Databricks customer, whose Spark jobs are loadbearing and have been running for years.

How TextQL Fits

TextQL Ana operates at the SQL layer, so it interacts with Spark through Databricks SQL Warehouses rather than by writing Spark code directly. The Spark jobs that matter for TextQL are upstream: they're the ETL and feature pipelines that build the curated tables Ana queries. A well-organized Spark-built lakehouse, with clean Delta tables registered in Unity Catalog and good column-level documentation, is one of the easiest environments to deploy an AI analyst against.

See TextQL in action

Apache Spark (Databricks)

Released 2010 (UC Berkeley AMPLab); open-sourced 2010; Apache TLP 2014

Created by Matei Zaharia and the AMPLab at UC Berkeley

Vendor Databricks (commercial steward)

Type Distributed data processing engine

Languages Scala, Python, Java, R, SQL

Category Data Warehouse

Monthly mindshare ~600K · Spark predates Databricks; one of the most popular OSS data tools; massive Stack Overflow footprint