NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Apache Spark (Databricks)
Apache Spark is the distributed processing engine that Databricks was founded to commercialize -- the original foundation of the lakehouse and still the engine behind most non-SQL Databricks workloads.
Apache Spark is a distributed compute engine for processing large datasets in parallel across a cluster of machines. You give it data (in object storage, in a database, in Kafka), you give it a transformation (a SQL query, a DataFrame pipeline, a Python function, a streaming job), and it splits the work across many nodes, runs it, and gives you the result. It is the engine that does the actual computing inside almost every Databricks product.
The simple metaphor: Spark is a factory floor for data. Raw materials come in (your tables, files, streams), workstations (executors on cluster nodes) each handle a piece of the job in parallel, and finished goods come out (a transformed table, a model, a stream of events). Databricks built the factory; Spark is the assembly line.
Spark was created in 2009—2010 by Matei Zaharia as a PhD project at the UC Berkeley AMPLab (the same lab that gave us Mesos, Tachyon/Alluxio, and a startling fraction of the modern open-source data stack). The original motivation was practical and pointed: Hadoop MapReduce was too slow for iterative algorithms. A typical machine learning training run involved dozens of passes over the same dataset, and MapReduce wrote intermediate results to disk between every pass. For an iterative job, that was catastrophic.
Spark's key insight was to keep intermediate data in memory across stages, structured as an immutable distributed collection called a Resilient Distributed Dataset (RDD). The result was 10—100x speedups over MapReduce on iterative workloads. Spark went open source in 2010, became an Apache Top-Level Project in 2014, and almost immediately started displacing MapReduce as the default Hadoop-era compute engine.
Databricks was founded in 2013 by Zaharia and his AMPLab collaborators (Ali Ghodsi, Ion Stoica, Reynold Xin, Patrick Wendell, Andy Konwinski, Arsalan Tavakoli) explicitly to commercialize Spark. The company was unusual in that the founders kept Spark itself open source under Apache governance while building a managed Spark service as the commercial product. That bet — "open core engine, proprietary platform around it" — is the reason Databricks exists as a $40B+ company today.
Spark has a few layers worth knowing:
Cluster model. A Spark application has one driver (the brain that builds the query plan) and many executors (workers running on cluster nodes). The driver splits work into tasks, sends them to executors, and collects results. Modern Spark on Databricks adds a layer above this — the Photon engine — which replaces parts of the JVM-based execution path with vectorized C++ code for massive speedups on SQL and DataFrame workloads.
APIs. Spark exposes itself in several flavors that target different audiences:
Catalyst and Tungsten. Two long-standing internal projects worth naming. Catalyst is Spark's query optimizer; it takes your DataFrame or SQL code and rewrites it into an efficient execution plan. Tungsten was the project that overhauled Spark's memory and code-generation paths starting in Spark 1.5, dramatically improving JVM performance. Photon is the spiritual successor: a C++ rewrite of the execution layer that ships only inside Databricks (it is not open source) and is what powers Databricks SQL Warehouses and the modern Databricks runtime.
Spark is the most successful open-source data project of the last 15 years, and it created the company that is now reshaping the entire data warehouse market. It would be easy to view Spark in 2026 as legacy — newer engines (DuckDB, Polars, Velox, DataFusion) are faster on small to medium data, and SQL-first lakehouses are winning new workloads — but that read misses how deeply Spark is woven into production. Almost every large enterprise data pipeline running on Databricks is, underneath, a Spark job. Photon makes those jobs fast; Delta Lake makes them transactional; Unity Catalog makes them governed; but Spark is still doing the work.
The strategic tension worth noting: Databricks has spent the last five years quietly making Spark optional from the user's perspective. A modern Databricks customer running a SQL Warehouse may never write a line of Spark code. They write SQL, the SQL is optimized by Catalyst, and it executes on Photon. From their seat, "Databricks" is a SQL warehouse, not a Spark platform. This is deliberate — it's how Databricks competes for the BI buyer who would never touch a notebook — but it's also a long-term hedge: if a faster engine ever supersedes Spark for the workloads Spark currently dominates, Databricks can swap it out without changing the user experience much. The brand is "Databricks lakehouse," not "Spark."
For now, though, Spark is still the engine of the lakehouse. Every Databricks competitor that wants to challenge it (Snowflake with Snowpark, pure-SQL lakehouse vendors like Starburst and Dremio, and a long tail of open-source alternatives) is implicitly arguing that Spark is more than you need. They might be right for some workloads. They are not right for the median Databricks customer, whose Spark jobs are loadbearing and have been running for years.
TextQL Ana operates at the SQL layer, so it interacts with Spark through Databricks SQL Warehouses rather than by writing Spark code directly. The Spark jobs that matter for TextQL are upstream: they're the ETL and feature pipelines that build the curated tables Ana queries. A well-organized Spark-built lakehouse, with clean Delta tables registered in Unity Catalog and good column-level documentation, is one of the easiest environments to deploy an AI analyst against.
See TextQL in action