Query Engines & Virtualization | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Wiki Query Engines Query Engines & Virtualization

Contents

Query Engines & Virtualization

Query engines are the SQL brains that read data from somewhere else. They separate compute from storage, federate across sources, and make data lakes feel like databases. The dominant engines are Trino, Dremio, DuckDB, Presto, Hive, and Databricks Photon.

A query engine is the part of a database that takes a SQL query, figures out how to execute it, reads the data, and gives you the answer. In a traditional database, the query engine is welded to the storage layer — they ship together, they're tuned together, and you can't really use one without the other. A standalone query engine rips that apart. It's a SQL brain that reads data from somewhere else — usually files in a data lake, sometimes other databases, sometimes a dozen sources at once.

Think of it this way: a query engine is a chef who doesn't own a kitchen. You give the chef ingredients (your data, wherever it lives), they cook with their own techniques, and they serve you the result. The chef can work in any kitchen, with any pantry. A traditional warehouse like Snowflake is the opposite — it's a chef who only cooks with food from their own farm, served in their own dining room. Both produce great meals. The standalone chef gives you flexibility; the all-in-one restaurant gives you consistency.

This category is one of the most confusing in the modern data stack because the products inside it are doing very different things under one umbrella. A federated SQL engine like Trino, an embedded analytical engine like DuckDB, a data virtualization platform like Denodo, and a vectorized execution engine like Databricks Photon all get called "query engines," but they solve very different problems.

What a Query Engine Actually Does

Every query engine does roughly four things, in this order:

1. Parse the SQL. Turn your text query into a tree the computer understands. This is the easy part.

2. Plan the execution. Decide how to actually run the query. Which tables to read first, which joins to do in which order, whether to push computation down into the storage layer, how to distribute work across machines. The query planner is where most of the magic (and most of the engineering effort) lives. A good plan and a bad plan for the same query can differ by 100x in runtime.

3. Read the data. Pull bytes from storage. This might mean reading Parquet files from S3, fetching pages from Postgres, or scanning a Kafka topic. Modern engines push as much filtering as possible down into this layer (called "predicate pushdown") so they don't haul data they don't need.

4. Execute and return. Run the joins, aggregations, and window functions, then hand back rows. Modern engines do this with vectorized execution — processing batches of rows at a time using SIMD CPU instructions instead of one row at a time. This is where engines like Photon and DuckDB get their speed.

The reason query engines exist as a separate category is that steps 3 and 4 don't need to be tied to a particular storage system. If your data is sitting in Parquet files on S3 in a well-known format (Apache Iceberg, Delta Lake, Apache Hudi), any engine that understands the format can read it. That's the whole game.

The Separation of Compute from Storage

The single most important architectural shift in data infrastructure over the last 15 years was separating compute from storage. In the old world, your database was one box that did both — if you needed more query power, you bought a bigger box, and your data was stuck on it. In the new world, your data sits in cheap object storage (S3, GCS, ADLS) and your query engines are stateless workers you can spin up and down on demand.

Snowflake and BigQuery were the first to make this commercially obvious — they hid the separation behind a closed product, but underneath, the storage and compute were independent. Standalone query engines take this idea further: you, the customer, get to pick the storage layer and the compute layer separately, and swap either one without losing the other. That's what people mean when they talk about "open data architectures" or "the lakehouse."

The promise: no more vendor lock-in on storage, multiple engines can read the same data, and you pay for compute only when you query. The reality: it's harder than it looks, governance and metadata catalogs become the new battleground (see Unity Catalog, Polaris, Apache Gravitino), and query performance on raw object storage is still slower than on a tightly-integrated warehouse for many workloads.

The Federation Story

The other reason query engines exist is federation — querying multiple data sources as if they were one. This was the original Presto pitch at Facebook: an analyst should be able to join data from Hive, MySQL, and Cassandra in a single SQL statement, without copying anything.

The federation dream is seductive: leave your data where it is, query it where it lives, no ETL pipelines, no staleness. The honest reality is that federation works best when you have a few large sources that all live nearby (same cloud, same region, same security perimeter) and worst when you're trying to join a 10-billion-row table in S3 against a Salesforce export over the public internet. Network is the enemy. Predicate pushdown helps but only goes so far.

Despite the caveats, federation is still the most compelling reason to introduce a query engine into a stack that already has a warehouse. It lets you reach data the warehouse can't easily ingest, and it lets you keep operational data in operational systems instead of dragging it through a nightly pipeline.

Where Each Engine Sits

The seven engines in this section are doing meaningfully different things. A rough taxonomy:

Engine	Primary job	Where it shines
—-	—-	—-
Trino / Starburst	Federated, distributed SQL across many sources	Lake querying + cross-source federation at scale
Presto	Same as Trino, the original	Mostly maintenance mode at this point
Dremio	Iceberg-native lakehouse SQL with reflections	Self-service BI on the lake
DuckDB	Embedded single-node analytical SQL	Local analytics, notebooks, small-to-medium data
Apache Hive	The original SQL-on-Hadoop translator	Legacy Hadoop estates only
Denodo	Enterprise data virtualization with semantic modeling	Big-bank, big-pharma, big-everything legacy environments
Databricks Photon	C++ vectorized execution inside Databricks	Speedup for SQL workloads inside the Databricks platform

The Honest Market Take

Two things are true at once. First, Trino effectively won the open-source federated SQL war — the original Presto creators forked it into Trino in 2020 after a corporate dispute with Facebook, and the community followed them almost completely. Presto-the-Linux-Foundation-project still exists and is maintained largely by Meta, IBM, and Uber, but Starburst (the Trino commercial sponsor) is the company most enterprises end up talking to.

Second, DuckDB is having a moment that nobody expected five years ago. It's a single-node embedded engine — the SQLite of analytics — and it turns out a huge amount of "big data" isn't actually big. A laptop with 64GB of RAM and DuckDB will out-execute a small Spark cluster on most jobs you'd care to run. MotherDuck is the commercial play, extending DuckDB into a hybrid local-and-cloud experience.

Meanwhile, Hive is legacy. It was the right answer in 2010 and the wrong answer in 2026 — if you're starting fresh, you should not be reaching for Hive. Most "Hive" you encounter today is really just the Hive Metastore (the metadata catalog) being used by other engines that long ago replaced Hive's execution layer.

And Photon is locked into Databricks. It's a real engineering achievement — a from-scratch C++ vectorized engine that gives Databricks SQL workloads a 2-3x speedup — but it doesn't exist outside the Databricks runtime, and that's by design. If you bet on Photon, you bet on Databricks.

When to Pick a Query Engine vs. a Warehouse

Scenario	Best fit	Why
—-	—-	—-
All your data already in one warehouse	Stay in the warehouse	Adding a query engine is overhead you don't need
Data spread across cloud storage + databases + SaaS	Trino / Starburst	Federation is the original use case
Single-node analysis on Parquet files	DuckDB	Simpler than spinning up a cluster, often faster
Self-service BI on Iceberg lakehouse	Dremio	Reflections + Iceberg-native semantics
Avoiding vendor lock-in on storage	Trino or Dremio on Iceberg	Open formats, swappable engines
Speed inside an existing Databricks footprint	Photon	Already paid for, drop-in upgrade

How TextQL Works with Query Engines

Query engines are the natural connection point for TextQL when an organization has data spread across multiple sources. TextQL Ana connects to Trino, Starburst, Dremio, DuckDB, and Databricks SQL (Photon) directly, treating the query engine as the single SQL endpoint that fans out to the underlying systems. This is especially powerful for federated environments — the query engine handles the cross-source plumbing while TextQL handles turning a business question into the right SQL.

See TextQL in action

Query Engines & Virtualization

Category SQL compute, federation, virtualization

Also called SQL engines, federated query, distributed SQL

Not to be confused with Data warehouses (which bundle storage), data lakes (which only store)

Key open-source projects Trino, Presto, DuckDB, Hive

Key commercial vendors Starburst, Dremio, MotherDuck, Denodo

Query language SQL (mostly ANSI SQL with extensions)

Monthly mindshare ~200K · people working with federated/distributed SQL