Public Preview · May 18–29 · Ten new features in two weeks See the lineup →

Public Preview · May 18–29 · Ten new features in two weeks See the lineup →

[ Research ]
Semantic Layers Are a Patch. Ontologies Are the Fix.

Semantic Layers Are a Patch. Ontologies Are the Fix.

Benjamin Shi

May 21, 2026

Semantic Layers Are a Patch. Ontologies Are the Fix.

TLDR: AI for structured data has been stuck for years. The industry solved accuracy. What it didn't solve for is breadth, cost, and time. Ontologies are the first approach that addresses all three.

For years, the data industry has been selling you a prerequisite. Before AI can touch your data, you need a lakehouse. Before the lakehouse, a medallion architecture. Your data needs to be "AI-ready": clean schemas, modeled marts, and a data dictionary to pair with it. The promise of conversational data analytics is always one step away.

It's 2026, and the AI revolution in structured data still hasn't arrived. Not really. Text-to-SQL benchmarks inch upward as models get more powerful, semantic layers multiply, and vendors continue to repackage the same fundamental bet: if we just describe our data well enough, machines will understand it. They don't. Not on the questions that matter, at least.

What I Got Wrong About AI for Structured Data

I'm Ben. I joined TextQL after five years embedded in data and AI infrastructure, at Blackstone and across the modern data stack. Like a lot of people in this space, I've spent most of that time chasing the same goal: conversational analytics that actually works.

I started where almost everyone starts: text-to-SQL. The pitch is irresistible. Type a question, get SQL out, no modeling required. In a demo with five clean tables, it works. The problem, as I've realized, is that real business users don't ask questions the way demos do. Users ask ambiguous questions, half-formed questions, questions that assume context the system doesn't have. The system returns an answer confidently that is only half-right. After enough of those, conversations stop being about what the product can do and start being about how you can guarantee it won't be wrong.

BIRD-SQL is the industry's standard benchmark for measuring exactly this. In 2026, the best systems plateau around 82% under controlled conditions. Move to unmodeled, enterprise-realistic data and that drops to 42%. That gap is where the industry's obsession with accuracy benchmarks comes from. It isn't misplaced. It just isn't enough.

Execution accuracy
100%80%60%40%20%0%
HUMAN EXPERT BASELINE — 93%
82%
AskData + GPT-4o Dec 16, 2025 AT&T's text-to-SQL pipeline built on top of GPT-4o.
82%
Agentar-Scale-SQL Sep 25, 2025 Ant Group's multi-stage framework that runs SQL through several reasoning and refinement passes.
78%
LongData-SQL Jul 14, 2025 LongShine AI Research's framework that loads large database context to improve accuracy.
77%
SiriusAI-Text2SQL-Agent Apr 28, 2026 Tencent's agentic text-to-SQL system from the Data & Computation Platform Department.
77%
Zhiwen-Lingsi-Agent Jan 02, 2026 China Telecom and TeleAI's agentic SQL system tuned for enterprise data.
42%
Gemini 3.1 Pro Sep 10, 2025 Frontier model tested on fresh databases with no hand-written hints. The score that's left when the curation goes away.
AskData
+ GPT-4o
Agentar-
Scale-SQL
LongData-
SQL
SiriusAI-
Text2SQL
Zhiwen-
Lingsi-Agent
Gemini
3.1 Pro
BIRD-Bench
Top 5 leaderboard systems with hand-written column hints
LiveSQLBench
Unmodeled data

Sources: BIRD-Bench official leaderboard (top 5 systems ranked by test-set execution accuracy, evaluated with hand-written column hints). LiveSQLBench-Base-Full V1 (Gemini 3.1 Pro, normal difficulty, contamination-free databases). Human expert baseline: 92.96% execution accuracy on BIRD dev set (Guo et al., NeurIPS 2023).

FIG. 1 — Text-to-SQL accuracy on trained benchmarks vs. unseen databases

The industry's answer to that gap was semantic layers. And they deserved the reputation. Map your metrics, dimensions, and business logic into a structured layer and accuracy becomes real. Answers get reliable.

dbt's 2026 benchmarks tell the same story BIRD-SQL does, just from the other direction. Raw text-to-SQL against a normalized, lightly modeled enterprise schema lands around 70%. Back it with a well-modeled semantic layer and that climbs toward 100%. The accuracy problem, it turns out, is solvable.

Within Semantic Layer Scope
Outside Semantic Layer Scope
Accuracy
100%80%60%40%20%0%
SCHEMA DOESN'T SUPPORT
MULTI-HOP JOINS
The Semantic Layer returns "cannot answer" rather than guess.
26.9%
Text-to-SQL · GPT-4 2023 Baseline accuracy on questions that fit within MetricFlow's join scope.
62.5%
Text-to-SQL · Sonnet 4.6 2026 Modern frontier model on the same questions. Big jump, but still not reliable.
83.1%
Semantic Layer · GPT-4 2023 Same model, but routed through dbt's semantic layer. Accuracy jumps significantly.
100%
Semantic Layer · Sonnet 4.6 2026 Perfect accuracy within scope. Deterministic SQL generation eliminates wrong answers.
48.3%
Text-to-SQL · GPT-4 2023 On questions requiring too many entity hops for MetricFlow.
70%
Text-to-SQL · Sonnet 4.6 2026 Frontier model still tries, sometimes successfully. But no determinism.
0%
Semantic Layer · GPT-4 2023 Cannot answer. MetricFlow's schema doesn't support multi-hop joins beyond a limit.
0%
Semantic Layer · Sonnet 4.6 2026 Still cannot answer. Better models don't help when the architecture doesn't support the question.
Text-to-SQL2023 (GPT-4)Text-to-SQL2026 (Sonnet 4.6)Semantic Layer2023 (GPT-4)Semantic Layer2026 (Sonnet 4.6)Text-to-SQL2023 (GPT-4)Text-to-SQL2026 (Sonnet 4.6)Semantic Layer2023 (GPT-4)Semantic Layer2026 (Sonnet 4.6)

Source: dbt Developer Blog, "Semantic Layer vs. Text-to-SQL: 2026 Benchmark Update" (April 7, 2026). ACME Insurance benchmark (data.world), 11 questions × 20 runs. "Within scope" uses Too Many Hops = False; "Outside scope" uses Too Many Hops = True.

FIG. 2 — dbt 2026: semantic layer hits 100% within scope, 0% outside it

But you can see the ceiling this imposes. The questions outside the semantic layer's scope show 0% accuracy in dbt's own benchmark. They aren't answered at all, by design. That points to the fundamental constraint of the approach.

The permutations of questions any organization needs to ask always outpace what the data team has gotten around to modeling or building a semantic layer around. A new revenue breakdown. A different cut of customer behavior. A question that crosses two domains nobody has formally connected yet. The data exists, but until an engineer sits down to model it, AI can't reliably answer questions about it. The only solution is to prioritize which datasets get built out and leave the rest dark.

Datasets available to AI scale linearly with headcountMore models = more modeled dataDatasetsavailableto AIData Team Headcount

FIG. 3 — Datasets available to AI scale linearly with data team headcount

Most organizations accept this tradeoff because the demos finally work. The analytics they showcase are accurate. But the tradeoff is still there, and the rest of the industry hasn't made it any easier to escape.

The Landscape, and Why It's Stuck

Spend enough time inside text-to-SQL and semantic layers and you start seeing them everywhere. Every major BI tool, every cloud warehouse AI feature, every coding agent pointed at a database is running one of two playbooks. Nothing launched in the last year has broken the pattern.

This is broadly how I see the conversational data analytics market today:

The Conversational Data Analytics Landscape

Text-to-SQL

Wraps LLMs with context injection and RAG to convert natural language directly into SQL queries.

Examples
Seek AI
Vanna AI
AI2SQL
Strengths
No modeling required
Fast to deploy
Broad data access
Limitations
Unreliable on messy schemas
Confident wrong answers
No business context
~40% on raw enterprise data

Semantic Layers

Maps metrics, dimensions, and business logic into a structured layer on top of your data models.

Examples
dbt MetricFlow
Cube
AtScale
Zenlytic
Strengths
Near 100% on modeled data
Deterministic results
Business logic encoded once
Limitations
Headcount scales linearly
0% outside modeled scope
Engineering sprint per domain
Single platform only

Generative BI

Natural language layered on existing BI models. Compiles SQL into charts and dashboards.

Examples
Tableau Pulse
PowerBI Copilot
Ask Sigma
Looker Explorer
Strengths
Familiar BI interface
Visual output out of the box
Fast for pre-built metrics
Limitations
Blind to upstream context
Bounded by BI team models
No cross-system reach
Visualization-first, not insight-first

Warehouse Copilots

Semantic layer inside a warehouse's security perimeter. AI-first architecture, platform-native.

Examples
Snowflake Cortex
Databricks Genie
BigQuery Analytics Hub
Strengths
Native warehouse integration
Security and governance built-in
Semantic accuracy within walls
Limitations
Single warehouse only
Cross-system answers impossible
Inherits semantic layer ceiling
Vendor lock-in by design

Coding Agents

Agentic reasoning around text-to-SQL. Points at a database via MCP, introspects schema, generates SQL.

Examples
Claude Code
Cursor
GitHub Copilot
Cortex Code
Genie Code
Strengths
Flexible, code-first workflow
Handles complex multi-step logic
No pre-modeling required
Limitations
Expensive reasoning loops
~90% accuracy ceiling
No persistent understanding
Not scalable beyond power users
Pattern:Text-to-SQL
Pattern:Semantic Layer
Pattern:Text-to-SQL

FIG. 4 — Five approaches to conversational data analytics, two underlying bets

These are different jackets, but the same two bets. And anything built on top of them inherits the same fundamental ceiling. The industry has spent the last two years measuring success on the one dimension these approaches were always going to win: accuracy. But accuracy alone was never the point.

Accuracy Is Not The North Star

What the industry isn't measuring matters more than what it is. The vectors that actually determine whether AI on your data works in production are cost and time (we'll get to why they belong together) and breadth. Both have been overlooked while the industry races to win on accuracy. That's why we're stuck.

Solved
01
Accuracy
The right answer for a given question.
Industry progress Mature
Unsolved
02
Cost & Time
Whether the system scales past a handful of power users.
Industry progress Early
Unsolved
03
Breadth
Whether the system can answer questions that span your actual data estate.
Industry progress Early

Cost and Time

Cost and time determine whether the system can scale past a handful of power users. To get accurate answers, text-to-SQL agents lean on long reasoning loops that consume a lot of tokens and take a while to return. That's reasonable for one analyst running one query. It becomes harder to justify when you're trying to roll AI-driven analytics out to a whole organization, where every question multiplies the cost and the wait.

Breadth

Breadth determines whether the system can answer questions that span your actual data estate. Semantic layers do their job well, but they're built for a specific platform or a specific team. Most real business questions cross at least one boundary, and a semantic layer designed around one tool can only take you so far when the answer lives somewhere else.

Ontologies as the Answer

A third pattern is emerging in the industry, one that resolves the tradeoff the first two treat as fixed. It's called an ontology: a living map of your data estate that builds itself and gets smarter every time it's used.

Here's the kicker.

First query: the ontology has never seen your data before. It searches exhaustively, traversing table relationships, mapping attributes, locating where business metrics actually live across your systems. What a data engineer would spend weeks modeling one domain at a time, the ontology does automatically, across your entire data estate at once.

FIG. 5 — On first query, the ontology maps the full data estate exhaustively

This isn't scoped to one warehouse or one tool either. An ontology can traverse your entire data estate, across systems, across schemas, across the boundaries that semantic layers were never designed to cross.

Second query: the ontology already knows the terrain. It finds the shortest path to the answer directly: no redundant scanning, no reasoning loops, no waiting for someone to model a new domain. And every query after that compounds. The more questions your organization asks, the more the ontology learns, and the faster and cheaper every subsequent answer becomes.

FIG. 6 — On subsequent queries, the ontology routes directly, compounding efficiency

For the first time, the only barrier to asking a question is whether the data exists.

That's the infrastructure bet at the core of TextQL, and why we built on ontologies from the start.

The Cost of Getting There

And even with that breakthrough, opening access to everyone changes the economics entirely. That's where cost becomes the next step function.

A few years ago, the pitch was simple: LLM costs are falling fast, and eventually querying your data with AI will cost less than asking a human analyst. That happened. A query today costs a fraction of a percent of what a senior analyst costs to answer the same question. And yet most organizations are no closer to rolling agentic analytics out to everyone.

LLM cost per query is falling below human laborCost perQueryTimeHuman AnalystLLM costwe are here

FIG. 7 — LLM cost per query is falling below the cost of human analyst labor

So why is agentic analytics still expensive?

Because, as we've established, the cost has never really been about the LLM. It's about how these systems reason. Accuracy aside, flying blind on every query is expensive, and right now that cost is absorbed by a small group of specialists who can justify it. The rest of the organization files requests and waits. We are holding the floodgates shut ourselves. Nobody is handing the whole company access to Claude Code or Cursor.

When that changes, query volume doesn't increase incrementally. It explodes. And if your system reasons inefficiently, falling costs don't save you. They just enable more waste at higher volume. The cost bottleneck doesn't disappear. It shifts.

Semantically identical queries — answered from scratch every session, no shared context
M Maya Tuesday
most popular routes last month?
Tool calls 0
Routes 14 and 30 — 1.2M and 980K boardings.
D Devon Thursday
top OD pairs in the last 4 weeks?
Tool calls 0
Route 14 and Route 30 — 1.2M / 980K trips.
P Priya Today
busiest stations / routes?
Tool calls 0
Routes 14 and 30 — ~1.2M / ~980K boardings.
Without an ontology 148 tool calls across 3 sessions · same answer reached 3 times from scratch

Both Snowflake and LLMs bill proportionally to work. Snowflake charges compute hours. LLMs bill for every token the model processes. Every extra reasoning step, every redundant table scan, every time an agent figures out something an ontology already knows — that's money spent twice, on your warehouse bill and your inference bill at the same time.

As a result, the platform that gives you the fastest path to an answer is, almost by definition, the cheapest one at scale. Especially over the long horizon. Speed and cost turn out to be the same thing at scale.

Time to verified answer improves with query volumeTime toAnswerQueries answered by orgClaude Code + claude.mdOntology7x faster

FIG. 8 — Query volume compounds time-to-answer under an ontology-driven approach

This is why ontology-driven queries running 10x faster isn't just a performance story. At the volume that AI on data will operate at, it's a unit economics story.

The New Metric

This isn't theoretical. There's a pattern we keep seeing at TextQL. Every time we cut query costs by 1.5x, demand surges by 3x. Not incrementally. It jumps. Jevons' paradox, applied to enterprise analytics: the cheaper and faster access becomes, the more of it people want. Consumption expands.

SANDBOX HOURS REDUCEDINPUT TOKEN / CACHING OPTIMIZATION020M40M60M80MWeekly ACUs (Millions)07.5K15K22.5K30KWeekly messages sentOct 2025Nov 2025Dec 2025Jan 2026Feb 2026Mar 2026Apr 2026Actual Weekly ACUs MeasuredACU consumption per week,excluding TextQL employees.Projected Without Changes~26K messages/user/week × peak users —what consumption would have beenwithout optimization.Weekly Messages SentTotal user queries across theplatform (right axis).Sandbox Hours ReducedSandbox efficiency improvementsreduced ACU consumption by upto 24x, materially lowering overallsandbox hours.Input Token / Caching OptimizationNew token optimization capabilitieshave substantially lowered inputtoken usage through more efficientmanagement and caching strategies.

FIG. 3 — Weekly ACU consumption vs. messages sent: efficiency gains compound as demand grows

This is the right outcome. It means AI on data is finally working the way it was supposed to, not as a tool for specialists, but as infrastructure for everyone. The real question is whether your stack is built around the right framework to absorb that demand without the cost curve running away. We've shown that it can.

That changes how we should measure success. The new eval for this era is time to answer: it drives cost per query, which determines whether you can scale across your org. Accuracy is the floor. The platforms that win get there fastest, across the most data, at a cost that holds over time.

That's what we're building at TextQL.