Semantic Layers Are a Patch. Ontologies Are the Fix.

TLDR: AI for structured data has been stuck for years. The industry mostly solved accuracy. What it didn't solve for is breadth, cost, and time. Our Ontology is the first approach to break the ceiling.

For years, the data industry has been selling you a prerequisite. Before AI can touch your data, you need a lakehouse. Before the lakehouse, a medallion architecture. Your data needs to be "AI-ready": clean schemas, modeled marts, and a data dictionary to pair with it. The promise of conversational data analytics is always one step away.

It's 2026, and the AI revolution in structured data still hasn't arrived. Not really. Text-to-SQL benchmarks inch upward as models get more powerful, semantic layers multiply, and vendors continue to repackage the same fundamental bet: if we just describe our data well enough, machines will understand it. They don't. Not on the questions that matter, at least.

What I Got Wrong About AI for Structured Data

I'm Ben. I joined TextQL after five years embedded in data and AI infrastructure, at Blackstone and across the modern data stack. Like a lot of people in this space, I've spent most of that time chasing the same goal: conversational data analytics that actually works.

The TextQL team — Thumbs up if your data is a mess.

I started where almost everyone starts: text-to-SQL. The pitch is irresistible. Type a question, get SQL out, no modeling required. In a demo with five clean tables, it works. The problem, as I've realized, is that real business users don't ask questions the way demos do. Users ask ambiguous questions, half-formed questions, questions that assume context the system doesn't have. The system confidently returns an answer that is only half-right. After enough of those, conversations stop being about what the product can do and start being about how you can guarantee it won't be wrong.

BIRD-SQL is the industry's standard benchmark for measuring exactly this. In 2026, the best systems plateau around 82% under controlled conditions. Move to unmodeled, enterprise-realistic data and that drops to 42%. That gap is where the industry's obsession with accuracy benchmarks comes from. It isn't misplaced. It just isn't enough.

Execution accuracy

100%80%60%40%20%0%

HUMAN EXPERT BASELINE — 93%

82%

AskData + GPT-4o Dec 16, 2025 AT&T's text-to-SQL pipeline built on top of GPT-4o.

82%

Agentar-Scale-SQL Sep 25, 2025 Ant Group's multi-stage framework that runs SQL through several reasoning and refinement passes.

78%

LongData-SQL Jul 14, 2025 LongShine AI Research's framework that loads large database context to improve accuracy.

77%

SiriusAI-Text2SQL-Agent Apr 28, 2026 Tencent's agentic text-to-SQL system from the Data & Computation Platform Department.

77%

Zhiwen-Lingsi-Agent Jan 02, 2026 China Telecom and TeleAI's agentic SQL system tuned for enterprise data.

42%

Gemini 3.1 Pro Sep 10, 2025 Frontier model tested on fresh databases with no hand-written hints. The score that's left when the curation goes away.

AskData
+ GPT-4oAgentar-
Scale-SQLLongData-
SQLSiriusAI-
Text2SQLZhiwen-
Lingsi-AgentGemini
3.1 Pro

BIRD-Bench

Top 5 leaderboard systems with hand-written column hints

LiveSQLBench

Unmodeled data

Sources: BIRD-Bench official leaderboard (top 5 systems ranked by test-set execution accuracy, evaluated with hand-written column hints). LiveSQLBench-Base-Full V1 (Gemini 3.1 Pro, normal difficulty, contamination-free databases). Human expert baseline: 92.96% execution accuracy on BIRD dev set (Guo et al., NeurIPS 2023).

FIG. 1 — Text-to-SQL accuracy on trained benchmarks vs. unseen databases

The industry's answer to that gap was semantic layers. And they deserved the reputation. Map your metrics, dimensions, and business logic into a structured layer and accuracy becomes real. Answers get reliable.

dbt's 2026 benchmarks show that precise pattern. Raw text-to-SQL against a normalized, lightly modeled enterprise schema lands around 70%. Back it with a well-modeled semantic layer and that climbs toward 100%. The accuracy problem, it turns out, is solvable.

Within Semantic Layer Scope

Outside Semantic Layer Scope

Accuracy

100%80%60%40%20%0%

SCHEMA DOESN'T SUPPORT
MULTI-HOP JOINS

The Semantic Layer returns "cannot answer" rather than guess.

26.9%

Text-to-SQL · GPT-4 2023 Baseline accuracy on questions that fit within MetricFlow's join scope.

62.5%

Text-to-SQL · Sonnet 4.6 2026 Modern frontier model on the same questions. Big jump, but still not reliable.

83.1%

Semantic Layer · GPT-4 2023 Same model, but routed through dbt's semantic layer. Accuracy jumps significantly.

100%

Semantic Layer · Sonnet 4.6 2026 Perfect accuracy within scope. Deterministic SQL generation eliminates wrong answers.

48.3%

Text-to-SQL · GPT-4 2023 On questions requiring too many entity hops for MetricFlow.

70%

Text-to-SQL · Sonnet 4.6 2026 Frontier model still tries, sometimes successfully. But no determinism.

Semantic Layer · GPT-4 2023 Cannot answer. MetricFlow's schema doesn't support multi-hop joins beyond a limit.

Semantic Layer · Sonnet 4.6 2026 Still cannot answer. Better models don't help when the architecture doesn't support the question.

Text-to-SQL2023 (GPT-4)Text-to-SQL2026 (Sonnet 4.6)Semantic Layer2023 (GPT-4)Semantic Layer2026 (Sonnet 4.6)Text-to-SQL2023 (GPT-4)Text-to-SQL2026 (Sonnet 4.6)Semantic Layer2023 (GPT-4)Semantic Layer2026 (Sonnet 4.6)

Source: dbt Developer Blog, "Semantic Layer vs. Text-to-SQL: 2026 Benchmark Update" (April 7, 2026). ACME Insurance benchmark (data.world), 11 questions × 20 runs. "Within scope" uses Too Many Hops = False; "Outside scope" uses Too Many Hops = True.

FIG. 2 — dbt 2026: semantic layer hits 100% within scope, 0% outside it

But you can see the ceiling this imposes. The questions outside the semantic layer's scope show 0% accuracy in dbt's own benchmark. They aren't answered at all, by design. That points to the fundamental constraint of the approach.

The permutations of questions any organization needs to ask always outpace what the data team has gotten around to modeling or building a semantic layer around. A new revenue breakdown. A different cut of customer behavior. A question that crosses two domains nobody has formally connected yet. The data exists, but until an engineer sits down to model it, AI can't reliably answer questions about it. The only solution is to prioritize which datasets get built out and leave the rest dark.

FIG. 3 — Datasets available to AI scale linearly with data team headcount

Most organizations accept this tradeoff because the demos finally work. The analytics they showcase are accurate. But the tradeoff is still there, and the rest of the industry hasn't made it any easier to escape.

The Landscape, and Why It's Stuck

Spend enough time inside text-to-SQL and semantic layers and you start seeing them everywhere. Every major BI tool, every cloud warehouse AI feature, every coding agent pointed at a database is running one of two playbooks. Nothing launched in the last year has broken the pattern.

And so, these are all different jackets on the same two bets. Anything built on top of them inherits the same ceiling:

The Conversational Data Analytics Landscape

Text-to-SQL

Wraps LLMs with context injection and RAG to convert natural language directly into SQL queries.

Examples

Seek AI

Vanna AI

AI2SQL

Strengths

No modeling required

Fast to deploy

Broad data access

Limitations

Unreliable on messy schemas

Confident wrong answers

No business context

~40% on raw enterprise data

Semantic Layers

Maps metrics, dimensions, and business logic into a structured layer on top of your data models.

Examples

dbt MetricFlow

Cube

AtScale

Zenlytic

Strengths

Near 100% on modeled data

Deterministic results

Business logic encoded once

Limitations

Headcount scales linearly

0% outside modeled scope

Engineering sprint per domain

Single platform only

Generative BI

Natural language layered on existing BI models. Compiles SQL into charts and dashboards.

Examples

Tableau Pulse

PowerBI Copilot

Ask Sigma

Looker Explorer

Omni Analytics

Strengths

Familiar BI interface

Visual output out of the box

Fast for pre-built metrics

Limitations

Blind to upstream context

Bounded by BI team models

No cross-system reach

Visualization-first, not insight-first

Warehouse Copilots

Semantic layer inside a warehouse's security perimeter. AI-first architecture, platform-native.

Examples

Snowflake Cortex

Databricks Genie

BigQuery Analytics Hub

Strengths

Native warehouse integration

Security and governance built-in

Semantic accuracy within walls

Limitations

Single warehouse only

Cross-system answers impossible

Inherits semantic layer ceiling

Vendor lock-in by design

Coding Agents

Agentic reasoning around text-to-SQL. Points at a database via MCP, introspects schema, generates SQL.

Examples

Claude Code

Cursor

GitHub Copilot

Cortex Code

Genie Code

Strengths

Flexible, code-first workflow

Handles complex multi-step logic

No pre-modeling required

Limitations

Expensive reasoning loops

~90% accuracy ceiling

No persistent understanding

Not scalable beyond power users

Pattern:Text-to-SQL

Pattern:Semantic Layer

Pattern:Text-to-SQL

FIG. 4 — Five approaches to conversational data analytics, two underlying bets

A creative way to close the gap has been to stack these approaches: give the coding agent a semantic layer to anchor its queries, or use the coding agent to build the semantic layer itself.

You can see it in Snowflake's Cortex Code sitting over semantic views, in Databricks Genie Code sitting over Unity Catalog metric views, or in Cursor pointed at a database via MCP. Different vendors, but the same architectural foundation. The agents provide breadth across the database. The semantic layers provide direction. Accuracy gets closer to solved, and so does coverage of the warehouse itself.

And yet it hasn't reached most of the organization.

Today, these tools are put in the hands of a small group of specialists who can justify it. The rest of the organization submits requests and waits for the data still.

Governance explains some of it. Setup is another barrier, since wiring an agent to a semantic layer requires technical fluency most business users don't have.

But let's assume both are addressed, and a deeper problem emerges: the economics break at organizational scale. The pattern that works for ten power users does not survive a thousand employees asking real business questions.

That's where cost and time become the next step function.

The Cost of Getting There

A few years ago, the pitch was simple: LLM costs are falling fast, and eventually querying your data with AI will cost less than asking a human analyst. That happened. A query today costs a fraction of a percent of what a senior analyst costs to answer the same question.

FIG. 5 — LLM cost per query is falling below the cost of human analyst labor

And yet nobody is handing the whole company access to Claude Code or Cursor on top of their data stacks. The floodgates are still being held shut.

The cost has never really been about the LLM. It is about how these systems reason. Semantic layers anchor a portion of the data estate, but business users do not stay inside it. They ask questions that span the whole warehouse, and often beyond it, into the business applications, documents, and institutional knowledge that actually explain what the data means.

Outside the modeled slice, most queries start from zero. An agent rediscovers your schema, infers what columns mean, and verifies which joins are valid before it can answer anything. Flying blind is expensive.

Semantically identical queries — answered from scratch every session, no shared context

M Maya Tuesday

Three Constraints, Not One

This points to three constraints that matter for designing an agentic analytics system, not just one. Accuracy has dominated our attention. But cost and time, and breadth have not. That is why the industry is stuck.

Solved

Accuracy

The right answer for a given question.

Industry progress Mature

Unsolved

Breadth

Whether the system can answer questions that span your actual data estate.

Industry progress Partial

Unsolved

Cost & Time

Whether the system scales past a handful of power users.

Industry progress Early

Note that I am not framing these as benchmarks or evals. I believe that both are imperfect measures. A mentor once told me that evals for AI are "like putting lipstick on a pig." He was right. Treating each constraint as a score, not a design property, is what keeps us boxed into the current architecture patterns.

On Accuracy

Accuracy is still the principal measure of a system worth trusting. The right answer for a given question is what every system promises, and within modeled scope, the industry has broadly delivered.

But accuracy benchmarks miss most of what matters in production. They do not capture ambiguity, where an agent returns a technically correct answer to the wrong interpretation. They do not distinguish partially right answers from technically accurate but useless ones. They do not measure whether the system pulled from the right sources, or built what the user actually wanted.

Accuracy also has a control problem. Semantic modeling is the dominant path to accurate answers, but if every business user builds their own semantic layer, you get conflicting definitions of the same metric. Accuracy at the query level becomes incoherence at the organization level.

On Breadth

The best answers are only possible if the right data is in the room. Breadth extends larger than a single warehouse. It includes the lakes, the business applications, and the documentation that explains what any of it means. Most "AI-ready" benchmarks measure only a slice. The real question is what coverage of your data systems AI can actually reach, and whether it can pull reliable insights across them. Breadth determines whether useful questions get answered at all, because most real questions cross at least one boundary. A semantic layer designed around one tool can only take you so far when the answer lives somewhere else.

On Cost and Time

Accuracy and reach only matter at scale, though, when the economics of getting there work. Cost and time should be solved by systems that get more efficient as they accumulate knowledge. The property to look for is memory and reuse: each query should make the next one cheaper. Do the costs flatten over time? Does the time per query come down as the agent rediscovers less?

Marginal cost needs to trend downward as volume grows. A system whose cost-per-query stays flat as volume scales is a system that does not work in production, or one where access is deliberately constrained.

Most of what is worth measuring here is not an eval at all. It is observability and secondary measures of quality: user sentiment, response length, number of tool calls, failed executions, cost per query, LLM-judged missing context. Ease of modification in response to new learnings matters more than any one-shot benchmark score.

Closing The Gap: An Ontology Repo

These design considerations led us to the infrastructure bet at the core of TextQL: that the right shape for AI on structured data combines what semantic models and agents do well into something new. We call it an Ontology.

Ontologies have become a buzzword in the last year, and every vendor defines them differently. Ours is specific: a git-controlled, role-aware set of files containing rich documentation and a small set of structured definitions in a format we call .tql. Every part of it was designed for AI agents to navigate, which is why it rewards locality, tolerates redundancy, and stays readable to humans.

The benefits of this design are best illustrated across two queries:

First query: The ontology has not seen your data before. It searches exhaustively across systems, schemas, and files, mapping attributes and locating where business metrics actually live. What a data engineer would spend weeks modeling one domain at a time, the ontology does automatically, across your entire data estate at once. This is not scoped to a single warehouse or a single tool. An ontology can traverse the boundaries that semantic layers were never designed to cross.

FIG. 6 — On first query, the ontology maps the full data estate exhaustively

Second query: The ontology knows the terrain. It finds the shortest path to the answer directly. This is memory, working as a design property: no redundant scanning, no reasoning loops, no waiting for someone to model a new domain. And every query after that routes directly.

FIG. 7 — On subsequent queries, the ontology routes directly, driving efficiency

For the first time, the only barrier to asking a question is whether the data exists.

This is why ontology-driven queries running 7x faster isn't just a performance story. At the volume AI on data will operate at, it is a unit economics story. The real question is whether your stack is built to absorb that demand without the cost curve running away. We have shown that it can.

FIG. 8 — Query volume accelerates time-to-answer under an ontology-driven approach

AI Analytics Has Arrived

This isn't theoretical. There is a pattern we keep seeing inside customer deployments at TextQL. Every time we cut query costs by 1.5x, demand surges by 3x. Not incrementally. It jumps.

FIG. 9 — Weekly ACU consumption vs. messages sent: efficiency gains drive demand

This is Jevons paradox applied to enterprise analytics: the cheaper and faster access becomes, the more of it people want. Consumption expands to fill the new capacity. Questions nobody bothered requesting become routine. Workflows that needed a quarterly review become weekly.

This is the right outcome. It means AI on data is finally working as infrastructure, not just a specialist tool.

But it also changes how we should measure success. The era of conversational data analytics had a clean definition: hit accuracy on a benchmark and ship. The agentic analytics era needs a different one.

The signal that matters is virality. Not virality in the consumer sense, but in the enterprise sense: whether one team's adoption pulls the next team in. Virality is the only signal that proves all three constraints are working at once. If accuracy fails, trust collapses. If breadth is too narrow, each new team hits the wall. If cost and time don't scale, the organization caps usage before it can spread.

But virality isn't only proof that the system works. It is what makes the system work better. Each new query pays for the discovery that powers the next. Spread drives reuse, reuse drives cost down, cost down drives spread.

This is what AI on data was always supposed to do. The platforms that win this era will be the ones that get to the answer fastest, across the most data, at a cost that holds over time. That is what we are building at TextQL.