TLDR: AI for structured data has been stuck for years. The industry mostly solved accuracy. What it didn't solve for is breadth, cost, and time. Our Ontology is the first approach to break the ceiling.
For years, the data industry has been selling you a prerequisite. Before AI can touch your data, you need a lakehouse. Before the lakehouse, a medallion architecture. Your data needs to be "AI-ready": clean schemas, modeled marts, and a data dictionary to pair with it. The promise of conversational data analytics is always one step away.
It's 2026, and the AI revolution in structured data still hasn't arrived. Not really. Text-to-SQL benchmarks inch upward as models get more powerful, semantic layers multiply, and vendors continue to repackage the same fundamental bet: if we just describe our data well enough, machines will understand it. They don't. Not on the questions that matter, at least.
What I Got Wrong About AI for Structured Data
I'm Ben. I joined TextQL after five years embedded in data and AI infrastructure, at Blackstone and across the modern data stack. Like a lot of people in this space, I've spent most of that time chasing the same goal: conversational data analytics that actually works.
I started where almost everyone starts: text-to-SQL. The pitch is irresistible. Type a question, get SQL out, no modeling required. In a demo with five clean tables, it works. The problem, as I've realized, is that real business users don't ask questions the way demos do. Users ask ambiguous questions, half-formed questions, questions that assume context the system doesn't have. The system confidently returns an answer that is only half-right. After enough of those, conversations stop being about what the product can do and start being about how you can guarantee it won't be wrong.
BIRD-SQL is the industry's standard benchmark for measuring exactly this. In 2026, the best systems plateau around 82% under controlled conditions. Move to unmodeled, enterprise-realistic data and that drops to 42%. That gap is where the industry's obsession with accuracy benchmarks comes from. It isn't misplaced. It just isn't enough.
+ GPT-4oAgentar-
Scale-SQLLongData-
SQLSiriusAI-
Text2SQLZhiwen-
Lingsi-AgentGemini
3.1 Pro
Sources: BIRD-Bench official leaderboard (top 5 systems ranked by test-set execution accuracy, evaluated with hand-written column hints). LiveSQLBench-Base-Full V1 (Gemini 3.1 Pro, normal difficulty, contamination-free databases). Human expert baseline: 92.96% execution accuracy on BIRD dev set (Guo et al., NeurIPS 2023).
FIG. 1 — Text-to-SQL accuracy on trained benchmarks vs. unseen databases
The industry's answer to that gap was semantic layers. And they deserved the reputation. Map your metrics, dimensions, and business logic into a structured layer and accuracy becomes real. Answers get reliable.
dbt's 2026 benchmarks show that precise pattern. Raw text-to-SQL against a normalized, lightly modeled enterprise schema lands around 70%. Back it with a well-modeled semantic layer and that climbs toward 100%. The accuracy problem, it turns out, is solvable.
MULTI-HOP JOINS
Source: dbt Developer Blog, "Semantic Layer vs. Text-to-SQL: 2026 Benchmark Update" (April 7, 2026). ACME Insurance benchmark (data.world), 11 questions × 20 runs. "Within scope" uses Too Many Hops = False; "Outside scope" uses Too Many Hops = True.
FIG. 2 — dbt 2026: semantic layer hits 100% within scope, 0% outside it
But you can see the ceiling this imposes. The questions outside the semantic layer's scope show 0% accuracy in dbt's own benchmark. They aren't answered at all, by design. That points to the fundamental constraint of the approach.
The permutations of questions any organization needs to ask always outpace what the data team has gotten around to modeling or building a semantic layer around. A new revenue breakdown. A different cut of customer behavior. A question that crosses two domains nobody has formally connected yet. The data exists, but until an engineer sits down to model it, AI can't reliably answer questions about it. The only solution is to prioritize which datasets get built out and leave the rest dark.
FIG. 3 — Datasets available to AI scale linearly with data team headcount
Most organizations accept this tradeoff because the demos finally work. The analytics they showcase are accurate. But the tradeoff is still there, and the rest of the industry hasn't made it any easier to escape.
The Landscape, and Why It's Stuck
Spend enough time inside text-to-SQL and semantic layers and you start seeing them everywhere. Every major BI tool, every cloud warehouse AI feature, every coding agent pointed at a database is running one of two playbooks. Nothing launched in the last year has broken the pattern.
And so, these are all different jackets on the same two bets. Anything built on top of them inherits the same ceiling:
The Conversational Data Analytics Landscape
Text-to-SQL
Wraps LLMs with context injection and RAG to convert natural language directly into SQL queries.
Seek AI
Vanna AI
AI2SQLSemantic Layers
Maps metrics, dimensions, and business logic into a structured layer on top of your data models.
Cube
AtScale
ZenlyticGenerative BI
Natural language layered on existing BI models. Compiles SQL into charts and dashboards.
Warehouse Copilots
Semantic layer inside a warehouse's security perimeter. AI-first architecture, platform-native.
Coding Agents
Agentic reasoning around text-to-SQL. Points at a database via MCP, introspects schema, generates SQL.
Claude Code
Cursor
GitHub CopilotFIG. 4 — Five approaches to conversational data analytics, two underlying bets
A creative way to close the gap has been to stack these approaches: give the coding agent a semantic layer to anchor its queries, or use the coding agent to build the semantic layer itself.
You can see it in Snowflake's Cortex Code sitting over semantic views, in Databricks Genie Code sitting over Unity Catalog metric views, or in Cursor pointed at a database via MCP. Different vendors, but the same architectural foundation. The agents provide breadth across the database. The semantic layers provide direction. Accuracy gets closer to solved, and so does coverage of the warehouse itself.
And yet it hasn't reached most of the organization.
Today, these tools are put in the hands of a small group of specialists who can justify it. The rest of the organization submits requests and waits for the data still.
Governance explains some of it. Setup is another barrier, since wiring an agent to a semantic layer requires technical fluency most business users don't have.
But let's assume both are addressed, and a deeper problem emerges: the economics break at organizational scale. The pattern that works for ten power users does not survive a thousand employees asking real business questions.
That's where cost and time become the next step function.
The Cost of Getting There
A few years ago, the pitch was simple: LLM costs are falling fast, and eventually querying your data with AI will cost less than asking a human analyst. That happened. A query today costs a fraction of a percent of what a senior analyst costs to answer the same question.
FIG. 5 — LLM cost per query is falling below the cost of human analyst labor
And yet nobody is handing the whole company access to Claude Code or Cursor on top of their data stacks. The floodgates are still being held shut.
The cost has never really been about the LLM. It is about how these systems reason. Semantic layers anchor a portion of the data estate, but business users do not stay inside it. They ask questions that span the whole warehouse, and often beyond it, into the business applications, documents, and institutional knowledge that actually explain what the data means.
Outside the modeled slice, most queries start from zero. An agent rediscovers your schema, infers what columns mean, and verifies which joins are valid before it can answer anything. Flying blind is expensive.
Consequently, when the floodgates open, query volume does not increase incrementally. It explodes. And if your system reasons inefficiently, falling LLM prices do not save you. They enable more waste at higher volume. The cost bottleneck does not disappear. It moves from the LLM bill to the warehouse bill and the wasted tokens piled on top.
Snowflake and Databricks charge for every second their compute is running. LLMs charge for every token the model processes. Every extra reasoning step, every redundant table scan, every time an agent figures out something it has already seen before is money spent twice, on your warehouse bill and your inference bill at the same time.
As a result, the platform that wastes the least work at scale wins. Cost and time start to become two views of the same variable.
Three Constraints, Not One
This points to three constraints that matter for designing an agentic analytics system, not just one. Accuracy has dominated our attention. But cost and time, and breadth have not. That is why the industry is stuck.
Note that I am not framing these as benchmarks or evals. I believe that both are imperfect measures. A mentor once told me that evals for AI are "like putting lipstick on a pig." He was right. Treating each constraint as a score, not a design property, is what keeps us boxed into the current architecture patterns.
On Accuracy
Accuracy is still the principal measure of a system worth trusting. The right answer for a given question is what every system promises, and within modeled scope, the industry has broadly delivered.
But accuracy benchmarks miss most of what matters in production. They do not capture ambiguity, where an agent returns a technically correct answer to the wrong interpretation. They do not distinguish partially right answers from technically accurate but useless ones. They do not measure whether the system pulled from the right sources, or built what the user actually wanted.
Accuracy also has a control problem. Semantic modeling is the dominant path to accurate answers, but if every business user builds their own semantic layer, you get conflicting definitions of the same metric. Accuracy at the query level becomes incoherence at the organization level.
On Breadth
The best answers are only possible if the right data is in the room. Breadth extends larger than a single warehouse. It includes the lakes, the business applications, and the documentation that explains what any of it means. Most "AI-ready" benchmarks measure only a slice. The real question is what coverage of your data systems AI can actually reach, and whether it can pull reliable insights across them. Breadth determines whether useful questions get answered at all, because most real questions cross at least one boundary. A semantic layer designed around one tool can only take you so far when the answer lives somewhere else.
On Cost and Time
Accuracy and reach only matter at scale, though, when the economics of getting there work. Cost and time should be solved by systems that get more efficient as they accumulate knowledge. The property to look for is memory and reuse: each query should make the next one cheaper. Do the costs flatten over time? Does the time per query come down as the agent rediscovers less?
Marginal cost needs to trend downward as volume grows. A system whose cost-per-query stays flat as volume scales is a system that does not work in production, or one where access is deliberately constrained.
Most of what is worth measuring here is not an eval at all. It is observability and secondary measures of quality: user sentiment, response length, number of tool calls, failed executions, cost per query, LLM-judged missing context. Ease of modification in response to new learnings matters more than any one-shot benchmark score.
Closing The Gap: An Ontology Repo
These design considerations led us to the infrastructure bet at the core of TextQL: that the right shape for AI on structured data combines what semantic models and agents do well into something new. We call it an Ontology.
Ontologies have become a buzzword in the last year, and every vendor defines them differently. Ours is specific: a git-controlled, role-aware set of files containing rich documentation and a small set of structured definitions in a format we call .tql. Every part of it was designed for AI agents to navigate, which is why it rewards locality, tolerates redundancy, and stays readable to humans.
The benefits of this design are best illustrated across two queries:
First query: The ontology has not seen your data before. It searches exhaustively across systems, schemas, and files, mapping attributes and locating where business metrics actually live. What a data engineer would spend weeks modeling one domain at a time, the ontology does automatically, across your entire data estate at once. This is not scoped to a single warehouse or a single tool. An ontology can traverse the boundaries that semantic layers were never designed to cross.
FIG. 6 — On first query, the ontology maps the full data estate exhaustively
Second query: The ontology knows the terrain. It finds the shortest path to the answer directly. This is memory, working as a design property: no redundant scanning, no reasoning loops, no waiting for someone to model a new domain. And every query after that routes directly.
FIG. 7 — On subsequent queries, the ontology routes directly, driving efficiency
For the first time, the only barrier to asking a question is whether the data exists.
This is why ontology-driven queries running 7x faster isn't just a performance story. At the volume AI on data will operate at, it is a unit economics story. The real question is whether your stack is built to absorb that demand without the cost curve running away. We have shown that it can.
FIG. 8 — Query volume accelerates time-to-answer under an ontology-driven approach
AI Analytics Has Arrived
This isn't theoretical. There is a pattern we keep seeing inside customer deployments at TextQL. Every time we cut query costs by 1.5x, demand surges by 3x. Not incrementally. It jumps.
FIG. 9 — Weekly ACU consumption vs. messages sent: efficiency gains drive demand
This is Jevons paradox applied to enterprise analytics: the cheaper and faster access becomes, the more of it people want. Consumption expands to fill the new capacity. Questions nobody bothered requesting become routine. Workflows that needed a quarterly review become weekly.
This is the right outcome. It means AI on data is finally working as infrastructure, not just a specialist tool.
But it also changes how we should measure success. The era of conversational data analytics had a clean definition: hit accuracy on a benchmark and ship. The agentic analytics era needs a different one.
The signal that matters is virality. Not virality in the consumer sense, but in the enterprise sense: whether one team's adoption pulls the next team in. Virality is the only signal that proves all three constraints are working at once. If accuracy fails, trust collapses. If breadth is too narrow, each new team hits the wall. If cost and time don't scale, the organization caps usage before it can spread.
But virality isn't only proof that the system works. It is what makes the system work better. Each new query pays for the discovery that powers the next. Spread drives reuse, reuse drives cost down, cost down drives spread.
This is what AI on data was always supposed to do. The platforms that win this era will be the ones that get to the answer fastest, across the most data, at a cost that holds over time. That is what we are building at TextQL.