DataHub | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

DataHub

DataHub is the open-source data catalog originally built at LinkedIn and now commercialized by Acryl Data. It is the leading open-source challenger to Atlan and the preferred catalog at companies with strong in-house platform teams.

DataHub is the open-source data catalog that grew up at LinkedIn and is now commercialized by Acryl Data. It is the strongest open-source entry in the modern catalog wave, the most technically credible alternative to Atlan, and the only catalog in this list that you can realistically run yourself for free in production.

The project's ancestry explains everything about its design. In 2014 LinkedIn released WhereHows, an internal metadata system built because LinkedIn's data platform team was drowning in Hadoop datasets nobody could find. WhereHows worked but did not scale. In 2019 LinkedIn open-sourced a complete rewrite called DataHub, built by Shirshanka Das and Mars Lan on top of a push-based, event-sourced metadata architecture. In 2020 Shirshanka and colleague Swaroop Jagadish left LinkedIn to found Acryl Data as the commercial steward of the project, raising funding from 8VC, LinkedIn itself, and Insight Partners.

Unlike Amundsen (see below), DataHub did not stall after leaving its birthplace; Acryl has been an active, serious maintainer, and the open-source project has genuinely thriving momentum — tens of thousands of GitHub stars, a large Slack community, and production deployments at Netflix, Pinterest, Expedia, and LinkedIn itself.

The Technical Bet: Push, Not Pull

Most legacy catalogs (Collibra, Alation) work by pulling metadata: running scheduled crawlers against your sources to ingest schemas, query logs, and lineage. DataHub's defining architectural bet is to support a push model in addition to pull. Producers of metadata — the warehouse, dbt, Airflow, Spark, Kafka, BI tools — emit metadata change events into DataHub via a Kafka-based event stream. The catalog is eventually consistent with the state of the world rather than a stale snapshot from last night's crawl.

The payoff is real: more accurate lineage, faster propagation of changes, and a cleaner story for modern streaming architectures. The cost is also real: push architecture is more complex to operate and pushes more responsibility onto the producers.

Under the hood, DataHub uses Kafka for the change stream, Elasticsearch for search, MySQL/Postgres for primary storage, and Neo4j (or an equivalent graph backend) for lineage traversal. The metadata model is an extensible schema defined in PDL (Pegasus Data Language, a LinkedIn-origin IDL), which lets teams add custom aspects to any entity. This extensibility is one of the main reasons large platform teams pick DataHub: they can model their own quirky data landscape without waiting for a vendor to ship a feature.

What DataHub Actually Does

Feature-for-feature, DataHub is competitive with every major catalog:

Search over tables, columns, dashboards, pipelines, ML features, and models, with relevance tuned by tags, ownership, and usage signals.
Column-level lineage across SQL warehouses (Snowflake, BigQuery, Redshift, Databricks), dbt, Airflow, Spark, and Looker, parsed from query logs and transformation code. The depth is widely considered on par with Atlan's.
dbt integration that is a first-class ingestion source, not an afterthought.
Glossary and tags, including business glossary terms with approval states.
Domain and data product concepts, aligned with data-mesh thinking and championed hard by the Acryl team.
Policies and access control at the catalog layer (who can see which assets, who can propose changes).
ML and feature catalog — a distinguishing strength, since DataHub grew up at LinkedIn alongside their feature store, so ML models, features, and training datasets are first-class entities in the metadata model.
DataHub Cloud (Acryl) — the managed commercial offering that adds SSO/RBAC, advanced observability, automated metadata tests ("assertions"), and enterprise support.

Open Source vs Acryl Cloud

DataHub comes in two flavors that matter to buyers:

DataHub open source — fully featured, Apache 2.0, runs on Kubernetes. You can genuinely operate it in production, and large platform teams (LinkedIn, Netflix, Pinterest) do. The cost is operational: running Kafka, Elasticsearch, Neo4j, MySQL, and a microservices backend is real work, and if you are not already a Kubernetes-comfortable shop, it will hurt.

Acryl DataHub Cloud — Acryl's managed SaaS built on the open-source core, with enterprise features layered on top: fine-grained access control, no-code ingestion, advanced data observability, automated metadata propagation, and a polished UI. This is what Acryl actually sells. The pitch is: you get the architectural benefits of DataHub without having to run Kafka yourself.

This two-tier structure — identical to GitLab, Grafana, Confluent, and a dozen other commercial-open-source companies — is what makes DataHub credible as both a free community project and a viable enterprise purchase.

The Opinionated Take

DataHub is the open-source challenger that actually matters. In the modern catalog race, there are really only four serious answers: Atlan, DataHub, Collibra, and Alation. Of those, DataHub is the only one an engineering team can adopt without talking to a salesperson, which is a structural advantage in bottom-up data platform teams.

DataHub wins where the buyer is an engineering-led platform team that (a) has strong opinions about running its own infrastructure, (b) values open standards and avoiding vendor lock-in, (c) has custom metadata needs that don't fit off-the-shelf SaaS, or (d) already runs Kafka and Kubernetes and is not scared of more. Many of its biggest users — Netflix, Pinterest, LinkedIn, Expedia — fit all four criteria. For these teams, DataHub is strictly better than Atlan because it bends to their model instead of the other way around.

DataHub loses when the buyer is a Head of Data at a 500-person company who wants a catalog running by Thursday and does not care about push-vs-pull architectural purity. For them, Atlan's multi-tenant SaaS is an easier sell and will deliver value faster. Acryl Cloud narrows this gap meaningfully but has historically not had the same brand momentum as Atlan in that buyer segment.

The longer-term bet. Open-source catalogs have a structural advantage in the emerging world where metadata standards matter more than specific products. If OpenLineage, Iceberg, and emerging catalog protocols standardize how tools exchange metadata, the open project with the largest community wins. DataHub is currently that project, and its open-source license makes it the natural interchange format for a multi-vendor stack.

TextQL Fit

TextQL integrates with DataHub via its GraphQL API and Kafka event stream to pull table descriptions, column metadata, glossary terms, ownership, and column-level lineage. The alignment is particularly strong for customers with data-mesh-style domain structures, where DataHub's domain and data-product model can be mirrored directly into TextQL's scoping. For engineering-led teams already running DataHub in production, TextQL becomes the natural natural-language interface on top of the metadata they already curate.

See TextQL in action

DataHub

Open-sourced 2019 (LinkedIn)

Commercial entity Acryl Data, founded 2020

Founders (Acryl) Shirshanka Das, Swaroop Jagadish (both ex-LinkedIn)

HQ Sunnyvale, California

Origin LinkedIn — evolved from WhereHows, their internal catalog

Category Data Catalog

License Apache 2.0

Notable users LinkedIn, Netflix, Expedia, Pinterest, Hurdle (formerly Peloton Interactive's data team), Stash

Monthly mindshare ~25K · ~10K GitHub stars; LinkedIn OSS; smaller commercial footprint