DS/ML Platforms | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Contents

DS/ML Platforms

ML platforms are end-to-end environments for training, deploying, and operating machine learning models. The dominant players are Databricks and SageMaker, but the LLM era is rewriting the entire category.

A data science / machine learning platform is the environment where models get built, trained, deployed, and operated. The job description has expanded over the last decade: in 2015 an ML platform was mostly a Jupyter notebook with a GPU attached. In 2020 it was a full pipeline — feature store, training cluster, model registry, deployment service, monitoring. In 2026 it is all of that plus LLM fine-tuning, vector databases, RAG pipelines, agent orchestration, and a confusing new vocabulary that nobody quite agrees on.

Think of an ML platform as the IDE for machine learning — the place where every step from raw data to production model lives. Without one, ML teams stitch together notebooks, GitHub repos, EC2 instances, S3 buckets, and Slack threads, and nothing is reproducible. With one, the entire ML lifecycle has a home.

What an ML Platform Actually Does

A modern ML platform typically covers some or all of the following:

1. Notebooks and IDEs. A hosted environment where data scientists write code, usually Jupyter or a Jupyter clone, with GPU attachment, library management, and shared state.

2. Training infrastructure. Compute clusters (CPU and GPU) that you can spin up to train a model on a real dataset, then spin down. Distributed training (across many GPUs) lives here.

3. Experiment tracking. Every model run logs its hyperparameters, metrics, and artifacts so you can compare versions and reproduce results. MLflow is the dominant open-source standard for this.

4. Feature store. A centralized place to define, compute, and serve the input features models need. The same customer_lifetime_value feature gets used during training and inference, eliminating training-serving skew. Tecton, Feast, and the warehouses' native feature stores compete here.

5. Model registry. A versioned catalog of trained models, with metadata, lineage, and stage tags (dev, staging, prod).

6. Deployment / serving. Wrapping a trained model in an API endpoint that production applications can call. Real-time, batch, or streaming.

7. Monitoring. Watching deployed models for drift, latency, accuracy degradation, and bias. The "data observability" of the ML world.

8. Pipeline orchestration. Stitching all of the above into reproducible workflows, usually via Kubeflow, Airflow, Metaflow, or platform-specific DSLs.

The platforms in the "ML platform" category bundle some or all of these into a single product. The ones that bundle them all are large and expensive. The ones that bundle a few are cheap and modular.

A Brief History

ML platforms as a commercial category started around 2014-2016 with companies like Domino Data Lab, Dataiku, and DataRobot, all of which sold the idea that data scientists needed a managed environment instead of bare EC2 boxes. The early pitch was mostly about productivity: a notebook with reproducibility, collaboration, and one-click deployment.

In 2017, AWS launched SageMaker, which immediately validated the category and gave AWS-native enterprises a default choice. Google Cloud followed with Vertex AI (originally AI Platform). Azure followed with Azure ML. The big cloud vendors all wanted their cut.

Around the same time, Databricks — which had started as the commercial company behind Apache Spark — built MLflow as an open-source project for experiment tracking and model registry. MLflow quickly became the default open standard for the experiment-tracking layer and gave Databricks a wedge into the ML platform market. By 2020, "Databricks ML" was a complete platform: notebooks, AutoML, MLflow, model serving, and feature store, all on top of the Databricks lakehouse.

The 2020-2022 generation of MLOps tooling (Weights & Biases for experiment tracking, Tecton for feature stores, Determined AI for training, Run:ai for GPU scheduling, Modal and Replicate for model hosting) was a Cambrian explosion of point solutions. Most have either been acquired (W&B by CoreWeave, Determined by HPE) or have become part of larger platforms.

Then 2023 happened. The release of ChatGPT, the explosion of open-source LLMs, and the flood of enterprise generative AI projects rewrote the category in eighteen months. Suddenly, "ML platform" had to also mean "LLM platform" — fine-tuning, RAG, vector databases, agents, prompt management, evaluation. The classical ML lifecycle did not go away, but it stopped being the most exciting part of the platform.

The Opinionated Take

Two players dominate the classical ML platform category in 2026: Databricks and Amazon SageMaker. Databricks is the data-team-led choice (the data scientist is already in the lakehouse, so adding ML is one click away). SageMaker is the AWS-engineering-led choice (the team is already deep in AWS, so SageMaker is the path of least resistance). Vertex AI and Azure ML round out the big-cloud trio. Independent vendors like DataRobot, Dataiku, and Domino exist but have steadily lost share to the cloud-native and lakehouse-native alternatives.

The most interesting thing about the category is that the LLM era is changing everything underneath the labels. The classical MLOps stack — feature stores, training clusters, model registries — is still useful for the regression and classification models that power churn prediction, fraud detection, and recommendation. But the new generative AI workloads have a completely different shape. They use foundation models trained by someone else, not custom models. They need prompt management, not feature stores. They need evaluation frameworks, not accuracy dashboards. They need agent orchestration, not training clusters.

The big platforms are scrambling to add these. Databricks acquired MosaicML in 2023 for $1.3B specifically to be the LLM training platform. SageMaker has a JumpStart catalog of foundation models, fine-tuning services, and Bedrock for inference. Vertex AI ships Gemini models natively. Snowflake has Cortex. None of these offerings is yet as polished as the classical ML side of the same platforms, which is what creates room for new entrants like Modal, Together, Anyscale, Replicate, OctoAI, and the LLMOps startups.

The honest prediction: the classical ML platform category will continue to consolidate around Databricks, SageMaker, and Vertex. The generative AI platform category is wide open and will have its own winners over the next few years. Most enterprises will end up with both — a classical ML platform for their old workloads and a separate (or layered) GenAI platform for the new ones.

How TextQL Works with ML Platforms

TextQL Ana is, in part, a generative AI application — it uses LLMs to translate natural language into SQL against a customer's data. Customers running ML platforms care about Ana for two reasons. First, Ana exposes the outputs of their classical ML models (churn scores, customer segments, fraud flags) as queryable features in natural language, so business users can ask "show me customers with the highest churn risk" without writing SQL. Second, Ana itself is an example of the kind of LLM-powered application the new generation of ML platforms is built to host. TextQL is not a competitor to Databricks or SageMaker; it is a downstream consumer of the data those platforms produce.

See TextQL in action

DS/ML Platforms

Category Model training, serving, MLOps

Also called MLOps platforms, ML lifecycle platforms

Dominant vendors Databricks ML, Amazon SageMaker, Vertex AI, DataRobot

Open source MLflow, Kubeflow, Metaflow, Ray

Era shift Classical ML → Deep learning → Generative AI / LLMs

Typical users Data scientists, ML engineers, MLOps engineers

Monthly mindshare ~500K · ML engineers and data scientists; broad category