Amazon Web Services (AWS) | Data Ecosystem Wiki

Thirty Launches in Thirty Days · Read the recap →

Wiki Vendors Amazon Web Services (AWS)

Contents

Amazon Web Services (AWS)

AWS is the original public cloud and the dominant hyperscaler. Its data portfolio -- Redshift, S3, Kinesis, QuickSight, SageMaker, Glue, Athena, EMR -- is the broadest in the industry. Each individual product is usually beaten by a best-of-breed alternative, but the bundle plus account lock-in is the moat.

Amazon Web Services is the original public cloud, the largest by revenue, and the company that essentially invented the modern data stack as a side effect of inventing modern cloud computing. If you work in data, you almost certainly touch an AWS product every day, even if you think you don't — because the warehouse you actually use (Snowflake, Databricks) is, in most accounts, running on top of AWS.

Plain-English version: AWS sells the building blocks of computers over the internet, by the hour, by the gigabyte, and by the API call. Inside that menu of building blocks is a complete data stack — somewhere to store data (S3), somewhere to query it (Redshift, Athena), somewhere to move it (Glue, Kinesis), somewhere to visualize it (QuickSight), and somewhere to train models on it (SageMaker). None of these are usually the best-in-class option in their category. All of them are already in your AWS bill.

Origin Story: A Side Project That Ate the World

AWS launched publicly in March 2006 with two services: S3 (Simple Storage Service) and shortly after, EC2 (Elastic Compute Cloud). The standard origin myth — that AWS came out of Amazon's internal infrastructure being so good that they decided to sell it — is a useful simplification but not quite right. The actual idea was articulated by Andy Jassy and a small team in 2003-2004: Amazon wanted to be the "operating system for the internet," renting out the primitives that every developer needed. The internal Amazon infrastructure was a starting point and a proof that the model could work, not the reason it shipped.

For its first six years, AWS was about general-purpose compute and storage. Then in November 2012, at the very first AWS re:Invent conference, Amazon announced Redshift — a fully managed columnar data warehouse based on a license from ParAccel. This was the moment the cloud data warehouse era began. Redshift was 10x cheaper than Teradata and Oracle Exadata, you could spin one up in minutes, and it ran SQL well enough to replace a real warehouse for most companies. By 2015, Redshift was AWS's fastest-growing service ever to that point.

Everything else followed from there. Kinesis (streaming) launched in 2013. Athena (serverless SQL on S3) launched in 2016 and was an early sign that AWS was willing to build "lake-style" alternatives to its own warehouse. Glue (managed ETL) shipped in 2017. SageMaker launched at re:Invent 2017. Lake Formation (unified governance) followed in 2018. The pattern was always the same: see a category get hot in the broader market, ship an AWS-branded version of it within 12-24 months, integrate it with IAM and S3, and let the existing AWS customer base do the rest.

Their Data Products

Here is the actual list, with links to the per-product pages where they exist:

Amazon Redshift — The original cloud data warehouse, launched in 2012. Classic MPP columnar architecture. Now also available as Redshift Serverless. Still the default warehouse for AWS-first shops, but losing share to Snowflake and Databricks in larger accounts.
Amazon S3 — The de facto cloud object storage standard. Launched in 2006 as AWS's very first service. S3 is so dominant that "the S3 API" is the universal protocol for object storage — every other vendor implements it as a compatibility layer. Probably AWS's most strategically important data product.
AWS Kinesis — Managed streaming, launched 2013. Kinesis Data Streams is AWS's homegrown alternative to Kafka; Kinesis Firehose handles the "stream-to-S3" pattern; Kinesis Data Analytics (now Managed Service for Apache Flink) is for stream processing. Roughly the equivalent of Kafka + Flink, all under one IAM role.
Amazon QuickSight — AWS's BI tool, launched in 2016. The "good enough" dashboard layer for AWS-native customers. Got an AI rebrand as "QuickSight Q" and then "Generative BI" in 2023-2024. Not a serious threat to Tableau or Power BI in standalone evaluations, but bundled into enough AWS contracts to ship in production at large companies.
Amazon SageMaker — AWS's flagship ML platform, launched in 2017. Started as a managed Jupyter + training + hosting offering and has since absorbed feature stores, model monitoring, and most recently a Databricks-style "lakehouse" pivot under the SageMaker umbrella (the 2024 SageMaker Lakehouse re-architecture).

Other AWS data products that don't yet have a wiki page but are part of the same family:

Athena — Serverless SQL on S3, based on Presto/Trino. Launched 2016. The lake-query alternative to Redshift.
Glue — Managed ETL with a Spark backend and a Hive-compatible catalog. The Glue Data Catalog has quietly become the de facto metadata store for the AWS data ecosystem.
EMR — Elastic MapReduce, AWS's managed Hadoop/Spark/Presto platform. Launched 2009 as managed Hadoop; now mostly used for managed Spark.
MSK — Managed Streaming for Apache Kafka. AWS's response to Confluent, launched 2018.
MWAA — Managed Workflows for Apache Airflow. The orchestration entry in the bundle.
Lake Formation — The governance/access-control layer that sits on top of S3 + Glue Catalog. AWS's answer to Unity Catalog.

If you map this against the categories in the rest of the wiki, AWS has at least one product in every box: warehouse, storage, streaming, stream processing, BI, ML, ETL, orchestration, governance. That is the entire point. The only major box AWS does not seriously play in is the BI/visualization premium tier (Tableau, Power BI), and even there QuickSight exists.

The Strategy: Always Second Best, Always Integrated

There is a phrase that captures AWS's data strategy more honestly than anything Amazon will ever publish itself: always second best, always integrated.

If you do a head-to-head bake-off in any single data category, AWS usually loses. Snowflake outperforms Redshift on most analytical workloads. Databricks outperforms EMR on Spark. Confluent outperforms MSK on Kafka. Tableau and Power BI both outsell QuickSight by an order of magnitude. Even SageMaker is regularly out-architected by Databricks and Vertex AI.

But here is the trick: the bake-off is rarely the actual decision. The actual decision is whether to add another vendor to a stack that already has S3, IAM, VPC, and an enterprise AWS commitment. Picking Redshift means zero new contracts, zero new SSO integration, zero new bills, zero new procurement cycles. Picking Snowflake means all four. The friction asymmetry is enormous, and AWS engineers it deliberately.

This is also why AWS keeps shipping more and more "lake-style" products that quietly compete with its own warehouse. Athena cannibalizes Redshift on S3-resident data; Glue cannibalizes EMR; Lake Formation cannibalizes the Redshift permissions model; SageMaker Lakehouse cannibalizes most of it. AWS is fine with internal cannibalization because the customer always stays in AWS. The unit of competition is "the account," not "the product."

Honest Market Take

The AWS data portfolio is the broadest in the industry and the most operationally reliable. That's the good part. The bad part is that almost every individual product is a generation behind the best-of-breed alternative, and AWS only catches up under serious pressure (Iceberg support in Athena and Glue arrived about three years after the rest of the industry; Redshift's serverless story took years to mature).

Where AWS has actually won on technical merit: S3, IAM, and the AWS networking stack. These are genuinely the best-in-class building blocks, and they're so good that most modern data tools are built on them. Snowflake's storage layer is S3 under the hood. Databricks runs on EC2. Trino on Iceberg on AWS is a perfectly serious architecture. So in a deep sense, AWS won data infrastructure even as it lost most of the individual data product categories: the cloud is AWS, and the data lives on AWS, even if the brand on the query engine isn't Amazon.

The risk for AWS is that the storage primitives commoditize — which they kind of already have, since S3, GCS, and ADLS are roughly interchangeable — and the value migrates upward to whoever owns the query engine and the metadata. AWS has been working hard since 2023 to rebuild its position higher in the stack (Amazon Q, Bedrock, the SageMaker Lakehouse rebrand, the December 2024 Iceberg-focused S3 Tables launch). It's plausible that this works. It's also plausible that AWS becomes the "low-margin substrate" while the high-margin layer goes to Snowflake, Databricks, and the AI labs.

How TextQL Works with AWS

TextQL Ana connects natively to Redshift, Athena, and Glue Catalog, and TextQL deployments commonly run inside customer AWS accounts via PrivateLink for data residency and security. For AWS-native shops, the typical pattern is: data lands in S3 via Kinesis Firehose or Glue, gets cataloged in Glue, queried by Redshift or Athena, and asked questions of by Ana. Because almost every modern data stack has at least an S3 dependency, AWS is the most common substrate underneath TextQL deployments, even when the brand-name warehouse on top is Snowflake or Databricks.

See TextQL in action

Amazon Web Services (AWS)

Founded 2006 (publicly launched)

Parent Amazon.com, Inc. (NASDAQ: AMZN)

Headquarters Seattle, WA

CEO (AWS) Matt Garman (since June 2024)

Annual revenue ~$108B (FY 2024 AWS segment)

Category Hyperscaler / cloud platform

Data products Redshift, S3, Kinesis, QuickSight, SageMaker, Athena, Glue, EMR, MSK, MWAA, Lake Formation

Monthly mindshare ~3M · largest cloud (~$100B revenue); millions of accounts using some AWS data product