NEW: Scale AI Case Study — ~1,900 data requests per week across 4 business units Read now →
Contents
Amazon Web Services (AWS)
AWS is the original public cloud and the dominant hyperscaler. Its data portfolio -- Redshift, S3, Kinesis, QuickSight, SageMaker, Glue, Athena, EMR -- is the broadest in the industry. Each individual product is usually beaten by a best-of-breed alternative, but the bundle plus account lock-in is the moat.
Amazon Web Services is the original public cloud, the largest by revenue, and the company that essentially invented the modern data stack as a side effect of inventing modern cloud computing. If you work in data, you almost certainly touch an AWS product every day, even if you think you don't — because the warehouse you actually use (Snowflake, Databricks) is, in most accounts, running on top of AWS.
Plain-English version: AWS sells the building blocks of computers over the internet, by the hour, by the gigabyte, and by the API call. Inside that menu of building blocks is a complete data stack — somewhere to store data (S3), somewhere to query it (Redshift, Athena), somewhere to move it (Glue, Kinesis), somewhere to visualize it (QuickSight), and somewhere to train models on it (SageMaker). None of these are usually the best-in-class option in their category. All of them are already in your AWS bill.
AWS launched publicly in March 2006 with two services: S3 (Simple Storage Service) and shortly after, EC2 (Elastic Compute Cloud). The standard origin myth — that AWS came out of Amazon's internal infrastructure being so good that they decided to sell it — is a useful simplification but not quite right. The actual idea was articulated by Andy Jassy and a small team in 2003-2004: Amazon wanted to be the "operating system for the internet," renting out the primitives that every developer needed. The internal Amazon infrastructure was a starting point and a proof that the model could work, not the reason it shipped.
For its first six years, AWS was about general-purpose compute and storage. Then in November 2012, at the very first AWS re:Invent conference, Amazon announced Redshift — a fully managed columnar data warehouse based on a license from ParAccel. This was the moment the cloud data warehouse era began. Redshift was 10x cheaper than Teradata and Oracle Exadata, you could spin one up in minutes, and it ran SQL well enough to replace a real warehouse for most companies. By 2015, Redshift was AWS's fastest-growing service ever to that point.
Everything else followed from there. Kinesis (streaming) launched in 2013. Athena (serverless SQL on S3) launched in 2016 and was an early sign that AWS was willing to build "lake-style" alternatives to its own warehouse. Glue (managed ETL) shipped in 2017. SageMaker launched at re:Invent 2017. Lake Formation (unified governance) followed in 2018. The pattern was always the same: see a category get hot in the broader market, ship an AWS-branded version of it within 12-24 months, integrate it with IAM and S3, and let the existing AWS customer base do the rest.
Here is the actual list, with links to the per-product pages where they exist:
Other AWS data products that don't yet have a wiki page but are part of the same family:
If you map this against the categories in the rest of the wiki, AWS has at least one product in every box: warehouse, storage, streaming, stream processing, BI, ML, ETL, orchestration, governance. That is the entire point. The only major box AWS does not seriously play in is the BI/visualization premium tier (Tableau, Power BI), and even there QuickSight exists.
There is a phrase that captures AWS's data strategy more honestly than anything Amazon will ever publish itself: always second best, always integrated.
If you do a head-to-head bake-off in any single data category, AWS usually loses. Snowflake outperforms Redshift on most analytical workloads. Databricks outperforms EMR on Spark. Confluent outperforms MSK on Kafka. Tableau and Power BI both outsell QuickSight by an order of magnitude. Even SageMaker is regularly out-architected by Databricks and Vertex AI.
But here is the trick: the bake-off is rarely the actual decision. The actual decision is whether to add another vendor to a stack that already has S3, IAM, VPC, and an enterprise AWS commitment. Picking Redshift means zero new contracts, zero new SSO integration, zero new bills, zero new procurement cycles. Picking Snowflake means all four. The friction asymmetry is enormous, and AWS engineers it deliberately.
This is also why AWS keeps shipping more and more "lake-style" products that quietly compete with its own warehouse. Athena cannibalizes Redshift on S3-resident data; Glue cannibalizes EMR; Lake Formation cannibalizes the Redshift permissions model; SageMaker Lakehouse cannibalizes most of it. AWS is fine with internal cannibalization because the customer always stays in AWS. The unit of competition is "the account," not "the product."
The AWS data portfolio is the broadest in the industry and the most operationally reliable. That's the good part. The bad part is that almost every individual product is a generation behind the best-of-breed alternative, and AWS only catches up under serious pressure (Iceberg support in Athena and Glue arrived about three years after the rest of the industry; Redshift's serverless story took years to mature).
Where AWS has actually won on technical merit: S3, IAM, and the AWS networking stack. These are genuinely the best-in-class building blocks, and they're so good that most modern data tools are built on them. Snowflake's storage layer is S3 under the hood. Databricks runs on EC2. Trino on Iceberg on AWS is a perfectly serious architecture. So in a deep sense, AWS won data infrastructure even as it lost most of the individual data product categories: the cloud is AWS, and the data lives on AWS, even if the brand on the query engine isn't Amazon.
The risk for AWS is that the storage primitives commoditize — which they kind of already have, since S3, GCS, and ADLS are roughly interchangeable — and the value migrates upward to whoever owns the query engine and the metadata. AWS has been working hard since 2023 to rebuild its position higher in the stack (Amazon Q, Bedrock, the SageMaker Lakehouse rebrand, the December 2024 Iceberg-focused S3 Tables launch). It's plausible that this works. It's also plausible that AWS becomes the "low-margin substrate" while the high-margin layer goes to Snowflake, Databricks, and the AI labs.
TextQL Ana connects natively to Redshift, Athena, and Glue Catalog, and TextQL deployments commonly run inside customer AWS accounts via PrivateLink for data residency and security. For AWS-native shops, the typical pattern is: data lands in S3 via Kinesis Firehose or Glue, gets cataloged in Glue, queried by Redshift or Athena, and asked questions of by Ana. Because almost every modern data stack has at least an S3 dependency, AWS is the most common substrate underneath TextQL deployments, even when the brand-name warehouse on top is Snowflake or Databricks.
See TextQL in action