What is Amazon SageMaker Lakehouse?
Amazon SageMaker Lakehouse is an open, unified lakehouse architecture that brings together Amazon S3 data lakes (including S3 Tables) and Amazon Redshift warehouses on a single copy of data. Analytics and AI/ML workloads access the same data without moving or duplicating it.
The lakehouse addresses the classic problem of separate data silos: data lakes and data warehouses often evolve in parallel, leading to redundant copies, ETL pipelines, and inconsistent permissions. SageMaker Lakehouse builds on the open Apache Iceberg standard and exposes Iceberg-compatible APIs, so any Iceberg-compatible engine can query the data in place.
Core Features
- Unified data foundation: Brings together S3 data lakes (including S3 Tables) and Redshift warehouses so that analytics and AI/ML run on a single copy of data with no data movement.
- Open Apache Iceberg standard: Iceberg-compatible APIs allow queries with Amazon EMR, AWS Glue, Amazon Athena, Amazon Redshift, Apache Spark, and compatible third-party tools directly in place.
- Fine-grained access control: Centralized permissions at the table, column, row, and cell level via tag-based, attribute-based, or role-based policies, applied consistently across all engines through AWS Lake Formation and the AWS Glue Data Catalog.
- Zero-ETL and federation: Additional data arrives via zero-ETL integrations from operational databases and applications, query federation to external sources, and catalog federation for remote Apache Iceberg tables.
Typical Use Cases
Unifying data lake and warehouse: Bring existing S3 data lakes and Redshift warehouses together without copying data or mirroring it through ETL pipelines. Teams work on a consistent copy of data.
Cross-engine analytics: Query the same data with different engines such as EMR, Glue, Athena, Redshift, or Apache Spark in place, depending on workload and team, without maintaining separate copies.
Governance across engines: Define fine-grained permissions once in Lake Formation and enforce them consistently across all accessing engines, down to the column, row, and cell level.
Benefits
- A single copy of data for analytics and AI/ML instead of redundant copies and ETL pipelines
- Open Apache Iceberg standard avoids lock-in and enables free choice of engine
- Consistent, fine-grained access control across all engines
- Usage-based billing with no upfront cost
Integration with innFactory
As an AWS Reseller, innFactory supports you with the adoption and operation of this service.
Typical Use Cases
Frequently Asked Questions
What is Amazon SageMaker Lakehouse?
Amazon SageMaker Lakehouse is an open, unified lakehouse architecture built on Apache Iceberg. It brings together Amazon S3 data lakes (including S3 Tables) and Amazon Redshift warehouses so that analytics and AI/ML run on a single copy of data, with no data movement or duplication required.
When should I use Amazon SageMaker Lakehouse?
Use it when data currently sits in separate S3 data lakes and Redshift warehouses and you want to unify those silos without ETL copies. It also fits when multiple engines such as EMR, Glue, Athena, Redshift, or Apache Spark need to query the same data in place with consistent, fine-grained access control.
How much does Amazon SageMaker Lakehouse cost?
Billing is pay-as-you-go with no upfront cost, charged through the underlying components: AWS Glue Data Catalog for metadata storage and API requests (with a free tier), S3 or Redshift Managed Storage for storage and compute, and automated statistics and Iceberg table maintenance. Actual costs depend on usage.
Which query engines does SageMaker Lakehouse support?
Because SageMaker Lakehouse is built on the open Apache Iceberg standard and exposes Iceberg-compatible APIs, any Iceberg-compatible engine can query the data in place. This includes Amazon EMR, AWS Glue, Amazon Athena, Amazon Redshift, Apache Spark, and compatible third-party tools.