Data Lakehouse Architecture
A comprehensive guide to data lakehouse architecture, the modern analytical platform that combines open file formats on object storage with table format governance and SQL query engines to deliver warehouse performance with data lake flexibility.
Converging the Best of Two Worlds
For two decades, enterprise data infrastructure split into two separate ecosystems: the data warehouse (fast, governed, expensive, proprietary) and the data lake (flexible, cheap, open, ungoverned). Organizations operated both, incurring the costs and complexity of maintaining two systems, writing pipelines to move data between them, and managing the inevitable inconsistencies when data in the warehouse diverged from data in the lake.
The data lakehouse architecture resolves this split by applying the governance, performance, and reliability capabilities of the data warehouse directly on top of the cheap, open, flexible storage layer of the data lake. The result is a single analytical platform that stores data in open formats on commodity object storage while providing ACID transactions, schema evolution, time travel, metadata management, and SQL query performance comparable to purpose-built analytical databases.
The Four Layers of Lakehouse Architecture
Storage Layer: Object storage (Amazon S3, Azure Data Lake Storage, Google Cloud Storage) provides the foundation. Object storage is cheap ($0.02-0.05/GB-month vs $0.30-0.50/GB-month for warehouse proprietary storage), infinitely scalable, accessible by any authorized compute, and durable (11 nines of durability). Data is stored as Parquet files, the columnar format that enables efficient analytical reads.
Table Format Layer: Apache Iceberg sits above raw object storage, adding the metadata and transaction layer that transforms raw Parquet files into ACID-compliant tables. Iceberg’s metadata hierarchy (catalog -> metadata file -> manifest list -> manifests -> data files) tracks which Parquet files belong to which table at each snapshot, enabling ACID commits, time travel queries, schema evolution, and query optimization through file-level statistics and partition pruning.
Catalog Layer: The Iceberg catalog (Apache Polaris REST Catalog, AWS Glue, Project Nessie) manages table registrations, provides the namespace hierarchy (catalog -> database -> table), enforces access control, and serves as the authoritative source for each table’s current metadata pointer. Multiple compute engines connect to the same catalog, ensuring they all see the same consistent view of each table’s current state.
Compute Layer: Separate compute engines connect to the catalog and storage for different workload types. Apache Spark and Apache Flink handle batch and streaming data ingestion and transformation. Dremio provides governed SQL analytics for BI tools and AI agents through its Semantic Layer, Data Reflections, and Arrow Flight interface. DuckDB and Polars serve local analytics and data science exploration. The ability to use best-of-breed engines for each workload type, all operating against the same Iceberg tables in shared storage, is the lakehouse’s defining architectural advantage.

The Lakehouse vs. Traditional Data Warehouse
The data lakehouse outperforms the traditional warehouse model on several dimensions: cost (object storage pricing vs. proprietary warehouse storage), openness (Parquet/Iceberg can be read by any compatible engine without export), ML integration (Spark and Python tools read Iceberg directly for training, no export required), and multi-engine flexibility (each workload uses the most appropriate engine).
The data warehouse retains advantages in operational simplicity (fully managed, minimal expertise required), point query performance (row-level lookups optimized for transactional patterns), and the fully integrated governance, query, and visualization stack.
Dremio occupies the intersection: providing the governed, managed query experience of a data warehouse (Semantic Layer, access control, Data Reflections for sub-second queries) while operating natively on open Iceberg tables in customer-controlled object storage, delivering both the performance of a warehouse and the openness of a lakehouse.
Learn More
To dive deeper into these architectures and master the modern data ecosystem, check out the comprehensive books by Alex Merced available in our Books section.