Architecting an Apache Iceberg Lakehouse you own this product

A scalable, open-source data platform

Alex Merced

MEAP began July 2025
Last updated November 2025
Publication in Spring 2026 (estimated)

ISBN 9781633435100
250 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Software Development / Cloud / Data Engineering / Cloud Data Platforms

table of content

PART 1: THE VALUE OF THE APACHE ICEBERG LAKEHOUSE

1 The world of the Apache Iceberg Lakehouse

1.1 What is a data lakehouse

1.1.1 The rise of data warehouses

1.1.2 The move to cloud data warehouses

1.1.3 The data lake and the Hadoop era

1.1.4 Apache Iceberg: The key to the data lakehouse

1.1.5 The data lakehouse: the best of both worlds

1.2 What is Apache Iceberg?

1.2.1 The need for a table format

1.2.2 How Apache Iceberg manages metadata

1.2.3 Key features of Apache Iceberg

1.2.4 Apache Iceberg as an open-source standard

1.3 The benefits of Apache Iceberg

1.3.1 ACID transactions

1.3.2 Table evolution

1.3.3 Time travel & snapshot-based queries

1.3.4 Hidden partitioning for reduced accidental full-table scans

1.3.5 Cost efficiency & optimized query performance

1.4 The components of an Apache Iceberg lakehouse

1.4.1 The storage layer: The foundation of your lakehouse

1.4.2 The ingestion layer: Feeding data into Iceberg tables

1.4.3 The catalog layer: The entry point to your lakehouse

1.4.4 The federation layer: Modeling & accelerating data

1.4.5 The consumption layer: Delivering value to the business

1.5 Summary

2 Hands-on with Apache Iceberg

2.1 Setting up an Apache Iceberg environment

2.1.1 Prerequisites: Install Docker

2.1.2 Creating the Docker compose file

2.1.3 Running the environment

2.1.4 Accessing the services

2.2 Creating Iceberg tables in Spark

2.2.1 Populating the PostgreSQL database

2.2.2 Starting the Apache Spark environment

2.2.3 Configuring Apache Spark for Iceberg

2.2.4 Loading data from PostgreSQL into Iceberg

2.2.5 Verifying data storage in MinIO

2.3 Reading Iceberg tables in Dremio

2.3.1 Starting Dremio

2.3.2 Connecting Dremio to the Nessie Catalog

2.3.3 Querying Iceberg tables in Dremio

2.4 Creating a BI dashboard from your Iceberg tables

2.4.1 Starting Apache Superset

2.4.2 Connecting Superset to Dremio

2.4.3 Creating a dataset from Iceberg tables

2.4.4 Building charts and dashboards

2.5 Summary

PART 2: DESIGNING YOUR ICEBERG ARCHITECTURE

3 Preparing for your move to Apache Iceberg

3.1 Conducting your data platform audit

3.1.1 Who are the stakeholders?

3.1.2 What should you ask stakeholders?

3.1.3 Conducting a technological audit

3.2 Hamerliwa Bank’s audit in action

3.2.1 Hamerliwa Bank interviews their stakeholders

3.2.2 Hamerliwa Bank audits its technology

3.2.3 Hamerliwa Bank summarizes its audit findings

3.3 From audit to requirements: Laying the foundation for design

3.3.1 Defining storage requirements

3.3.2 Defining ingestion requirements

3.3.3 Defining catalog requirements

3.3.4 Defining federation requirements

3.3.5 Defining consumption requirements

3.3.6 Hamerliwa Bank establishes its requirements

3.4 Architectural plan and road show

3.4.1 Hamerliwa Bank creates its architectural plan

3.4.2 Hamerliwa Bank conducts a road show

3.5 Summary

4 Selecting the storage layer

4.1 Storage requirements

4.1.1 File retrieval performance requirements

4.1.2 Security requirements

4.1.3 Integrity requirements

4.1.4 Cost and operational overhead requirements

4.2 Block vs object

4.2.1 Block storage

4.2.2 Object storage

4.3 The standards in the storage layer

4.3.1 Apache Parquet

4.3.2 The S3 API

4.4 Storage solutions

4.4.1 Vendor Comparison Summary

4.4.2 Hadoop (HDFS)

4.4.3 Amazon S3

4.4.4 Google Cloud Storage

4.4.5 Azure Blob Storage and ADLS

4.4.6 MiniO

4.4.7 Ceph

4.4.8 NetApp StorageGRID

4.4.9 Pure Storage

4.4.10 Dell ECS

4.4.11 Wasabi

4.5 Selecting based on requirements

4.5.1 Performance requirements

4.5.2 Security requirements

4.5.3 Integrity requirements

4.5.4 Cost and operational requirements

4.6 Summary

5 Architecting the ingestion layer

5.1 Ingestion requirements

5.1.1 Ingestion throughput and latency

5.1.2 Reliability and fault tolerance

5.1.3 Schema management and evolution

5.1.4 Operational complexity and maintainability

5.2 Ingestion models and architectures

5.2.1 Batch ingestion

5.2.2 Micro-batch and incremental ingestion

5.2.3 Streaming ingestion

5.3 How Iceberg manages writes

5.3.1 Write semantics in Iceberg

5.3.2 Commit protocols and conflict handling

5.4 Tools and frameworks for ingestion

5.4.1 Apache Spark

5.4.2 Apache Flink

5.4.3 Apache NiFi

5.4.4 Fivetran

5.4.5 Qlik

5.4.6 Airbyte

5.4.7 Confluent

5.4.8 Redpanda

5.4.9 Cloud-native ingestion services

5.4.10 Tool selection considerations

5.5 Applying ingestion requirements in context

5.5.1 Prioritizing low latency

5.5.2 Managing high throughput

5.5.3 Supporting complex transformations

5.5.4 Handling schema evolution

5.5.5 Balancing operational overhead

5.5.6 Considering existing cloud environments

5.6 Summary

6 Implementing the catalog layer

6.1 The role of the catalog in Apache Iceberg lakehouses

6.1.1 Responsibilities of the catalog

6.1.2 Catalog interactions with query and processing engines

6.2 Evaluating catalog requirements

6.2.1 Performance, availability, and scale

6.2.2 Metadata governance and lineage

6.2.3 Security and compliance

6.2.4 Deployment flexibility and ecosystem compatibility

6.2.5 Cost and operational overhead

6.2.6 Catalog federation and mesh architectures

6.3 Apache Iceberg REST Catalog Spec

6.3.1 Before the Apache Iceberg REST spec

6.3.2 The solution

6.4 Catalog options: Exploring the ecosystem

6.4.1 Hadoop Catalog

6.4.2 Hive Catalog

6.4.3 JDBC Catalog

6.4.4 Apache Polaris

6.4.5 Project Nessie

6.4.6 Apache Gravitino

6.4.7 Lakekeeper

6.4.8 AWS Glue Data Catalog

6.4.9 Dremio Catalog

6.4.10 Snowflake Open Catalog

6.4.11 Databricks Unity Catalog

6.5 Choosing the right catalog: Evaluating options through scenarios

6.5.1 Scenario: A mid-sized data team migrating from Hive

6.5.2 Scenario: A rapidly scaling cloud-native startup

6.5.3 Scenario: A multinational enterprise with strict data governance

6.5.4 Scenario: SaaS startup prioritizing operational simplicity

6.5.5 Scenario: A large enterprise with multi-cloud and federated governance needs

6.5.6 Scenario: Financial firm requiring daily environment cloning for stress testing

6.5.7 Scenario: Phased Iceberg migration with query federation across legacy systems

6.5.8 Scenario: Lightweight lakehouse adoption with Hadoop catalog and Python

6.6 Summary

7 Designing the federation layer

7.1 What data federation is and why it matters

7.1.1 Common use cases and challenges driving federation needs

7.1.2 How federation aligns with agility and accessibility

7.2 Key requirements for federation

7.2.1 Supporting diverse data sources without duplication

7.2.2 Ensuring consistent semantics and business logic

7.2.3 Providing seamless connectivity for analytics tools

7.2.4 Introducing Dremio and Trino

7.3 Dremio

7.3.1 Dremio architecture

7.3.2 Dremio’s connector ecosystem and Iceberg-centric focus

7.3.3 Dremio’s performance enhancements

7.4 Trino

7.4.1 Modular architecture for wide-source support

7.4.2 Flexibility and configurability for complex environments

7.4.3 Community-led evolution and vendor extensions

7.4.4 Semantic layer considerations in Trino

7.5 Deployment models

7.5.1 Deployment with Dremio

7.5.2 Deployment with Trino

7.6 Federation platform decision scenarios

7.6.1 Fragmented multi-source environment: Trino for connector breadth

7.6.2 Building a native Iceberg lakehouse: Dremio for Iceberg-native features

7.6.3 Empowering business users with UI and governed datasets: Dremio

7.6.4 Lightweight querying of Hudi datasets: Trino via AWS Athena

7.6.5 On-prem Cloudera modernization: Trino replacing Impala for performance

7.6.6 Hybrid cloud Iceberg strategy: Dremio bridging on-prem and ADLS

7.7 Federation Alternatives

7.7.1 Virtualization via shortcuts in OneLake

7.7.2 AI-native data virtualization with Spice.ai

7.7.3 Choosing the right fit

7.8 Summary

8 Understanding the consumption layer

8.1 Revisiting the benefits of the lakehouse for consumption

8.1.1 Connecting the lakehouse to the people

8.2 Revisiting requirements from our audit

8.2.1 Interpreting requirements for consumption

8.2.2 Requirements for BI tools

8.2.3 Requirements for interactive notebook environments

8.2.4 Requirements for AI and specialized data consumption tools

8.3 Open interfaces for seamless consumption

8.3.1 JDBC and ODBC

8.3.2 Arrow Flight

8.3.3 Model Context Protocol (MCP)

8.4 Business intelligence tools in the lakehouse

8.4.1 Open source BI tools

8.4.2 Commercial BI tools

8.5 Tools for AI and machine learning workloads

8.6 Choosing the right consumption tools: Ten illustrated scenarios

8.6.1 Startup with a data science focus

8.6.2 Large financial institution with strict governance

8.6.3 Mid-sized e-commerce platform building embedded analytics

8.6.4 Decentralized media organization enabling self-service analytics

8.6.5 Government agency balancing public transparency and internal control

8.6.6 Healthcare provider with compliance and data locality constraints

8.6.7 Logistics company unifying real-time operations and historical analysis

8.6.8 SaaS company offering customizable data access to clients

8.6.9 Nonprofit organization supporting collaborative research

8.6.10Manufacturing company enabling predictive maintenance

8.7 Summary

PART 3: OPERATING YOUR APACHE ICEBERG LAKEHOUSE

9 Maintaining an Apache Iceberg Lakehouse

10 Other Topics to Consider

Overview

1 The world of the Apache Iceberg Lakehouse

The chapter situates the lakehouse within the broader evolution of data architectures, explaining how organizations have long balanced performance, cost, flexibility, and governance as data scales and diversifies. It traces the path from OLTP systems to enterprise and cloud data warehouses and then to Hadoop-era data lakes—highlighting benefits such as elasticity and low-cost storage alongside persistent pain points like rigidity, vendor lock-in, data duplication, and “data swamps.” The lakehouse emerges as a response that unifies the ease, reliability, and performance of warehouses with the openness, scalability, and cost-efficiency of lakes, enabling a single, governed source of truth that multiple tools can query without excessive data movement.

Apache Iceberg anchors this paradigm by introducing an open, vendor-agnostic table format that makes data-lake files behave like reliable database tables. It adds ACID transactions, seamless schema and partition evolution, time travel, and hidden partitioning while organizing rich metadata in layered structures (table metadata, snapshot manifest lists, and file manifests) to enable efficient pruning and fast queries across massive datasets. Built as a community-driven standard and supported by a broad ecosystem of engines and platforms, Iceberg improves interoperability and reduces lock-in, while addressing historical weaknesses of data lakes—consistency, governance, and performance—without sacrificing the openness that teams need.

The chapter also outlines the modular components of an Iceberg lakehouse—storage, ingestion, catalog, federation/modeling, and consumption—and how they work together to deliver scalable, governed analytics. Object storage holds Parquet data and Iceberg metadata, batch and streaming tools populate tables, catalogs serve as the single source of truth for discovery and governance, federation layers unify and accelerate data for analysis, and the consumption layer powers BI, AI/ML, and operational applications. By reducing duplication and ETL, enabling multi-engine access to one canonical dataset, and decoupling storage from compute, an Iceberg lakehouse offers a practical path to performance, cost control, and flexibility—making it a compelling choice for organizations modernizing their data platforms.

The evolution of data platforms from on-prem warehouses to data lakehouses.

The role of the table format in data lakehouses.

The anatomy of a lakehouse table, metadata files, and data files.

The structure and flow of an Apache Iceberg table read and write operation.

Engines use metadata statistics to eliminate data files from being scanned for faster queries.

Engines can scan older snapshots, which will provide a different list of files to scan, enabling scanning older versions of the data.

The components of a complete data lakehouse implementation

Summary

Data lakehouse architecture combines the scalability and cost-efficiency of data lakes with the performance, ease of use, and structure of data warehouses, solving key challenges in governance, query performance, and cost management.
Apache Iceberg is a modern table format that enables high-performance analytics, schema evolution, ACID transactions, and metadata scalability. It transforms data lakes into structured, mutable, governed storage platforms.
Iceberg eliminates significant pain points of OLTP databases, enterprise data warehouses, and Hadoop-based data lakes, including high costs, rigid schemas, slow queries, and inconsistent data governance.
With features like time travel, partition evolution, and hidden partitioning, Iceberg reduces storage costs, simplifies ETL, and optimizes compute resources, making data analytics more efficient.
Iceberg integrates with query engines (Trino, Dremio, Snowflake), processing frameworks (Spark, Flink), and open lakehouse catalogs (Nessie, Polaris, Gravitino), enabling modular, vendor-agnostic architectures.
The Apache Iceberg Lakehouse has five key components: storage, ingestion, catalog, federation, and consumption.

FAQ

What is a data lakehouse and how does it differ from data warehouses and data lakes?

A data lakehouse combines the low-cost, scalable storage and openness of data lakes with the performance, governance, and ease of use of data warehouses. It lets teams run high-performance analytics directly on data stored in the lake, using open table formats (like Apache Iceberg) to provide ACID guarantees, schema management, and interoperability across tools—without duplicating data into proprietary warehouses.

What is Apache Iceberg and why is it central to the lakehouse paradigm?

Apache Iceberg is an open, vendor-agnostic table format that makes collections of files (typically Parquet) behave like fully managed analytical tables. It brings warehouse-grade features—ACID transactions, schema and partition evolution, hidden partitioning, time travel, and efficient metadata pruning—to data lakes. Because it’s community-driven and widely integrated, multiple engines can read and write the same tables reliably.

Which problems with traditional Hadoop-era data lakes and Hive tables does Iceberg solve?

Iceberg addresses slow and fragile metadata operations, lack of robust schema evolution, weak or limited ACID guarantees, and inefficient scans that read entire directories. It adds a modern metadata layer and transactional model so queries are faster, writes are consistent, and large-scale datasets are easier to govern and evolve.

How does Iceberg’s metadata architecture work?

Iceberg organizes metadata in layers: table-level metadata.json (schemas, partition strategies, and snapshots), snapshot-level manifest lists (groups of manifests with summary stats), and file-level manifests (lists of data files with statistics). A catalog points engines to the right metadata. This structure enables aggressive pruning and fast planning, so engines scan only relevant files.

What are the key features of Apache Iceberg?

Core features include: - ACID transactions for reliable multi-writer reads/writes - Schema evolution without costly table rewrites - Partition evolution and hidden partitioning for simpler, faster queries - Time travel for querying or restoring past table versions - Rich metadata and statistics for efficient pruning and planning across engines

How does Iceberg improve cost efficiency and performance?

Iceberg reduces redundant copies by enabling analytics directly on lake data, minimizing warehouse ingestion and data marts. Its metadata pruning and partitioning optimizations cut scanned data and compute costs. Organizations can support batch and streaming in the same tables, simplify ETL, and avoid vendor lock-in, leading to significant savings.

When should an organization implement an Apache Iceberg lakehouse?

Consider Iceberg when you need a single, governed copy of data accessible by many tools; want ACID reliability on the lake; seek to reduce warehouse/storage and ETL duplication costs; must support both streaming and batch; and want an open, scalable, AI-ready architecture that avoids vendor lock-in.

What role does the catalog layer play, and what options exist?

The catalog is the entry point to Iceberg tables, tracking locations and versions of table metadata and enabling atomic updates and cross-engine consistency. Options include AWS Glue, Apache Polaris, Project Nessie (Git-like versioned catalog), Apache Gravitino, Lakekeeper, and the Iceberg REST Catalog. A strong catalog adds governance, RBAC, and portability across tools.

What are the main components of an Apache Iceberg lakehouse?

Five modular layers work together: - Storage: object storage or filesystems for data and metadata (e.g., S3, GCS, Azure Blob, HDFS) - Ingestion: batch/stream tools loading data into Iceberg (Spark, Flink, Kafka Connect, etc.) - Catalog: tracks tables and governs access/versions - Federation: models, unifies, and accelerates data across sources (e.g., Dremio, Trino, dbt) - Consumption: BI, AI/ML, apps, and APIs using the same governed data

How does Iceberg compare with Delta Lake, Apache Hudi, and Apache Paimon?

All offer ACID, schema evolution, and time travel, but Iceberg stands out for partition evolution and hidden partitioning, which simplify operations and reduce re-writes. Iceberg also has broad, vendor-neutral ecosystem support and governance under the Apache Software Foundation. Delta is closely tied to Databricks, and Hudi is optimized for streaming but with a smaller integration footprint; Iceberg provides a flexible path with wide multi-engine compatibility.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more