Architecture
Embucket delivers a Snowflake-compatible lakehouse through a zero-disk architecture—an architectural approach where all persistent state resides in object storage rather than on local compute nodes. This document explains how Embucket’s components work together to provide a complete analytics platform in a single binary.
You’ll learn about the system’s architecture, core components, data flow, and operational characteristics. This overview targets data engineers, platform architects, and anyone deploying Embucket in production.
System overview
Section titled “System overview”Embucket implements a zero-disk lakehouse architecture—a system design where compute nodes maintain no persistent state. Object storage serves as the single source of truth for both data and metadata. This design eliminates operational complexity while delivering horizontal scalability and fault tolerance.
Key architectural principles:
- Stateless compute: Nodes store no data locally and allow instant replacement
- Object storage persistence: All data and metadata live in your S3 bucket
- Single binary deployment: No external dependencies or complex installations
- Query-per-node processing: Each node handles complete queries independently
Core components
Section titled “Core components”Embucket integrates four key components into a single binary that delivers complete lakehouse capabilities.
Query engine
Section titled “Query engine”Apache DataFusion powers Embucket’s SQL execution. DataFusion provides:
- Vectorized processing: Columnar data processing using Apache Arrow
- SQL compliance: ANSI SQL support with Snowflake-specific extensions
- Extensibility: Custom functions and data source integrations
Each Embucket node runs a complete query engine, enabling horizontal scalability and fault tolerance.
Data storage
Section titled “Data storage”Apache Iceberg manages table metadata and provides ACID guarantees:
- Schema evolution: Add, drop, and change columns without rewriting data
- Time travel: Query historical table snapshots
- ACID transactions: Atomic commits with snapshot isolation
- Partition pruning: Efficient query performance on large datasets
All table data stays in your object storage using Parquet files organized by Iceberg’s metadata structure.
Metadata management
Section titled “Metadata management”SlateDB stores catalog metadata directly in object storage:
- Embedded Log-Structured Merge tree: No external database dependencies
- Object storage native: Writes Sorted String Table files directly to S3
- Consistency guarantees: Strong consistency for metadata operations
- Crash recovery: Automatic recovery from node failures
SlateDB eliminates the need for external database dependencies while maintaining metadata durability.
API compatibility
Section titled “API compatibility”Embucket implements Snowflake’s SQL dialect and REST API:
- SQL compatibility: Snowflake-flavored SQL with existing queries
- REST API: v1 Snowflake REST API for driver compatibility
- Tool integration: Works with dbt, Apache Superset, and BI tools
- Protocol support: All native drivers built on top of REST API
Storage architecture
Section titled “Storage architecture”Embucket treats object storage as the database, storing both data and metadata in your S3 bucket using open formats.
Data organization
Section titled “Data organization”Your data follows this structure in object storage:
s3://your-bucket/├── volume/ # Embucket volume location│ ├── database1/│ │ ├── table1/│ │ │ ├── metadata/ # Embucket table metadata files│ │ │ └── data/ # Embucket data files│ │ └── table2/│ └── database2/└── metadata/ # Embucket metadata ├── manifest-*.sst # Embucket metadata manifest files └── *.sst # Embucket metadata SST files
ACID guarantees
Section titled “ACID guarantees”Embucket provides ACID properties through Iceberg’s snapshot isolation:
- Atomicity: All changes within a transaction succeed or fail together
- Consistency: Schema validation ensures data integrity
- Isolation: Concurrent queries see consistent snapshots
- Durability: Committed changes persist in object storage
Deployment patterns
Section titled “Deployment patterns”Embucket’s stateless architecture supports three deployment patterns depending on your requirements.
Single node deployment
Section titled “Single node deployment”Deploy one Embucket instance for development or small workloads:
- Simple setup with minimal resource requirements
- Suitable for development and testing environments
- Single point of failure during maintenance
Multi-node deployment
Section titled “Multi-node deployment”Deploy two or more Embucket instances behind a load balancer:
- High availability and fault tolerance
- Horizontal scaling for increased query throughput
- Load balancer distributes queries across healthy nodes