Architecture

Embucket delivers a Snowflake-compatible lakehouse through a zero-disk architecture—an architectural approach where all persistent state resides in object storage rather than on local compute nodes. This document explains how Embucket’s components work together to provide a complete analytics platform in a single binary.

You’ll learn about the system’s architecture, core components, data flow, and operational characteristics. This overview targets data engineers, platform architects, and anyone deploying Embucket in production.

System overview

Embucket implements a zero-disk lakehouse architecture—a system design where compute nodes maintain no persistent state. Object storage serves as the single source of truth for both data and metadata. This design eliminates operational complexity while delivering horizontal scalability and fault tolerance.

System Overview

Key architectural principles:

Stateless compute: Nodes store no data locally and allow instant replacement
Object storage persistence: All data and metadata live in your S3 bucket
Single binary deployment: No external dependencies or complex installations
Query-per-node processing: Each node handles complete queries independently

Core components

Embucket integrates four key components into a single binary that delivers complete lakehouse capabilities.

Query engine

Apache DataFusion powers Embucket’s SQL execution. DataFusion provides:

Vectorized processing: Columnar data processing using Apache Arrow
SQL compliance: ANSI SQL support with Snowflake-specific extensions
Extensibility: Custom functions and data source integrations

Each Embucket node runs a complete query engine, enabling horizontal scalability and fault tolerance.

Data storage

Apache Iceberg manages table metadata and provides ACID guarantees:

Schema evolution: Add, drop, and change columns without rewriting data
Time travel: Query historical table snapshots
ACID transactions: Atomic commits with snapshot isolation
Partition pruning: Efficient query performance on large datasets

All table data stays in your object storage using Parquet files organized by Iceberg’s metadata structure.

Metadata management

SlateDB stores catalog metadata directly in object storage:

Embedded Log-Structured Merge tree: No external database dependencies
Object storage native: Writes Sorted String Table files directly to S3
Consistency guarantees: Strong consistency for metadata operations
Crash recovery: Automatic recovery from node failures

SlateDB eliminates the need for external database dependencies while maintaining metadata durability.

API compatibility

Embucket implements Snowflake’s SQL dialect and REST API:

SQL compatibility: Snowflake-flavored SQL with existing queries
REST API: v1 Snowflake REST API for driver compatibility
Tool integration: Works with dbt, Apache Superset, and BI tools
Protocol support: All native drivers built on top of REST API

Storage architecture

Embucket treats object storage as the database, storing both data and metadata in your S3 bucket using open formats.

Data organization

Your data follows this structure in object storage:

s3://your-bucket/
├── volume/                 # Embucket volume location
│   ├── database1/
│   │   ├── table1/
│   │   │   ├── metadata/   # Embucket table metadata files
│   │   │   └── data/       # Embucket data files
│   │   └── table2/
│   └── database2/
└── metadata/               # Embucket metadata
    ├── manifest-*.sst      # Embucket metadata manifest files
    └── *.sst               # Embucket metadata SST files

ACID guarantees

Embucket provides ACID properties through Iceberg’s snapshot isolation:

Atomicity: All changes within a transaction succeed or fail together
Consistency: Schema validation ensures data integrity
Isolation: Concurrent queries see consistent snapshots
Durability: Committed changes persist in object storage

Deployment patterns

Embucket’s stateless architecture supports three deployment patterns depending on your requirements.

Single node deployment

Deploy one Embucket instance for development or small workloads:

Simple setup with minimal resource requirements
Suitable for development and testing environments
Single point of failure during maintenance

Multi-node deployment

Deploy two or more Embucket instances behind a load balancer:

High availability and fault tolerance
Horizontal scaling for increased query throughput
Load balancer distributes queries across healthy nodes