Skip to content

Architecture

Embucket delivers a Snowflake-compatible lakehouse through a zero-disk architecture—an architectural approach where all persistent state resides in object storage rather than on local compute nodes. This document explains how Embucket’s components work together to provide a complete analytics platform in a single binary.

You’ll learn about the system’s architecture, core components, data flow, and operational characteristics. This overview targets data engineers, platform architects, and anyone deploying Embucket in production.

Embucket implements a zero-disk lakehouse architecture—a system design where compute nodes maintain no persistent state. Object storage serves as the single source of truth for both data and metadata. This design eliminates operational complexity while delivering horizontal scalability and fault tolerance.

System Overview

Key architectural principles:

  • Stateless compute: Nodes store no data locally and allow instant replacement
  • Object storage persistence: All data and metadata live in your S3 bucket
  • Single binary deployment: No external dependencies or complex installations
  • Query-per-node processing: Each node handles complete queries independently

Embucket integrates four key components into a single binary that delivers complete lakehouse capabilities.

Apache DataFusion powers Embucket’s SQL execution. DataFusion provides:

  • Vectorized processing: Columnar data processing using Apache Arrow
  • SQL compliance: ANSI SQL support with Snowflake-specific extensions
  • Extensibility: Custom functions and data source integrations

Each Embucket node runs a complete query engine, enabling horizontal scalability and fault tolerance.

Apache Iceberg manages table metadata and provides ACID guarantees:

  • Schema evolution: Add, drop, and change columns without rewriting data
  • Time travel: Query historical table snapshots
  • ACID transactions: Atomic commits with snapshot isolation
  • Partition pruning: Efficient query performance on large datasets

All table data stays in your object storage using Parquet files organized by Iceberg’s metadata structure.

SlateDB stores catalog metadata directly in object storage:

  • Embedded Log-Structured Merge tree: No external database dependencies
  • Object storage native: Writes Sorted String Table files directly to S3
  • Consistency guarantees: Strong consistency for metadata operations
  • Crash recovery: Automatic recovery from node failures

SlateDB eliminates the need for external database dependencies while maintaining metadata durability.

Embucket implements Snowflake’s SQL dialect and REST API:

  • SQL compatibility: Snowflake-flavored SQL with existing queries
  • REST API: v1 Snowflake REST API for driver compatibility
  • Tool integration: Works with dbt, Apache Superset, and BI tools
  • Protocol support: All native drivers built on top of REST API

Embucket treats object storage as the database, storing both data and metadata in your S3 bucket using open formats.

Your data follows this structure in object storage:

s3://your-bucket/
├── volume/ # Embucket volume location
│ ├── database1/
│ │ ├── table1/
│ │ │ ├── metadata/ # Embucket table metadata files
│ │ │ └── data/ # Embucket data files
│ │ └── table2/
│ └── database2/
└── metadata/ # Embucket metadata
├── manifest-*.sst # Embucket metadata manifest files
└── *.sst # Embucket metadata SST files

Embucket provides ACID properties through Iceberg’s snapshot isolation:

  • Atomicity: All changes within a transaction succeed or fail together
  • Consistency: Schema validation ensures data integrity
  • Isolation: Concurrent queries see consistent snapshots
  • Durability: Committed changes persist in object storage

Embucket’s stateless architecture supports three deployment patterns depending on your requirements.

Deploy one Embucket instance for development or small workloads:

  • Simple setup with minimal resource requirements
  • Suitable for development and testing environments
  • Single point of failure during maintenance

Deploy two or more Embucket instances behind a load balancer:

  • High availability and fault tolerance
  • Horizontal scaling for increased query throughput
  • Load balancer distributes queries across healthy nodes