Skip to content

Architecture

Embucket delivers a Snowflake-compatible lakehouse. This document explains how Embucket’s components work together to provide a complete analytics platform in a single binary.

You’ll learn about the system’s architecture, core components, data flow, and operational characteristics. This overview targets data engineers, platform architects, and anyone deploying Embucket in production.

Embucket integrates four key components into a single binary that delivers complete lakehouse capabilities.

Apache DataFusion powers Embucket’s SQL execution. DataFusion provides:

  • Vectorized processing: Columnar data processing using Apache Arrow
  • SQL compliance: ANSI SQL support with Snowflake-specific extensions
  • Extensibility: Custom functions and data source integrations

Each Embucket node runs a complete query engine, enabling horizontal scalability and fault tolerance.

Apache Iceberg manages table metadata and provides ACID guarantees:

  • Schema evolution: Add, drop, and change columns without rewriting data
  • Time travel: Query historical table snapshots
  • ACID transactions: Atomic commits with snapshot isolation
  • Partition pruning: Efficient query performance on large datasets

All table data stays in your object storage using Parquet files organized by Iceberg’s metadata structure.

Embucket implements Snowflake’s SQL dialect and REST API:

  • SQL compatibility: Snowflake-flavored SQL with existing queries
  • REST API: v1 Snowflake REST API for driver compatibility
  • Tool integration: Works with dbt, Apache Superset, and BI tools
  • Protocol support: All native drivers built on top of REST API

Embucket provides ACID properties through Iceberg’s snapshot isolation:

  • Atomicity: All changes within a transaction succeed or fail together
  • Consistency: Schema validation ensures data integrity
  • Isolation: Concurrent queries see consistent snapshots
  • Durability: Committed changes persist in object storage

Embucket’s stateless architecture supports deployment patterns for different requirements.

Deploy one Embucket instance for development or small workloads:

  • Simple setup with minimal resource requirements
  • Suitable for development and testing environments
  • Single point of failure during maintenance

Deploy Embucket as an AWS Lambda function:

  • High availability and fault tolerance
  • Horizontal scaling for increased query throughput