Skip to content

Snowflake compatibility

Embucket provides Snowflake compatibility through SQL dialect support and API compatibility. This document explains how Embucket implements Snowflake compatibility, covers current limitations, and outlines development plans.

This document doesn’t cover Snowflake enterprise features, data governance capabilities, or advanced security configurations. The current focus remains on core SQL compatibility and API integration for analytical workloads.

Embucket supports the Snowflake SQL dialect and provides a Snowflake v1 REST API. You can use existing Snowflake clients and tools with Embucket.

Key compatibility features:

  • Snowflake v1 REST API compatibility
  • SQL dialect support through Apache DataFusion engine
  • Compatible with snowflake-connector-python and dependent tools
  • Integration with dbt, snowflake-cli, and Apache Superset

Embucket provides a Snowflake v1 REST API that works with any Snowflake client. The platform has tested compatibility primarily with snowflake-connector-python and tools that depend on it, including dbt, snowflake-cli, and Apache Superset.

Embucket provides a SQL engine based on Apache DataFusion - an Apache Arrow-native query engine. The engine works with any SQL client. Embucket aims for 100% SQL dialect compatibility with Snowflake but currently support a subset of SQL features. Embucket includes some built-in functions and data types that Snowflake doesn’t provide.

Embucket verifies compatibility using two test methods:

  • SQL Logic Tests: Verify SQL engine compatibility with Snowflake
  • dbt integration tests: Verify REST API compatibility with Snowflake

Embucket uses the dbt Gitlab project as a compatibility benchmark. The Embucket repository displays current compatibility badges for both SQL logic test suite and dbt Gitlab.

Snowflake provides a managed analytics database built on FoundationDB and object storage such as S3. Snowflake delivers a managed service for deploying and scaling analytics workloads. See their whitepaper for details.

Embucket provides an open source Snowflake alternative with the same ease of use and manageability. You gain open source and self-hosting benefits. The system uses open technologies including Apache DataFusion, Apache Iceberg, SlateDB, Apache Arrow, and Parquet.

These technology choices differ structurally:

  • Metadata storage: SlateDB on object storage instead of FoundationDB distributed database
  • Data format: Iceberg and Parquet format instead of proprietary format
  • Query execution: Single-node DataFusion SQL engine with Arrow in-memory representation
  • Single-node execution: Query processing limits to one node’s memory and CPU capacity
  • No distributed parallelism: Can’t distribute queries across many nodes
  • Single writer: Only one instance can write to a table simultaneously
  • Limited VARIANT support: Embucket stores VARIANT as JSON-serialized TEXT because Parquet, Iceberg, and Arrow don’t support VARIANT natively
  • Numeric type differences: Embucket uses different numeric types than Snowflake for type coercion
  • Timestamp precision: Fixed nanosecond precision differs from Snowflake’s variable precision
  • No collation/charset support: All text data uses UTF-8 encoding. Locale-aware collations and custom character sets remain unsupported for some time.
  • Different error format: Error messages don’t match Snowflake’s format

Embucket handles backslash escaping in string literals differently than Snowflake. This affects how you write queries that include backslashes or special characters in strings.

Key differences:

  • Escape character interpretation: Snowflake treats single backslashes literally in most cases, while Embucket interprets them as escape characters
  • Backslash repetition: You need more backslashes in Embucket to achieve the same result as Snowflake
  • Error handling: Some backslash patterns that work in Snowflake cause errors in Embucket

Snowflake:

SELECT * FROM VALUES ('\\b');
-- Returns: \b

Embucket:

SELECT * FROM VALUES ('\\\\b');
-- Returns: \b

Snowflake:

SELECT * FROM VALUES ('\b');
-- Returns: \x08 (backspace character, may appear as broken symbol)

Embucket:

SELECT * FROM VALUES ('\b');
-- Returns: \x08 (backspace character, may not display in UI)

Snowflake:

SELECT * FROM VALUES ('\\');
-- Returns: \

Embucket:

SELECT * FROM VALUES ('\\');
-- Error: Unterminated string literal

Embucket implements VARIANT as JSON-serialized TEXT on the storage layer. This approach becomes necessary because Parquet, Iceberg, and Arrow don’t support VARIANT natively.

  1. Create a table

    create table t2 (c1 variant) as values (parse_json('{"k1":1}'));
    +-------+
    | count |
    |-------|
    | 1 |
    +-------+
  2. Count rows

    select * from t2;
    +----------+
    | c1 |
    |----------|
    | {"k1":1} |
    +----------+

In the storage layer, Embucket stores VARIANT as JSON-serialized TEXT.

select arrow_typeof(c1) from t2;
+-------------------------------------+
| arrow_typeof(embucket.public.t2.c1) |
|-------------------------------------|
| Utf8 |
+-------------------------------------+

Embucket uses DataFusion for type coercion and may use different numeric types than Snowflake. This produces different results in some cases.

For aggregation functions on numeric types, Snowflake uses Decimal with dynamic precision and scale. Embucket uses Decimal128 for better compatibility and fewer errors. Decimal128 provides the closest behavior to Snowflake’s NUMBER type.

Embucket result:

select avg(a), system$typeof(avg(a)) from values (1.0), (2.0), (10.0) as t(a);
+--------------------------------------------+
| avg(t.a) | arrow_typeof(avg(t.a)) |
|-------------------+------------------------|
| 4.333333333333333 | Decimal128(7,5) |
+--------------------------------------------+

Snowflake result:

select avg(a), system$typeof(avg(a)) from values (1.0), (2.0), (10.0) as t(a);
+----------------------------------+
| AVG(A) | SYSTEM$TYPEOF(AVG(A)) |
|----------+-----------------------|
| 4.333333 | NUMBER(20,6)[SB16] |
+----------------------------------+

Embucket uses Arrow’s timestamp type: a 64-bit integer representing nanoseconds since the Unix epoch (1970-01-01 00:00:00 UTC). This fixed precision type always uses nanoseconds. Snowflake uses different precision.

  • Snowflake: Stores timezone offset per timestamp value
  • Embucket: Stores timezone offset per column

Embucket may return different results for the same query depending on the session timezone setting.

Embucket returns errors in a different format than Snowflake. Apache Arrow and DataFusion generate these errors and they don’t match Snowflake’s error format.

The development roadmap includes these compatibility improvements:

  • Numeric support: Add dynamic precision Decimal types for aggregation functions
  • VARIANT support: Add native VARIANT storage when underlying dependencies support it
  • Timestamp support: Improve timezone handling to match Snowflake’s per-value storage
  • Error format: Align error messages with Snowflake’s error format