Snowflake compatibility
Embucket provides Snowflake compatibility through SQL dialect support and API compatibility. This document explains how Embucket implements Snowflake compatibility, covers current limitations, and outlines development plans.
This document doesn’t cover Snowflake enterprise features, data governance capabilities, or advanced security configurations. The current focus remains on core SQL compatibility and API integration for analytical workloads.
Embucket supports the Snowflake SQL dialect and provides a Snowflake v1 REST API. You can use existing Snowflake clients and tools with Embucket.
Key compatibility features:
- Snowflake v1 REST API compatibility
- SQL dialect support through Apache DataFusion engine
- Compatible with snowflake-connector-python and dependent tools
- Integration with dbt, snowflake-cli, and Apache Superset
API compatibility
Section titled “API compatibility”Embucket provides a Snowflake v1 REST API that works with any Snowflake client. The platform has tested compatibility primarily with snowflake-connector-python and tools that depend on it, including dbt, snowflake-cli, and Apache Superset.
SQL engine compatibility
Section titled “SQL engine compatibility”Embucket provides a SQL engine based on Apache DataFusion - an Apache Arrow-native query engine. The engine works with any SQL client. Embucket aims for 100% SQL dialect compatibility with Snowflake but currently support a subset of SQL features. Embucket includes some built-in functions and data types that Snowflake doesn’t provide.
Compatibility testing
Section titled “Compatibility testing”Embucket verifies compatibility using two test methods:
- SQL Logic Tests: Verify SQL engine compatibility with Snowflake
- dbt integration tests: Verify REST API compatibility with Snowflake
Embucket uses the dbt Gitlab project as a compatibility benchmark. The Embucket repository displays current compatibility badges for both SQL logic test suite and dbt Gitlab.
Architecture differences
Section titled “Architecture differences”Snowflake provides a managed analytics database built on FoundationDB and object storage such as S3. Snowflake delivers a managed service for deploying and scaling analytics workloads. See their whitepaper for details.
Embucket provides an open source Snowflake alternative with the same ease of use and manageability. You gain open source and self-hosting benefits. The system uses open technologies including Apache DataFusion, Apache Iceberg, SlateDB, Apache Arrow, and Parquet.
These technology choices differ structurally:
- Metadata storage: SlateDB on object storage instead of FoundationDB distributed database
- Data format: Iceberg and Parquet format instead of proprietary format
- Query execution: Single-node DataFusion SQL engine with Arrow in-memory representation
Current limitations
Section titled “Current limitations”Architecture constraints
Section titled “Architecture constraints”- Single-node execution: Query processing limits to one node’s memory and CPU capacity
- No distributed parallelism: Can’t distribute queries across many nodes
- Single writer: Only one instance can write to a table simultaneously
Data type limitations
Section titled “Data type limitations”- Limited VARIANT support: Embucket stores VARIANT as JSON-serialized
TEXT
because Parquet, Iceberg, and Arrow don’t support VARIANT natively - Numeric type differences: Embucket uses different numeric types than Snowflake for type coercion
- Timestamp precision: Fixed nanosecond precision differs from Snowflake’s variable precision
- No collation/charset support: All text data uses UTF-8 encoding. Locale-aware collations and custom character sets remain unsupported for some time.
Error handling
Section titled “Error handling”- Different error format: Error messages don’t match Snowflake’s format
Backslash escaping
Section titled “Backslash escaping”Embucket handles backslash escaping in string literals differently than Snowflake. This affects how you write queries that include backslashes or special characters in strings.
Key differences:
- Escape character interpretation: Snowflake treats single backslashes literally in most cases, while Embucket interprets them as escape characters
- Backslash repetition: You need more backslashes in Embucket to achieve the same result as Snowflake
- Error handling: Some backslash patterns that work in Snowflake cause errors in Embucket
Getting literal backslashes
Section titled “Getting literal backslashes”Snowflake:
SELECT * FROM VALUES ('\\b');-- Returns: \b
Embucket:
SELECT * FROM VALUES ('\\\\b');-- Returns: \b
Handling special characters
Section titled “Handling special characters”Snowflake:
SELECT * FROM VALUES ('\b');-- Returns: \x08 (backspace character, may appear as broken symbol)
Embucket:
SELECT * FROM VALUES ('\b');-- Returns: \x08 (backspace character, may not display in UI)
Single trailing backslash
Section titled “Single trailing backslash”Snowflake:
SELECT * FROM VALUES ('\\');-- Returns: \
Embucket:
SELECT * FROM VALUES ('\\');-- Error: Unterminated string literal
VARIANT data type support
Section titled “VARIANT data type support”Embucket implements VARIANT as JSON-serialized TEXT
on the storage layer. This approach becomes necessary because Parquet, Iceberg, and Arrow don’t support VARIANT natively.
-
Create a table
create table t2 (c1 variant) as values (parse_json('{"k1":1}'));+-------+| count ||-------|| 1 |+-------+ -
Count rows
select * from t2;+----------+| c1 ||----------|| {"k1":1} |+----------+
Storage implementation details
Section titled “Storage implementation details”In the storage layer, Embucket stores VARIANT as JSON-serialized TEXT
.
select arrow_typeof(c1) from t2;
+-------------------------------------+| arrow_typeof(embucket.public.t2.c1) ||-------------------------------------|| Utf8 |+-------------------------------------+
Numeric type handling
Section titled “Numeric type handling”Embucket uses DataFusion for type coercion and may use different numeric types than Snowflake. This produces different results in some cases.
For aggregation functions on numeric types, Snowflake uses Decimal with dynamic precision and scale. Embucket uses Decimal128 for better compatibility and fewer errors. Decimal128
provides the closest behavior to Snowflake’s NUMBER
type.
Numeric function result differences
Section titled “Numeric function result differences”Embucket result:
select avg(a), system$typeof(avg(a)) from values (1.0), (2.0), (10.0) as t(a);
+--------------------------------------------+| avg(t.a) | arrow_typeof(avg(t.a)) ||-------------------+------------------------|| 4.333333333333333 | Decimal128(7,5) |+--------------------------------------------+
Snowflake result:
select avg(a), system$typeof(avg(a)) from values (1.0), (2.0), (10.0) as t(a);
+----------------------------------+| AVG(A) | SYSTEM$TYPEOF(AVG(A)) ||----------+-----------------------|| 4.333333 | NUMBER(20,6)[SB16] |+----------------------------------+
Timestamp handling
Section titled “Timestamp handling”Embucket uses Arrow’s timestamp type: a 64-bit integer representing nanoseconds since the Unix epoch (1970-01-01 00:00:00 UTC). This fixed precision type always uses nanoseconds. Snowflake uses different precision.
Timezone storage differences
Section titled “Timezone storage differences”- Snowflake: Stores timezone offset per timestamp value
- Embucket: Stores timezone offset per column
Embucket may return different results for the same query depending on the session timezone setting.
Error message format
Section titled “Error message format”Embucket returns errors in a different format than Snowflake. Apache Arrow and DataFusion generate these errors and they don’t match Snowflake’s error format.
Development roadmap
Section titled “Development roadmap”The development roadmap includes these compatibility improvements:
- Numeric support: Add dynamic precision Decimal types for aggregation functions
- VARIANT support: Add native VARIANT storage when underlying dependencies support it
- Timestamp support: Improve timezone handling to match Snowflake’s per-value storage
- Error format: Align error messages with Snowflake’s error format