Snowflake compatibility
Embucket provides Snowflake compatibility through SQL dialect support and REST API compatibility. This page covers core SQL compatibility and API integration for analytical workloads. Snowflake enterprise features, data governance capabilities, and advanced security configurations fall out of scope.
Key features:
- Snowflake v1 REST API
- SQL dialect powered by Apache DataFusion
- Compatible with snowflake-connector-python and tools that depend on it
- Integration with dbt, snowflake-cli, and Apache Superset
API compatibility
Section titled “API compatibility”Embucket exposes a Snowflake v1 REST API that works with any Snowflake client. The API handles login, session management, and query execution through the same endpoints that Snowflake clients expect. Testing focuses primarily on snowflake-connector-python and tools that depend on it, including dbt, snowflake-cli, and Superset.
Query engine
Section titled “Query engine”Embucket uses Apache DataFusion as its query engine — an Apache Arrow-native analytical query engine. The project targets full SQL dialect compatibility with Snowflake. Embucket supports a growing set of Snowflake SQL features. DataFusion also includes some built-in functions that Snowflake doesn’t provide.
Compatibility testing
Section titled “Compatibility testing”Embucket verifies compatibility through two test methods:
- SQL Logic Tests — Verify SQL engine compatibility by running SQL statements and comparing results against expected output.
- dbt integration tests — Verify REST API compatibility by running dbt models end-to-end. Embucket uses the dbt GitLab project as a compatibility benchmark.
Architecture differences
Section titled “Architecture differences”Snowflake operates as a managed analytics database built on FoundationDB and object storage. See the Snowflake whitepaper for details on its architecture. Embucket offers an open-source alternative that uses Apache DataFusion, Apache Iceberg, Apache Arrow, and Parquet.
The key structural differences include:
- Metadata storage — Embucket uses external catalogs such as S3 Tables instead of FoundationDB.
- Data format — Embucket stores data in Iceberg and Parquet instead of a proprietary format.
- Query execution — Embucket runs single-node DataFusion with Arrow in-memory representation instead of a distributed execution engine.
Current limitations
Section titled “Current limitations”Architecture
Section titled “Architecture”- Single-node execution — Query processing uses only one node’s memory and CPU capacity.
- No distributed parallelism — The engine doesn’t distribute queries across nodes.
- Single writer — Only one Embucket instance can write to a table at a time.
Data types
Section titled “Data types”- VARIANT — Stored as JSON-serialized
TEXTbecause Parquet, Iceberg, and Arrow don’t support the VARIANT type natively. - Numeric type coercion — Type promotion and coercion rules differ from Snowflake.
- Timestamp precision — Arrow uses a fixed nanosecond precision for timestamps.
- No collation or charset support — All text uses UTF-8 encoding. Snowflake’s collation and character set options lack support.
Error handling
Section titled “Error handling”Error messages don’t match Snowflake’s error format. Arrow and DataFusion generate errors in their own format.
Backslash escaping
Section titled “Backslash escaping”Backslash escaping differs from Snowflake in the following cases.
Literal backslashes:
Snowflake:
SELECT * FROM VALUES ('\\b');\bEmbucket:
SELECT * FROM VALUES ('\\\\b');\bSpecial characters:
In both Snowflake and Embucket, the following produces a backspace character:
SELECT * FROM VALUES ('\b');Single trailing backslash:
Snowflake:
SELECT * FROM VALUES ('\\');\Embucket:
SELECT * FROM VALUES ('\\');Error: Unterminated string literalVARIANT data type
Section titled “VARIANT data type”Embucket stores VARIANT values as JSON-serialized TEXT on the storage layer. The following example demonstrates how to create, read, and inspect VARIANT data.
-
Create a table with a VARIANT column
create table t2 (c1 variant) as values (parse_json('{"k1":1}'));Count-----1 -
Read the data back
select * from t2;c1----------{"k1":1} -
Inspect the Arrow type
select arrow_typeof(c1) from t2;arrow_typeof(t2.c1)--------------------Utf8
Numeric type handling
Section titled “Numeric type handling”DataFusion uses different numeric types than Snowflake. Embucket uses Decimal128 for the closest behavior to Snowflake’s NUMBER type.
Embucket:
select avg(a), system$typeof(avg(a)) from values (1.0), (2.0), (10.0) as t(a);avg(a) | system$typeof(avg(a))--------------------|----------------------4.333333333333333 | Decimal128(7,5)Snowflake:
select avg(a), system$typeof(avg(a)) from values (1.0), (2.0), (10.0) as t(a);AVG(A) | SYSTEM$TYPEOF(AVG(A))----------|----------------------4.333333 | NUMBER(20,6)[SB16]Timestamp handling
Section titled “Timestamp handling”Arrow represents timestamps as 64-bit integers with nanosecond precision. This precision stays constant and can’t change per value. Snowflake uses variable precision for timestamps.
Timezone handling also differs. Snowflake stores a timezone offset per timestamp value. Embucket stores a timezone offset per column. Query results may vary depending on the session timezone setting.
Error message format
Section titled “Error message format”Arrow and DataFusion generate error messages that don’t match Snowflake’s error format. If your app parses Snowflake error codes or messages, expect differences when running against Embucket.
Development roadmap
Section titled “Development roadmap”The following areas remain under active development:
- VARIANT — Native storage support when upstream dependencies — Parquet, Iceberg, Arrow — add VARIANT support.
- Numeric types — Dynamic precision Decimal types for aggregation functions.
- Timestamps — Per-value timezone handling to match Snowflake behavior.
- Error format — Align error messages and codes with Snowflake’s error format.