Snowflake compatibility

Embucket provides Snowflake compatibility through SQL dialect support and REST API compatibility. This page covers core SQL compatibility and API integration for analytical workloads. Snowflake enterprise features, data governance capabilities, and advanced security configurations fall out of scope.

Key features:

Snowflake v1 REST API
SQL dialect powered by Apache DataFusion
Compatible with snowflake-connector-python and tools that depend on it
Integration with dbt, snowflake-cli, and Apache Superset

API compatibility

Embucket exposes a Snowflake v1 REST API that works with any Snowflake client. The API handles login, session management, and query execution through the same endpoints that Snowflake clients expect. Testing focuses primarily on snowflake-connector-python and tools that depend on it, including dbt, snowflake-cli, and Superset.

Query engine

Embucket uses Apache DataFusion as its query engine — an Apache Arrow-native analytical query engine. The project targets full SQL dialect compatibility with Snowflake. Embucket supports a growing set of Snowflake SQL features. DataFusion also includes some built-in functions that Snowflake doesn’t provide.

Compatibility testing

Embucket verifies compatibility through two test methods:

SQL Logic Tests — Verify SQL engine compatibility by running SQL statements and comparing results against expected output.
dbt integration tests — Verify REST API compatibility by running dbt models end-to-end. Embucket uses the dbt GitLab project as a compatibility benchmark.

Architecture differences

Snowflake operates as a managed analytics database built on FoundationDB and object storage. See the Snowflake whitepaper for details on its architecture. Embucket offers an open-source alternative that uses Apache DataFusion, Apache Iceberg, Apache Arrow, and Parquet.

The key structural differences include:

Metadata storage — Embucket uses external catalogs such as S3 Tables instead of FoundationDB.
Data format — Embucket stores data in Iceberg and Parquet instead of a proprietary format.
Query execution — Embucket runs single-node DataFusion with Arrow in-memory representation instead of a distributed execution engine.

Current limitations

Architecture

Single-node execution — Query processing uses only one node’s memory and CPU capacity.
No distributed parallelism — The engine doesn’t distribute queries across nodes.
Single writer — Only one Embucket instance can write to a table at a time.

Data types

VARIANT — Stored as JSON-serialized TEXT because Parquet, Iceberg, and Arrow don’t support the VARIANT type natively.
Numeric type coercion — Type promotion and coercion rules differ from Snowflake.
Timestamp precision — Arrow uses a fixed nanosecond precision for timestamps.
No collation or charset support — All text uses UTF-8 encoding. Snowflake’s collation and character set options lack support.

Error handling

Error messages don’t match Snowflake’s error format. Arrow and DataFusion generate errors in their own format.

Backslash escaping

Backslash escaping differs from Snowflake in the following cases.

Literal backslashes:

Snowflake:

SELECT * FROM VALUES ('\\b');

\b

Embucket:

SELECT * FROM VALUES ('\\\\b');

\b

Special characters:

In both Snowflake and Embucket, the following produces a backspace character:

SELECT * FROM VALUES ('\b');

Single trailing backslash:

Snowflake:

SELECT * FROM VALUES ('\\');

Embucket:

SELECT * FROM VALUES ('\\');

Error: Unterminated string literal

VARIANT data type

Embucket stores VARIANT values as JSON-serialized TEXT on the storage layer. The following example demonstrates how to create, read, and inspect VARIANT data.

Create a table with a VARIANT column

create table t2 (c1 variant) as values (parse_json('{"k1":1}'));

Count
-----
1

Read the data back
```
select * from t2;
```
```
c1
----------
{"k1":1}
```

Inspect the Arrow type

select arrow_typeof(c1) from t2;

arrow_typeof(t2.c1)
--------------------
Utf8

Numeric type handling

DataFusion uses different numeric types than Snowflake. Embucket uses Decimal128 for the closest behavior to Snowflake’s NUMBER type.

Embucket:

select avg(a), system$typeof(avg(a)) from values (1.0), (2.0), (10.0) as t(a);

avg(a)              | system$typeof(avg(a))
--------------------|----------------------
4.333333333333333   | Decimal128(7,5)

Snowflake:

select avg(a), system$typeof(avg(a)) from values (1.0), (2.0), (10.0) as t(a);

AVG(A)    | SYSTEM$TYPEOF(AVG(A))
----------|----------------------
4.333333  | NUMBER(20,6)[SB16]

Timestamp handling

Arrow represents timestamps as 64-bit integers with nanosecond precision. This precision stays constant and can’t change per value. Snowflake uses variable precision for timestamps.

Timezone handling also differs. Snowflake stores a timezone offset per timestamp value. Embucket stores a timezone offset per column. Query results may vary depending on the session timezone setting.

Error message format

Arrow and DataFusion generate error messages that don’t match Snowflake’s error format. If your app parses Snowflake error codes or messages, expect differences when running against Embucket.

Development roadmap

The following areas remain under active development:

VARIANT — Native storage support when upstream dependencies — Parquet, Iceberg, Arrow — add VARIANT support.
Numeric types — Dynamic precision Decimal types for aggregation functions.
Timestamps — Per-value timezone handling to match Snowflake behavior.
Error format — Align error messages and codes with Snowflake’s error format.