Snowflake compatibility

Embucket provides Snowflake compatibility through SQL dialect support and API compatibility. This document explains how Embucket implements Snowflake compatibility, covers current limitations, and outlines development plans.

This document doesn’t cover Snowflake enterprise features, data governance capabilities, or advanced security configurations. The current focus remains on core SQL compatibility and API integration for analytical workloads.

Embucket supports the Snowflake SQL dialect and provides a Snowflake v1 REST API. You can use existing Snowflake clients and tools with Embucket.

Key compatibility features:

Snowflake v1 REST API compatibility
SQL dialect support through Apache DataFusion engine
Compatible with snowflake-connector-python and dependent tools
Integration with dbt, snowflake-cli, and Apache Superset

API compatibility

Embucket provides a Snowflake v1 REST API that works with any Snowflake client. The platform has tested compatibility primarily with snowflake-connector-python and tools that depend on it, including dbt, snowflake-cli, and Apache Superset.

SQL engine compatibility

Embucket provides a SQL engine based on Apache DataFusion - an Apache Arrow-native query engine. The engine works with any SQL client. Embucket aims for 100% SQL dialect compatibility with Snowflake but currently support a subset of SQL features. Embucket includes some built-in functions and data types that Snowflake doesn’t provide.

Architecture differences

Snowflake provides a managed analytics database built on FoundationDB and object storage such as S3. Snowflake delivers a managed service for deploying and scaling analytics workloads. See their whitepaper for details.

Embucket provides an open source Snowflake alternative with the same ease of use and manageability. You gain open source and self-hosting benefits. The system uses open technologies including Apache DataFusion, Apache Iceberg, Apache Arrow, and Parquet.

These technology choices differ structurally:

Metadata storage: Embucket currently defers metadata management to external catalogs (such as AWS S3 table buckets)
Data format: Iceberg and Parquet format instead of proprietary format
Query execution: Single-node DataFusion SQL engine with Arrow in-memory representation

Current limitations

Architecture constraints

Single-node execution: Query processing limits to one node’s memory and CPU capacity
No distributed parallelism: Can’t distribute queries across many nodes

Data type limitations

Limited VARIANT support: Embucket stores VARIANT as JSON-serialized TEXT because Parquet, Iceberg, and Arrow don’t support VARIANT natively
Numeric type differences: Embucket uses different numeric types than Snowflake for type coercion
Timestamp precision: Fixed nanosecond precision differs from Snowflake’s variable precision
No collation/charset support: All text data uses UTF-8 encoding. Locale-aware collations and custom character sets remain unsupported for some time.

Error handling

Different error format: Error messages don’t match Snowflake’s format

Backslash escaping

Embucket handles backslash escaping in string literals differently than Snowflake. This affects how you write queries that include backslashes or special characters in strings.

Key differences:

Escape character interpretation: Snowflake treats single backslashes literally in most cases, while Embucket interprets them as escape characters
Backslash repetition: You need more backslashes in Embucket to achieve the same result as Snowflake
Error handling: Some backslash patterns that work in Snowflake cause errors in Embucket

Getting literal backslashes

Snowflake:

SELECT * FROM VALUES ('\\b');
-- Returns: \b

Embucket:

SELECT * FROM VALUES ('\\\\b');
-- Returns: \b

Handling special characters

Snowflake:

SELECT * FROM VALUES ('\b');
-- Returns: \x08 (backspace character, may appear as broken symbol)

Embucket:

SELECT * FROM VALUES ('\b');
-- Returns: \x08 (backspace character, may not display in UI)

Single trailing backslash

Snowflake:

SELECT * FROM VALUES ('\\');
-- Returns: \

Embucket:

SELECT * FROM VALUES ('\\');
-- Error: Unterminated string literal

VARIANT data type support

Embucket implements VARIANT as JSON-serialized TEXT on the storage layer. This approach becomes necessary because Parquet, Iceberg, and Arrow don’t support VARIANT natively.

Create a table

create table t2 (c1 variant) as values (parse_json('{"k1":1}'));

+-------+
| count |
|-------|
| 1     |
+-------+

Count rows

select * from t2;

+----------+
| c1       |
|----------|
| {"k1":1} |
+----------+

Storage implementation details

In the storage layer, Embucket stores VARIANT as JSON-serialized TEXT.

select arrow_typeof(c1) from t2;

+-------------------------------------+
| arrow_typeof(embucket.public.t2.c1) |
|-------------------------------------|
| Utf8                                |
+-------------------------------------+

Numeric type handling

Embucket uses DataFusion for type coercion and may use different numeric types than Snowflake. This produces different results in some cases.

For aggregation functions on numeric types, Snowflake uses Decimal with dynamic precision and scale. Embucket uses Decimal128 for better compatibility and fewer errors. Decimal128 provides the closest behavior to Snowflake’s NUMBER type.

Numeric function result differences

Embucket result:

select avg(a), system$typeof(avg(a)) from values (1.0), (2.0), (10.0) as t(a);

+--------------------------------------------+
| avg(t.a)          | arrow_typeof(avg(t.a)) |
|-------------------+------------------------|
| 4.333333333333333 | Decimal128(7,5)        |
+--------------------------------------------+

Snowflake result:

select avg(a), system$typeof(avg(a)) from values (1.0), (2.0), (10.0) as t(a);

+----------------------------------+
| AVG(A)   | SYSTEM$TYPEOF(AVG(A)) |
|----------+-----------------------|
| 4.333333 | NUMBER(20,6)[SB16]    |
+----------------------------------+

Timestamp handling

Embucket uses Arrow’s timestamp type: a 64-bit integer representing nanoseconds since the Unix epoch (1970-01-01 00:00:00 UTC). This fixed precision type always uses nanoseconds. Snowflake uses different precision.

Timezone storage differences

Snowflake: Stores timezone offset per timestamp value
Embucket: Stores timezone offset per column

Embucket may return different results for the same query depending on the session timezone setting.

Error message format

Embucket returns errors in a different format than Snowflake. Apache Arrow and DataFusion generate these errors and they don’t match Snowflake’s error format.

Development roadmap

The development roadmap includes these compatibility improvements:

VARIANT support: Add native VARIANT storage when underlying dependencies support it
Timestamp support: Improve timezone handling to match Snowflake’s per-value storage
Numeric support: Add dynamic precision Decimal types for aggregation functions
Error format: Align error messages with Snowflake’s error format