Iceberg catalog
This guide shows how to connect external Iceberg clients to Embucket’s Iceberg Catalog API and create tables using pyiceberg
. It doesn’t cover Apache Spark or Trino integrations, advanced Apache Iceberg features, or production deployment considerations.
- Install
pyiceberg
and ensure Embucket runs - Create a volume and database for data persistence
- Connect to Embucket’s Iceberg Catalog API
- Create and populate Iceberg tables
- Query tables through Embucket’s SQL interface
Prerequisites
Section titled “Prerequisites”Install pyiceberg
and run Embucket before starting this guide. If you haven’t already, follow the quick start to get started.
pip install 'pyiceberg[all]'
Create a volume and database
Section titled “Create a volume and database”Volumes specify where Embucket stores data. For this guide create a volume of type file
, name it demo
and specify path to a directory to store the data.
import requests
response = requests.post( "http://127.0.0.1:3000/v1/metastore/volumes", json={ "ident": "demo", "type": "file", "path": "/home/ec2-user/tmp/demo", })
CREATE EXTERNAL VOLUME IF NOT EXISTS demo STORAGE_LOCATIONS = ((NAME = 'demo' STORAGE_PROVIDER = 'file' PATH = '/home/ec2-user/tmp/demo'));
Next, create a database. Embucket requires a volume for each database. Create databases in the UI, using the internal API or SQL interface:
response = requests.post( "http://127.0.0.1:3000/v1/metastore/databases", json={ "ident": "demo", "volume": "demo", })
CREATE DATABASE IF NOT EXISTS demo EXTERNAL_VOLUME = 'demo';
Embucket uses database terminology, while Iceberg uses warehouse terminology. At the same time, Embucket provides an Iceberg Catalog API, that operate on warehouses, namespaces and tables. Following table shows the mapping between Embucket and Iceberg entities:
Embucket | Iceberg |
---|---|
Database | Warehouse |
Schema | Namespace |
Table | Table |
Run the following query to show databases:
$ snow sql -c local
> show databases;+----------------------------------------------------------------+| created_on | name | kind | database_name | schema_name ||------------+----------+----------+---------------+-------------|| None | slatedb | STANDARD | None | None || None | demo | STANDARD | None | None || None | embucket | STANDARD | None | None |+----------------------------------------------------------------+
Connect to Embucket
Section titled “Connect to Embucket”Use pyiceberg
to connect to the Embucket’s Iceberg Catalog API. It requires rest
catalog type, URI of the API and a warehouse name. Warehouse matches a database name in Embucket.
from pyiceberg.catalog import load_catalog
# Connect to Embucketcatalog = load_catalog( type="rest", uri="http://127.0.0.1:3000/catalog", warehouse="demo",)
Upload the dataset
Section titled “Upload the dataset”-
Download the NYC Taxi dataset as an example
Terminal window curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o yellow_tripdata_2023-01.parquet -
Load it into a PyArrow table:
import pyarrow.parquet as pqdf = pq.read_table("yellow_tripdata_2023-01.parquet") -
Create a new Iceberg table:
catalog.create_namespace("public")table = catalog.create_table("public.taxi_dataset",schema=df.schema,) -
Append the PyArrow table to the Iceberg table:
table.append(df)len(table.scan().to_arrow())
At this moment pyiceberg
SDK have written a data and committed an Iceberg transaction with Embucket Iceberg Catalog API. Query the data using SQL interface:
select * from demo.public.taxi_dataset limit 10;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee ||----------+----------------------+-----------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+---------+------------+--------------+-----------------------+--------------+----------------------+-------------|| 2 | 2023-01-22 21:37:32 | 2023-01-22 21:53:13 | 3.0 | 3.12 | 1.0 | N | 230 | 211 | 1 | 17.0 | 1.0 | 0.5 | 4.4 | 0.0 | 1.0 | 26.4 | 2.5 | 0.0 || 2 | 2023-01-22 21:56:16 | 2023-01-22 22:05:41 | 1.0 | 1.94 | 1.0 | N | 114 | 13 | 1 | 12.1 | 1.0 | 0.5 | 3.42 | 0.0 | 1.0 | 20.52 | 2.5 | 0.0 || 2 | 2023-01-22 21:39:05 | 2023-01-22 21:43:20 | 1.0 | 0.37 | 1.0 | N | 229 | 229 | 1 | 5.8 | 1.0 | 0.5 | 0.25 | 0.0 | 1.0 | 11.05 | 2.5 | 0.0 || 2 | 2023-01-22 21:50:14 | 2023-01-22 21:52:57 | 1.0 | 0.85 | 1.0 | N | 262 | 263 | 1 | 5.8 | 1.0 | 0.5 | 2.16 | 0.0 | 1.0 | 12.96 | 2.5 | 0.0 || 2 | 2023-01-22 21:57:31 | 2023-01-22 22:02:46 | 1.0 | 1.23 | 1.0 | N | 141 | 75 | 2 | 7.9 | 1.0 | 0.5 | 0.0 | 0.0 | 1.0 | 12.9 | 2.5 | 0.0 || 2 | 2023-01-22 21:22:44 | 2023-01-22 21:26:06 | 1.0 | 0.44 | 1.0 | N | 256 | 256 | 2 | 5.8 | 1.0 | 0.5 | 0.0 | 0.0 | 1.0 | 8.3 | 0.0 | 0.0 || 2 | 2023-01-22 21:38:28 | 2023-01-22 21:46:46 | 1.0 | 1.78 | 1.0 | N | 79 | 170 | 1 | 10.7 | 1.0 | 0.5 | 3.14 | 0.0 | 1.0 | 18.84 | 2.5 | 0.0 || 2 | 2023-01-22 21:49:46 | 2023-01-22 22:02:26 | 1.0 | 2.42 | 1.0 | N | 170 | 143 | 1 | 14.9 | 1.0 | 0.5 | 3.98 | 0.0 | 1.0 | 23.88 | 2.5 | 0.0 || 2 | 2023-01-22 21:59:58 | 2023-01-22 22:36:35 | 1.0 | 22.26 | 1.0 | N | 132 | 241 | 1 | 83.5 | 1.0 | 0.5 | 10.0 | 6.55 | 1.0 | 103.8 | 0.0 | 1.25 || 2 | 2023-01-22 21:15:44 | 2023-01-22 21:40:22 | 2.0 | 9.14 | 1.0 | N | 264 | 151 | 1 | 38.7 | 6.0 | 0.5 | 13.81 | 6.55 | 1.0 | 70.31 | 2.5 | 1.25 |+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
select count(*) from demo.PUBLIC.taxi_dataset;
+----------+| count(*) ||----------|| 3066766 |+----------+