Working with Apache Spark
This guide walks you through setting up Apache Spark with Embucket to create a complete data processing environment. You’ll use Docker containers to deploy Spark with Embucket’s Apache Iceberg storage format for efficient data operations.
Embucket provides an Iceberg Catalog API that enables Apache Spark to read and write data seamlessly. This integration lets you process large datasets while maintaining data consistency and performance.
What you’ll learn
Section titled “What you’ll learn”In this guide, you:
- Deploy a complete Apache Spark environment with Embucket using Docker
- Configure Spark to connect with Embucket’s Iceberg catalog
- Load sample data into Embucket tables
- Query data using Spark SQL in Jupyter notebooks
Docker configuration
Section titled “Docker configuration”Create a docker-compose.yml
file in your working directory. This configuration defines four services that work together to provide a complete data processing environment.
services: # Apache Spark with Iceberg support and Jupyter notebook interface spark-iceberg: image: tabulario/spark-iceberg container_name: spark-iceberg depends_on: - embucket - minio networks: iceberg_net: volumes: # Local warehouse directory for data storage - ./warehouse:/home/iceberg/warehouse # Directory for Jupyter notebooks - ./notebooks:/home/iceberg/notebooks/notebooks environment: # AWS credentials for S3 compatibility with minio - AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE - AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY - AWS_REGION=us-east-2 # Spark memory configuration - adjust based on your system resources - SPARK_DRIVER_MEMORY=16g - SPARK_EXECUTOR_MEMORY=16g ports: # Jupyter notebook web interface - 8888:8888 entrypoint: /bin/sh command: > -c " echo ' spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions spark.sql.catalog.demo org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.demo.catalog-impl org.apache.iceberg.rest.RESTCatalog spark.sql.catalog.demo.uri http://embucket:3000/catalog spark.sql.catalog.demo.io-impl org.apache.iceberg.aws.s3.S3FileIO spark.sql.catalog.demo.warehouse demo spark.sql.catalog.demo.cache-enabled false spark.sql.catalog.demo.rest.access-key-id AKIAIOSFODNN7EXAMPLE spark.sql.catalog.demo.rest.secret-access-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY spark.sql.catalog.demo.rest.signing-region us-east-2 spark.sql.catalog.demo.rest.sigv4-enabled true spark.sql.catalog.demo.s3.endpoint http://warehouse.minio:9000 spark.sql.defaultCatalog demo spark.eventLog.enabled true spark.eventLog.dir /home/iceberg/spark-events spark.history.fs.logDirectory /home/iceberg/spark-events spark.sql.catalog.demo.s3.path-style-access true ' > /opt/spark/conf/spark-defaults.conf && ./entrypoint.sh notebook "
# Embucket server providing Iceberg catalog and web interface embucket: image: embucket/embucket container_name: embucket depends_on: mc: condition: service_healthy # Wait for minio setup to complete networks: iceberg_net: ports: # API server for catalog operations - 3000:3000 # Web interface for data management - 8080:8080 environment: # Configure S3-compatible storage backend - OBJECT_STORE_BACKEND=s3 - SLATEDB_PREFIX=data/ # S3 credentials matching minio configuration - AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE - AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY - AWS_REGION=us-east-2 # S3 bucket and endpoint configuration - S3_BUCKET=mybucket - S3_ENDPOINT=http://warehouse.minio:9000 - S3_ALLOW_HTTP=true # Required for local minio setup - CATALOG_URL=http://embucket:3000/catalog volumes: - ./tmp:/tmp
# minio S3-compatible object storage minio: image: minio/minio container_name: minio environment: # minio admin credentials (must match AWS credentials above) - MINIO_ROOT_USER=AKIAIOSFODNN7EXAMPLE - MINIO_ROOT_PASSWORD=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY volumes: # Local directory for persistent storage - ./warehouse:/warehouse networks: iceberg_net: aliases: # Network alias for internal container communication - warehouse.minio ports: # minio web console - 9001:9001 # minio S3 API - 9000:9000 command: ['server', '/warehouse', '--console-address', ':9001']
# minio client for initial bucket setup mc: depends_on: - minio image: minio/mc container_name: mc networks: iceberg_net: environment: - AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE - AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY - AWS_REGION=us-east-2 entrypoint: > /bin/sh -c " # Wait for minio to be ready until (/usr/bin/mc alias set minio http://warehouse.minio:9000 AKIAIOSFODNN7EXAMPLE wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY) do echo '...waiting for minio...' && sleep 1; done; # Clean up any existing bucket /usr/bin/mc rm -r --force minio/mybucket; # Create new bucket for Embucket data /usr/bin/mc mb minio/mybucket; # Set bucket permissions /usr/bin/mc anonymous set public minio/mybucket; # Keep container running tail -f /dev/null " healthcheck: test: ['CMD', '/usr/bin/mc', 'ls', 'minio/mybucket'] interval: 10s timeout: 5s retries: 3
# Custom network for container communicationnetworks: iceberg_net:
Service overview
Section titled “Service overview”The Docker Compose configuration creates four interconnected services:
Service | Purpose | Key Features |
---|---|---|
spark-iceberg | Apache Spark runtime | Provides Spark with Iceberg support and Jupyter notebook interface |
embucket | Catalog management | Main server that manages the Iceberg catalog and web interface |
minio | Object storage | S3-compatible storage backend for table data files |
mc | Storage initialization | minio client that sets up the initial storage bucket |
All services communicate through a custom Docker network iceberg_net
and share consistent AWS credentials for secure authentication between components.
Start the services
Section titled “Start the services”Deploy all services using Docker Compose:
# Pull images and start services in detached modedocker compose up -d
# Monitor startup progress (optional)docker compose logs -f
The initial startup takes 2-3 minutes as Docker downloads images and initializes services. You’ll see log messages indicating each service’s startup progress.
Verify deployment
Section titled “Verify deployment”Check that all services run:
# Check service statusdocker compose ps
Expected output:
NAME IMAGE STATUS PORTSembucket embucket/embucket Up 0.0.0.0:3000->3000/tcp, 0.0.0.0:8080->8080/tcpmc minio/mc Up (healthy)minio minio/minio Up 0.0.0.0:9000->9000/tcp, 0.0.0.0:9001->9001/tcpspark-iceberg tabulario/spark-iceberg Up 0.0.0.0:8888->8888/tcp
All services should show “Up” status. The mc
service shows “healthy” when bucket initialization completes.
Access the web interfaces
Section titled “Access the web interfaces”Once deployment completes, access these interfaces:
Interface | URL | Purpose |
---|---|---|
Embucket Web UI | http://localhost:8080 | Manage databases, volumes, and tables |
Jupyter Notebooks | http://localhost:8888 | Interactive Spark development |
minio Console | http://localhost:9001 | Track object storage |
Use the minio console credentials: username AKIAIOSFODNN7EXAMPLE
, password wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
.
Configure Embucket storage
Section titled “Configure Embucket storage”Before loading data, configure Embucket with the necessary storage volumes and databases. You can use either the web interface or API calls for this setup.
Create a storage volume
Section titled “Create a storage volume”A volume defines where Embucket stores your table data. Create an S3-compatible volume that connects to your minio instance.
- Open http://localhost:8080 in your browser
- Navigate to the Volumes section
- Click Create Volume
- Enter these details:
- Name:
demo
- Type:
S3
- Bucket:
mybucket
- Endpoint:
http://warehouse.minio:9000
- Access Key:
AKIAIOSFODNN7EXAMPLE
- Secret Key:
wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
- Name:
- Click Create to save the volume
curl -X POST http://localhost:3000/v1/metastore/volumes \ -H "Content-Type: application/json" \ -d '{ "ident": "demo", "type": "s3", "credentials": { "credential_type": "access_key", "aws-access-key-id": "AKIAIOSFODNN7EXAMPLE", "aws-secret-access-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" }, "bucket": "mybucket", "endpoint": "http://warehouse.minio:9000" }'
CREATE EXTERNAL VOLUME demo STORAGE_LOCATIONS = ((NAME = 'demo' STORAGE_PROVIDER = 's3' ENDPOINT = 'http://warehouse.minio:9000' ACCESS_KEY = 'AKIAIOSFODNN7EXAMPLE' SECRET_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' BUCKET = 'mybucket'));
Create a database
Section titled “Create a database”Databases organize your tables logically within Embucket. Create a database that uses your storage volume.
- In the Embucket web interface, go to Databases
- Click Create Database
- Enter:
- Name:
demo
- Volume:
demo
- Name:
- Click Create to save the database
curl -X POST http://localhost:3000/v1/metastore/databases \ -H "Content-Type: application/json" \ -d '{"ident": "demo", "volume": "demo"}'
CREATE DATABASE demo WITH EXTERNAL_VOLUME = 'demo';
Verify the configuration
Section titled “Verify the configuration”Confirm that your volume and database exist:
# Check created volumescurl -s http://localhost:3000/v1/metastore/volumes | jq
# Check created databasescurl -s http://localhost:3000/v1/metastore/databases | jq
Both commands should return JSON responses showing your demo
volume and database configurations.
Work with data using Spark
Section titled “Work with data using Spark”Use Apache Spark through Jupyter notebooks to load data into Embucket and run queries on your tables.
Open Jupyter notebook
Section titled “Open Jupyter notebook”- Navigate to http://localhost:8888 in your browser
- The Jupyter notebook interface displays with a file browser
- Create a new notebook: click New > Python 3
Create databases and explore data
Section titled “Create databases and explore data”Set up a database structure for your data:
%%sql
-- Create a database for sample dataCREATE DATABASE IF NOT EXISTS nyc;
-- Show all available databasesSHOW DATABASES;
Check for sample data in the container:
!ls -l /home/iceberg/data/
Load sample data
Section titled “Load sample data”Load NYC taxi data into an Iceberg table:
# Read the parquet filedf = spark.read.parquet("/home/iceberg/data/yellow_tripdata_2021-04.parquet")
# Show basic dataset informationprint(f"Records: {df.count():,}")print(f"Columns: {len(df.columns)}")
# Display schemaprint("\nSchema:")df.printSchema()
# Show sample rowsprint("\nSample data:")df.show(3)
# Save data as an Iceberg tabledf.write.mode("append").saveAsTable("nyc.yellow_taxis")
Query the data
Section titled “Query the data”Run SQL queries on your loaded data:
-
Basic count query
%%sql-- Basic count querySELECT COUNT(*) as total_tripsFROM nyc.yellow_taxis; -
Trip analysis by passenger count
%%sqlSELECTpassenger_count,COUNT(*) as trip_count,ROUND(AVG(trip_distance), 2) as avg_distance_miles,ROUND(AVG(total_amount), 2) as avg_fare_amountFROM nyc.yellow_taxisWHERE passenger_count BETWEEN 1 AND 6GROUP BY passenger_countORDER BY passenger_count; -
Top pickup locations
%%sqlSELECTPULocationID as pickup_location,COUNT(*) as tripsFROM nyc.yellow_taxisGROUP BY PULocationIDORDER BY trips DESCLIMIT 10;
Use your own data
Section titled “Use your own data”To work with your own data files:
- Copy files to the
notebooks
directory in your project folder - Access them from
/home/iceberg/notebooks/notebooks/
in the container - Load and process using Spark:
# Load different file formats
# CSV with headerscsv_df = spark.read.option("header", "true").option("inferSchema", "true").csv("/home/iceberg/notebooks/notebooks/data.csv")
# JSON filesjson_df = spark.read.json("/home/iceberg/notebooks/notebooks/data.json")
# Parquet filesparquet_df = spark.read.parquet("/home/iceberg/notebooks/notebooks/data.parquet")
# Save to Iceberg tablecsv_df.write.mode("overwrite").saveAsTable("demo.my_data")
# Create tables with specific schemasfrom pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
# Define schemaschema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("value", DoubleType(), True)])
# Read with schemadf_with_schema = spark.read.schema(schema).csv("/home/iceberg/notebooks/notebooks/structured_data.csv")df_with_schema.write.mode("overwrite").saveAsTable("demo.structured_table")