Skip to content

Working with Apache Spark

This guide walks you through setting up Apache Spark with Embucket to create a complete data processing environment. You’ll use Docker containers to deploy Spark with Embucket’s Apache Iceberg storage format for efficient data operations.

Embucket provides an Iceberg Catalog API that enables Apache Spark to read and write data seamlessly. This integration lets you process large datasets while maintaining data consistency and performance.

In this guide, you:

  • Deploy a complete Apache Spark environment with Embucket using Docker
  • Configure Spark to connect with Embucket’s Iceberg catalog
  • Load sample data into Embucket tables
  • Query data using Spark SQL in Jupyter notebooks

Create a docker-compose.yml file in your working directory. This configuration defines four services that work together to provide a complete data processing environment.

services:
# Apache Spark with Iceberg support and Jupyter notebook interface
spark-iceberg:
image: tabulario/spark-iceberg
container_name: spark-iceberg
depends_on:
- embucket
- minio
networks:
iceberg_net:
volumes:
# Local warehouse directory for data storage
- ./warehouse:/home/iceberg/warehouse
# Directory for Jupyter notebooks
- ./notebooks:/home/iceberg/notebooks/notebooks
environment:
# AWS credentials for S3 compatibility with minio
- AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
- AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
- AWS_REGION=us-east-2
# Spark memory configuration - adjust based on your system resources
- SPARK_DRIVER_MEMORY=16g
- SPARK_EXECUTOR_MEMORY=16g
ports:
# Jupyter notebook web interface
- 8888:8888
entrypoint: /bin/sh
command: >
-c "
echo '
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.demo org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.catalog-impl org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.demo.uri http://embucket:3000/catalog
spark.sql.catalog.demo.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.demo.warehouse demo
spark.sql.catalog.demo.cache-enabled false
spark.sql.catalog.demo.rest.access-key-id AKIAIOSFODNN7EXAMPLE
spark.sql.catalog.demo.rest.secret-access-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
spark.sql.catalog.demo.rest.signing-region us-east-2
spark.sql.catalog.demo.rest.sigv4-enabled true
spark.sql.catalog.demo.s3.endpoint http://warehouse.minio:9000
spark.sql.defaultCatalog demo
spark.eventLog.enabled true
spark.eventLog.dir /home/iceberg/spark-events
spark.history.fs.logDirectory /home/iceberg/spark-events
spark.sql.catalog.demo.s3.path-style-access true
' > /opt/spark/conf/spark-defaults.conf && ./entrypoint.sh notebook
"
# Embucket server providing Iceberg catalog and web interface
embucket:
image: embucket/embucket
container_name: embucket
depends_on:
mc:
condition: service_healthy # Wait for minio setup to complete
networks:
iceberg_net:
ports:
# API server for catalog operations
- 3000:3000
# Web interface for data management
- 8080:8080
environment:
# Configure S3-compatible storage backend
- OBJECT_STORE_BACKEND=s3
- SLATEDB_PREFIX=data/
# S3 credentials matching minio configuration
- AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
- AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
- AWS_REGION=us-east-2
# S3 bucket and endpoint configuration
- S3_BUCKET=mybucket
- S3_ENDPOINT=http://warehouse.minio:9000
- S3_ALLOW_HTTP=true # Required for local minio setup
- CATALOG_URL=http://embucket:3000/catalog
volumes:
- ./tmp:/tmp
# minio S3-compatible object storage
minio:
image: minio/minio
container_name: minio
environment:
# minio admin credentials (must match AWS credentials above)
- MINIO_ROOT_USER=AKIAIOSFODNN7EXAMPLE
- MINIO_ROOT_PASSWORD=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
volumes:
# Local directory for persistent storage
- ./warehouse:/warehouse
networks:
iceberg_net:
aliases:
# Network alias for internal container communication
- warehouse.minio
ports:
# minio web console
- 9001:9001
# minio S3 API
- 9000:9000
command: ['server', '/warehouse', '--console-address', ':9001']
# minio client for initial bucket setup
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
networks:
iceberg_net:
environment:
- AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
- AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
- AWS_REGION=us-east-2
entrypoint: >
/bin/sh -c "
# Wait for minio to be ready
until (/usr/bin/mc alias set minio http://warehouse.minio:9000 AKIAIOSFODNN7EXAMPLE wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY) do echo '...waiting for minio...' && sleep 1; done;
# Clean up any existing bucket
/usr/bin/mc rm -r --force minio/mybucket;
# Create new bucket for Embucket data
/usr/bin/mc mb minio/mybucket;
# Set bucket permissions
/usr/bin/mc anonymous set public minio/mybucket;
# Keep container running
tail -f /dev/null
"
healthcheck:
test: ['CMD', '/usr/bin/mc', 'ls', 'minio/mybucket']
interval: 10s
timeout: 5s
retries: 3
# Custom network for container communication
networks:
iceberg_net:

The Docker Compose configuration creates four interconnected services:

ServicePurposeKey Features
spark-icebergApache Spark runtimeProvides Spark with Iceberg support and Jupyter notebook interface
embucketCatalog managementMain server that manages the Iceberg catalog and web interface
minioObject storageS3-compatible storage backend for table data files
mcStorage initializationminio client that sets up the initial storage bucket

All services communicate through a custom Docker network iceberg_net and share consistent AWS credentials for secure authentication between components.

Deploy all services using Docker Compose:

Terminal window
# Pull images and start services in detached mode
docker compose up -d
# Monitor startup progress (optional)
docker compose logs -f

The initial startup takes 2-3 minutes as Docker downloads images and initializes services. You’ll see log messages indicating each service’s startup progress.

Check that all services run:

Terminal window
# Check service status
docker compose ps

Expected output:

NAME IMAGE STATUS PORTS
embucket embucket/embucket Up 0.0.0.0:3000->3000/tcp, 0.0.0.0:8080->8080/tcp
mc minio/mc Up (healthy)
minio minio/minio Up 0.0.0.0:9000->9000/tcp, 0.0.0.0:9001->9001/tcp
spark-iceberg tabulario/spark-iceberg Up 0.0.0.0:8888->8888/tcp

All services should show “Up” status. The mc service shows “healthy” when bucket initialization completes.

Once deployment completes, access these interfaces:

InterfaceURLPurpose
Embucket Web UIhttp://localhost:8080Manage databases, volumes, and tables
Jupyter Notebookshttp://localhost:8888Interactive Spark development
minio Consolehttp://localhost:9001Track object storage

Use the minio console credentials: username AKIAIOSFODNN7EXAMPLE, password wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY.

Before loading data, configure Embucket with the necessary storage volumes and databases. You can use either the web interface or API calls for this setup.

A volume defines where Embucket stores your table data. Create an S3-compatible volume that connects to your minio instance.

  1. Open http://localhost:8080 in your browser
  2. Navigate to the Volumes section
  3. Click Create Volume
  4. Enter these details:
    • Name: demo
    • Type: S3
    • Bucket: mybucket
    • Endpoint: http://warehouse.minio:9000
    • Access Key: AKIAIOSFODNN7EXAMPLE
    • Secret Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  5. Click Create to save the volume

Databases organize your tables logically within Embucket. Create a database that uses your storage volume.

  1. In the Embucket web interface, go to Databases
  2. Click Create Database
  3. Enter:
    • Name: demo
    • Volume: demo
  4. Click Create to save the database

Confirm that your volume and database exist:

Terminal window
# Check created volumes
curl -s http://localhost:3000/v1/metastore/volumes | jq
# Check created databases
curl -s http://localhost:3000/v1/metastore/databases | jq

Both commands should return JSON responses showing your demo volume and database configurations.

Use Apache Spark through Jupyter notebooks to load data into Embucket and run queries on your tables.

  1. Navigate to http://localhost:8888 in your browser
  2. The Jupyter notebook interface displays with a file browser
  3. Create a new notebook: click New > Python 3

Set up a database structure for your data:

%%sql
-- Create a database for sample data
CREATE DATABASE IF NOT EXISTS nyc;
-- Show all available databases
SHOW DATABASES;

Check for sample data in the container:

Terminal window
!ls -l /home/iceberg/data/

Load NYC taxi data into an Iceberg table:

# Read the parquet file
df = spark.read.parquet("/home/iceberg/data/yellow_tripdata_2021-04.parquet")
# Show basic dataset information
print(f"Records: {df.count():,}")
print(f"Columns: {len(df.columns)}")
# Display schema
print("\nSchema:")
df.printSchema()
# Show sample rows
print("\nSample data:")
df.show(3)
# Save data as an Iceberg table
df.write.mode("append").saveAsTable("nyc.yellow_taxis")

Run SQL queries on your loaded data:

  1. Basic count query

    %%sql
    -- Basic count query
    SELECT COUNT(*) as total_trips
    FROM nyc.yellow_taxis;
  2. Trip analysis by passenger count

    %%sql
    SELECT
    passenger_count,
    COUNT(*) as trip_count,
    ROUND(AVG(trip_distance), 2) as avg_distance_miles,
    ROUND(AVG(total_amount), 2) as avg_fare_amount
    FROM nyc.yellow_taxis
    WHERE passenger_count BETWEEN 1 AND 6
    GROUP BY passenger_count
    ORDER BY passenger_count;
  3. Top pickup locations

    %%sql
    SELECT
    PULocationID as pickup_location,
    COUNT(*) as trips
    FROM nyc.yellow_taxis
    GROUP BY PULocationID
    ORDER BY trips DESC
    LIMIT 10;

To work with your own data files:

  1. Copy files to the notebooks directory in your project folder
  2. Access them from /home/iceberg/notebooks/notebooks/ in the container
  3. Load and process using Spark:
# Load different file formats
# CSV with headers
csv_df = spark.read.option("header", "true").option("inferSchema", "true").csv("/home/iceberg/notebooks/notebooks/data.csv")
# JSON files
json_df = spark.read.json("/home/iceberg/notebooks/notebooks/data.json")
# Parquet files
parquet_df = spark.read.parquet("/home/iceberg/notebooks/notebooks/data.parquet")
# Save to Iceberg table
csv_df.write.mode("overwrite").saveAsTable("demo.my_data")
# Create tables with specific schemas
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
# Define schema
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("value", DoubleType(), True)
])
# Read with schema
df_with_schema = spark.read.schema(schema).csv("/home/iceberg/notebooks/notebooks/structured_data.csv")
df_with_schema.write.mode("overwrite").saveAsTable("demo.structured_table")