Splitgraph has been acquired by EDB! Read the blog post.

Baking a dataset into a Docker image

For a small enough dataset (which takes less than a few hundred MB in Seafowl), a potential scaling and deployment strategy is building a Docker image that contains the Seafowl binary, the SQLite catalog and the actual Parquet files.

No writers, many readers architecture diagram

See our GitHub for an example Docker Compose stack for this.

This has some benefits:

  • Scaling without any need to set up and run a separate catalog and object store
  • Data is co-located with the Seafowl instance, making queries faster
  • Can deploy and automatically scale Seafowl to any serverless provider that lets you run a Docker container (e.g. Fly.io or Google Cloud Run).
  • Can build and deploy a dataset as part of your CI process
  • You can build a self-contained "data product" that is runnable and queryable by any Docker user.

And some drawbacks:

  • Doesn't support writes. To update your dataset, you need to redeploy your Seafowl instance.
  • Some providers might have a limit on the Docker image size

If these drawbacks are an issue, you might want to take a look at other scaling options.

Running Seafowl with --one-off

When you run Seafowl with --one-off, instead of starting a server, this runs a single command and exits Seafowl. We'll be using this in the context of a Docker image.

Building a Docker image

You can use the official Seafowl Docker image that is already configured to run Seafowl:

FROM splitgraph/seafowl:nightly

# Use a single RUN command in which we'll download and delete the source file
# so that it doesn't stay in a Docker layer.

RUN \
  curl https://www.cl.cam.ac.uk/research/dtg/weather/weather-raw.csv -o weather-raw.csv && \
  seafowl --one-off \
  "CREATE EXTERNAL TABLE cambridge_weather (\
  timestamp TIMESTAMP,\
  temp_c_10 INT NULL,\
  humidity_perc INT NULL,\
  dew_point_c_10 INT NULL,\
  pressure_mbar INT NULL,\
  mean_wind_speed_knots_10 INT NULL,\
  average_wind_bearing_degrees INT NULL,\
  sunshine_hours_100 INT,\
  rainfall_mm_1000 INT,\
  max_wind_speed_knots_10 INT NULL)\
  STORED AS CSV\
  LOCATION './weather-raw.csv';\
  CREATE TABLE cambridge_weather AS SELECT * FROM staging.cambridge_weather" && \
  rm weather-raw.csv

This Dockerfile:

You can now test this Docker image locally:

DOCKER_BUILDKIT=1 docker build -t seafowl-weather-test .
docker run -d --name seafowl-weather-test-1 -p 127.0.0.1:8080:8080 seafowl-weather-test

curl \
  -H "Content-Type: application/json" \
  http://localhost:8080/q -d@- <<EOF
{"query": "SELECT date_trunc('month', timestamp) AS month,
  AVG(temp_c_10 * 0.1) AS avg_temp
  FROM cambridge_weather GROUP BY 1 ORDER BY 2 ASC LIMIT 10"}
EOF

{"month":"2010-12-01 00:00:00","avg_temp":-0.18382749326145384}
{"month":"2018-02-01 00:00:00","avg_temp":1.0846681922196781}
{"month":"2010-01-01 00:00:00","avg_temp":1.3368279569892467}
{"month":"2021-01-01 00:00:00","avg_temp":1.5615591397849464}
{"month":"2017-01-01 00:00:00","avg_temp":1.7540322580645156}
{"month":"2009-01-01 00:00:00","avg_temp":2.276196898179361}
{"month":"2013-03-01 00:00:00","avg_temp":2.369500674763832}
{"month":"2019-01-01 00:00:00","avg_temp":2.500806451612902}
{"month":"1995-12-01 00:00:00","avg_temp":2.592955892034235}
{"month":"2022-01-01 00:00:00","avg_temp":2.6318611987381733}