Part 2: Deploying to Fly.io
Introduction
In this part of the tutorial, you'll:
- deploy Seafowl to Fly.io
- add a persistent volume to it to support writes
- add the dataset from the previous part to it
Set up Fly.io
Fly.io is a platform for deploying applications around the
world. Their free tier allows us to deploy up to 2 shared-cpu
VMs with 256MB
of RAM each, which is good enough for our purposes.
Follow the Fly docs to:
- Install
fly
, the CLI for managing Fly applications - Use
fly
to sign up to Fly
Normally, you shouldn't need to add a payment method to Fly in order to utilize the free tier, but in some cases you might be asked for one to prevent abuse. If you stay below the free limit, your card won't be charged.
Deploy Seafowl
Make a new directory for our Fly app. It doesn't have to be inside of the directory you created in the previous step.
mkdir seafowl-fly
To deploy Seafowl to Fly we need a Docker image. Luckily, we provide pre-built Docker images with Seafowl1.
Run this:
fly launch --image splitgraph/seafowl:nightly --now
(Tested on flyctl v0.1.63
)
Follow the fly
CLI instructions to create and deploy the app:
- For the name, use anything you want (we used
seafowl
) - Pick any region you want (preferably the one closest to you)
This is what the output should look like:
$ fly launch
Creating app in seafowl-fly
? App Name (leave blank to use an auto-generated name): seafowl
? Select organization: Seafowl
? Select region: lhr (London, United Kingdom)
Created app seafowl in organization personal
Wrote config file fly.toml
Deploying seafowl
==> Validating app configuration
--> Validating app configuration done
Services
TCP 80/443 ⇢ 8080
Searching for image 'splitgraph/seafowl:nightly' remotely...
image found: img_n37qpoxm8ynpmz69
Image: registry-1.docker.io/splitgraph/seafowl:nightly
Image size: 49 MB
==> Creating release
...
==> Monitoring deployment
1 desired, 1 placed, 1 healthy, 0 unhealthy [health checks: 1 total, 1 passing]
--> v1 deployed successfully
Query the instance
You can now find out the domain your Seafowl instance was deployed to:
$ fly status
App
Name = seafowl
Owner = personal
Version = 2
Status = running
Hostname = seafowl.fly.dev
Run a query against it (replace seafowl.fly.dev
with your domain):
$ curl -i \
-H "Content-Type: application/json" \
https://seafowl.fly.dev/q \
-d'{"query": "SELECT 2*2"}'
{"Int64(2) * Int64(2)":4}
Make the instance writeable
This Seafowl instance is read-only, but we do want it to have our dataset from the previous step.
First, we need to make your instance writeable2. Seafowl supports being able to source configuration from environment variables, so let's use that to enable writes.
Generate a new password and set it as a Fly secret:
pw=$(< /dev/urandom LC_ALL=C tr -dc A-Za-z0-9 | head -c${1:-32};echo -n)
pw_hash=$(echo -n $pw | sha256sum - | head -c 64)
echo -e "Password: $pw"
export YOUR_PASSWORD=$pw
fly secrets set "SEAFOWL__FRONTEND__HTTP__WRITE_ACCESS=$pw_hash"
This will also redeploy the application to pick up the password and enable writes.
Note: You set the secret to the sha256
hash of the plaintext $YOUR_PASSWORD
.
Make sure to save the plaintext value of $YOUR_PASSWORD
(rather than its
hash), because that's what you'll need to make authenticated requests to your
Seafowl instance.
Add persistent storage
You can now write to Seafowl! However, since there's no persistent storage, your data will be lost as soon as the application is restarted or redeployed.
Let's add a volume to Seafowl so that it can persist writes to it. You get 1GB for free with Fly, which is also the minimum volume size.
Use the same region as where you deployed Seafowl initially (in our case, it's
lhr
- if you aren't sure of your region, you can see it with fly status
):
fly volumes create seafowl_data --region lhr --size 1
Edit the fly.toml
file and add this section to link the volume to Seafowl:
[mounts]
source="seafowl_data"
destination="/seafowl-data"
And, redeploy your application!
fly deploy
The data that you now write to Seafowl will be persisted in your Fly.io volume.
Add the dataset
Do you remember the Supply Chains dataset from the previous part? Let's add it to Seafowl. Instead of downloading and re-uploading it, we can tell Seafowl to query it directly from HTTP:
$ curl -i \
-H "Content-Type: application/json" https://seafowl.fly.dev/q \
-H "Authorization: Bearer $YOUR_PASSWORD" \
-d@- <<EOF
{"query": "CREATE EXTERNAL TABLE supply_chains \
STORED AS PARQUET \
LOCATION 'https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet'"}
EOF
You could now query it as-is (Seafowl will cache the parts of the Parquet file that you query), but it'll be faster if we also ingest it into Seafowl:
$ curl -v \
-H "Content-Type: application/json" https://seafowl.fly.dev/q \
-H "Authorization: Bearer $YOUR_PASSWORD" \
-d@- <<EOF
{"query": "CREATE TABLE supply_chains \
AS SELECT * FROM staging.supply_chains"}
EOF
This should take about a minute3.
Query it
Now, you can run exactly the same queries as in the previous part, except this time, this dataset is queryable from anywhere in the world:
$ curl \
-H "Content-Type: application/json" https://seafowl.fly.dev/q \
-d@- <<EOF
{"query": "SELECT country_of_production, COUNT(*) AS count \
FROM supply_chains \
GROUP BY 1 ORDER BY 2 DESC"}
EOF
{"country_of_production":"BRAZIL","count":2386600}
{"country_of_production":"ARGENTINA","count":260293}
{"country_of_production":"INDONESIA","count":155751}
{"country_of_production":"ECUADOR","count":96842}
{"country_of_production":"PARAGUAY","count":29848}
{"country_of_production":"COTE D'IVOIRE","count":27636}
{"country_of_production":"COLOMBIA","count":20820}
{"country_of_production":"PERU","count":8484}
{"country_of_production":"GHANA","count":2111}
{"country_of_production":"BOLIVIA","count":806}
Congratulations! You now have a single-node Seafowl deployment running on Fly.io, serving up a SQL API to a real dataset!
What's next?
If you were to now publish a dashboard that uses this API, every visit to it would mean Seafowl would have to rerun all queries used by the dashboard.
But, even if it's an interactive dashboard, the same queries often get run multiple times. In this case, we can vastly improve performance with caching.
Seafowl relies on HTTP caches, the browser's cache and CDNs and doesn't cache query results itself.
In the next part of the guide, we'll query Seafowl's cached GET API. This is the main building block that we need to get Seafowl working with HTTP caches.
- You can of course bring your own Dockerfile. See some examples on our GitHub for inspiration.↩
- There are other ways you can add a dataset to Seafowl. For example, you can
bake the dataset directly into the Docker image.
This limits the size of your dataset but lets you scale Seafowl
indefinitely. You can also upload a
CSV or a Parquet file to Seafowl without using
CREATE EXTERNAL TABLE
.↩ - Note that due to the strict free tier RAM limit on Fly (256MB), this upload can very occasionally fail when Seafowl indexes the data. In that case, try again or let us know in our GitHub issues↩