Splitgraph has been acquired by EDB! Read the blog post.

Part 3: Querying the cached GET API

Introduction

In this part of the tutorial, you'll query Seafowl's cached GET API, which will allow you to put Seafowl behind an HTTP cache or a CDN, as well as cache query results directly in the user's browser.

Cached GET API

Up until now, we've been sending POST requests to Seafowl. Those are usually not cached by anything.

Seafowl also has a special HTTP-cache-friendly API, available via GET requests. Using this API makes Seafowl query results look like a static HTTP asset. They then get automatically cached by browsers and CDNs.

To use it, we have to either supply the query via the URL itself, or alternatively include the hash of the query in the URL and smuggle the actual query as a query body or an HTTP header1.

Create a sample dataset

Since we'll be modifying the dataset to demonstrate cache invalidation, let's create a throwaway table for our experiments. Use the synthetic dataset from an earlier part of this tutorial:

curl -i \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $YOUR_PASSWORD" \
https://seafowl.fly.dev/q -d@-<<EOF
{"query": "CREATE TABLE test_cache (
    name VARCHAR,
    value DOUBLE,
    other_value INT
)"}
EOF

curl -i \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $YOUR_PASSWORD" \
https://seafowl.fly.dev/q -d@-<<EOF
{"query": "INSERT INTO test_cache
  VALUES
    ('Alpha', 2.3, 17),
    ('Alpha', 2.4, 18),
    ('Beta', 9.3, 42),
    ('Beta', 4.0, 18),
    ('Gamma', 6.8, 2),
    ('Gamma', 1.2, 8)
"}
EOF

Query the cached GET API from the command line

In this example we'll go with the approach of passing the query via the special X-Seafowl-Query header. Since we need to include the query hash in the URL, let's make a small script to simplify that:

#!/bin/bash -ex

# Replace with your host
SEAFOWL_HOST=${SEAFOWL_HOST-"https://seafowl.fly.dev"}
query=$1
etag=$2

# Hash the query
hash=$(echo -n "$query" | sha256sum | cut -f 1 -d " ")

headers=(-H "X-Seafowl-Query: $query")

# Add an ETag (if passed) for cache revalidation
if [[ -n "$etag" ]]; then
  headers+=(-H "If-None-Match: $etag")
fi

curl -i "${headers[@]}" "$SEAFOWL_HOST/q/$hash"

Pass a query to the script:

chmod +x query.sh
./query.sh "SELECT name, SUM(value) AS sum FROM test_cache GROUP BY 1"

This should output:

+ curl -i -H 'X-Seafowl-Query: SELECT name, SUM(value) AS sum FROM test_cache GROUP BY 1' https://seafowl.fly.dev/q/1775f92c98b2917913a3acbfc36d230d0dc3f2b1cb182d1f5eb03cfc88265bb9
HTTP/2 200
content-type: application/octet-stream
etag: 06d033ece6645de592db973644cf7357255f24536ff7b03c3b2ace10736f7636
content-length: 83
date: Mon, 29 Aug 2022 07:45:59 GMT
server: Fly/99aa446d1 (2022-08-23)
via: 2 fly.io
fly-request-id: 01GBM90YYJXSWDRWVX09TAHSSQ-lhr

{"name":"Alpha","sum":35.0}
{"name":"Beta","sum":60.0}
{"name":"Gamma","sum":10.0}

That doesn't seem cached to me?

Seafowl itself doesn't cache query results. Instead, it follows HTTP cache semantics. You'll see that it returned an ETag together with the query results2. This is a "fingerprint" for a certain version of the query result that intermediate caches can use to revalidate it.

When you write to a table, that changes the calculated ETag, letting caches know that their data is stale.

Normally, browsers and caches handle ETags for the user and we don't need to worry about managing them. But, to see how it works, let's pretend to be a browser and pass the ETag from the first response manually:

./query.sh \
  "SELECT name, SUM(value) AS sum FROM test_cache GROUP BY 1" \
  "06d033ece6645de592db973644cf7357255f24536ff7b03c3b2ace10736f7636"

+ curl -i -H 'X-Seafowl-Query: SELECT name, SUM(value) AS sum FROM test_cache GROUP BY 1' -H 'If-None-Match: 06d033ece6645de592db973644cf7357255f24536ff7b03c3b2ace10736f7636' https://seafowl.fly.dev/q/1775f92c98b2917913a3acbfc36d230d0dc3f2b1cb182d1f5eb03cfc88265bb9
HTTP/2 304
...

Because we used the ETag to make a conditional HTTP request, Seafowl noticed that the dataset didn't change and gave us back a 304 Not Modified response without actually running the query.

Change the dataset

Update the dataset:

curl -i \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $YOUR_PASSWORD" \
https://seafowl.fly.dev/q -d@-<<EOF
{"query": "INSERT INTO test_cache
  VALUES
    ('Delta', 4.2, 83),
    ('Kappa', 9.3, 16)
"}
EOF

And run the previous query again:

./query.sh \
  "SELECT name, SUM(value) AS sum FROM test_cache GROUP BY 1" \
  "06d033ece6645de592db973644cf7357255f24536ff7b03c3b2ace10736f7636"

...
content-type: application/octet-stream
etag: 46b1884167c4edd308bcf0c04163dd02d05c9742b35e86b57b5f7ed1b82f3850
content-length: 139
date: Mon, 29 Aug 2022 07:53:45 GMT
server: Fly/99aa446d1 (2022-08-23)
via: 2 fly.io
fly-request-id: 01GBM9F607PRD92W6WF4C5B7C0-lhr

{"name":"Delta","sum":83.0}
{"name":"Kappa","sum":16.0}
{"name":"Alpha","sum":35.0}
{"name":"Beta","sum":60.0}
{"name":"Gamma","sum":10.0}

Since the dataset changed, this invalidated the old ETag we had and made Seafowl rerun the query, giving us back the new results.

What's next?

In the next part of the guide, we'll add the Varnish HTTP cache to your Seafowl deployment to make it ready for prime time.

Or, do you have a domain name that's managed by Cloudflare? Let's put Seafowl behind Cloudflare and get query results cached and delivered globally.


  1. We recommend passing the query as a part of the URL path, as both the query header and GET request body methods have their own drawbacks. The former doesn't support special characters and newlines, whereas the GET request body can sometimes get dropped by clients and isn't supported by the fetch() API. More on the cached GET API in the guides section.
  2. Learn more about ETags on the MDN.