Splitgraph has been acquired by EDB! Read the blog post.

Accelerating queries with a cache / CDN

Seafowl also provides a read-only query execution endpoint that follows HTTP cache semantics. This lets you cache Seafowl query results:

  • in the user's browser
  • using an HTTP cache like Varnish
  • using a CDN like Cloudflare or Fastly

An example is available on our GitHub.

Having in mind that the URL is the main factor contributing to the HTTP response cache key, Seafowl provides a couple of ways to use this endpoint:

  1. Query in the URL path

    Simply put the URL-encoded query as the final path parameter:

    $ curl "http://localhost:8080/q/$query"
    

    This is the recommended approach, chiefly because it's less involved than the alternative:

  2. Query in the body or header

    This approach requires that you supply the SHA-256 hash of the query in the GET URL, and then either:

    • Pass the actual query as an X-Seafowl-Query header:

      #!/bin/bash -e
      SEAFOWL_HOST = "http://localhost:8080"
      query=$1
      hash=$(echo -n "$query" | sha256sum | cut -f 1 -d " ")
      
      curl -v\
      -H "X-Seafowl-Query: ${query}" \
      "$SEAFOWL_HOST/q/$hash"
      

      Note that the header value doesn't support newlines or non-printable / non-ASCII characters, so use encodeURIComponent() to encode special characters in your queries.

    • Otherwise, you can pass the query as a GET request body (though that can be silently dropped by some clients, including the fetch() API):

      jq -cn --arg q "$query" '{"query": $q}' | \
        curl -XGET \
        -H "Content-Type: application/json" \
        "$SEAFOWL_HOST/q/$hash" -d@-
      

Seafowl will then:

  • (Optional) Make sure the query matches the hash if it's provided. This is done so to avoid cache poisoning and because it can't recover the original query just from the hash.
  • Run the query
  • Return the result with an ETag that changes with the version of the dataset.

For example, to execute SELECT 'Hello, Seafowl' using the first method above run:

$ curl http://localhost:8080/q/SELECT%20\'Hello,%20Seafowl\'
{"Utf8(\"Hello, Seafowl\")":"Hello, Seafowl"}

Using the ETag for cache invalidation

You can pass the ETag back to Seafowl manually, performing cache revalidation using HTTP conditional requests.

This is mostly unnecessary if you are using the browser cache, an HTTP cache or a CDN (those handle conditional requests for you), but can be useful to debug caching.

curl -v \
  -H "If-None-Match: ${current_etag}" \
  "$SEAFOWL_HOST/q/$query"

Seafowl keeps track of all versions of all tables. If none of the tables in the query have been changed, Seafowl will return a 304 Not Modified response. Otherwise, it will re-execute the query and return the new result and the new ETag.

For example:

curl -i -H "If-None-Match: 4f53cda18c2baa0c0354bb5f9a3ecbe5ed12ab4d8e11ba873c2f11161202b945" \
  "http://localhost:8080/q/SELECT%20'Hello,%20Seafowl'"

HTTP/1.1 304 Not Modified


curl -i -H "If-None-Match: etag_changed" \
  "http://localhost:8080/q/SELECT%20'Hello,%20Seafowl'"

HTTP/1.1 200 OK
content-type: application/octet-stream
etag: 4f53cda18c2baa0c0354bb5f9a3ecbe5ed12ab4d8e11ba873c2f11161202b945
...

Querying from the browser using the fetch() API

In the "default" cache mode, the fetch() call handles the ETag and the If-None-Match headers for you. This means that by default, query results will be cached in the client's browser without you having to do anything.

async function executeQueryCached(sql, host = "https://localhost:8080") {
  // Transform the query to make it a valid header
  // NOTES:
  //   - this will break legitimate newlines in your SQL (in string literals)
  //   - this won't handle characters that aren't a valid HTTP header
  const query = sql.trim().replace(/(?:\r\n|\r|\n)/g, " ");

  // NOTE: crypto.subtle.digest only available in secure contexts like
  // https or localhost. As an alternative, use something like:
  //
  //    const sha256 = require('https://cdnjs.cloudflare.com/ajax/libs/js-sha256/0.9.0/sha256.js');
  //    const hash = sha256(query)
  const digest = await crypto.subtle.digest(
    "SHA-256",
    new TextEncoder("utf-8").encode(query)
  );
  const hash = [...new Uint8Array(digest)]
    .map((x) => x.toString(16).padStart(2, "0"))
    .join("");
  const response = await fetch(
    // Add an (ignored) extension to the URL to make Cloudflare treat the
    // query result as a static asset without using Page Rules.
    `${host}/q/${hash}.csv`,
    { headers: { "X-Seafowl-Query": query } }
  );
  const response_t = await response.text();
  return response_t ? response_t.trim().split("\n").map(JSON.parse) : [];
}

Putting Seafowl behind Varnish

The Varnish HTTP cache automatically handles the ETag header and sends conditional requests with If-None-Match.

See our GitHub for an example Docker Compose stack with Varnish and Seafowl that includes a sample VCL file.

Query results will then be cached automatically:

curl -v "$VARNISH_HOST/q/$query"

Putting Seafowl behind Cloudflare CDN

By default, Cloudflare only caches files with certain extensions and treats files without an extension as a dynamic resource.

Seafowl supports passing an optional extension to the GET request path. This extension is discarded, but can be used to trick Cloudflare into treating the resource as static and cacheable.

curl -v "$DOMAIN_BEHIND_CF/q/$query.csv"

Alternatively, you can add a Cloudflare page rule to cache Seafowl responses.

To verify the caching works as expected, make a request to Seafowl and note the Cloudflare response header. CF-Cache-Status: DYNAMIC means that the resource is uncacheable by Cloudflare. CF-Cache-Status: MISS or HIT means that the resource is cacheable.

Bypassing / busting the cache

You currently can't reconfigure a Seafowl instance to alter the ETag generation algorithm, change the returned cache headers (including adding different Cache-Control headers) or invalidate the HTTP cache in some other way.

You can query the uncached endpoint instead or bust the cache by adding an immaterial modification to your query. For example:

-- Comments are still included in the hash
-- so we can include a random value to bust the cache
-- abcdef12345
SELECT 1;

Note on authorization

Disabling Seafowl reads (read = "off") will also disable the cached GET endpoint. On the other hand, making the reads password-protected (read = "my_password_hash") will enforce request authorization via the usual Authorization: Bearer (my_password) header.