Part 4 (option 2): Delivering query results globally with Cloudflare
Introduction
In this part of the tutorial, we'll put your Seafowl deployment behind Cloudflare CDN. This will let you use Cloudflare's edge network to cache and deliver query results globally.
You need to have a domain that's managed by Cloudflare for this to work. As an alternative, you can also set up Varnish in front of Seafowl.
Add a certificate on Fly.io
Go to the Fly.io dashboard1 for your application, for example, https://fly.io/apps/seafowl.
Click on "Certificates", then "Add certificate":
Enter a hostname for your application. For example, if you have a domain
yourdomain.io
, you can use seafowl.yourdomain.io
. Then, click "Create".
We now need to verify your domain ownership with Fly.io and direct visitors to Seafowl. Click on "View" next to your certificate. You should see verification instructions.
Set up the subdomain on Cloudflare
Go to your Cloudflare dashboard:
Add the required CNAME (ownership verification) and A/AAAA records. Make sure Cloudflare proxying is disabled for both the CNAME record, and the A/AAAA records (grey cloud)!
Finish the setup
Go back to Fly.io and click "Check again" on the certificate dialog. You should see green bullet points next to "Confirm domain ownership" and "Direct visitors to application". This normally takes effect instantly, but could take several minutes:
Finally, enable proxying on Cloudflare again:
Troubleshooting: Redirect Loop
Try sending a simple query to your domain which is now proxied through Cloudflare to Fly:
curl -L -i \
-H "Content-Type: application/json" \
https://seafowl.yourdomain.io/q \
-d'{"query": "SELECT 2*3"}'
Do you get stuck in an infinite redirect loop? This can happen when Cloudflare
is trying to connect to http
of your Fly app, and Fly is redirecting it to
https
, causing Cloudflare to redirect you to the same URL you requested.
To fix it, edit fly.toml
to set force_https = false
:
# ...
[[services.ports]]
- force_https = true
+ force_https = false
handlers = ["http"]
Then redeploy your app:
flyctl deploy
Query Seafowl through Cloudflare
Use the query.sh
script from the previous part and pass
SEAFOWL_HOST=https://seafowl.yourdomain.io
to query Seafowl:
export SEAFOWL_HOST=https://demo.seafowl.io
./query.sh "SELECT name, SUM(value) AS sum FROM test_cache GROUP BY 1"
HTTP/2 200
...
cf-cache-status: DYNAMIC
server: cloudflare
cf-ray: 74248dc99aca54e1-MAN
{"name":"Alpha","sum":35.0}
{"name":"Beta","sum":60.0}
{"name":"Gamma","sum":10.0}
{"name":"Delta","sum":83.0}
{"name":"Kappa","sum":16.0}
That's weird. The cf-cache-status: DYNAMIC
header means Cloudflare doesn't
think this resource is cacheable.
This is because Cloudflare caches responses based on their extension by default. Files without an extension aren't cached at all.
We could use Cloudflare's Page Rules to get it to cache everything on our domain. Or, we could add one of these extensions to our URL. Let's do that instead.
Edit query.sh
Change the last line in query.sh
to request a query result with a .csv
extension. Seafowl ignores the extension and doesn't use it to determine the
response format, but it will get Cloudflare to cache Seafowl responses.
- curl -i "${headers[@]}" "$SEAFOWL_HOST/q/$hash"
+ curl -i "${headers[@]}" "$SEAFOWL_HOST/q/$hash.csv"
Run the query again
./query.sh "SELECT name, SUM(value) AS sum FROM test_cache GROUP BY 1"
...
cache-control: max-age=14400
cf-cache-status: MISS
server: cloudflare
cf-ray: 7424955fa85f35fb-MAN
{"name":"Alpha","sum":35.0}
{"name":"Beta","sum":60.0}
{"name":"Gamma","sum":10.0}
{"name":"Delta","sum":83.0}
{"name":"Kappa","sum":16.0}
That's better. You'll notice Cloudflare injected a Cache-Control
header of its
own into the response. This is for the downstream clients, like your user's
browser, to know how long to keep the result in the cache for before contacting
Cloudflare again.
The second time you run the query, its results will be cached by Cloudflare2:
cf-cache-status: HIT
age: 58
{"name":"Alpha","sum":35.0}
{"name":"Beta","sum":60.0}
{"name":"Gamma","sum":10.0}
{"name":"Delta","sum":83.0}
{"name":"Kappa","sum":16.0}
Test cache invalidation
Let's add more rows to the dataset. Note that we can't use the cached GET endpoint to perform writes:
curl -i \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_PASSWORD" \
https://demo.seafowl.io/q -d@-<<EOF
{"query": "INSERT INTO test_cache
VALUES
('Epsilon', 7.4, 49)
"}
EOF
Sadly, the new data won't propagate to Cloudflare immediately, as Seafowl
currently doesn't send Cache-Control
headers that can force Cloudflare to
always revalidate query results. If we were to wait for a few hours and rerun
the query, we'd see this3:
cf-cache-status: INVALIDATED
age: 0
{"name":"Epsilon","sum":49.0}
{"name":"Alpha","sum":35.0}
{"name":"Beta","sum":60.0}
{"name":"Gamma","sum":10.0}
{"name":"Delta","sum":83.0}
{"name":"Kappa","sum":16.0}
Troubleshooting: Error 520 from Cloudflare
This is a mysterious error which Cloudflare returns "when the origin server returns an empty, unknown, or unexpected response to Cloudflare.".
This has been known to happen when hosting Seafowl as a Fly app with a shared IPv4 address. The exact trigger of the bug appears to be sending an HTTP request with more than 4 request headers through Cloudflare to Fly, for example:
# this works fine and returns a 200
curl -i 'https://seafowl.yourdomain.io/q/7e323062f1a298c84cbfa790e4157c710b95bfe5e3f6573302881f0a049c485e' -X 'OPTIONS' \
-H 'access-control-request-headers: content-type,x-seafowl-query' \
-H 'access-control-request-method: GET' \
-H 'origin: http://localhost:8000' \
-H 'sec-fetch-mode: cors'
# this (one more header) returns a 520 and never reaches the app (no logs)
curl -i 'https://seafowl.yourdomain.io/q/7e323062f1a298c84cbfa790e4157c710b95bfe5e3f6573302881f0a049c485e' -X 'OPTIONS' \
-H 'access-control-request-headers: content-type,x-seafowl-query' \
-H 'access-control-request-method: GET' \
-H 'origin: http://localhost:8000' \
-H 'sec-fetch-mode: cors' \
-H 'another-header: yes-please'
The error does not happen when sending the request directly to your Fly app
(e.g. seafowl.fly.dev
).
You are most likely to notice this when querying from the browser, since it adds
more request headers than when you're just querying through curl
.
Until this bug is fixed, the best solution is to add a dedicated IPv4 address to
your Fly app, which costs $2 per month. Once you've signed up for the paid plan,
you can add a dedicated IPv4 address to your Fly app with the fly
CLI:
fly ips allocate-v4
Once you have a dedicated IPv4, make sure you edit the DNS settings in Cloudflare to use that address for your app.
Next steps
It's time to upgrade from the command line to the browser. In the next part, we'll put everything we learned together to query Seafowl directly from the user's browser and build a beautiful dynamic visualization with Observable.
- Thanks to this blog post for the step-by-step guide.↩
- This is not reliable: you might hit a different Cloudflare server and get
another
MISS
.↩ - See
the Cloudflare reference
for all possible values of
CF-Cache-Status
.↩