Part 1: Running Seafowl locally
Introduction
In this part of the tutorial, you'll:
- install Seafowl locally
- write some toy data to it
- load and query a real dataset in the Parquet format
Download a Seafowl binary for your system
Seafowl ships as a single binary for Linux, OSX and Windows. You can get release binaries from our GitHub nightly builds page:
Make a directory for Seafowl and its data:
mkdir seafowl
cd seafowl
Download the binary and unzip it:
# Pick one
TRIPLE="unknown-linux-gnu"
# TRIPLE="x86_64-apple-darwin"
# TRIPLE="pc-windows-msvc"
wget "https://nightly.link/splitgraph/seafowl/workflows/nightly/main/seafowl-nightly-$TRIPLE.zip"
unzip seafowl-nightly-x86_64-unknown-linux-gnu.zip
chmod +x ./seafowl # no need to do this if you're on Windows
Start Seafowl
You don't need a configuration file or any other services to start Seafowl. Just run:
./seafowl # seafowl.exe if you're on Windows
You will see Seafowl print out a startup log:
INFO seafowl > Starting Seafowl
INFO seafowl::config::schema > Writing to Seafowl will require a password. Randomly generated password: BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M
INFO seafowl::config::schema > The SHA-256 hash will be stored in the config as follows:
INFO seafowl::config::schema > [frontend.http]
INFO seafowl::config::schema > write_access = "a548f4bfe3c1d696f5f651f7036dd5e9f5e8642281f859d53109a4f27273b49e"
INFO seafowl > Writing a default configuration file to seafowl.toml
INFO seafowl > Starting the HTTP frontend on 127.0.0.1:8080
INFO seafowl > HTTP access settings: read any, write password
INFO warp::server > Server::run; addr=127.0.0.1:8080
INFO warp::server > listening on http://127.0.0.1:8080
Seafowl is going to store the data (the catalog and the object store2) in
your current working directory. It also wrote out a configuration file,
seafowl.toml
.
Run some simple queries
Now that we're up and running, let's run a basic SQL query against Seafowl's HTTP interface:
$ curl -H "Content-Type: application/json" localhost:8080/q -d'{"query": "SELECT 1"}'
{"Int64(1)":1}
It works! Let's try some more complex expressions:
$ curl -H "Content-Type: application/json" localhost:8080/q -d'{"query": "SELECT 2*2"}'
{"Int64(2) * Int64(2)":4}
$ curl -H "Content-Type: application/json" localhost:8080/q -d'{"query": "SELECT 2*2 AS result"}'
{"result":4}
Let's list all tables in our database:
$ curl -H "Content-Type: application/json" localhost:8080/q -d'{"query": "SHOW TABLES"}'
{"table_catalog":"default","table_schema":"information_schema","table_name":"tables","table_type":"VIEW"}
{"table_catalog":"default","table_schema":"information_schema","table_name":"columns","table_type":"VIEW"}
There's nothing here apart from information_schema
tables, used by database
clients to programmatically inspect a database's structure. Let's fix that.
Write some data
Let's create a table and add some data to it.
$ curl -i -H "Content-Type: application/json" localhost:8080/q -d@-<<EOF
{"query": "CREATE TABLE first_table (
name VARCHAR,
value DOUBLE,
other_value INT
)"}
EOF
HTTP/1.1 403 Forbidden
content-length: 15
date: Mon, 22 Aug 2022 14:40:11 GMT
WRITE_FORBIDDEN
What happened? By default, Seafowl doesn't let anonymous users write data to your database. Instead, it requires a password to perform writes. You may have seen it flash in the logs when you first started Seafowl1:
INFO seafowl::config::schema > Writing to Seafowl will require a password. Randomly generated password: BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M
INFO seafowl::config::schema > The SHA-256 hash will be stored in the config as follows:
INFO seafowl::config::schema > [frontend.http]
INFO seafowl::config::schema > write_access = "a548f4bfe3c1d696f5f651f7036dd5e9f5e8642281f859d53109a4f27273b49e"
Let's try again, now with a password (use your own one here):
$ export YOUR_PASSWORD = "BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M"
$ curl -i \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_PASSWORD" \
localhost:8080/q -d@-<<EOF
{"query": "CREATE TABLE first_table (
name VARCHAR,
value DOUBLE,
other_value INT
)"}
EOF
HTTP/1.1 200 OK
content-type: application/octet-stream
content-length: 0
date: Mon, 22 Aug 2022 14:45:43 GMT
That looks promising. Let's now write some test data into our table:
$ curl -i \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_PASSWORD" \
localhost:8080/q -d@-<<EOF
{"query": "INSERT INTO first_table
VALUES
('Alpha', 2.3, 17),
('Alpha', 2.4, 18),
('Beta', 9.3, 42),
('Beta', 4.0, 18),
('Gamma', 6.8, 2),
('Gamma', 1.2, 8)
"}
EOF
HTTP/1.1 200 OK
content-type: application/octet-stream
content-length: 0
date: Mon, 22 Aug 2022 14:53:50 GMT
And run some queries on it:
$ curl -H "Content-Type: application/json" localhost:8080/q \
-d'{"query": "SELECT COUNT(*) FROM first_table"}'
{"COUNT(UInt8(1))":6}
$ curl -H "Content-Type: application/json" localhost:8080/q \
-d'{"query": "SELECT name, SUM(value) FROM first_table GROUP BY 1"}'
{"name":"Gamma","SUM(first_table.value)":10.0}
{"name":"Alpha","SUM(first_table.value)":35.0}
{"name":"Beta","SUM(first_table.value)":60.0}
$ curl -H "Content-Type: application/json" localhost:8080/q \
-d'{"query": "SELECT name, other_value, AVG(value) FROM first_table GROUP BY 1,
2 ORDER BY 3 DESC"}'
{"name":"Beta","other_value":9,"AVG(first_table.value)":42.0}
{"name":"Beta","other_value":4,"AVG(first_table.value)":18.0}
{"name":"Alpha","other_value":2,"AVG(first_table.value)":17.5}
{"name":"Gamma","other_value":1,"AVG(first_table.value)":8.0}
{"name":"Gamma","other_value":6,"AVG(first_table.value)":2.0}
Query some real data
We will now experiment with a real dataset. Trase's Supply Chains is a dataset of environmental risks related to commodity supply chains. The dashboard at https://explore.trase.earth/ is powered by Splitgraph, Seafowl's parent project, but we have mirrored the data to a single Parquet file that you'll now import to Seafowl.
Download the file:
wget https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet
Create an "external table" in Seafowl that points to this file:
$ curl \
-H "Content-Type: application/json" localhost:8080/q \
-H "Authorization: Bearer $YOUR_PASSWORD" \
-d@- <<EOF
{"query": "CREATE EXTERNAL TABLE supply_chains \
STORED AS PARQUET \
LOCATION '$(realpath trase-supply-chains.parquet)'"}
EOF
(here, the path is relative to the Seafowl server's working directory, so we
used realpath
to make it absolute).
External tables are temporary and are created in a special staging
schema2.
We can now query this Parquet file:
$ curl -H "Content-Type: application/json" localhost:8080/q \
-d'{"query": "SELECT COUNT(*) FROM staging.supply_chains"}'
{"COUNT(UInt8(1))":2989191}
$ curl \
-H "Content-Type: application/json" localhost:8080/q \
-d@- <<EOF
{"query": "SELECT country_of_production, COUNT(*) AS count \
FROM staging.supply_chains \
GROUP BY 1 ORDER BY 2 DESC"}
EOF
{"country_of_production":"BRAZIL","count":2386600}
{"country_of_production":"ARGENTINA","count":260293}
{"country_of_production":"INDONESIA","count":155751}
{"country_of_production":"ECUADOR","count":96842}
{"country_of_production":"PARAGUAY","count":29848}
{"country_of_production":"COTE D'IVOIRE","count":27636}
{"country_of_production":"COLOMBIA","count":20820}
{"country_of_production":"PERU","count":8484}
{"country_of_production":"GHANA","count":2111}
{"country_of_production":"BOLIVIA","count":806}
Finally, let's actually ingest the table into Seafowl. Internally, Seafowl also stores tables as Parquet files, but it splits them into partitions and adds extra indexing information to accelerate queries3:
$ curl \
-H "Content-Type: application/json" localhost:8080/q \
-H "Authorization: Bearer BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M" \
-d@- <<EOF
{"query": "CREATE TABLE supply_chains \
AS SELECT * FROM staging.supply_chains"}
EOF
Now you can query the real Seafowl table:
$ curl \
-H "Content-Type: application/json" localhost:8080/q \
-d@- <<EOF
{"query": "SELECT country_of_production, COUNT(*) AS count \
FROM supply_chains \
GROUP BY 1 ORDER BY 2 DESC"}
EOF
{"country_of_production":"BRAZIL","count":2386600}
{"country_of_production":"ARGENTINA","count":260293}
{"country_of_production":"INDONESIA","count":155751}
{"country_of_production":"ECUADOR","count":96842}
{"country_of_production":"PARAGUAY","count":29848}
{"country_of_production":"COTE D'IVOIRE","count":27636}
{"country_of_production":"COLOMBIA","count":20820}
{"country_of_production":"PERU","count":8484}
{"country_of_production":"GHANA","count":2111}
{"country_of_production":"BOLIVIA","count":806}
What's next?
Next, we'll deploy a Seafowl instance to Fly.io, so that your application can run these queries from the user's browser — and from anywhere in the world.
- More on external tables in the dedicated guide.↩
- Exposing the password in the logs is a bad idea, since logs often get shipped to analysis/alerting tools and stored forever. We do this as a tradeoff between a fresh Seafowl instance being at least somewhat secure and not requiring the user to go through a setup step before starting it. In other cases, when you're configuring Seafowl's HTTP frontend, not explicitly enabling writes disables them.↩
- Find out more about how Seafowl stores data in the learning section.↩