Querying live tables stored in remote data sources
Another feature that Seafowl offers is the ability to query a table hosted on a number of supported remote data sources, in close analogy with the foreign data wrapper concept in Postgres. The supported data sources at present include PostgreSQL(-compatible), MySQL and SQLite databases. Let's demonstrate this capability by creating and querying a remote Postgres table through Seafowl.
Creating a remote table
For the purpose of this guide it's handy to use a readily available and publicly exposed table—a perfect opportunity to try out our sister project, the Splitgraph DDN. The DDN is an endpoint that enables anyone with a PostgreSQL client to query datasets hosted on the Splitgraph data platform.
To declare the remote table we employ the appropriate CREATE EXTERNAL TABLE
statement variant (you'll need to specify your own
DDN API credentials in the
connection string below)
curl -H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_PASSWORD" \
http://localhost:8080/q \
-d@-<<EOF
{"query": "
CREATE EXTERNAL TABLE seafowl_issue_reactions
STORED AS TABLE
OPTIONS ('name' '\"splitgraph-demo/seafowl:latest\".issue_reactions')
LOCATION 'postgresql://$SPLITGRAPH_API_KEY:$SPLITGRAPH_API_SECRET@data.splitgraph.com:5432/ddn'"}
EOF
The specified table name is a fully-qualified table reference format that is used in Splitgraph. This particular table is auto-generated using one of many supported Splitgraph connectors, and contains the data from reactions used in Seafowl GitHub issues.
Inspecting the remote table
Note that all external tables are ephemeral and are stored in the
staging schema; to verify this
we can query the corresponding entry in the
information_schema
:
curl -H "Content-Type: application/json" \
http://localhost:8080/q -d@-<<EOF
{"query": "SELECT * FROM information_schema.tables WHERE table_name = 'seafowl_issue_reactions'"}
EOF
{"table_catalog":"default","table_name":"seafowl_issue_reactions","table_schema":"staging","table_type":"VIEW"}
Also note the table is of a VIEW type, signifying that it actually proxies queries to a live external table.
It is important to note that our remote table declaration didn't include the
column schema. In such cases Seafowl will perform schema introspection upon
table instantiation by itself. You can double check the inferred schema using
the information_schema
again:
curl -H "Content-Type: application/json" http://localhost:8080/q -d@-<<EOF
{"query": "SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'seafowl_issue_reactions'"}
EOF
...
{"column_name":"content","data_type":"Utf8"}
{"column_name":"created_at","data_type":"Timestamp(Nanosecond, None)"}
{"column_name":"issue_number","data_type":"Int64"}
...
If the column schema is specified in the remote table declaration, Seafowl will
automatically try to perform the necessary casting of any column whose type
doesn't match the provided one. In addition, note that JSON fields are not
supported in Seafowl for now, so Seafowl will convert those fields to strings
(i.e. Utf8
).
Querying the remote table
Finally, let's go ahead and actually query the table:
curl -H "Content-Type: application/json" http://localhost:8080/q -d@-<<EOF
{"query": "SELECT content, created_at, issue_number FROM staging.seafowl_issue_reactions"}
EOF
{"content":"hooray","created_at":"2022-09-26 13:41:52","issue_number":122}
{"content":"+1","created_at":"2022-10-26 19:18:07","issue_number":137}
{"content":"heart","created_at":"2022-10-26 19:18:10","issue_number":137}
{"content":"+1","created_at":"2022-10-26 19:18:35","issue_number":57}
{"content":"heart","created_at":"2022-10-26 19:18:42","issue_number":57}
{"content":"+1","created_at":"2022-10-26 21:35:08","issue_number":57}
{"content":"heart","created_at":"2022-10-27 09:25:51","issue_number":174}
{"content":"+1","created_at":"2022-11-03 09:41:13","issue_number":188}
{"content":"+1","created_at":"2022-11-04 12:31:14","issue_number":188}
Filter pushdown
Note that not only is the column projection pushed down into the remote data
source, some WHERE
and LIMIT
clauses are too. For example in the query:
$ curl -H "Content-Type: application/json" http://localhost:8080/q -d@-<<EOF
{"query": "SELECT content FROM staging.seafowl_issue_reactions \
WHERE issue_number = 122 OR created_at >= '2022-10-26 19:18:10' LIMIT 2;"}
EOF
{"content":"hooray"}
{"content":"heart"}
all the filters have been pushed down to the remote, to reduce the number of
rows transmitted over the network. To see this explicitly, you can run an
EXPLAIN
query and look at full_filters
(and if present fetch
) param of the
table scan in the resulting plan:
$ curl -H "Content-Type: application/json" http://localhost:8080/q -d@-<<EOF
{"query": "EXPLAIN SELECT content FROM staging.seafowl_issue_reactions \
WHERE issue_number = 122 OR created_at >= '2022-10-26 19:18:10' LIMIT 2;"}
EOF
...
full_filters=[staging.seafowl_issue_reactions.issue_number = Int64(122) OR staging.seafowl_issue_reactions.created_at >= Utf8(\"2022-10-26 19:18:10\")], fetch=2
...
Not all WHERE
clauses can be pushed down however, which puts an upper limit on
the size of the remote table. In cases where some filters are shippable but
others are not, only those eligible will be pushed down. The remaining row
filtration will then take place on the Seafowl instance itself.
The expansion of the supported expressions that can be pushed down is a work in progress. If you find a bug in the existing filter pushdown logic, or have a need for one that isn't implemented yet, you can always open a Seafowl GitHub issue!