Splitgraph has been acquired by EDB! Read the blog post.

Querying live tables stored in remote data sources

Another feature that Seafowl offers is the ability to query a table hosted on a number of supported remote data sources, in close analogy with the foreign data wrapper concept in Postgres. The supported data sources at present include PostgreSQL(-compatible), MySQL and SQLite databases. Let's demonstrate this capability by creating and querying a remote Postgres table through Seafowl.

Creating a remote table

For the purpose of this guide it's handy to use a readily available and publicly exposed table—a perfect opportunity to try out our sister project, the Splitgraph DDN. The DDN is an endpoint that enables anyone with a PostgreSQL client to query datasets hosted on the Splitgraph data platform.

To declare the remote table we employ the appropriate CREATE EXTERNAL TABLE statement variant (you'll need to specify your own DDN API credentials in the connection string below)

curl -H "Content-Type: application/json" \
  -H "Authorization: Bearer $YOUR_PASSWORD" \
  http://localhost:8080/q \
  -d@-<<EOF
{"query": "
  CREATE EXTERNAL TABLE seafowl_issue_reactions
  STORED AS TABLE
  OPTIONS ('name' '\"splitgraph-demo/seafowl:latest\".issue_reactions')
  LOCATION 'postgresql://$SPLITGRAPH_API_KEY:$SPLITGRAPH_API_SECRET@data.splitgraph.com:5432/ddn'"}
EOF

The specified table name is a fully-qualified table reference format that is used in Splitgraph. This particular table is auto-generated using one of many supported Splitgraph connectors, and contains the data from reactions used in Seafowl GitHub issues.

Inspecting the remote table

Note that all external tables are ephemeral and are stored in the staging schema; to verify this we can query the corresponding entry in the information_schema:

curl -H "Content-Type: application/json" \
  http://localhost:8080/q -d@-<<EOF
{"query": "SELECT * FROM information_schema.tables WHERE table_name = 'seafowl_issue_reactions'"}
EOF
{"table_catalog":"default","table_name":"seafowl_issue_reactions","table_schema":"staging","table_type":"VIEW"}

Also note the table is of a VIEW type, signifying that it actually proxies queries to a live external table.

It is important to note that our remote table declaration didn't include the column schema. In such cases Seafowl will perform schema introspection upon table instantiation by itself. You can double check the inferred schema using the information_schema again:

curl -H "Content-Type: application/json" http://localhost:8080/q -d@-<<EOF
{"query": "SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'seafowl_issue_reactions'"}
EOF
...
{"column_name":"content","data_type":"Utf8"}
{"column_name":"created_at","data_type":"Timestamp(Nanosecond, None)"}
{"column_name":"issue_number","data_type":"Int64"}
...

If the column schema is specified in the remote table declaration, Seafowl will automatically try to perform the necessary casting of any column whose type doesn't match the provided one. In addition, note that JSON fields are not supported in Seafowl for now, so Seafowl will convert those fields to strings (i.e. Utf8).

Querying the remote table

Finally, let's go ahead and actually query the table:

curl -H "Content-Type: application/json" http://localhost:8080/q -d@-<<EOF
{"query": "SELECT content, created_at, issue_number FROM staging.seafowl_issue_reactions"}
EOF
{"content":"hooray","created_at":"2022-09-26 13:41:52","issue_number":122}
{"content":"+1","created_at":"2022-10-26 19:18:07","issue_number":137}
{"content":"heart","created_at":"2022-10-26 19:18:10","issue_number":137}
{"content":"+1","created_at":"2022-10-26 19:18:35","issue_number":57}
{"content":"heart","created_at":"2022-10-26 19:18:42","issue_number":57}
{"content":"+1","created_at":"2022-10-26 21:35:08","issue_number":57}
{"content":"heart","created_at":"2022-10-27 09:25:51","issue_number":174}
{"content":"+1","created_at":"2022-11-03 09:41:13","issue_number":188}
{"content":"+1","created_at":"2022-11-04 12:31:14","issue_number":188}

Filter pushdown

Note that not only is the column projection pushed down into the remote data source, some WHERE and LIMIT clauses are too. For example in the query:

$ curl -H "Content-Type: application/json" http://localhost:8080/q -d@-<<EOF
{"query": "SELECT content FROM staging.seafowl_issue_reactions \
WHERE issue_number = 122 OR created_at >= '2022-10-26 19:18:10' LIMIT 2;"}
EOF
{"content":"hooray"}
{"content":"heart"}

all the filters have been pushed down to the remote, to reduce the number of rows transmitted over the network. To see this explicitly, you can run an EXPLAIN query and look at full_filters (and if present fetch) param of the table scan in the resulting plan:

$ curl -H "Content-Type: application/json" http://localhost:8080/q -d@-<<EOF
{"query": "EXPLAIN SELECT content FROM staging.seafowl_issue_reactions \
WHERE issue_number = 122 OR created_at >= '2022-10-26 19:18:10' LIMIT 2;"}
EOF
...
full_filters=[staging.seafowl_issue_reactions.issue_number = Int64(122) OR staging.seafowl_issue_reactions.created_at >= Utf8(\"2022-10-26 19:18:10\")], fetch=2
...

Not all WHERE clauses can be pushed down however, which puts an upper limit on the size of the remote table. In cases where some filters are shippable but others are not, only those eligible will be pushed down. The remaining row filtration will then take place on the Seafowl instance itself.

The expansion of the supported expressions that can be pushed down is a work in progress. If you find a bug in the existing filter pushdown logic, or have a need for one that isn't implemented yet, you can always open a Seafowl GitHub issue!