Splitgraph has been acquired by EDB! Read the blog post.

Querying external CSV / Parquet data

As an alternative to uploading files to Seafowl, you can also create an "external" table on Seafowl that points directly to a:

  1. File on an HTTP server

     curl -v \
       -H "Content-Type: application/json" \
       -H "Authorization: Bearer 2Ux0FMpIifxS4EQVxvBhyBQl9EfZ0Cq1" \
       http://localhost:8080/q \
       -d@- <<EOF
     {"query": "
     CREATE EXTERNAL TABLE my_external_table (
       timestamp TIMESTAMP,
       temp_c_10 INT NULL,
       humidity_perc INT NULL,
       dew_point_c_10 INT NULL,
       pressure_mbar INT NULL,
       mean_wind_speed_knots_10 INT NULL,
       average_wind_bearing_degrees INT NULL,
       sunshine_hours_100 INT,
       rainfall_mm_1000 INT,
       max_wind_speed_knots_10 INT NULL)
       STORED AS CSV
       LOCATION 'https://www.cl.cam.ac.uk/research/dtg/weather/weather-raw.csv'"}
     EOF
    
  2. File in a S3/GCS bucket

     curl -v \
       -H "Content-Type: application/json" \
       -H "Authorization: Bearer 2Ux0FMpIifxS4EQVxvBhyBQl9EfZ0Cq1" \
       http://localhost:8080/q \
       -d@- <<EOF
     {"query": "
     CREATE EXTERNAL TABLE my_external_table
       STORED AS PARQUET
       OPTIONS ('access_key_id' '**', 'secret_access_key' '****', 'region' 'eu-west-1')
       LOCATION 's3://tutorial/trase-supply-chains.parquet'"}
     EOF
    

    The option keys used in the statement correspond to the config parameters for the target object store type (i.e. S3 or GCS).

    Note that the target bucket may be in an object store completely unrelated to the one that was configured for Seafowl. If you've already configured Seafowl to use S3/GCS storage, you can simply omit the OPTIONS clause to re-use the provided credentials and query a file in the same object store (that may be in some different bucket, as specified by the LOCATION clause).

You can then query the data as follows:

curl -v \
  -H "Content-Type: application/json" \
  http://localhost:8080/q \
  -d@- <<EOF
{"query": "
SELECT * FROM staging.my_external_table"}
EOF

Note that the table is created in a special ephemeral staging schema. External tables currently don't survive restarts and are mostly used to stage data for ingestion. If you have multiple Seafowl nodes, one node won't see external tables that are located on the other node.

Caching and range queries

Seafowl caches data loaded from external tables and uses HTTP Range requests to download only the required parts of a file. This is most effective when using formats like Parquet that come with an index that lets query engines seek to the data for specific columns.

Security note

Anyone who can run CREATE EXTERNAL TABLE can effectively read any local file/accessible HTTP resource through Seafowl.

For example, these are perfectly valid statements that are going be executed by Seafowl without issues:

CREATE EXTERNAL TABLE shadow (
  username TEXT,
  password TEXT,
  lastchanged INT,
  minimum_age INT,
  maximum_age INT,
  warn_days INT,
  inactive_days INT,
  expiration_days INT
)
STORED AS CSV WITH DELIMITER ':'
LOCATION '/etc/shadow'
CREATE EXTERNAL TABLE some_resource (
  line TEXT
)
STORED AS CSV
LOCATION 'https://internal-service.service.dc1.consul'

By default, CREATE EXTERNAL TABLE requires authorization (together with all other write operations), but this can be a problem if you allow untrusted users to write to Seafowl.

Reference

For more information on supported formats and the CREATE EXTERNAL TABLE syntax, see the reference.