Querying CSV / Parquet data over HTTP

As an alternative to uploading files to Seafowl, you can create an "external" table on Seafowl that points to a file on an HTTP server directly.

curl -v \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 2Ux0FMpIifxS4EQVxvBhyBQl9EfZ0Cq1" \
  http://localhost:8080/q \
  [email protected] <<EOF
{"query": "
CREATE EXTERNAL TABLE cambridge_weather (
  timestamp TIMESTAMP,
  temp_c_10 INT NULL,
  humidity_perc INT NULL,
  dew_point_c_10 INT NULL,
  pressure_mbar INT NULL,
  mean_wind_speed_knots_10 INT NULL,
  average_wind_bearing_degrees INT NULL,
  sunshine_hours_100 INT,
  rainfall_mm_1000 INT,
  max_wind_speed_knots_10 INT NULL)
  STORED AS CSV
  LOCATION 'https://www.cl.cam.ac.uk/research/dtg/weather/weather-raw.csv'"}
EOF

You can then query the data as follows:

curl -v \
  -H "Content-Type: application/json" \
  http://localhost:8080/q \
  [email protected] <<EOF
{"query": "
SELECT * FROM staging.cambridge_weather ORDER BY timestamp DESC LIMIT 10"}
EOF

Note that the table is created in a special ephemeral staging schema. External tables currently don't survive restarts and are mostly used to stage data for ingestion. If you have multiple Seafowl nodes, one node won't see external tables that are located on the other node.

Caching and range queries

Seafowl caches data loaded from external tables and uses HTTP Range requests to download only the required parts of a file. This is most effective when using formats like Parquet that come with an index that lets query engines seek to the data for specific columns.

Security note

Anyone who can run CREATE EXTERNAL TABLE can effectively read any local file/accessible HTTP resource through Seafowl.

For example, these are perfectly valid statements that are going be executed by Seafowl without issues:

CREATE EXTERNAL TABLE shadow (
  username TEXT,
  password TEXT,
  lastchanged INT,
  minimum_age INT,
  maximum_age INT,
  warn_days INT,
  inactive_days INT,
  expiration_days INT
)
STORED AS CSV WITH DELIMITER ':'
LOCATION '/etc/shadow'
CREATE EXTERNAL TABLE some_resource (
  line TEXT
)
STORED AS CSV
LOCATION 'https://internal-service.service.dc1.consul'

By default, CREATE EXTERNAL TABLE requires authorization (together with all other write operations), but this can be a problem if you allow untrusted users to write to Seafowl.

Reference

For more information on supported formats and the CREATE EXTERNAL TABLE syntax, see the reference.