Querying external CSV / Parquet data
As an alternative to uploading files to Seafowl, you can also create an "external" table on Seafowl that points directly to a:
File on an HTTP server
curl -v \ -H "Content-Type: application/json" \ -H "Authorization: Bearer 2Ux0FMpIifxS4EQVxvBhyBQl9EfZ0Cq1" \ http://localhost:8080/q \ -d@- <<EOF {"query": " CREATE EXTERNAL TABLE my_external_table ( timestamp TIMESTAMP, temp_c_10 INT NULL, humidity_perc INT NULL, dew_point_c_10 INT NULL, pressure_mbar INT NULL, mean_wind_speed_knots_10 INT NULL, average_wind_bearing_degrees INT NULL, sunshine_hours_100 INT, rainfall_mm_1000 INT, max_wind_speed_knots_10 INT NULL) STORED AS CSV LOCATION 'https://www.cl.cam.ac.uk/research/dtg/weather/weather-raw.csv'"} EOF
File in a S3/GCS bucket
curl -v \ -H "Content-Type: application/json" \ -H "Authorization: Bearer 2Ux0FMpIifxS4EQVxvBhyBQl9EfZ0Cq1" \ http://localhost:8080/q \ -d@- <<EOF {"query": " CREATE EXTERNAL TABLE my_external_table STORED AS PARQUET OPTIONS ('access_key_id' '**', 'secret_access_key' '****', 'region' 'eu-west-1') LOCATION 's3://tutorial/trase-supply-chains.parquet'"} EOF
The option keys used in the statement correspond to the config parameters for the target object store type (i.e. S3 or GCS).
Note that the target bucket may be in an object store completely unrelated to the one that was configured for Seafowl. If you've already configured Seafowl to use S3/GCS storage, you can simply omit the
OPTIONS
clause to re-use the provided credentials and query a file in the same object store (that may be in some different bucket, as specified by theLOCATION
clause).
You can then query the data as follows:
curl -v \
-H "Content-Type: application/json" \
http://localhost:8080/q \
-d@- <<EOF
{"query": "
SELECT * FROM staging.my_external_table"}
EOF
Note that the table is created in a special ephemeral staging
schema. External
tables currently don't survive restarts and are mostly used to stage data for
ingestion. If you have multiple Seafowl nodes, one node won't see external
tables that are located on the other node.
Caching and range queries
Seafowl caches data loaded from external tables and uses HTTP Range requests to download only the required parts of a file. This is most effective when using formats like Parquet that come with an index that lets query engines seek to the data for specific columns.
Security note
Anyone who can run CREATE EXTERNAL TABLE
can effectively read any local
file/accessible HTTP resource through Seafowl.
For example, these are perfectly valid statements that are going be executed by Seafowl without issues:
CREATE EXTERNAL TABLE shadow (
username TEXT,
password TEXT,
lastchanged INT,
minimum_age INT,
maximum_age INT,
warn_days INT,
inactive_days INT,
expiration_days INT
)
STORED AS CSV WITH DELIMITER ':'
LOCATION '/etc/shadow'
CREATE EXTERNAL TABLE some_resource (
line TEXT
)
STORED AS CSV
LOCATION 'https://internal-service.service.dc1.consul'
By default, CREATE EXTERNAL TABLE
requires authorization (together with all
other write operations), but this can be a problem if you allow untrusted users
to write to Seafowl.
Reference
For more information on supported formats and the CREATE EXTERNAL TABLE
syntax, see the reference.