Uploading a CSV / Parquet file
Seafowl has an endpoint that you can use to upload CSV files and Parquet tables
as a standard multipart/form-data
upload.
curl -v \
-H "Authorization: Bearer 2Ux0FMpIifxS4EQVxvBhyBQl9EfZ0Cq1" \
-F "data=@path/to/file.parquet" \
http://localhost:8080/upload/[schema_name]/[table_name]
The /upload
endpoint follows the same authorization rules and configuration as
a standard write to Seafowl. See the HTTP endpoint guide for
more information on configuring it and setting up a password.
Special options for CSV files
When it comes to CSV files, Seafowl will try to infer the schema of the data automatically. If the inherent ambiguity of this leads to unsatisfactory results you can always make the schema explicit by passing an extra form-data parameter specifying the Arrow schema JSON representation:
curl -v \
-H "Authorization: Bearer 2Ux0FMpIifxS4EQVxvBhyBQl9EfZ0Cq1" \
-F 'schema={
"fields": [
{
"name": "some_number",
"type": {"name": "int", "isSigned": true, "bitWidth": 32},
"nullable": true,
"children": []
},
{
"name": "some_name",
"type": {"name": "utf8"},
"nullable": true,
"children": []
}
]
}' \
http://localhost:8080/upload/[schema_name]/[table_name] \
-F "data=@path/to/file.csv"
If, on the other hand, specifying the schema explicitly turns out to be too laborious, you can instead use the table with the inferred schema as source table when declaring a new table while recasting and renaming the columns:
CREATE TABLE actual_data AS
SELECT (
column_1::int AS some_number,
column_2::varchar AS some_name
) FROM staging_table
In addition, Seafowl assumes by default that headers are present in the file; if
not you'll need to specify this explicitly through another parameter with
-F "has_headers=false"
.
Alternative: CREATE EXTERNAL TABLE
If your file is hosted somewhere else that's accessible by Seafowl, you can
create an external table and then store that table in Seafowl using
CREATE TABLE AS
. For example (data from
here):
CREATE EXTERNAL TABLE data
STORED AS PARQUET
LOCATION 'https://parqueth-sample.s3.us-west-1.amazonaws.com/mainnet/transactions/dt=2021-07-01/blocks-0012738509-0012739509.parquet';
CREATE TABLE parqueth_sample AS SELECT * FROM staging.data;
Read more in the dedicated guide.