seafowl.toml configuration
Using environment variables
Seafowl supports sourcing configuration values from environment variables.
The environment variable format is:
SEAFOWL__[section]__[section]__[key]=value
. The key or section names are
separated by a double underscore __
. Dots in names must also be replaced with
a double underscore.
Environment variables take precedence over the config file.
For example: SEAFOWL__FRONTEND__HTTP__WRITE_ACCESS=off
is equivalent to
setting the configuration parameter frontend.http.write_access=off
.
object_store
section
This section contains the configuration for the object store used by Seafowl to store data.
Select the object store by setting a type=...
parameter and configure it by
adding extra fields for the specific flavor.
type = "local"
Default. Store data files on the local filesystem.
data_dir
The directory to store data files in. Default ./seafowl-data
.
type = "memory"
Store the data in RAM. This does not support any other parameters.
Note that when using this option, restarting the process will lose all data. In addition, combining an in-memory catalog with an persistent object store (or vice versa) will lead to consistency issues.
type = "s3"
Store data files in S3-compatible object storage such as S3 itself, MinIO, Cloudflare R2 etc.
⚠️ NOTE: If you're using actual AWS S3, do not specify endpoint, please specify only region.
region
AWS S3 region. Optional.
access_key_id
AWS access key ID. Required.
secret_access_key
AWS secret access key. Required.
endpoint
Service endpoint for storage, for Minio or other S3-like APIs. If using S3
itself, use the region
parameter instead. Optional.
Example: https://localhost:9000
bucket
Name of the S3 bucket. Required.
type = "gcs"
Store data files in a GCS bucket.
bucket
Name of the GCS bucket. Required.
google_application_credentials
Path to the GCP JSON credentials file. Optional, the credentials can be sourced
from the env var GOOGLE_APPLICATION_CREDENTIALS
, or the metadata server in
case of GCP VMs.
object_store.cache_properties
section
This is an optional sub-section for the S3 object store, which enables caching of fetched object byte ranges. In addition, it performs range coalescing, by enforcing a minimum byte range threshold for fetching.
It stores the actual contents of the cached entries in a temporary directory on the local file system.
capacity
Maximum size of all objects in the cache. Defaults to 512 MB.
min_fetch_size
Determines the minimum range size for a byte fetch request. Defaults to 2MB.
ttl_s
Time-to-live for the entries in the cache. Defaults to 3 minutes.
catalog
section
This section contains the configuration for the catalog used by Seafowl to store metadata (table names and mappings to partitions, index for partition pruning, UDF definitions etc).
Select the catalog by setting a type=...
parameter and configure it by adding
extra fields for the specific flavor.
type = "sqlite"
Default. Store the catalog in a local SQLite file.
dsn
Path to the SQLite file or the connection string. Default
./seafowl-data/seafowl.sqlite
.
You can use :memory:
here to use an in-memory SQLite database. Note that when
using this option, restarting the process will lose all data. In addition,
combining an in-memory catalog with an persistent object store (or vice versa)
will lead to consistency issues.
journal_mode
Journal mode used by SQLite. Default wal
. One of delete
, truncate
,
persist
, memory
, wal
, off
. See the
SQLite documentation
for more information.
journal_mode = 'delete'
is required to make a Seafowl instance work against
LiteFS
as a leader (since it doesn't support wal
).
read_only
Open the SQLite database in read-only mode. Using journal_mode = 'off'
and
read_only = true
is required to make a Seafowl instance work against a
LiteFS replica.
type = "postgres"
Store the catalog in a PostgreSQL database.
dsn
Connection URI
to the PostgreSQL database, in the format
postgresql://[user[:password]@][[host][:port][,...]][/dbname][name=value[&...]]
Example: postgresql://user:secret@localhost
frontend.http
section
This section contains the configuration for the HTTP frontend used to query Seafowl from Web applications. Omit this section to disable the HTTP frontend altogether.
write_access
Settings for write access to Seafowl (execution of any non-SELECT/EXPLAIN
queries). This can be either any
(anyone can write), off
(disabled) or a
SHA-256 hash of a password.
By default, Seafowl will generate and write a password hash to this section (as well as the actual password in the logs) once when it starts up without detecting a config file.
If a config file already exists and this is omitted, it defaults to off
.
To generate a new password, you can use this Bash snippet:
pw=$(< /dev/urandom LC_ALL=C tr -dc A-Za-z0-9 | head -c${1:-32};echo -n)
pw_hash=$(echo -n $pw | sha256sum - | head -c 64)
echo -e "Password: $pw\nHash: $pw_hash"
read_access
Settings for read access to Seafowl (execution of SELECT/EXPLAIN
queries).
This can be either any
(anyone can read), off
(disabled) or a SHA-256 hash
of a password. By default, this is set to any
.
The read password can be different from the write password.
bind_host
IP address to bind the HTTP frontend to. Default 127.0.0.1
. To expose Seafowl
to other machines on the network, use 0.0.0.0
here.
bind_port
Port for the HTTP frontend. Default 8080
.
upload_data_max_length
Maximum size (in MB) of uploads to Seafowl's /upload
endpoint. Default 2MB.
Note that Seafowl currently keeps the whole uploaded file in memory, making the
upload endpoint unsuitable for memory-constrained environments.
cache_control
The directives set as Cache-Control header value for the
cached GET endpoint. Optional, defaults to
max-age=43200, public
.
frontend.postgres
section
This section contains the configuration for the PostgreSQL frontend used to query Seafowl by PostgreSQL clients. This endpoint doesn't support authentication or encryption and should only be used in development.
By default, this section is omitted and disabled.
bind_host
IP address to bind the PostgreSQL frontend to. Default 127.0.0.1
. To expose
Seafowl to other machines on the network, use 0.0.0.0
here.
bind_port
Port for the PostgreSQL frontend. Default 6432
.
misc
section
Miscellaneous Seafowl configuration.
max_partition_size
Maximum length (in rows) of a Parquet file (partition) to produce when writing Seafowl tables. Default 1048576 (1024x1024).
For more information on partitioning, see the learning section.
gc_interval
Interval (in hours) at which a cron task will run garbage collection of orphan
partitions (effectively invoking
VACUUM PARTITIONS
).
Default is 0 (i.e. the task is not run at all).
runtime
section
Various configuration settings related to executing queries.
max_memory
Guideline for the maximum amount of RAM (in MB) for DataFusion to use when executing queries, spilling data to disk during operations where there isn't enough memory. Note that DataFusion currently doesn't always respect this amount and it's not a guaranteed maximum RAM cap.
Default unlimited.
temp_dir
Override the temporary directory used to spill files during execution when DataFusion reaches the memory limit.