How Seafowl stores data
Within the object store root directory Seafowl stores each table inside of its own sub-directory, named after UUID of that particular table. Inside that sub-directory Seafowl adheres to the Delta protocol, powered by the delta-rs Rust implementation:
$ tree ~/seafowl-data . ├── dc030c8d-7f75-4d32-bd99-f31d5013882f │ ├── _delta_log │ │ ├── 00000000000000000000.json │ │ └── 00000000000000000001.json │ ├── part-00000-43566caa-fe48-4c4d-a77c-1d4ee39139be-c000.snappy.parquet │ ├── part-00001-43566caa-fe48-4c4d-a77c-1d4ee39139be-c000.snappy.parquet │ └── part-00002-43566caa-fe48-4c4d-a77c-1d4ee39139be-c000.snappy.parquet ...
This means that Seafowl storage layer is compatible with the Delta lake standard, and can reap the benefits of its extended ecosystem.
A Delta table consist of a number of partitions. These are...
...Columnar Parquet files. Parquet is an efficient, typed format that stores data by column. This speeds up analytics queries:
- similar data (same column) is co-located, so it compresses better
- storing data in column batches means the query engine can skip over columns that aren't required by the query at all
- enables query vectorization: processing batches of values at once.
...Immutable. After Seafowl writes out a partition file, it never changes it. This doesn't mean that you can't write to Seafowl tables. Updating a table or inserting into it means Seafowl creates new partitions (possibly with all the data from the previous partition apart from the updated/deleted rows) and a new table version. It then links related existing and new partitions to the newly created table version.
This is done because the expectation is that the user will write to the table
rarely and in large batches (through uploads,
CREATE TABLE AS statements, bulk
INSERTs or using
DELETEs to modify existing rows). In addition,
immutability means that once a partition has been downloaded from object storage
to a Seafowl instance's cache, it doesn't need to worry about that partition
Seafowl tables are versioned. When you write to a Seafowl table, Seafowl actually creates a new table version and repoints the table to that latest version.
This can be used for time travel queries or recovering older versions of your data (though Seafowl currently doesn't expose this functionality, see this GitHub issue).
Seafowl uses table versioning to provide fast cache invalidation. When you put Seafowl behind a CDN or an HTTP cache, the cached HTTP API uses the versions of all tables in a certain query as an input to its ETag computation. When you update the table, its version changes, changing the ETag and forcing caches to ask Seafowl to re-execute queries.
This all happens automatically and means that Seafowl query results are cached in your user's browser, by HTTP caches or by CDNs and can be quickly checked for freshness and updated if required.
The catalog is a classical OLTP database (SQLite or PostgreSQL) where Seafowl, among other things, stores table versions, dropped tables pending for lazy physical deletion, function definitions and so on.
It is vaguely similar to the Hive Metastore.
CREATE TABLE AS?
When you run a query like:
INSERT INTO some_table VALUES (1, 'one'), (2, 'two')
CREATE TABLE some_table AS ( SELECT dimension_1, SUM(column_1) AS measure_1 FROM some_other_table GROUP BY 1 )
Seafowl executes the input to that query (e.g. the
VALUES clause or the
CREATE TABLE AS) and persists the output as one or more partition
files into the Delta table directory. Finally, it commits the ongoing changes
for a new table version and stores it in the catalog.
In the case of an
INSERT, the new version is going to inherit the partitions
from the previous version of the table, together with the new partitions. In the
case of a
CREATE TABLE, this is going to be the first version of a brand new
UPDATE some_table SET some_val = some_val + 1 WHERE some_key = 'to_update'
While Seafowl isn't optimized for frequent updates, it still supports them. In this case, it constructs a plan that:
- prunes away partitions that are refuted by the
- applies the provided updates to each row of the remaining partitions, thus creating new partitions
It then executes this plan and persists the newly created partitions to the object store, while re-using the partitions that were pruned away as is.
Finally, it commits a new table version out of the resulting added/removed partitions. Note that an update of one row will still result in a table version that only differs from its parent by one partition addition/removal to the table state.
DELETE some_table WHERE some_key = 'to_delete'
This is similar to the
UPDATE case, in that it also employs
partition pruning to scope down the affected partitions
only to those targeted by the
WHERE qualifier. The difference is that the plan
then simply filters out the exact rows in the affected partitions that match
that delete condition. This in turn leads to the creation of new substitute
The unchanged and the newly created partitions make it into the new table version.
All table versions for the specified table are deleted. However, Seafowl defers
the deletion of the underlying table directory from the object store (unless a
background job is
set up). To clean up the
dropped table physical remnants, you need to run
VACUUM DATABASE db_name.