Splitgraph has been acquired by EDB! Read the blog post.

How Seafowl stores data

Within the object store root directory Seafowl stores each table inside of its own sub-directory, named after UUID of that particular table. Inside that sub-directory Seafowl adheres to the Delta protocol, powered by the delta-rs Rust implementation:

$ tree ~/seafowl-data
.
├── dc030c8d-7f75-4d32-bd99-f31d5013882f
│   ├── _delta_log
│   │   ├── 00000000000000000000.json
│   │   └── 00000000000000000001.json
│   ├── part-00000-43566caa-fe48-4c4d-a77c-1d4ee39139be-c000.snappy.parquet
│   ├── part-00001-43566caa-fe48-4c4d-a77c-1d4ee39139be-c000.snappy.parquet
│   └── part-00002-43566caa-fe48-4c4d-a77c-1d4ee39139be-c000.snappy.parquet
...

This means that Seafowl storage layer is compatible with the Delta lake standard, and can reap the benefits of its extended ecosystem.

Partitions

A Delta table consist of a number of partitions. These are...

...Columnar Parquet files. Parquet is an efficient, typed format that stores data by column. This speeds up analytics queries:

  • similar data (same column) is co-located, so it compresses better
  • storing data in column batches means the query engine can skip over columns that aren't required by the query at all
  • enables query vectorization: processing batches of values at once.

...Immutable. After Seafowl writes out a partition file, it never changes it. This doesn't mean that you can't write to Seafowl tables. Updating a table or inserting into it means Seafowl creates new partitions (possibly with all the data from the previous partition apart from the updated/deleted rows) and a new table version. It then links related existing and new partitions to the newly created table version.

This is done because the expectation is that the user will write to the table rarely and in large batches (through uploads, CREATE TABLE AS statements, bulk INSERTs or using UPDATE/DELETEs to modify existing rows). In addition, immutability means that once a partition has been downloaded from object storage to a Seafowl instance's cache, it doesn't need to worry about that partition changing.

Table versions

Seafowl tables are versioned. When you write to a Seafowl table, Seafowl actually creates a new table version and repoints the table to that latest version. This can be used for time travel queries or recovering older versions of your data.

Seafowl uses table versioning to provide fast cache invalidation. When you put Seafowl behind a CDN or an HTTP cache, the cached HTTP API uses the versions of all tables in a certain query as an input to its ETag computation. When you update the table, its version changes, changing the ETag and forcing caches to ask Seafowl to re-execute queries.

This all happens automatically and means that Seafowl query results are cached in your user's browser, by HTTP caches or by CDNs and can be quickly checked for freshness and updated if required.

Catalog

The catalog is a classical OLTP database (SQLite or PostgreSQL) where Seafowl, among other things, stores table versions, dropped tables pending for lazy physical deletion, function definitions and so on.

It is vaguely similar to the Hive Metastore.

What happens...

...on INSERT / CREATE TABLE AS?

When you run a query like:

INSERT INTO some_table VALUES (1, 'one'), (2, 'two')

or:

CREATE TABLE some_table AS (
  SELECT dimension_1, SUM(column_1) AS measure_1
  FROM some_other_table
  GROUP BY 1
)

Seafowl executes the input to that query (e.g. the VALUES clause or the subquery in CREATE TABLE AS) and persists the output as one or more partition files into the Delta table directory. Finally, it commits the ongoing changes for a new table version and stores it in the catalog.

In the case of an INSERT, the new version is going to inherit the partitions from the previous version of the table, together with the new partitions. In the case of a CREATE TABLE, this is going to be the first version of a brand new table.

...on UPDATE?

UPDATE some_table
SET some_val = some_val + 1
WHERE some_key = 'to_update'

While Seafowl isn't optimized for frequent updates, it still supports them. In this case, it constructs a plan that:

  • prunes away partitions that are refuted by the WHERE qualifier
  • applies the provided updates to each row of the remaining partitions, thus creating new partitions

It then executes this plan and persists the newly created partitions to the object store, while re-using the partitions that were pruned away as is.

Finally, it commits a new table version out of the resulting added/removed partitions. Note that an update of one row will still result in a table version that only differs from its parent by one partition addition/removal to the table state.

...on DELETE?

DELETE some_table
WHERE some_key = 'to_delete'

This is similar to the UPDATE case, in that it also employs partition pruning to scope down the affected partitions only to those targeted by the WHERE qualifier. The difference is that the plan then simply filters out the exact rows in the affected partitions that match that delete condition. This in turn leads to the creation of new substitute partitions.

The unchanged and the newly created partitions make it into the new table version.

...on DROP?

DROP some_table

All table versions for the specified table are deleted. However, Seafowl defers the deletion of the underlying table directory from the object store for later. To clean up the dropped table physical remnants, you need to run VACUUM DATABASE db_name (unless a background job is set up).