Time travel queries
Modification of existing data is a core feature of Seafowl, and it effectively reduces to adding new and/or potentially replacing old partitions, followed by creating a new table version. Each table version is nothing more than an entry in our metadata store, linking it to the underlying partitions.
Any partition not touched by a write query ends up being a shared component
between the latest and previous version(s). We try to optimise for partition
re-use across different table versions: for
INSERTs we simply create new
partitions from the provided data; in the case of
DELETE only the
partitions targeted by the statement qualifier (i.e. the
WHERE clause, if any)
are replaced by new ones.
Querying older table versions
By default, Seafowl interprets a plain table reference in a SQL statement as pointing to the latest version for that table. However, given that all (meta)data for previous table versions is being tracked, there is also an option of referencing older table versions. In other words, one can query historical data (AKA table time travel), and do so by using the table function syntax with a single argument
SELECT * FROM my_table('2022-10-25T09:30:55+00:00')
The provided timestamp tells Seafowl to plan and execute the query using the
table version that was latest at the specified point in time. The time formats
accepted include RFC 3339 (as above), RFC 2822 and a frequently used
YY-MM-DD HH:MM:SS Z format.
Table version bookkeeping
There are no limitations on the table time-travel queries—one can mix and match
multiple table references with no version specifiers, or with ones using any of
the supported timestamp formats in a single
SELECT statement—apart from one.
Namely, Seafowl does not allow referencing table versions prior to the oldest
recorded one, and will instead error out with the message
No recorded table versions for the provided timestamp.
For this reason, but also as a more general tool aimed as a catalog and aide in
version selection, the
schema provides the
table_versions table, that can be queried to determine the
exact times when each new version was created, starting with the oldest one.
Finally, if one ends up with too many table versions and/or deems them obsolete,
they can easily be cleaned up using the
VACUUM command. In particular,
VACUUM TABLE my_table will purge all table versions except the last one, and
delete any resulting orphan partitions.
Assume we have just created a table
CREATE TABLE my_table (number INT, parity VARCHAR)
and we go on to add some data
INSERT INTO my_table VALUES (1, 'odd'), (2, 'even'), (3, 'odd'), (4, 'even')
This will result in the creation of two table versions, the initial (empty) one, and the one resulting from the insert
SELECT * FROM system.table_versions; table_schema | table_name | table_version_id | creation_time --------------+------------+------------------+--------------------- public | my_table | 1 | 2022-10-25 11:21:16 public | my_table | 2 | 2022-10-25 11:21:36 (2 rows)
Running another DML statement
DELETE FROM my_table WHERE parity = 'odd'
will result in yet another table version being created
SELECT * FROM system.table_versions; table_schema | table_name | table_version_id | creation_time --------------+------------+------------------+--------------------- public | my_table | 1 | 2022-10-25 11:21:16 public | my_table | 2 | 2022-10-25 11:21:36 public | my_table | 3 | 2022-10-25 11:22:48 (3 rows)
Querying the latest table version (3) gives us the expected results
SELECT * FROM my_table; number | parity --------+-------- 2 | even 4 | even (2 rows)
However, we can also go back and query the table as it was just prior to the
DELETE statement (2) using the time travel syntax
SELECT * FROM my_table('2022-10-25 11:22:00 +00:00'); number | parity --------+-------- 1 | odd 2 | even 3 | odd 4 | even (4 rows)
- Can't be used in DML statement subqueries.
- Can't use SQL expressions to specify the timestamp (e.g.
now() - INTERVAL '5 hours').