Partition pruning
Partition pruning is a common operation done by analytical query engines that helps them read fewer partitions by consulting a query's filters and some cached metadata about each partition.
Example
Let's say we have a table of widgets sold in each country by year and month, partitioned as follows:
Partition 1
Year | Month | Country | Widget | Sales |
---|---|---|---|---|
2020 | 01 | US | A | 12 |
2020 | 01 | UK | A | 34 |
2020 | 01 | DE | B | 2 |
Partition 2
Year | Month | Country | Widget | Sales |
---|---|---|---|---|
2020 | 02 | US | A | 15 |
2020 | 02 | US | B | 23 |
2020 | 03 | US | B | 42 |
Partition 3
Year | Month | Country | Widget | Sales |
---|---|---|---|---|
2020 | 03 | US | B | 11 |
2020 | 03 | DE | B | 73 |
In this case, when Seafowl writes out a partition, it also stores the min-max values of every column in its catalog:
- Partition 1:
- Year: min/max 2020
- Month: min/max 01
- Country: min DE, max US
- Widget: min A, max B
- Sales: min 2, max 34
- Partition 2:
- Year: min/max 2020
- Month: min/max 02
- Country: min US, max US
- Widget: min A, max B
- Sales: min 15, max 42
- Partition 3:
- Year: min/max 2020
- Month: min/max 03
- Country: min DE, max US
- Widget: min B, max B
- Sales: min 11, max 73
Let's say the user tries to run this query:
SELECT country, widget, SUM(sales)
FROM table
WHERE year = '2020' AND month = '02'
GROUP BY 1, 2, 3
Because month=02
isn't included in any other partitions, we know we only need
to read the second partition. This avoids the need to even download the other
two partitions from the object store.
Taking advantage of partition pruning
Seafowl stores rows in the order of insertion. To maximize the efficiency of partition pruning, you should order the data in your table by keys that you're going to be filtering on most often.
If you're periodically inserting data and filter based on its timestamp, this
will take care of itself. You can also use CREATE TABLE AS
to reorder the data
in a table:
CREATE TABLE staging AS (SELECT * FROM my_table ORDER BY key_1 ASC, key_2 DESC);
DROP TABLE my_table;
ALTER TABLE staging RENAME TO my_table;
Exact and inexact pruning
DataFusion also has the concept of exact and inexact pruning.
Exact pruning for a given filter means that either:
- ALL values in a partition match it. DataFusion will scan through that partition but won't rerun the filter on the returned data.
- NO values in a partition match it. DataFusion won't scan through that partition at all.
Inexact pruning for a given filter means that either:
- SOME values in a partition match it. DataFusion will scan through that partition and rerun the filter on the returned data.
- NO values in a partition match that filter. DataFusion won't scan through that partition at all.
Seafowl currently doesn't record metadata that lets exact pruning work. In the
above example, you could imagine us knowing that Partition 2 only has rows with
year=2020, month=02
and intelligently partition data based on maintaining
these kinds of constraints.
When pruning doesn't help
In the example above, note that partition 3 spans countries DE
to US
. While
that partition doesn't have data with country=UK
, Seafowl's partition pruning
will still scan through it, since lexicographically, DE < UK < US
.
There are some ways to fix it, for example, by using a bloom filter to quickly determine that a partition definitely doesn't have a certain value.
Aside: discussion on the economics of object stores
The pricing of object stores like AWS S3 has multiple components:
- Actual storage (priced as $ per GB per month)
- Requests (priced as $ per request)
- Egress (data leaving the region it's stored in, priced as $ per GB).
Some of these components are tiered (e.g. the first 10GB of storage per month might be free) or have multiple classes (e.g. write requests might be more expensive than reads or transfer to a different region with the same cloud provider might be cheaper than transfer to the Internet), but this is a standard structure.
How query engines optimize data downloads
For the "big three" object stores (S3/Azure/GCP), the price is dominated by egress fees, unless the data never leaves the same cloud provider's region that the application is in. This encourages application developers to use the provider's compute as well.
In the context of a query engine downloading some Parquet files to satisfy a query, it makes sense to read as little data as possible from the object store to minimize transfer costs (both time- and money- wise). Query engines do this via a combination of:
- Caching: storing partitions on a compute node's local disk in order to decrease latency and not have to re-download the file on every query.
- Partition pruning: storing metadata that lets the query engine determine whether it needs to scan through any part of the file at all, without downloading the actual file.
- Row group pruning/individual column reads: Using row group statistics in an individual Parquet file to skip over irrelevant blocks as well as only download certain columns in a row group. This can be done by using byte range fetches.
No egress fees at all?
However, what happens when a new entrant like Cloudflare R2 doesn't charge for egress at all? This has some interesting implications for applications that can now be deployed on any compute vendor and in any region without costly egress fees.
Consider a 1GB Parquet file with 8 columns (of the same type) and 16 row groups. In a single row group, each column will take 1024 / 8 / 16 = 8MB. Assume we need to scan just the data for a single column over a link with a bandwidth of 128MB/s. We can either:
- download the whole 1GB Parquet file: 1 request; 1024 / 128 = 8 seconds
- use Range requests to download the column values for each of the 16 row groups: 16 requests; 16 x 8 / 128 = 1 second
The second case is 8 times faster but 16 times more expensive. To take advantage of this different cost structure, we might need to rethink:
- how tables are partitioned into Parquet files (bigger partitions can be better)
- how tables are partitioned inside of Parquet files (files with single row groups are better, since it lets the engine read a single column with a single byte range fetch)
- how caching files on the compute node works (it might make sense to download and cache a whole Parquet file instead of a small region)
- how partition pruning works (we could use a byte range fetch to read a Parquet file footer and get away without storing partition ranges in the catalog, apart from latency issues, but now these small reads will form a large part of our costs)
Though note that doing nothing and proceeding as if we were using a big-three object store is still going to be much cheaper due to the lack of egress fees.