TPC-H Query 11 - Diving into Statistics
Times have been busy after joining: Floe. I highly recommend you check out our blog there to see what we are up to.
I am not giving up on my TPC-H series! Today we are half-way through the workload and have arrived at Query 11. If I am succeeding in what I set out to do, regular readers now have a good grasp on:
- The large, often orders of magnitude, impact of good Query Optimisation.
- How to come up with good query plans manually and validate those made by machines.
- A basic grasp of statistics and what they mean for query planners.
Today's query is pretty simple. Your new skills will let you find the optimal query plan easily.
I am going to take this chance to talk about statistics and how they relate to Query 11. We will also be talking more about bloom filters and what they can do for your analytical workload.
Query 11
The query deals with a single nation via the n_name filter.
Interestingly, the subquery is not correlated with the outer query - which makes things
a lot simpler.
SELECT ps_partkey,
SUM(ps_supplycost * ps_availqty) AS v
FROM tpch.partsupp
INNER JOIN tpch.supplier
ON ps_suppkey = s_suppkey
INNER JOIN tpch.nation
ON s_nationkey = n_nationkey
WHERE n_name = 'JAPAN'
GROUP BY ps_partkey
HAVING SUM(ps_supplycost * ps_availqty)
> (SELECT SUM(ps_supplycost * ps_availqty) * 0.0001
FROM tpch.partsupp
INNER JOIN tpch.supplier
ON ps_suppkey = s_suppkey
INNER JOIN tpch.nation
ON s_nationkey = n_nationkey
WHERE n_name = 'JAPAN')
ORDER BY v DESC
Notice that the subquery in the HAVING clause is touching exactly the same rows as the outer query.
Filter Selectivity
Two filters are present in the query.
The selectivity of these filters depend somewhere on data generation and random picks of n_name.
| Filter | Selectivity | Cardinality |
|---|---|---|
n_name = 'JAPAN' |
4% | 1 |
SUM(...) > (SELECT SUM(...) * 0.0001 |
~4% | 1196 |
The last filter can only be taken after we group, so the only filter we are concerned with is the
one of n_name.
Use your already acquired skill to find the optimal query plan before reading on.
Optimal Join Order
For once, lineitem is not present in the query - the largest table is instead partsupp.
As per our heuristics, we let that table be the driving table in the tree (the one that probes into the others).
Reaching n_name can only be done by joining via supplier.
We are faced with a choice between two possible plans:
- The left deep:
partsupp⨝supplier⨝nation - The bushy:
partsupp⨝ (supplier⨝nation)
There are 10000 rows in supplier and we know from statistics
that the filter on nation results in 1/25 of these.
That means that the hash table we build in the bushy join must contain roughly 10000/25
rows – which means the bushy join is almost certainly the better choice.
But how exactly do we know that without running the query?
Statistics and why they matter
Query optimisers rely on statistics sampled from data to make good decisions.
One of those statistics is the number of unique/distinct values in a column.
In case of query 11, statistics tell us that there are 25 different values of n_name.
We also know that there are 25 rows in the table nation.
That must then mean that if we filter on an exact value of n_name (as we do here)
then we can either get zero or one row from nation.
Most query optimisers make the assumption that if you ask for something, it is probably going to be there, so the estimate is 1 row.
All the engines in SQL Arena agree: the number of rows coming out of nation is
estimated as 1.
PostgreSQL and SQL Server do even better, they can know for a fact that the count is 1 row, not 0. How does SQL Server and PostgreSQL know that? They collect additional statistics about the data.
Both PostgreSQL and SQL Server sample a histogram of the values in each column.
In the case of PostgreSQL, we can query is like this:
SELECT n_distinct, cardinality(histogram_bounds) FROM pg_stats
WHERE tablename = 'nation'
AND attname = 'n_name'
--- returns:
--- n_distinct | cardinality
--- -----------+-------------
--- -1 | 25
The value -1 is another Postgres'ism - it just means: "All the values are unique".
We can inspect histogram_bounds directly to see what it contains.
We could even, if we felt like doing self harm, query the individual values with
something like this:
SELECT v.*
FROM pg_stats
CROSS JOIN LATERAL unnest(histogram_bounds::TEXT::TEXT[]) AS v
WHERE tablename = 'nation'
AND attname = 'n_name'
Yeah... ugly...
histogram_bound is of anyarray type - so the above seems to be the
way to do unnest it.
Thanks to Yurii Rashkovskii for showing me that you can raw cast the array to text
and then cast it back to an array.
Guessing cardinality of (supplier ⨝ nation) with MCV
Now that we know the filter on nation returns 1 row, good statistics
will tell us with near certainty that the reduction in rows (to 1/25)
applies to supplier too.
How do we know? Let us use PostgreSQL as a case study.
PostgreSQL collects something called "Most Common Value" (MCV) for each column.
By default, it will collect the 100 most common values in the column - this is configurable.
Since the cardinality of nation is only 25, the most common values of s_nationkey
completely cover the range of values in n_nationkey.
We can list those values with something like this:
SELECT MIN(freq), MAX(freq), COUNT(*) AS frequency
FROM pg_stats
CROSS JOIN LATERAL unnest(most_common_freqs::TEXT::TEXT[]) AS f(freq)
WHERE tablename = 'supplier'
AND attname = 's_nationkey';
This gives us the exact frequency range for s_nationkey values:
min max count
0.0362 0.0438 25
We also know that there is a primary / foreign key relationship between supplier and nation.
Thus, we (and the optimiser) can reason this way:
- We pick exactly one row on
nation - That row will have a
n_nationkeythat is also ins_nationkey(because there are 25 distinct values in both tables) - The selectivity of that key is in the range [3.6%; 4.4%]
- The result of
supplier⨝nationmust be around 400 rows - this is a hard boundary! - The bushy join plan is therefore best and the hash table will be very small
- We can know almost exactly how much memory to allocate for the hash table of (
supplier⨝nation)
All this with some simple statistics in our metadata.
PostgreSQL estimation example
Here is PostgreSQL hitting a home run:
Estimate Actual Operator
- ∞ SEQUENCE
1 1 ├─GROUP BY SIMPLE AGGREGATE SUM(ps_supplycost * ps_availqty) * 0.0001
3 3 │ GROUP BY SIMPLE AGGREGATE PARTIALSUM(ps_supplycost * ps_availqty)
13333 10053 │ INNER JOIN HASH ON ps_suppkey = s_suppkey
400 377 │ │└INNER JOIN HASH ON s_nationkey = n_nationkey <--- GOOD!
1 1 │ │ │└TABLE SCAN nation_1 WHERE n_name = 'JAPAN'
10000 10000 │ │ TABLE SCAN supplier_1
999999 800000 │ TABLE SCAN partsupp_1
.... more...
SQL Server and DuckDB get the estimate right the same way.
A note on PostgreSQL plans: I have updated dbprove so it understands init plans (an odd PostgreSQL artefact we will talk about in a second).
ClickHouse - The usual story
Here is ClickHouse failing to make the right plan with its poor estimator:
Actual Operator
32493 PROJECT SUM(ps_supplycost * ps_availqty)
32493 SORT SUM(ps_supplycost * ps_availqty)
32493 FILTER SUM(ps_supplycost * ps_availqty) > 8654127.262542
32493 GROUP BY HASH ps_partkey AGGREGATE SUM(ps_supplycost * ps_availqty)
34557 PROJECT ps_supplycost * ps_availqty
34557 INNER JOIN HASH ON s_nationkey = n_nationkey
1 │└TABLE SCAN nation WHERE n_name = 'JAPAN'
799921 INNER JOIN HASH ON ps_suppkey = s_suppkey
10000 │└TABLE SCAN supplier
800000 TABLE SCAN partsupp
Notice how the bushy join isn't found.
Not the most interesting query to plan - but it is a great query for illustrating something else.
Bloom Filters
We know that the optimal plan is to bushy join (supplier ⨝ nation).
We also know that this reduces the amount of rows to only 4% of the original table.
Could we perhaps use of this information to also reduce the scan of partsupp - the largest
table in the query?
The answer is yes, and the trick is called: bloom filters. My colleague Makar Ivashko just wrote a blog post about those filters, go read that if you don't know what a bloom filter is. Even if you do know what those filters are, the post shows some very cool optimisations you can play when you know as much about CPU pipelines as Makar does.
... Welcome back – now you know how a bloom filter works. How can it help us in our query?
We know that only 4% of all suppliers match the filter on n_name.
That also means that only 4% of partsupp match those suppliers.
In our query, we join to establish that – a total of 800K rows joined in our little 1GB
dataset.
But it turns out that we don't really need to join all those rows if we use bloom filters.
What if we did this:
- Build a bloom filter over
s_suppkeyafter we have taken the filter onn_name - Use that bloom filter to filter out rows in
partsuppthat don't match those suppliers
There are only around 400 suppliers that match the filter on n_name.
That means that even a very small bloom filter can save us a lot of work.
Neither PostgreSQL, SQL Server (in row mode) nor DuckDB use bloom filters to support this query. DuckDB has experimental support for them but as of this time of writing, they are not in mainline DuckDB.
Floe makes heavy use of bloom filters for query execution. We don't have our engine fully running bete level yet (follow us to keep track of that). But we do have prototypes of the bloom filters so I can tell you what the plan currently looks like.
It looks like this (the bloom can also be used on supplier, so you get a reduction there too):
10053 INNER JOIN HASH ON ps_suppkey = s_suppkey
377 │└INNER JOIN HASH ON s_nationkey = n_nationkey
1 │ │└TABLE SCAN nation_1 WHERE n_name = 'JAPAN'
399 │ TABLE SCAN supplier <--- Bloom happens here from nation
58281 TABLE SCAN partsupp <--- And here from supplier ⨝ nation
That's a nice reduction compared to:
10053 INNER JOIN HASH ON ps_suppkey = s_suppkey
377 │└INNER JOIN HASH ON s_nationkey = n_nationkey
1 │ │└TABLE SCAN nation WHERE n_name = 'JAPAN'
10000 │ TABLE SCAN supplier
800000 TABLE SCAN partsupp
Detour: Init plans and canonical forms
SQL Arena will be updated with shiny new Postgres 18 plans soon.
While working on that, I realised I had to expand my EXPLAIN tool to understand init plans.
What are they?
Consider this part of Query 11:
GROUP BY ps_partkey
HAVING SUM(ps_supplycost * ps_availqty)
--- Subquery here
> (SELECT SUM(ps_supplycost * ps_availqty) * 0.0001
FROM tpch.partsupp
... etc...
This subquery returns a constant, scalar value. It is not correlated with the outer query - it is a constant value for the duration of the outer query execution.
For cases like this, PostgreSQL uses a concept called "init plans". The idea is that a single query is split into a series of queries. The smaller queries are called "init plans" and together they create the full query plan.
It looks like this in PostgreSQL EXPLAIN, and it is awful because your execution engine now needs to understand this. concept:
Sort Key: (sum((partsupp.ps_supplycost * (partsupp.ps_availqty)::numeric))) DESC
InitPlan 1 <--- That's the init plan
-> Finalize Aggregate (cost=23727.91..23727.92 rows=1 width=32)
-> Gather (cost=23727.69..23727.90 rows=2 width=32)
What's the alternative?
You could just express the subquery as a simple cross join, like DuckDB and SQL Server do:
Operator
v--- Just a join, no need for init plans
INNER JOIN LOOP ON CAST(SUM(ps_supplycost * CAST(ps_availqty AS DECIMAL(18,0))) AS DECIMAL(38,6)) > SUBQUERY
│└GROUP BY SIMPLE AGGREGATE "first" #0, count_star()
│ PROJECT SUM(ps_supplycost * CAST(ps_availqty AS DECIMAL(18,0))) * 0.0001
│ GROUP BY SIMPLE AGGREGATE SUM(#0)
│ PROJECT ps_supplycost * CAST(ps_availqty AS DECIMAL(18,0))
│ INNER JOIN HASH ON ps_suppkey = s_suppkey
│ │└INNER JOIN HASH ON s_nationkey = n_nationkey
│ │ │└TABLE SCAN nation WHERE n_name = 'JAPAN'
│ │ TABLE SCAN supplier
│ TABLE SCAN partsupp WHERE ps_suppkey >= 1
│ v--- The thing that PostgreSQL turns into an init plan
GROUP BY HASH #0 AGGREGATE SUM(#1)
PROJECT ps_partkey, ps_supplycost * CAST(ps_availqty AS DECIMAL(18,0))
INNER JOIN HASH ON ps_suppkey = s_suppkey
│└INNER JOIN HASH ON s_nationkey = n_nationkey
│ │└TABLE SCAN nation WHERE n_name = 'JAPAN'
│ TABLE SCAN supplier
TABLE SCAN partsupp WHERE ps_suppkey >= 1
Notice how the engine understands that the HAVING subquery can be replaced
with a join to a single row.
Of course, for that to be effective – such joins need to be treated special by the
execution engine of the database.
But by bringing the query into this "pure join" form you greatly reduce the amount
of special cases you need to deal with in the execution engine.
The advantage is that every query is just that: one query!
This idea of turning complex queries into simpler forms is called "canonicalisation" and it greatly simplifies engines.
You can read more examples of canonical forms in this blog post:
Summary
Today, I wandered off the path a bit to talk about how optimisers and execution engines actually do the work they do.
You learned:
- What types of statistics you can collect and how to use them to reason about queries
- How to use bloom filters to reduce the amount of work done in a query
- How bringing a query into a canonical form can simplify the database
This prepares us for the next post about Query 12, where we will talk about aggressively filtering large tables.
Have a productive week, and see you all soon.



11 min



