TPC series - TPC-H Query 11 - X
Happy 2026 to everyone. It is good to be back writing - this time about Query 11 from TPC-H. We are now half way through the workload and if I am doing what I set out to do, you now have a good grasp on:
- The large, often orders of magnitude, impact of good Query Optimisation.
- How to come up with good query plans manually and validate those made by machines.
- Statistics and what they mean for query planners.
Query 11
The query deal with a single nation via the n_name filter.
Interestingly, the subquery is not correlated with the outer query.
SELECT ps_partkey,
SUM(ps_supplycost * ps_availqty) AS v
FROM tpch.partsupp
INNER JOIN tpch.supplier
ON ps_suppkey = s_suppkey
INNER JOIN tpch.nation
ON s_nationkey = n_nationkey
WHERE n_name = 'JAPAN'
GROUP BY ps_partkey
HAVING SUM(ps_supplycost * ps_availqty)
> (SELECT SUM(ps_supplycost * ps_availqty) * 0.0001
FROM tpch.partsupp
INNER JOIN tpch.supplier
ON ps_suppkey = s_suppkey
INNER JOIN tpch.nation
ON s_nationkey = n_nationkey
WHERE n_name = 'JAPAN')
ORDER BY v DESC
Notice that the subquery in the HAVING clause is touching exactly the same rows as the outer query.
Filter Selectivity
Two filters are present in the query
The selectivity of the filters depend somewhere on data generation and random picks of p_name.
There is also an interesting trap here. The data in TPC-H is made in such a way that
every combination of l_partkey and l_suppkey in lineitem has a corresponding match in partsupp.
But how can the optimiser know?
And the selectivities are:
| Filter | Selectivity | Cardinality |
|---|---|---|
n_name = 'JAPAN'' |
4% | 1 |
SUM(...) > (SELECT SUM(...) * 0.0001 |
4% | 1196 |
The last filter can only be taken after we group, so the only filter we are concerned with is the
one of n_name.
Optimal Join Order
There is only one filter worth taking in this query: p_name LIKE '%grey%'.
As always, lineitem is the largest stream of data - so we want to get to p_name as fast as possible.
By transitive closure of the keys in the data model, we can know that a filter on ps_partkey is also a filter
on l_partkey.
In other words: The filter on p_name also reduces partsupp.
Without any knowledge of statistics, it should be clear that the initial join order must be:
part->partsupp(taking the reduction) ->lineitem
Since partsupp is much smaller than lineitem (by about an order of magnitude) it makes sense that any hash strategy
should construct a hash table on the result of part JOIN partsupp and then join that to lineitem.
That is indeed what SQL Server and PostgreSQL do.
But note that if we believe that the keys in lineitem only point to a fraction of partsupp, we might just decide
to join lineitem with partsupp first.
Or worse... if you are ClickHouse...
ClickHouse once again embarrasses itself
Here is the plan that ClickHouse makes:
Estimate Actual Operator
- 175 PROJECT SUM(amount)
- 175 SORT nation, o_year
- 175 GROUP BY HASH nation, o_year AGGREGATE SUM(amount)
- 295609 PROJECT EXTRACT(YEAR FROM o_orderdate), ps_supplycost * l_quantity, n_name, 1 - l_discount, EXTRACT(YEAR FROM o_orderdate), l_extendedprice * (1 - l_discount), (l_extendedprice * (1 - l_discount)) - (ps_supplycost * l_quantity), (l_extendedprice * (1 - l_discount)) - (ps_supplycost * l_quantity)
- 295609 INNER JOIN HASH ON s_nationkey = n_nationkey
- 25 │└TABLE SCAN nation
- 295609 INNER JOIN HASH ON l_suppkey = s_suppkey
- 10000 │└TABLE SCAN supplier
- 295630 INNER JOIN HASH ON l_partkey = p_partkey
- 10886 │└TABLE SCAN part WHERE p_name LIKE '%lace%'
- 5419720 INNER JOIN HASH ON (l_suppkey,l_partkey) = (ps_suppkey,ps_partkey)
- 800000 │└TABLE SCAN partsupp
- 5419720 INNER JOIN HASH ON l_orderkey = o_orderkey
- 1500000 │└TABLE SCAN orders
- 5998820 TABLE SCAN lineitem
As the human query optimiser, you have become - reflect for a moment on how terrible a plan this is.
ClickHouse does not even bother collecting the filter on p_name until after it has done a series of joins.
In fact, does something stand out about this join tree for you?
If we follow the relationships between the tables in the data model, draw them as a tree and order the children by total table size, we get:
lineitem 6M
├order 1.5M
├partsupp 800K
├part 200K
└supplier 10K
└nation 25
Turn the above upside down, and you got the join order that ClickHouse picked.
Coincidence? Perhaps...
Continuing the join chain
After we have taken the filters, the rest of the join order does not matter as much.
Let's look at what tables we still need to join
ordersuppliernation(viasupplier)
orders First
All the databases with decent optimisers go after orders before the two others.
This makes sense, because we have a calculation we need to do: EXTRACT(YEAR FROM o_orderdate)
Since we have reduced lineitem via part we know that a similar reduction has to occur in orders (because
of the keys in the data model).
This means we should join orders next and get that calculation done, we know there won't be any more
row reductions - because from here we are joining on FK/PK relationships without further filters.
supplier and nation
We can only reach nation by going after supplier first.
Whenever we see one of these "go via X to get to Y" it generally means there is an opportunity for a bushy join.
We could join nation with supplier and then join the result of that to lineitem.
Indeed, this is what DuckDB decides to do in this fragment:
9615 10000 │└INNER JOIN HASH ON s_nationkey = n_nationkey
25 25 │ │└TABLE SCAN nation
10000 10000 │ TABLE SCAN supplier
... Or we could be memory conservative at the expense of CPU time and do loop joins into supplier and then hash
join to nation (recall that nation is tiny - 25 rows).
This saves us the trouble of building a hash table on supplier - but it means we instead have to dig into indexes
with loop joins (which is a lot more expensive in CPU).
This is exactly what PostgreSQL does, which is probably not a surprise to you at this point:
148 245492 INNER JOIN HASH ON s_nationkey = n_nationkey
25 25 │└TABLE SCAN nation
148 245492 INNER JOIN LOOP ON s_suppkey = l_suppkey
148 245520 │└INNER JOIN LOOP ON o_orderkey = l_orderkey
148 245520 │ │└INNER JOIN LOOP ON l_partkey = ps_partkey <---- Notice the bad estimate here
46920 32792 │ │ │└INNER JOIN HASH ON ps_partkey = p_partkey
10100 10940 │ │ │ │└TABLE SCAN part WHERE p_name LIKE '%lace%'
774192 600000 │ │ │ TABLE SCAN partsupp
32792 229544 │ │ TABLE SEEK lineitem WHERE ps_suppkey = l_suppkey
245520 245520 │ TABLE SEEK orders
245520 245520 TABLE SEEK supplier
Notice that we still end up joining 245492 times to get the n_name column that we need for the final aggregate.
SQL Server will have none of this nonsense... Once again, it has a trick to play.
Aggregate before Join
Before we go into the details of SQL Server's optimisations, consider what its query plan looks like:
Estimate Actual Operator
175 175 PROJECT CASE WHEN globalagg1019 = 0 THEN NULL ELSE globalagg1021 END AS Expr1016
175 175 GROUP BY HASH , AGGREGATE SUM(partialagg1018) AS globalagg1019, SUM(partialagg1020) AS globalagg1021
175 175 SORT n_name, Expr1014
175 175 INNER JOIN MERGE ON n_nationkey = s_nationkey <---- Only 175 joins here!
25 25 │└TABLE SEEK nation
175 175 SORT s_nationkey v---- what is THIS?
175 175 GROUP BY HASH AGGREGATE COUNT(Expr1017 - ps_supplycost * l_quantity) AS partialagg1018, ...
34408 245491 INNER JOIN MERGE ON o_orderkey = l_orderkey
1500000 1499999 │└PROJECT datepart(EXTRACT(YEAR FROM o_orderdate) AS Expr1014
1500000 1499999 │ TABLE SEEK orders
SQL Server only does 175 joins to nation - as opposed to the 245K that both DuckDB and PostgreSQL does.
How does it pull this off? It reasons like this:
- I am eventually going to group on
GROUP BY nation, o_year nationis an alias ton_nameand there are only 25 of thoseo_yearis an alias toEXTRACT(YEAR FROM o_orderdate)and I know there are only 7 of those (statistics again!)- That must mean that in the worst case, there are only 25 * 7 = 125 rows possible in that
GROUP BY - There are also 25 unique
n_nationkey - If I just group on
n_nationkeyfirst, then I can just join tonationlater and get then_name
This optimisation of aggregating before doing a join is relatively rare. It also seems to be very hard for the PostgreSQL community to wrap its head around this feature that SQL Server has done since SQL Server 2005 (released in the year of its name).
For example, see:
This aggregate before joining optimisation is often a powerful trick in analytical queries. TPC-H Query 9 checks if it is present and reward SQL Server with the lowest join count.
Estimating the join between partsupp and lineitem
Previously I mentioned that joining lineitem with partsupp does not reduce rows.
But, there is no composite, foreign key between these tables.
How can the optimiser know that there wont be a row reduction?
The answer is: It can't
Lets look at what the three optimisers we now care think about this join.
If we simply execute this:
SELECT COUNT(*)
FROM lineitem
JOIN partsupp
ON l_partkey = ps_partkey AND l_suppkey = ps_suppkey
| Optimiser | Estimate | Actual | Estimated Selectivity |
|---|---|---|---|
| DuckDB | 12564486 | 5998820 | 209% |
| PostgreSQL | 608 | 5998820 | 0.01% |
| SQL Server | 801167 | 5998820 | 13% |
The actual selectivity is 100%.
Notice how these estimates are all over the place.
- Postgres Likely thinks that both
ps_partkeyandps_suppkeyare filters inlineitemand thus reduce by what it belives the combined selectivity is - SQL Server similarly things something along those lines. But its algorithms for combining selectivity is much less aggressive when multiple filters are in play
- DuckDB on the other hand is using an entirely different estimation algorithm. From what I know about DuckDB, this is likely some kind of HyperLogLog overlap estimation
Trying to help the Optimiser
I tried doing this with PostgreSQL:
CREATE STATISTICS stat_lineitem1 (ndistinct) ON l_partkey, l_suppkey FROM tpch.lineitem;
CREATE STATISTICS stat_pspartsupp1 (ndistinct) ON ps_partkey, ps_suppkey FROM tpch.partsupp;
CREATE STATISTICS stat_lineitem2 (DEPENDENCIES ) ON l_partkey, l_suppkey FROM tpch.lineitem;
CREATE STATISTICS stat_pspartsupp2 (DEPENDENCIES ) ON ps_partkey, ps_suppkey FROM tpch.partsupp;
But unfortunately, this does not seem to influence the bad estimate - I get the same estimation. PostgreSQL people in my network, please tell me that I am doing something wrong here!
However, if I do this in SQL Server:
CREATE STATISTICS stat_lineitem1 ON tpch.lineitem(l_partkey, l_suppkey) ;
CREATE STATISTICS stat_pspartsupp1 ON tpch.partsupp (ps_partkey, ps_suppkey);
Then, my estimate is now: 5967483 rows... as I like to say: "Bang on the money!"
A post-script for Data Vaulters
TODO
Summary
In today's analysis of Query 9 we saw that optimisers can save many join operations if they understand how to aggregate data before joining it. But the optimiser can only do so if it has good statistics. Unless you know exactly how large an aggregate may be - it is dangerous to aggregate before you join. You may end up spending more time aggregating than you win by joining.
We also saw that estimating composite key joins can be challenging without composite statistics.


4 min
12/3/2025
Listen