4 min

12/3/2025

Listen

TPC series - TPC-H Query 11 - X

Happy 2026 to everyone. It is good to be back writing - this time about Query 11 from TPC-H. We are now half way through the workload and if I am doing what I set out to do, you now have a good grasp on:

The large, often orders of magnitude, impact of good Query Optimisation.
How to come up with good query plans manually and validate those made by machines.
Statistics and what they mean for query planners.

Query 11

The query deal with a single nation via the n_name filter. Interestingly, the subquery is not correlated with the outer query.

SELECT ps_partkey,
       SUM(ps_supplycost * ps_availqty) AS v
FROM tpch.partsupp
INNER JOIN tpch.supplier
  ON ps_suppkey = s_suppkey
INNER JOIN tpch.nation
  ON s_nationkey = n_nationkey
WHERE n_name = 'JAPAN'
GROUP BY ps_partkey
HAVING SUM(ps_supplycost * ps_availqty)
         > (SELECT SUM(ps_supplycost * ps_availqty) * 0.0001
            FROM tpch.partsupp
            INNER JOIN tpch.supplier
              ON ps_suppkey = s_suppkey
            INNER JOIN tpch.nation
              ON s_nationkey = n_nationkey
            WHERE n_name = 'JAPAN')
ORDER BY v DESC

Notice that the subquery in the HAVING clause is touching exactly the same rows as the outer query.

Filter Selectivity

Two filters are present in the query

The selectivity of the filters depend somewhere on data generation and random picks of p_name.

There is also an interesting trap here. The data in TPC-H is made in such a way that every combination of l_partkey and l_suppkey in lineitem has a corresponding match in partsupp. But how can the optimiser know?

And the selectivities are:

Filter	Selectivity	Cardinality
`n_name = 'JAPAN''`	4%	1
`SUM(...) > (SELECT SUM(...) * 0.0001`	4%	1196

The last filter can only be taken after we group, so the only filter we are concerned with is the one of n_name.

Optimal Join Order

There is only one filter worth taking in this query: p_name LIKE '%grey%'. As always, lineitem is the largest stream of data - so we want to get to p_name as fast as possible. By transitive closure of the keys in the data model, we can know that a filter on ps_partkey is also a filter on l_partkey. In other words: The filter on p_name also reduces partsupp.

Without any knowledge of statistics, it should be clear that the initial join order must be:

part -> partsupp (taking the reduction) -> lineitem

Since partsupp is much smaller than lineitem (by about an order of magnitude) it makes sense that any hash strategy should construct a hash table on the result of part JOIN partsupp and then join that to lineitem. That is indeed what SQL Server and PostgreSQL do.

But note that if we believe that the keys in lineitem only point to a fraction of partsupp, we might just decide to join lineitem with partsupp first.

Or worse... if you are ClickHouse...

ClickHouse once again embarrasses itself

Here is the plan that ClickHouse makes:

Estimate    Actual  Operator
       -       175  PROJECT SUM(amount)
       -       175  SORT nation, o_year
       -       175  GROUP BY HASH nation, o_year AGGREGATE SUM(amount)
       -    295609  PROJECT EXTRACT(YEAR FROM o_orderdate), ps_supplycost * l_quantity, n_name, 1 - l_discount, EXTRACT(YEAR FROM o_orderdate), l_extendedprice * (1 - l_discount), (l_extendedprice * (1 - l_discount)) - (ps_supplycost * l_quantity), (l_extendedprice * (1 - l_discount)) - (ps_supplycost * l_quantity)
       -    295609  INNER JOIN HASH ON s_nationkey = n_nationkey
       -        25  │└TABLE SCAN nation
       -    295609  INNER JOIN HASH ON l_suppkey = s_suppkey
       -     10000  │└TABLE SCAN supplier
       -    295630  INNER JOIN HASH ON l_partkey = p_partkey
       -     10886  │└TABLE SCAN part WHERE p_name LIKE '%lace%'
       -   5419720  INNER JOIN HASH ON (l_suppkey,l_partkey) = (ps_suppkey,ps_partkey)
       -    800000  │└TABLE SCAN partsupp
       -   5419720  INNER JOIN HASH ON l_orderkey = o_orderkey
       -   1500000  │└TABLE SCAN orders
       -   5998820  TABLE SCAN lineitem

As the human query optimiser, you have become - reflect for a moment on how terrible a plan this is.

ClickHouse does not even bother collecting the filter on p_name until after it has done a series of joins. In fact, does something stand out about this join tree for you?

If we follow the relationships between the tables in the data model, draw them as a tree and order the children by total table size, we get:

lineitem    6M
├order     1.5M
├partsupp  800K
├part      200K
└supplier   10K
 └nation     25

Turn the above upside down, and you got the join order that ClickHouse picked.

Coincidence? Perhaps...

Continuing the join chain

After we have taken the filters, the rest of the join order does not matter as much.

Let's look at what tables we still need to join

order
supplier
nation (via supplier)

`orders` First

All the databases with decent optimisers go after orders before the two others. This makes sense, because we have a calculation we need to do: EXTRACT(YEAR FROM o_orderdate)

Since we have reduced lineitem via part we know that a similar reduction has to occur in orders (because of the keys in the data model). This means we should join orders next and get that calculation done, we know there won't be any more row reductions - because from here we are joining on FK/PK relationships without further filters.

`supplier` and `nation`

We can only reach nation by going after supplier first. Whenever we see one of these "go via X to get to Y" it generally means there is an opportunity for a bushy join.

We could join nation with supplier and then join the result of that to lineitem. Indeed, this is what DuckDB decides to do in this fragment:

    9615     10000  │└INNER JOIN HASH ON s_nationkey = n_nationkey
      25        25  │ │└TABLE SCAN nation
   10000     10000  │ TABLE SCAN supplier

... Or we could be memory conservative at the expense of CPU time and do loop joins into supplier and then hash join to nation (recall that nation is tiny - 25 rows). This saves us the trouble of building a hash table on supplier - but it means we instead have to dig into indexes with loop joins (which is a lot more expensive in CPU).

This is exactly what PostgreSQL does, which is probably not a surprise to you at this point:

     148    245492  INNER JOIN HASH ON s_nationkey = n_nationkey
      25        25  │└TABLE SCAN nation
     148    245492  INNER JOIN LOOP ON s_suppkey = l_suppkey
     148    245520  │└INNER JOIN LOOP ON o_orderkey = l_orderkey
     148    245520  │ │└INNER JOIN LOOP ON l_partkey = ps_partkey <---- Notice the bad estimate here
   46920     32792  │ │ │└INNER JOIN HASH ON ps_partkey = p_partkey
   10100     10940  │ │ │ │└TABLE SCAN part WHERE p_name LIKE '%lace%'
  774192    600000  │ │ │ TABLE SCAN partsupp
   32792    229544  │ │ TABLE SEEK lineitem WHERE ps_suppkey = l_suppkey
  245520    245520  │ TABLE SEEK orders
  245520    245520  TABLE SEEK supplier

Notice that we still end up joining 245492 times to get the n_name column that we need for the final aggregate.

SQL Server will have none of this nonsense... Once again, it has a trick to play.

Aggregate before Join

Before we go into the details of SQL Server's optimisations, consider what its query plan looks like:

Estimate    Actual  Operator
     175       175  PROJECT CASE WHEN globalagg1019 = 0 THEN NULL ELSE globalagg1021 END AS Expr1016
     175       175  GROUP BY HASH ,  AGGREGATE SUM(partialagg1018) AS globalagg1019, SUM(partialagg1020) AS globalagg1021
     175       175  SORT n_name, Expr1014
     175       175  INNER JOIN MERGE ON n_nationkey = s_nationkey   <---- Only 175 joins here!
      25        25  │└TABLE SEEK nation
     175       175  SORT s_nationkey            v---- what is THIS?
     175       175  GROUP BY HASH AGGREGATE COUNT(Expr1017 - ps_supplycost * l_quantity) AS partialagg1018, ...
   34408    245491  INNER JOIN MERGE ON o_orderkey = l_orderkey
 1500000   1499999  │└PROJECT datepart(EXTRACT(YEAR FROM o_orderdate) AS Expr1014
 1500000   1499999  │ TABLE SEEK orders

SQL Server only does 175 joins to nation - as opposed to the 245K that both DuckDB and PostgreSQL does.

How does it pull this off? It reasons like this:

I am eventually going to group on GROUP BY nation, o_year
nation is an alias to n_name and there are only 25 of those
o_year is an alias to EXTRACT(YEAR FROM o_orderdate) and I know there are only 7 of those (statistics again!)
That must mean that in the worst case, there are only 25 * 7 = 125 rows possible in that GROUP BY
There are also 25 unique n_nationkey
If I just group on n_nationkey first, then I can just join to nation later and get the n_name

This optimisation of aggregating before doing a join is relatively rare. It also seems to be very hard for the PostgreSQL community to wrap its head around this feature that SQL Server has done since SQL Server 2005 (released in the year of its name).

For example, see:

This aggregate before joining optimisation is often a powerful trick in analytical queries. TPC-H Query 9 checks if it is present and reward SQL Server with the lowest join count.

Estimating the join between `partsupp` and `lineitem`

Previously I mentioned that joining lineitem with partsupp does not reduce rows. But, there is no composite, foreign key between these tables. How can the optimiser know that there wont be a row reduction?

The answer is: It can't

Lets look at what the three optimisers we now care think about this join.

If we simply execute this:

SELECT COUNT(*) 
FROM lineitem 
JOIN partsupp 
  ON l_partkey = ps_partkey AND l_suppkey = ps_suppkey

Optimiser	Estimate	Actual	Estimated Selectivity
DuckDB	12564486	5998820	209%
PostgreSQL	608	5998820	0.01%
SQL Server	801167	5998820	13%

The actual selectivity is 100%.

Notice how these estimates are all over the place.

Postgres Likely thinks that both ps_partkey and ps_suppkey are filters in lineitem and thus reduce by what it belives the combined selectivity is
SQL Server similarly things something along those lines. But its algorithms for combining selectivity is much less aggressive when multiple filters are in play
DuckDB on the other hand is using an entirely different estimation algorithm. From what I know about DuckDB, this is likely some kind of HyperLogLog overlap estimation

Trying to help the Optimiser

I tried doing this with PostgreSQL:

CREATE STATISTICS stat_lineitem1 (ndistinct) ON l_partkey, l_suppkey FROM tpch.lineitem;
CREATE STATISTICS stat_pspartsupp1 (ndistinct) ON ps_partkey, ps_suppkey FROM tpch.partsupp;
CREATE STATISTICS stat_lineitem2 (DEPENDENCIES ) ON l_partkey, l_suppkey FROM tpch.lineitem;
CREATE STATISTICS stat_pspartsupp2 (DEPENDENCIES ) ON ps_partkey, ps_suppkey FROM tpch.partsupp;

But unfortunately, this does not seem to influence the bad estimate - I get the same estimation. PostgreSQL people in my network, please tell me that I am doing something wrong here!

However, if I do this in SQL Server:

CREATE STATISTICS stat_lineitem1 ON tpch.lineitem(l_partkey, l_suppkey) ;
CREATE STATISTICS stat_pspartsupp1 ON tpch.partsupp (ps_partkey, ps_suppkey);

Then, my estimate is now: 5967483 rows... as I like to say: "Bang on the money!"

A post-script for Data Vaulters

TODO

Summary

In today's analysis of Query 9 we saw that optimisers can save many join operations if they understand how to aggregate data before joining it. But the optimiser can only do so if it has good statistics. Unless you know exactly how large an aggregate may be - it is dangerous to aggregate before you join. You may end up spending more time aggregating than you win by joining.

We also saw that estimating composite key joins can be challenging without composite statistics.