The Database Doctor
Musing about Databases

TPC series - TPC-H Query 8 - Loops, merges and hash

In today's look at TPCH-H Q08, we're going to once again walk through an optimal join order example.

The query is quite boring, but it does allow me to talk about a few other things related to optimiser costing: Join Algorithms (or as some would call them: "Physical operations").

Query 8

The query visits nearly every table in the datamodel. It has a few, simple filters:

SELECT o_year,
       SUM(CASE
           WHEN nation = 'GERMANY'
               THEN volume
           ELSE 0
           END) / SUM(volume) AS mkt_share
FROM (SELECT EXTRACT(YEAR FROM o_orderdate)     AS o_year,
             l_extendedprice * (1 - l_discount) AS volume,
             n2.n_name                          AS nation
      FROM tpch.lineitem
      INNER JOIN tpch.part ON l_partkey = p_partkey
      INNER JOIN tpch.supplier ON l_suppkey = s_suppkey
      INNER JOIN tpch.nation n2 ON n2.n_nationkey = s_nationkey
      INNER JOIN tpch.orders ON l_orderkey = o_orderkey
      INNER JOIN tpch.customer ON o_custkey = c_custkey
      INNER JOIN tpch.nation n1 ON c_nationkey = n1.n_nationkey
      INNER JOIN tpch.region ON n1.n_regionkey = r_regionkey
      WHERE r_name = 'EUROPE'
        AND o_orderdate BETWEEN '1995-01-01' AND '1996-12-31'
        AND p_type = 'MEDIUM PLATED BRASS') AS all_nations
GROUP BY o_year
ORDER BY o_year;

Selectivity of Filters:

Let us do the selectivity analysis that has become routine. We have:

Filter Selectivity Cardinality
r_name = 'EUROPE' 20% 1
o_orderdate BETWEEN '1995-01-01' AND '1996-12-31' 29% 428K
p_type = 'MEDIUM PLATED BRASS' < 1% 1.3K

Note how juicy the p_type filter is, we want to take that quickly!

Also, note that while the r_name filter is great, we can't really get to it before we join to customer via nation (alias n1).

Optimal Join Order

As is usual, lineitem is our largest stream. It drives the rest of the join ordering — we want it to be on the probe side of joins (or drive seeks in loop joins).

The best filter is the one on part via p_type = 'MEDIUM PLATED BRASS'. We take this filter quickly by joining from lineitem to part, reducing the stream of lineitem down to less than 1%. An optimiser with good statistics can be reasonably confident in this estimate of the reduction in lineitem.

With large certainty, the optimiser can know that the stream of lineitem JOIN part is now smaller than orders - even when we take the filter of o_orderdate BETWEEN '1995-01-01' AND '1996-12-31' into account.

So far, we're joining:

If we choose a hash join to orders - we also know that we want lineitem to be on the build side of the join, but only after we join to part.

ClickHouse Plan — how to fail at Estimation

Before we look at the last filter r_name = 'EUROPE' I think its worth checking out the ClickHouse query plan.

Here it is:

Estimate    Actual  Operator
       -         2  PROJECT sumIf(volume,(nation = 'FRANCE')) / SUM(volume), sumIf(volume,(nation = 'FRANCE')) / SUM(volume)
       -         2  SORT o_year
       -         2  GROUP BY HASH o_year AGGREGATE sumIf(volume,(nation = 'FRANCE')), SUM(volume)
       -      2400  PROJECT EXTRACT(YEAR FROM o_orderdate), n_name, 1 - l_discount, EXTRACT(YEAR FROM o_orderdate), l_extendedprice * (1 - l_discount), nation = 'FRANCE', l_extendedprice * (1 - l_discount)
       -      2400  INNER JOIN HASH ON s_nationkey = n_nationkey
       -        25  │└TABLE SCAN nation
       -      2050  INNER JOIN HASH ON n_regionkey = r_regionkey  <---- Take the filter here
       -         1  │└TABLE SCAN region WHERE r_name = 'EUROPE' 
       -     10392  INNER JOIN HASH ON c_nationkey = n_nationkey
       -        25  │└TABLE SCAN nation
       -     10392  INNER JOIN HASH ON o_custkey = c_custkey
       -    150000  │└TABLE SCAN customer                         <---- Greeding looking for next filter
       -     10392  INNER JOIN HASH ON l_suppkey = s_suppkey
       -     10000  │└TABLE SCAN supplier
       -     10398  INNER JOIN HASH ON l_partkey = p_partkey
       -      1353  │└TABLE SCAN part WHERE p_type = 'SMALL POLISHED NICKEL'
       -   1549566  INNER JOIN HASH ON l_orderkey = o_orderkey
       -    428930  │└TABLE SCAN orders WHERE (o_orderdate >= '1995-01-01') AND (o_orderdate <= '1996-12-31')
       -   5998820  TABLE SCAN lineitem

As we have seen before with in TPC-H Q07 - ClickHouse will only look for left deep trees. But that's not the worst bit, it also can't estimate even simple filters!

The Join order at the bottom of the tree is:

Instead of taking the big reduction to 1% to part first, ClickHouse picks the join to orders. Why does it do this? Most likely because it has no idea what the selectivity of the filters are (=poor statistics) and it ends up making a guess that results in a lot more work.

Bushing out r_name and customer

Recall that the optimal join order so far is:

At this point, we could directly join to customer and greedily look for the reduction in rows we get from going to nation, to region taking the r_name = 'EUROPE' filter.

Being smarter about the r_name filter

Instead of joining the still large stream to customer directly - we could instead do this join:

nation -> region (taking r_name filter)

The results of this join can then be used to join to the previous stream of:

This method will reduce the number of joins by not needing to join the above steam first tonation and then region but only to the result of the already filtered nation.

Postgres and DuckDB plans

The PostgreSQL and DuckDB plans are similar, differing only join type (loop vs hash).

Here is DuckDB:

Estimate    Actual  Operator
       -         2  SORT o_year
    1752         2  PROJECT o_year, mkt_share
       -         2  GROUP BY HASH #0 AGGREGATE SUM(#1), SUM(#2)
    1961      2323  PROJECT o_year, CASE WHEN(nation = 'FRANCE') THEN volume ELSE 0.000000 END, volume
    1961      2323  PROJECT o_year, volume, nation
    1961      2323  INNER JOIN HASH ON s_nationkey = n_nationkey
      25        25  │└TABLE SCAN nation
    2040      2323  INNER JOIN HASH ON s_suppkey = l_suppkey
    1748      2323  │└INNER JOIN HASH ON c_nationkey = n_nationkey  <---- Bushy Join to get filter
       5         5  │ │└INNER JOIN HASH ON n_regionkey = r_regionkey
       1         1  │ │ │└TABLE SCAN region WHERE r_name = 'EUROPE'
      25         5  │ │ TABLE SCAN nation
    9093      8482  │ INNER JOIN HASH ON c_custkey = o_custkey
    8601     11654  │ │└INNER JOIN HASH ON o_orderkey = l_orderkey
   42624     40975  │ │ │└INNER JOIN HASH ON l_partkey = p_partkey
    1419      1362  │ │ │ │└TABLE SCAN part WHERE p_type = 'SMALL POLISHED NICKEL'
 5998820   5991648  │ │ │ TABLE SCAN lineitem WHERE l_suppkey >= 1
  300000    428372  │ │ TABLE SCAN orders WHERE o_orderdate >= '1995-01-01' AND o_orderdate <= '1996-12-31'
  150000    108026  │ TABLE SCAN customer WHERE c_custkey >= 3
   10000      9989  TABLE SCAN supplier

This is the optimal plan for this query. It has the lowest number of operations in each node.

DuckDB and PostgreSQL need around 6M joins to run the query, ClickHouse needs 7.6M joins to do the same query.

Optimising Join Count or Memory?

We've seen how PostgreSQL and DuckDB find the optimal plan for Q08. And at this point, you may be wondering: What about SQL Server — the other competitor in our arena?

First, recall the lesson we learned in TPC-H Q03:

SQL Server makes this plan:

	
Estimate    Actual  Operator
       2         2  SORT Expr1015
       2         ∞  PROJECT Expr1017 / Expr1018 AS Expr1019
       2         2  PROJECT CASE WHEN Expr1029 = 0 THEN NULL ELSE Expr1030 END AS Expr1017, CASE WHEN Expr1031 = 0 THEN NULL ELSE Expr1032 END AS Expr1018
       2         2  GROUP BY HASH AGGREGATE COUNT(CASE WHEN n_name as n_name = 'FRANCE'THEN l_extendedprice * (1. - l_discount) ELSE 0.0000 END) AS Expr1029, SUM(CASE WHEN n_name as n_name = 'FRANCE'THEN l_extendedprice * (1. - l_discount) ELSE 0.0000 END) AS Expr1030, COUNT(Expr1020) AS Expr1031, SUM(Expr1020) AS Expr1032
    2697      2323  INNER JOIN HASH ON n_nationkey as n_nationkey = s_nationkey
      25        25  │└TABLE SEEK nation
    8047      2323  INNER JOIN MERGE ON s_suppkey = l_suppkey
   10000     10000  │└TABLE SEEK supplier
    2697      2323  SORT l_suppkey
    2697      2323  INNER JOIN HASH ON r_regionkey = n_regionkey as n_regionkey
       1         1  │└TABLE SEEK region WHERE r_name = 'EUROPE'
   13486     11656  INNER JOIN HASH ON n_nationkey as n_nationkey = c_nationkey
      25        25  │└TABLE SEEK nation
   13486     11656  INNER JOIN MERGE ON c_custkey = o_custkey
  150000    149968  │└TABLE SEEK customer
   13486     11656  SORT o_custkey
   13486     11656  INNER JOIN MERGE ON o_orderkey = l_orderkey
  427698    428381  │└PROJECT datepart(EXTRACT(YEAR FROM o_orderdate) AS Expr1015
  427698    428381  │ TABLE SEEK orders WHERE o_orderdate >= '1995-01-01' AND o_orderdate <= '1996-12-31'
   41682     40978  SORT l_orderkey
   41682     40978  INNER JOIN HASH ON p_partkey = l_partkey
    1381      1362  │└TABLE SEEK part WHERE p_type = 'SMALL POLISHED NICKEL'
 5998820   5998820  PROJECT l_extendedprice * (1. - l_discount) AS Expr1020
 5998820   5998820  TABLE SCAN lineitem

Unlike ClickHouse, SQL Server knows what order to take the filters.

But why does SQL Server Pick a Sort/Merge join and why doesn't it bother doing the bushy nation / region join?

Merging, Hashing and Looping

The three classic join algorithms are:

Sort and Merge: will make sure both sides of the join are sorted the same way and then merge the streams. People who have programmed SAS will be quite familiar with this method. As a rule of thumb, Sort/Merge is almost never worth it when other joins are available. Particularly not when both sides need to be sorted.

Build and Hash: is the bread and butter join of analytical systems. You build a hash table on one side, the probe into that hash with the other. This join type is very friendly to vectorised execution and also minimises memory latency. It is also branch prediction-friendly to the CPU. However, it does use slight more memory. The hash table (unlike the sorted stream) has some overhead in terms of pointer tracking and arrays.

Loop and Seek: This join type is typically used for an OLTP system and also by some old, but strong, analytical systems like Teradata. If indexes allow it, loop joins have very low memory cost and thus allow for higher concurrency. But their CPU consumption is significantly higher than hash joins.

Hybrid approaches exist that combine these methods — but that is the stuff for another blog entry.

Notice that SQL Server picks Merge join. We've previously seen how PostgreSQL sometimes does the same. This preference for merge seems to be a leftover from the time when you needed to save as much memory as possible.

SQL Server, like most advanced databases, allows us to say: "run the query as if merge join wasn't available to you". That allows us to compare the memory use of strategy with merge and one with hash.

Force hash and loop (via OPTION (LOOP JOIN, HASH JOIN) the answer is:

For the small, 1GB TPC-H dataset, the runtime is nearly identical (4.5s on my laptop).

Hence, I suspect that SQL Server picks the Merge join because it knows that it uses less memory and that runtimes are ballpark the same.

Interestingly, even when force to use only hash and loop, SQL Server does not manage to find the bushy tree to region / nation,

As a result, PostgreSQL and DuckDB take the winner seat for Q08 in the Arena!

Summary

Today we further honed our skills in "human query optimising". We found the optimal join order by taking filters in the right order and learned how to spot an opportunity to make a bushy join. The query shape we engineered is the same one that PostgreSQL and DuckDB found - and quite similar to SQL Server's plan.

We saw that understanding the selectivity of filters is crucial to save CPU cycles when executing complex queries and learned that sometimes, a database will pick a lower memory option instead of the plan we want.

We also saw that even on super simple queries like this one, bad estimation can increase the number of joins (and CPU cycles needed) significantly. In the case of Q08 - ClickHouse ends up doing 25% more work than other query optimisers.

If you're a regular follow of my blog (and thanks for hanging around), you will soon be able to answer the question: "is the query plan I'm getting the one I want?" — even for complex queries.

We will practise more together in future blogs. I look forward to being on this journey with you.