The Database Doctor
Musing about Databases

TPC series - TPC-H Query 5 - Transitive Closure and Join Order Dependencies

Welcome back to the TPC-H analysis. If this is your first time, I highly recommend that you visit the previous blogs in the series first.

They're here (and I look forward to seeing you back in this blog):

For my regulars: We've now learned how to manually search for query plans that are optimal or close to optimal.

Today we will apply this knowledge and see if PostgreSQL, SQL Server, and DuckDB can work how to optimally run Query 5.

Query 5

The query is a classic, almost canonical, analytical query — with a small twist.

Like many TPC-H queries, it has a filter on o_orderdate. This filter is a quick test for partitioning elimination. You use metadata about min/max values inside disk blocks (or Parquet files) to check if the block/file contains the range you're looking for. If the metadata tells you that the data can't be in the block - you don't read the block/file.

This is such a blatantly obvious optimisation that I've forgotten to mention it in previous blogs — because I didn't want to disrespect your intelligence. But I recently learned that this optimisation of using min/max metadata to avoid looking at actual data is apparently something the Iceberg crew thinks is an achievement worth mentioning.
It isn't - database from the late 1970ies knew this trick, and we've known how to match values in intervals since the invention of mathematics.

More interestingly, the query also contains a filter on r_name.

SELECT n_name
     , SUM(l_extendedprice * (1 - l_discount)) AS revenue
FROM tpch.lineitem
         INNER JOIN tpch.supplier
                    ON l_suppkey = s_suppkey
         INNER JOIN tpch.orders
                    ON l_orderkey = o_orderkey
         INNER JOIN tpch.customer
                    ON o_custkey = c_custkey
         INNER JOIN tpch.nation
                    ON s_nationkey = n_nationkey
         INNER JOIN tpch.region
                    ON r_regionkey = n_regionkey
WHERE c_nationkey = s_nationkey
  AND r_name = 'ASIA'
  AND o_orderdate >= '1994-01-01'
  AND o_orderdate < '1995-01-01'
GROUP BY n_name
ORDER BY revenue DESC;

As always, we start by looking at the selectivity of filters in the query:

Selectivity of Filters

Filter Selectivity Cardinality
o_orderdate >= '1994-01-01' AND o_orderdate < '1995-01-01' 14% 213946
r_name = 'ASIA' 20% 5

o_orderdate is a juicy filter on a column in the second-largest table. But r_name = 'ASIA' also has values because it can implicitly filter other tables via the relationship between region to nation and from there to supplier.

Query Plan Analysis

Our largest table is lineitems so we shall pick that as the driving table.

We also see that simply joining orders with lineitems first and taking the filter on o_orderdate would already lead to a healthy reduction to 14% of the rows.

It would appear the initial join should be between lineitems and orders.

But, Query 5 contains a trap. It tests for a clever little optimisation that allows us to do even better.

Join order dependency

As the query is written, if we want to take the filter on r_name = 'ASIA' we must follow this join path in the tree:

Another way to think about this is that r_name = 'ASIA' is a filter on supplier, which in turn filters lineitem.

Based on what we know so far, our best plan would be something like this:

But there is more to be gained here.

Transitive Closure

If we look at the join dependency graph, we can see that it isn't a tree - it is a directed, acyclic graph.

Notice this little filter:

WHERE c_nationkey = s_nationkey

This creates a dependency between customer and supplier.

Diagramming, we have this graph:

        customer ---------- nation 
      /                             \
orders                              (c_nationkey = s_nationkey) -- region (with r_name = 'ASIA') 
      \                             /
        lineitem - supplier - nation

From relational algebra, we know that the equality of joins is transitive.

We can use this transitive nature to reason about the query in a different way. The equality between c_nationkey and s_nationkey implies that the region filter applies to both customer and supplier via the nation table.

This gives us another way to collect the r_name = 'ASIA' filter in the plan:

In other words, we don't have to join via supplier to harvest the r_name filter. We can instead go via customerand filter that table first. customer is a much larger table than supplier (by 15x) so it is better to apply an aggressive filter on a larger table to reduce the rowcount as early as possible.

Since orders is already used to filter lineitem, we can improve the filter further by pre-joining to customer, nation and region to orders before we join to lineitem. That allows us to combine the two filters into a single, more powerful filter on lineitem.

We still have to apply the filter on c_nationkey = s_nationkey later in the query. Because any rows we find in supplier must be reduced to the c_nationkey we found in customer.

Knowing just a little, relational algebra allows us to "move the join around" by following the structure of the graph.

Does this actually happen in real life and how big is the impact?

Actual Query Plans

Let us see how today's competitors fare.

I've extended my query plan tooling. It can now understand SQL Server query plans and render them in the compact format we use on this blog. Adding query plan support for a new engine is an involved task taking several days. If you feel like getting involved in this work for your favourite database engine - reach out to me on LinkedIn and I will happily coach you through the process. Fair warning: you will need to be reasonably fluent in C++ to pull this off.

Supporters on Patreon of "Joins without Borders" get priority on me doing driver implementation on my weekends. Weekends are when I build drivers that are not related to my day job.

With that said - let us have a look at some real life query plans.

PostgreSQL

First, here is the plan:

Estimate    Actual  Operator
      25         5  SORT SUM(l_extendedprice * ('1' - l_discount))
      25         5  GROUP BY SORTn_name AGGREGATE SUM(l_extendedprice * ('1' - l_discount))
     125        25  GROUP BY SORTn_name AGGREGATE PARTIALSUM(l_extendedprice * ('1' - l_discount))
    8700      6990  SORT n_name
    8700      6990  INNER JOIN HASH ON (l_suppkey = s_suppkey) AND (n_nationkey = s_nationkey)
   10000     10000  │└SCAN supplier
  217380    171090  INNER JOIN HASH ON o_custkey = c_custkey
   12500     30049  │└INNER JOIN HASH ON c_nationkey = n_nationkey
       5         5  │ │└INNER JOIN HASH ON n_regionkey = r_regionkey
       1         1  │ │ │└SCAN region WHERE r_name = 'EUROPE'
      25        25  │ │ SCAN nation
   62500    150000  │ SCAN customer
 1086900    856050  INNER JOIN HASH ON l_orderkey = o_orderkey
  350695    214015  │└SCAN orders WHERE (o_orderdate >= '1995-01-01') AND (o_orderdate < '1996-01-01')
 7498290   5998820  SCAN lineitem

It is not a complete failure — but it isn't good either.

PostgreSQL realizes that taking the filter on orders first is worth it. It is also able to infer that it is better to use the nation / region to filter on customer instead of supplier.

But PostgreSQL isn't able to infer that we could have used the filter on customer to also filter orders before joining to lineitem.

Side note: I suspect that MySQL would in fact pick a worse plan than PostgreSQL here — but I have not tooled it yet. Look me up if you feel passionate about MySQL / MariaDB

SQL Server

With our new tooling, we can now look at SQL Server's plan:

Estimate    Actual  Operator
       5         5  SORT Expr1013
       5         5  PROJECT CASE WHEN Expr1019 = 0 THEN NULL ELSE Expr1020 END AS Expr1013
       5         5  GROUP BY HASH AGGREGATE COUNT(Expr1014) AS Expr1019, SUM(Expr1014) AS Expr1020
    7625      6988  INNER JOIN HASH ON s_suppkey = l_suppkey AND n_nationkey = s_nationkey
   10000     10000  │└SCAN supplier
  190648    171091  INNER JOIN HASH ON o_orderkey = l_orderkey
   42305     42896  │└INNER JOIN HASH ON c_custkey = o_custkey
   30000     30049  │ │└INNER JOIN HASH ON c_nationkey = n_nationkey
       5         5  │ │ │└INNER JOIN LOOP ON n_regionkey = r_regionkey
       1         5  │ │ │ │└SCAN region WHERE r_name = 'EUROPE'
      25        25  │ │ │ SCAN nation
  150000    150000  │ │ SCAN customer
  212096    214017  │ SCAN orders WHERE o_orderdate >= '1995-01-01' AND o_orderdate < '1996-01-01'
 5998820   5998820  PROJECT l_extendedprice * (1. - l_discount) AS Expr1014
 5998820   5998820  SCAN lineitem

Side Note: Interestingly, SQL Server appears to have a bug in the way it reports cluster index scan. According to the query plan, the actual rows from region is 5. But if you look at what is actually coming into the join - it is one row

SQL Server does exactly the right thing: it infers that doing a bushy join combining the filters on orders is the optimal strategy.

Also note how good the row estimation is. The statistics quality in this database is outstanding.

DuckDB

DuckDB, despite having estimates a bit off, finds the same, good query plan that SQL Server finds:

Estimate    Actual  Operator
  272721         5  SORT SUM(l_extendedprice * (1 - l_discount))
  272721         5  GROUP BY HASH #0 AGGREGATE SUM(#1)
  287133      6988  PROJECT n_name, l_extendedprice * (1.000 - l_discount)
  287133      6988  INNER JOIN HASH ON l_suppkey = s_suppkey AND n_nationkey = s_nationkey
   10000     10000  │└SCAN supplier
  246102    171079  INNER JOIN HASH ON l_orderkey = o_orderkey
   60992     42896  │└INNER JOIN HASH ON o_custkey = c_custkey
   28846     30048  │ │└INNER JOIN HASH ON c_nationkey = n_nationkey
       5         5  │ │ │└INNER JOIN HASH ON n_regionkey = r_regionkey
       1         1  │ │ │ │└SCAN region WHERE r_name = 'EUROPE'
      25         5  │ │ │ SCAN nation
  150000    108053  │ │ SCAN customer WHERE c_custkey >= 3
  300000    213997  │ SCAN orders WHERE o_orderdate >= '1995-01-01' AND o_orderdate < '1996-01-01'
 5998820   5998029  SCAN lineitem WHERE l_suppkey >= 1

But there is a small difference worth commenting on.

Deferring Expression Evaluation

The query includes this calculation

l_extendedprice * (1 - l_discount)

SQL Server and DuckDB make two different tradeoffs here:

DuckDB says: I don't want to calculate this expression until I've reduced the row count (via the joins). It decides to calculate the expression at the very last part of the plan where the rowcount is only 6988.

Estimate    Actual  Operator
  287133      6988  PROJECT n_name, l_extendedprice * (1.000 - l_discount)
  287133      6988  INNER JOIN HASH ON l_suppkey = s_suppkeyn_nationkey = s_nationkey

SQL Server makes a different choice. It realizes that l_extendedprice and l_discount are only needed to calculate this expression. Instead of flowing that data up the query tree, it instead decides to collapse the expression directly after the scan:

Estimate    Actual  Operator
 5998820   5998820  PROJECT l_extendedprice * (1. - l_discount) AS Expr1014
 5998820   5998820  SCAN lineitem

This does mean that SQL Server have to do the maths almost six million times (as opposed to around 7K times for DuckDB). But the tradeoff here is that SQL Server will then consume less memory in the rest of the plan

The impact of good planning in Q5 — and in general

I can now make numerical compares of the operators that each database engine must execute run a plan.

For example, we can compare how many times the database engine has to hash values and look up things on those hash tables.

We get this:

Engine Total Rows Joined Rows Hash Build
SQL Server and DuckDB 6533953 254041
PostgreSQL 7175985 1067194

What a staggering difference! PostgreSQL, with its poor query plan, needs to make 4x more hash table inserts than SQL Server and DuckDB. It also ends up doing more actual joining.

Consider for a moment what that means...

For a query like this, the database will spend a great deal of CPU time joining. It joins almost as much as it scans. Making joins effective is an ongoing research area that keeps improving. But fast joining (and scanning) is also a problem that runs into the limits of physics. We've previously seen how DuckDB and Yellowbrick do a great job here. But if your query planner is bad - all that work is for nothing.

Finding 4x performance improvement in join speed - to make up for bad planning - is hard. A 4x drop in performance is a 4x multiplier on your cloud bill. And if that is something you are paid to care about, you probably need to care about query planning.

Bad query planning, like bad programming patterns (Java coders - I'm looking at you) - can eliminate all other technical progress the tech industry has made in CPU and architecture improvements.

Query planning really matters!

Hand rolling a Query Plan

There is a slight improvement we can make to the query that I haven't seen any optimiser do.

We can realize that even though we already took most of the filter value by joining to customer - there is a little bit to be had by also joining to supplier. This optimisation is decent, because the tables nation and region are tiny. Duplicating the join and scan work is not a big deal.

We could rewrite the query to this:

WITH supplier_push AS (SELECT s.*
                       FROM tpch.supplier AS s
                                INNER JOIN tpch.nation
                                           ON s_nationkey = n_nationkey
                                INNER JOIN tpch.region
                                           ON n_regionkey = r_regionkey
                       WHERE r_name = 'EUROPE')
SELECT n_name,
       SUM(l_extendedprice * (1 - l_discount)) AS revenue
FROM tpch.lineitem
INNER JOIN tpch.orders
  ON l_orderkey = o_orderkey
INNER JOIN tpch.customer
  ON o_custkey = c_custkey
INNER JOIN tpch.nation
  ON c_nationkey = n_nationkey
INNER JOIN tpch.region
  ON n_regionkey = r_regionkey
WHERE r_name = 'EUROPE'
  AND o_orderdate >= '1995-01-01'
  AND o_orderdate < '1996-01-01'
  AND l_suppkey IN (SELECT s_suppkey FROM supplier_push)
GROUP BY n_name
ORDER BY revenue DESC

The improvement is quite small, but here it is:

Engine Total Rows Joined Rows Hash Build
SQL Server and DuckDB 6533953 254041
PostgreSQL 7175985 1067194
Hand roll 6495272 280732

We sacrifice a bit of build work to reduce the amount of joining. It is a tiny improvement and likely not one that optimizers have pursued much.

Here is the plan (which I ran on DuckDB):

Estimate    Actual  Operator
   48784         5  SORT SUM(l_extendedprice * (1 - l_discount))
   48784         5  GROUP BY HASH#0 AGGREGATE SUM(#1)
   49220     34761  PROJECT n_name, l_extendedprice * (1.000 - l_discount)
   49220     34761  LEFT SEMI JOIN HASH ON l_suppkey = #0
    1923      2033  │└PROJECT s_suppkey
    1923      2033  │ INNER JOIN HASH ON s_nationkey = n_nationkey
       5         5  │ │└INNER JOIN HASH ON n_regionkey = r_regionkey
       1         1  │ │ │└SCAN region WHERE r_name = 'EUROPE'
      25         5  │ │ SCAN nation
   10000      7192  │ SCAN supplier
  246102    170984  INNER JOIN HASH ON l_orderkey = o_orderkey
   60992     42896  │└INNER JOIN HASH ON o_custkey = c_custkey
   28846     30048  │ │└INNER JOIN HASH ON c_nationkey = n_nationkey
       5         5  │ │ │└INNER JOIN HASH ON n_regionkey = r_regionkey
       1         1  │ │ │ │└SCAN region WHERE r_name = 'EUROPE'
      25         5  │ │ │ SCAN nation
  150000    108053  │ │ SCAN customer WHERE c_custkey >= 3
  300000    213997  │ SCAN orders WHERE o_orderdate >= '1995-01-01' AND o_orderdate < '1996-01-01'
 5998820   5995036  SCAN lineitem WHERE l_suppkey >= 1

Bloom Filter Teaser

Now, there is another optimisation you can do with a query like this (and many others). We can see it in action if we build a column store index on lineitem in SQL Server.

First, we do this:

CREATE
CLUSTERED COLUMNSTORE INDEX cix_lineitem ON tpch.lineitem;

Let us first reflect on why it is a bit odd to even consider this idea.

Why would a column store index affect the plan shape? All the tables except one (region, which is looped) are being scanned anyway — and column stores are just really fast scan machines. We're already doing hash joins, which is one of the things column stores are good at too.

But in the case of SQL Server - running a plan that involves a column store scan enables an entirely new set of tricks in the query planner. These tricks could be applied to row store tables as well - but they just aren't.

With the column store index in place, we get this plan:

Estimate    Actual  Operator
       5         5  SORT Expr1013
       5         5  PROJECT CASE WHEN globalagg1023 = 0 THEN NULL ELSE globalagg1025 END AS Expr1013
       5         5  GROUP BY HASH AGGREGATE SUM(partialagg1022) AS globalagg1023, SUM(partialagg1024) AS globalagg1025
       5         5  INNER JOIN HASH ON r_regionkey = n_regionkey
       1         1  │└SCAN region WHERE r_name = 'EUROPE'
       2         5  GROUP BY HASH AGGREGATE COUNT(Expr1014) AS partialagg1022, SUM(Expr1014) AS partialagg1024
    3812      6988  INNER JOIN HASH ON n_nationkey = s_nationkey AND s_suppkey = l_suppkey
   10000     10000  │└SCAN supplier
   95324    171091  FILTER BLOOM(n_nationkey,l_suppkey)
   95324    171091  INNER JOIN HASH ON o_orderkey = l_orderkey
   21152     42896  │└INNER JOIN HASH ON c_nationkey = n_nationkey
      25        25  │ │└SCAN nation
   21152     42896  │ INNER JOIN HASH ON c_custkey = o_custkey
  150000    150000  │ │└SCAN customer
   21209    214017  │ SCAN orders WHERE o_orderdate >= '1995-01-01' AND o_orderdate < '1996-01-01'
   59988    171091  PROJECT l_extendedprice * (1. - l_discount) AS Expr1014
   59988    171091  SCAN lineitem WHERE BLOOM(l_orderkey)

There are some new operators in here that we will get to later. It also seems that the filter of r_name = 'EUROPE' is not even considered useful — at least not at first glance. What is that filter on lineitem that suddenly appears?

How does this new plan perform against the plans we've seen so far?

Engine Total Rows Joined Rows Hash Build
SQL Server and DuckDB 6533953 254041
PostgreSQL 7175985 1067194
Hand-roll 6495272 280732
SQL Server Column Store 599100 373992

The query plan we can access here is amazing! It is not just a little bit better than the others — it does 10x less work! That is the difference between a ruinous cloud bill with a complex scale out engine — and running your entire business on a laptop.

This is why you pay your database specialist well — because they save you money!

Summary

Today we've seen that optimisers can use the transitive closure properties of joins to move filters around in the query. Optimisers who do well can significantly reduce the amount of work needed to run the query.

We also saw that there appears to be "something else" going on when we enable certain optimisations in SQL Server. We've seen that "something else" before when we looked at Yellowbrick query plans. Whatever the trick being played is - it makes a huge difference to query planning.

It is late here, and I will have to leave you with your lessons already learned from today's blog. Soon we will talk about bloom filters — which is the name for "something else" we saw SQL Server do.

Until then, keep analysing and keep the debate going...