Database Doctor
Writing on databases, performance, and engineering.

TPC-H Query 11 - Diving into Statistics

Times have been busy after joining: Floe. I highly recommend you check out our blog there to see what we are up to.

I am not giving up on my TPC-H series! Today we are half-way through the workload and have arrived at Query 11. If I am succeeding in what I set out to do, regular readers now have a good grasp on:

Today's query is pretty simple. Your new skills will let you find the optimal query plan easily.

I am going to take this chance to talk about statistics and how they relate to Query 11. We will also be talking more about bloom filters and what they can do for your analytical workload.

Query 11

The query deals with a single nation via the n_name filter. Interestingly, the subquery is not correlated with the outer query - which makes things a lot simpler.

SELECT ps_partkey,
       SUM(ps_supplycost * ps_availqty) AS v
FROM tpch.partsupp
INNER JOIN tpch.supplier
  ON ps_suppkey = s_suppkey
INNER JOIN tpch.nation
  ON s_nationkey = n_nationkey
WHERE n_name = 'JAPAN'
GROUP BY ps_partkey
HAVING SUM(ps_supplycost * ps_availqty)
         > (SELECT SUM(ps_supplycost * ps_availqty) * 0.0001
            FROM tpch.partsupp
            INNER JOIN tpch.supplier
              ON ps_suppkey = s_suppkey
            INNER JOIN tpch.nation
              ON s_nationkey = n_nationkey
            WHERE n_name = 'JAPAN')
ORDER BY v DESC

Notice that the subquery in the HAVING clause is touching exactly the same rows as the outer query.

Filter Selectivity

Two filters are present in the query. The selectivity of these filters depend somewhere on data generation and random picks of n_name.

Filter Selectivity Cardinality
n_name = 'JAPAN' 4% 1
SUM(...) > (SELECT SUM(...) * 0.0001 ~4% 1196

The last filter can only be taken after we group, so the only filter we are concerned with is the one of n_name.

Use your already acquired skill to find the optimal query plan before reading on.

Optimal Join Order

For once, lineitem is not present in the query - the largest table is instead partsupp. As per our heuristics, we let that table be the driving table in the tree (the one that probes into the others).

Reaching n_name can only be done by joining via supplier.

We are faced with a choice between two possible plans:

There are 10000 rows in supplier and we know from statistics that the filter on nation results in 1/25 of these. That means that the hash table we build in the bushy join must contain roughly 10000/25 rows – which means the bushy join is almost certainly the better choice. But how exactly do we know that without running the query?

Statistics and why they matter

Query optimisers rely on statistics sampled from data to make good decisions. One of those statistics is the number of unique/distinct values in a column. In case of query 11, statistics tell us that there are 25 different values of n_name.

We also know that there are 25 rows in the table nation. That must then mean that if we filter on an exact value of n_name (as we do here) then we can either get zero or one row from nation.

Most query optimisers make the assumption that if you ask for something, it is probably going to be there, so the estimate is 1 row.

All the engines in SQL Arena agree: the number of rows coming out of nation is estimated as 1.

PostgreSQL and SQL Server do even better, they can know for a fact that the count is 1 row, not 0. How does SQL Server and PostgreSQL know that? They collect additional statistics about the data.

Both PostgreSQL and SQL Server sample a histogram of the values in each column.

In the case of PostgreSQL, we can query is like this:

SELECT n_distinct, cardinality(histogram_bounds) FROM pg_stats
WHERE tablename = 'nation'
  AND attname = 'n_name'

--- returns:
--- n_distinct | cardinality
--- -----------+-------------
---         -1 |         25

The value -1 is another Postgres'ism - it just means: "All the values are unique".

We can inspect histogram_bounds directly to see what it contains. We could even, if we felt like doing self harm, query the individual values with something like this:

SELECT v.*
FROM pg_stats
CROSS JOIN LATERAL unnest(histogram_bounds::TEXT::TEXT[]) AS v
WHERE tablename = 'nation'
 AND attname = 'n_name'

Yeah... ugly...

histogram_bound is of anyarray type - so the above seems to be the way to do unnest it. Thanks to Yurii Rashkovskii for showing me that you can raw cast the array to text and then cast it back to an array.

Guessing cardinality of (suppliernation) with MCV

Now that we know the filter on nation returns 1 row, good statistics will tell us with near certainty that the reduction in rows (to 1/25) applies to supplier too.

How do we know? Let us use PostgreSQL as a case study.

PostgreSQL collects something called "Most Common Value" (MCV) for each column. By default, it will collect the 100 most common values in the column - this is configurable. Since the cardinality of nation is only 25, the most common values of s_nationkey completely cover the range of values in n_nationkey.

We can list those values with something like this:

SELECT MIN(freq), MAX(freq), COUNT(*) AS frequency
FROM pg_stats
CROSS JOIN LATERAL unnest(most_common_freqs::TEXT::TEXT[]) AS f(freq)
WHERE tablename = 'supplier'
  AND attname = 's_nationkey';

This gives us the exact frequency range for s_nationkey values:

min     max     count
0.0362  0.0438  25

We also know that there is a primary / foreign key relationship between supplier and nation.

Thus, we (and the optimiser) can reason this way:

All this with some simple statistics in our metadata.

PostgreSQL estimation example

Here is PostgreSQL hitting a home run:

Estimate    Actual  Operator
       -         SEQUENCE
       1         1  ├─GROUP BY SIMPLE AGGREGATE SUM(ps_supplycost * ps_availqty) * 0.0001
       3         3  GROUP BY SIMPLE AGGREGATE PARTIALSUM(ps_supplycost * ps_availqty)
   13333     10053  INNER JOIN HASH ON ps_suppkey = s_suppkey
     400       377  │ │└INNER JOIN HASH ON s_nationkey = n_nationkey    <--- GOOD!
       1         1  │ │ │└TABLE SCAN nation_1 WHERE n_name = 'JAPAN'
   10000     10000  │ │ TABLE SCAN supplier_1
  999999    800000  TABLE SCAN partsupp_1
.... more...

SQL Server and DuckDB get the estimate right the same way.

A note on PostgreSQL plans: I have updated dbprove so it understands init plans (an odd PostgreSQL artefact we will talk about in a second).

ClickHouse - The usual story

Here is ClickHouse failing to make the right plan with its poor estimator:

Actual  Operator
 32493  PROJECT SUM(ps_supplycost * ps_availqty)
 32493  SORT SUM(ps_supplycost * ps_availqty)
 32493  FILTER SUM(ps_supplycost * ps_availqty) > 8654127.262542
 32493  GROUP BY HASH ps_partkey AGGREGATE SUM(ps_supplycost * ps_availqty)
 34557  PROJECT ps_supplycost * ps_availqty
 34557  INNER JOIN HASH ON s_nationkey = n_nationkey
     1  │└TABLE SCAN nation WHERE n_name = 'JAPAN'
799921  INNER JOIN HASH ON ps_suppkey = s_suppkey
 10000  │└TABLE SCAN supplier
800000  TABLE SCAN partsupp

Notice how the bushy join isn't found.

Not the most interesting query to plan - but it is a great query for illustrating something else.

Bloom Filters

We know that the optimal plan is to bushy join (suppliernation). We also know that this reduces the amount of rows to only 4% of the original table.

Could we perhaps use of this information to also reduce the scan of partsupp - the largest table in the query?

The answer is yes, and the trick is called: bloom filters. My colleague Makar Ivashko just wrote a blog post about those filters, go read that if you don't know what a bloom filter is. Even if you do know what those filters are, the post shows some very cool optimisations you can play when you know as much about CPU pipelines as Makar does.

... Welcome back – now you know how a bloom filter works. How can it help us in our query?

We know that only 4% of all suppliers match the filter on n_name. That also means that only 4% of partsupp match those suppliers. In our query, we join to establish that – a total of 800K rows joined in our little 1GB dataset. But it turns out that we don't really need to join all those rows if we use bloom filters.

What if we did this:

There are only around 400 suppliers that match the filter on n_name. That means that even a very small bloom filter can save us a lot of work.

Neither PostgreSQL, SQL Server (in row mode) nor DuckDB use bloom filters to support this query. DuckDB has experimental support for them but as of this time of writing, they are not in mainline DuckDB.

Floe makes heavy use of bloom filters for query execution. We don't have our engine fully running bete level yet (follow us to keep track of that). But we do have prototypes of the bloom filters so I can tell you what the plan currently looks like.

It looks like this (the bloom can also be used on supplier, so you get a reduction there too):

 10053  INNER JOIN HASH ON ps_suppkey = s_suppkey
   377  │└INNER JOIN HASH ON s_nationkey = n_nationkey
     1  │ │└TABLE SCAN nation_1 WHERE n_name = 'JAPAN'
   399  TABLE SCAN supplier  <--- Bloom happens here from nation
 58281  TABLE SCAN partsupp    <--- And here from supplier ⨝ nation

That's a nice reduction compared to:

 10053    INNER JOIN HASH ON ps_suppkey = s_suppkey
   377    │└INNER JOIN HASH ON s_nationkey = n_nationkey
     1    │ │└TABLE SCAN nation WHERE n_name = 'JAPAN'
 10000    TABLE SCAN supplier
800000    TABLE SCAN partsupp

Detour: Init plans and canonical forms

SQL Arena will be updated with shiny new Postgres 18 plans soon. While working on that, I realised I had to expand my EXPLAIN tool to understand init plans.

What are they?

Consider this part of Query 11:

GROUP BY ps_partkey
HAVING SUM(ps_supplycost * ps_availqty)
        --- Subquery here
         > (SELECT SUM(ps_supplycost * ps_availqty) * 0.0001
            FROM tpch.partsupp
            ... etc...

This subquery returns a constant, scalar value. It is not correlated with the outer query - it is a constant value for the duration of the outer query execution.

For cases like this, PostgreSQL uses a concept called "init plans". The idea is that a single query is split into a series of queries. The smaller queries are called "init plans" and together they create the full query plan.

It looks like this in PostgreSQL EXPLAIN, and it is awful because your execution engine now needs to understand this. concept:

  Sort Key: (sum((partsupp.ps_supplycost * (partsupp.ps_availqty)::numeric))) DESC
  InitPlan 1  <--- That's the init plan
    ->  Finalize Aggregate  (cost=23727.91..23727.92 rows=1 width=32)
          ->  Gather  (cost=23727.69..23727.90 rows=2 width=32)

What's the alternative?

You could just express the subquery as a simple cross join, like DuckDB and SQL Server do:

Operator
  v--- Just a join, no need for init plans
INNER JOIN LOOP ON CAST(SUM(ps_supplycost * CAST(ps_availqty AS DECIMAL(18,0))) AS DECIMAL(38,6)) > SUBQUERY
│└GROUP BY SIMPLE AGGREGATE "first" #0, count_star()
PROJECT SUM(ps_supplycost * CAST(ps_availqty AS DECIMAL(18,0))) * 0.0001
GROUP BY SIMPLE AGGREGATE SUM(#0)
PROJECT ps_supplycost * CAST(ps_availqty AS DECIMAL(18,0))
INNER JOIN HASH ON ps_suppkey = s_suppkey
│ │└INNER JOIN HASH ON s_nationkey = n_nationkey
│ │ │└TABLE SCAN nation WHERE n_name = 'JAPAN'
│ │ TABLE SCAN supplier
TABLE SCAN partsupp WHERE ps_suppkey >= 1
v--- The thing that PostgreSQL turns into an init plan
GROUP BY HASH #0 AGGREGATE SUM(#1)
PROJECT ps_partkey, ps_supplycost * CAST(ps_availqty AS DECIMAL(18,0))
INNER JOIN HASH ON ps_suppkey = s_suppkey
│└INNER JOIN HASH ON s_nationkey = n_nationkey
│ │└TABLE SCAN nation WHERE n_name = 'JAPAN'
TABLE SCAN supplier
TABLE SCAN partsupp WHERE ps_suppkey >= 1

Notice how the engine understands that the HAVING subquery can be replaced with a join to a single row. Of course, for that to be effective – such joins need to be treated special by the execution engine of the database. But by bringing the query into this "pure join" form you greatly reduce the amount of special cases you need to deal with in the execution engine.

The advantage is that every query is just that: one query!

This idea of turning complex queries into simpler forms is called "canonicalisation" and it greatly simplifies engines.

You can read more examples of canonical forms in this blog post:

Summary

Today, I wandered off the path a bit to talk about how optimisers and execution engines actually do the work they do.

You learned:

This prepares us for the next post about Query 12, where we will talk about aggressively filtering large tables.

Have a productive week, and see you all soon.