The Database Doctor
Musing about Databases

TPC series - TPC-H Query 6 - Expression Optimisation

And now, for something completely different.

This week on TPC-H query analysis - we are not going to look at join ordering. Today's query does not have any joins.

But as we shall see, there is still performance to be had for a clever query optimiser.

Query 6

Here is the simple query.

SELECT SUM(l_extendedprice * l_discount) AS revenue
FROM tpch.lineitem
WHERE l_shipdate >= '1994-01-01'
  AND l_shipdate < '1995-01-01'
  AND l_discount BETWEEN 0.03 AND 0.05
  AND l_quantity < 24
;

No joins, just a few filters.

As always, we start with a selectivity analysis

Selectivity of Filters

Filter Selectivity Cardinality
l_shipdate >= '1994-01-01' AND l_shipdate < '1995-01-01' 14% 856332
l_discount BETWEEN 0.03 AND 0.05 30% 1799915
l_quantity < 24 46% 2817735

You will notice that l_quantity is significantly more selective than the two other filters.

Due to the way TPC-H is generated, these filters are independent. That means that as you use all filters, you will reduce the rowcount coming out of lineitem to: 14% * 30% * 46% ~ 2%.

Predicate Evaluation order

We aren't allowed indexes on any of these columns in TPC-H. We're however allowed to partition data in l_shipdate if we so desire. But, for this blog - which is after all educational - I'm going to assume all data is randomly tossed into a large table. The arguments I will put forward here hold, even with partitioning applied to the workload — the results are just less pronounced as you optimise more. In our test, there is no ordering in the insertion of record. All data is completely randomized.

If we think about table scanning as looping over the records in the table, we will want to skip records as fast as we can.

That means we should evaluate the best filters first - taking l_quantity < 24 before anything else. In Python code - what we want is something like this:

for row in lineitem:
  if not (row.l_quantity < 24):
     continue
  if not (0.03 <= row.l_discount <= 0.05):
     continue 
  if not ('1994-01-01' <= l_shipdate >< '1995-01-01'):
     continue
  emit_row(row)

This idea of bailing out as early as possible is called "short-circuit evaluation".

Given decent statistics - particularly a histogram - a query optimiser should be able to determine that this ordering is the best one.

Short-Circuit Evaluation vs. Branch Elimination

When you use SIMD vectorisation for execution, you have to ask an important question:
Is it worth short-circuiting filter evaluation or is it better to just evaluate all filters and combine them?

Consider the optimal filter evaluation order again — this time with some comments:

for row in lineitem:
  if not (row.l_quantity < 24):                        
     continue                                          # 46% of the time

  if not (0.03 <= row.l_discount <= 0.05):             
     continue                                          #  46% * 30% of the time

  if not ('1994-01-01' <= l_shipdate < '1995-01-01'): 
     continue                                          #  46% * 30% * 14% of the time

  emit_row(row)

Might it be better to do this instead:

for row in lineitem:
  should_emit = True
  should_emit &= (row.l_quantity < 24)
  should_emit &= (0.03 <= row.l_discount <= 0.05)
  should_emit &= ('1994-01-01' <= l_shipdate < '1995-01-01')  
  if should_emit:  
    emit_row(row)                                      #  2% of the time

Why would you ever consider this latter method — isn't it doing more work? Because CPU execution pipelines don't like branches. A CPU will try to "guess" what the next instruction to run is — and it will get this instruction from memory (this is called "prefetching"). If the CPU guesses right, the next instruction can immediately be executed. But if the CPU makes the wrong guess - then it will need to wait for that instruction to be ready to execute. A wait like this can take a very long time (in CPU terms) - hundreds of CPU cycles. It might have been better to make the guessing simpler - and then pay the CPU cost of doing too many integers compares.

Incidentally, this reordering of expressions (which the SQL standard allows) is also why this is an error:

SELECT 
FROM foo
WHERE denominator <> 0 
  AND numerator / denoninator = 0.5 -- This might cause division by zero error!

The database is not required to evaluate the denominator <> 0 before it evaluates numerator / denoninator. This is not a bug — this is just the way databases work!

The safe and correct way to do this is:

SELECT 
FROM foo
WHERE CASE WHEN denominator <> 0 THEN numerator / denoninator ELSE NULL END = 0.5

Narrow Scans and Column Stores

Here is the DDL for lineitem, with rough comments on how large each column is on disk

CREATE TABLE tpch.lineitem
(
    l_orderkey      INT,               -- 4B

    l_partkey       INT,               -- 4B

    l_suppkey       INT,               -- 4B

    l_linenumber    INT,               -- 4B

    l_quantity      DECIMAL(15, 2),    -- 8B (more of Postgres) 

    l_extendedprice DECIMAL(15, 2),    -- 8B

    l_discount      DECIMAL(15, 2),    -- 8B

    l_tax           DECIMAL(15, 2),    -- 8B

    l_returnflag    VARCHAR(1),        -- 3B

    l_linestatus    VARCHAR(1),        -- 3B

    l_shipdate      DATE,              -- 3-4B

    l_commitdate    DATE,              -- 3-4B

    l_receiptdate   DATE,              -- 3-4B

    l_shipinstruct  VARCHAR(25),       -- 27B

    l_shipmode      VARCHAR(10),       -- 12B

    l_comment       VARCHAR(44)        -- 46B

);

The size of a row in this table is roughly ~150B

But, for this query we only need:

Total ~28B - which is only around 20% of the total row size.

Column stores are a killer optimization for this query because it allows us to visit only 20% of the data, even if we fully scan the table.

Actual Database runs

I used two different versions of the query to test whether the database can re-order the expressions.

This one, I call: Ship first

-- Ship First

SELECT SUM(l_extendedprice * l_discount) AS revenue
FROM tpch.lineitem
WHERE l_shipdate >= '1994-01-01'
  AND l_shipdate < '1995-01-01'
  AND l_discount BETWEEN 0.03 AND 0.05
  AND l_quantity < 24;

And this one is: Quantity First

-- Quantity First

SELECT SUM(l_extendedprice * l_discount) AS revenue
FROM tpch.lineitem
WHERE l_quantity < 24
  AND l_discount BETWEEN 0.03 AND 0.05
  AND l_shipdate >= '1994-01-01'
  AND l_shipdate < '1995-01-01'

If the predicates are evaluated in the order they are written, then we expect Quantity First to be faster.

Let me stress this: As a user, I shouldn't care. The two queries should have identical speed.

In the following, I will run every query 10 times, to make sure we tease out any fluctuations in the data.

To see if databases actually do these optimizations - and if they even matter - I wanted my first test to be a really slow database where any runtime difference would be clearly visible. PostgreSQL is the obvious choice given those constraints!

PostgreSQL Results

First, the data. Timings are measured with EXPLAIN ANALYSE taking the Execution Time: output. The initial run was discarded to remove any I/O cost from the equation.

Data Ship First Quantity First
Run 1 272 ms 340 ms
Run 2 277 ms 345 ms
Run 3 273 ms 347 ms
Run 4 285 ms 336 ms
Run 5 284 ms 338 ms
Run 6 280 ms 348 ms
Run 7 267 ms 349 ms
Run 8 278 ms 327 ms
Run 9 271 ms 362 ms
Run 10 274 ms 358 ms
Median 275.5 ms 347 ms

There is very clearly a difference in runtime here.

Note that in the Ship First result we actually mentioned the least selective filter first — which indicates that postgres may have evaluated the expression in the reverse order of the way it was written.

We can even see the evaluation order is different in the EXPLAIN output. Let me show you the crime against computer science aesthetics that is PostgreSQL explain:

Ship First:

->  Parallel Seq Scan on lineitem_opt  (cost=0.00..58855.72 rows=6 width=64) (actual time=0.021..129.697 rows=24148 loops=5)
      Filter: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01'::date) AND (l_discount >= 0.03) AND (l_discount <= 0.05) AND (l_quantity < '24'::numeric))
      Rows Removed by Filter: 1175616

Quantity First:

->  Parallel Seq Scan on lineitem  (cost=0.00..134871.36 rows=29960 width=12) (actual time=0.055..272.606 rows=24148 loops=5)
      Filter: ((l_quantity < '24'::numeric) AND (l_discount >= 0.03) AND (l_discount <= 0.05) AND (l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01'::date))
      Rows Removed by Filter: 1175616

Different plans, even after "optimisation".

side note: Row Removed by Filter is not correct - yet another bug in PG explain

PostgreSQL: Is it the missing Column Store?

Because I'm such a nice person, I want to give PostgreSQL a chance to redeem itself from the general embarrassment this blog series has caused it.

I want it to be a good database, I really do — I just can't find evidence that it is.

Let us help it out a bit. Remember how I told you that column stores are a killer optimization for a query like this? How about we do this to level the playing field:

CREATE TABLE tpch.lineitem_opt
AS
SELECT l_extendedprice
     , l_discount
     , l_quantity
     , l_shipdate
FROM tpch.lineitem;

If we now use tpch.lineitem_opt for the query, we get: ~190 ms for Ship First (versus 275 ms).

Certainly better, but not what we would expect from reducing scan sizes by 80%.

Let us look at someone who understands performance a bit better...

ClickHouse Results

Clickhouse uses columnar evaluation and storage. It also uses vectorized filter execution - which indicators that it might not short-circuit filters. My initial guess was that the two queries will have the same speed.

Timings are measured with --time in clickhouse client.

Data Ship First Quantity First
Run 1 33 ms 33 ms
Run 2 29 ms 29 ms
Run 3 32 ms 25 ms
Run 4 30 ms 25 ms
Run 5 32 ms 26 ms
Run 6 29 ms 26 ms
Run 7 28 ms 27 ms
Run 8 30 ms 27 ms
Run 9 31 ms 26 ms
Run 10 29 ms 26 ms
Median 30 ms 26 ms

Query 6 is the bread and butter of analytical engines — and ClickHouse doesn't disappoint. A massive 10x improvement over PostgreSQL on the same dataset on the same physical machine.

And since we're here anyway and for my audience, who I know love it...

...Quack Quack...

DuckDB Results

Timings are measured with EXPLAIN ANALYSE and .timing on

Data Ship First Quantity First
Run 26 ms 31 ms
Run 24 ms 25 ms
Run 24 ms 27 ms
Run 27 ms 26 ms
Run 25 ms 26 ms
Run 25 ms 26 ms
Run 25 ms 26 ms
Run 26 ms 26 ms
Run 26 ms 26 ms
Run 25 ms 27 ms
Median 25 ms 26 ms

The difference between the two columns is smaller than in ClickHouse. I removed the first datapoint from an unpaired t-test (since 31 ms looks like an outlier). Even when data is corrected in this way, we are still 95% confident that Ship First is faster.

Interestingly, DuckDB appears to evaluate in "reverse order" the same way PostgreSQL does.

The ballpark runtime of DuckDB and ClickHouse are very similar here — which is encouraging in terms of how far we have some in Execution Engine technology.

SQL Server

I have not yet instrumented SQL Server multi runs the way I have them on the other databases.

But, from manually running the two queries and relying on sys.dm_exec_request reporting, I can see that:

Data Ship First Quantity First
Average 53 ms 60 ms

Again, we see a clear difference in runtime (note: I can't get the median in SQL Server stats this way)

Take these numbers with a grain of salt — because I'm on SQL Server express, and it can only use 4 of my 14 cores. Assuming linear scale, SQL Server would be somewhere in the 15 ms range. Pretty respectable for an "old" engine that isn't cool any more.

Summary

Query 6 represents an interesting case of optimising expression evaluation order. This type of optimisation is typical of computer code compilers.

Despite such optimisation being a well-explored subject in the literature, none of the databases I tested showed the ability to optimise the expression — despite there clearly being a significant benefit to doing so.

You can read the rest of the TPC-H series here: