In today's look at TPCH-H Q08, we're going to once again walk through an optimal join
order example.
The query is quite boring, but it does allow me to talk about a few other things related
to optimiser costing: Join Algorithms (or as some would call them: "Physical operations").
Our database journey makes a brief stop. We need to appreciate an important design decision every database must make: Should I use row or batch execution?
Depending on the database vendor - you may get one or the other - or even both. What are the trade-offs? Let's take a look.
I recommend reading my previous posts before exploring this one, if you have not done so already:
Part 1: Joins
Part 2: Aggregates
In preparation for exploring the design space, we must now consider a slightly more complicated query than those we've seen so far:
SELECT a.x, SUM(b.y)
FROM a
INNER JOIN b ON a.x = b.y
GROUP BY a.x
This query does two operations:
Join a.x = b.y
Aggregate function SUM on b.y using grouping column a.x
There are a few more operations in play here that you may not think about as operations yet:
Finding rows in a for the join
Finding rows in b for the join
Sending rows to the client
Depending on the access method (=algorithm) chosen for the join, what exactly happens in that "finding" of rows depends on the database. For our example, I will assume a hash join is used.
As you can see at this point, writing out what a query does and how it does it, gets tedious and wordy. We are only looking at a simple query so far - and using plain English to explain what it does is getting complicated.
Because database people live with databases every day, the good database vendors have come up with another way to render what happens in queries. With a bit of training, you can easily learn how to read this.