In today's look at TPCH-H Q08, we're going to once again walk through an optimal join
order example.
The query is quite boring, but it does allow me to talk about a few other things related
to optimiser costing: Join Algorithms (or as some would call them: "Physical operations").
It is time to resume the TPC-H series and look at Query 7.
We will learn about how query optimisers can decompose filters and reason about
the structure of expressions to reduce join work.
This query is also a good way to teach us about how query optimisers use statistics and
constraints.
This is the first blog where we can now use SQL Arena to look at
query plans.
The SQL Arena is an ongoing project where I am using the tooling I have written to generate
comparable query plans between various database engines.
All the work is open source, details in the link above.
Today we are looking at a Q04 — which on the surface is similar to Q17.
Like Q17, Q04 has a correlated subquery that can be de-correlated using a join.
But sometimes, a regular INNER JOIN is not enough to de-correlate a query.
You need special join types that don't appear as part of the standard SQL syntax.
I want to teach you an important skill that will serve your well as a database specialist.
One blog entry is not going to be enough, but here is my goal:
When you look at an SQL query in the future, you should be able to answer this question:
"What query plan do I want this query to have?"
You will also be able to make an educated guess about what join order is best.
As you hoard optimisation tricks in your arsenal - you will also be able to make guesses about what additional tricks
the database could play (and you can force those tricks if the database does not do it for you).
Queries can be incredibly complex and still work very well, the beauty of SQL.
But, no matter the complexity - the method I'm going to teach you will allow you to reason about the query.
It starts with the selectivity of filters and Query 3 from TPC-H is a good way to illustrate it.
The great promise databases make to programmers is: "Tell me what you want and I will figure out the
fastest way to do it."
A database is a computer science engine — it knows things and optimisations that the average programmer has
not heard about...
Sometimes...
Some queries look "easy" to programmers — but databases often need to apply a method called de-correlation to
make them effective.
Even back in the 90ies, the great minds of the TPC council knew how to design queries that look for this optimisation.
Today we will learn how to spot these cases and what happens when databases fail at optimising them.
When talking about Data Lakes and how people access them - we must address some of the misconceptions that made
them popular in the first place.
One of the largest misconceptions is this: "Joins are expensive". It is apparently believed (as I discovered from
discussions on LinkedIn) that it costs less CPU to turn your data model into a flat table and serve that to users -
than it does to join everytime you access the table.
It is certainly true that object stores allow nearly infinite I/O (although at a very high CPU cost to handle
the HTTP overhead) and nearly infinite storage. But, is there really such a thing as "sacrificing disk space to save
the CPU cost of joins?"
After the wonderful feedback on the previous blog about Iceberg - it is now time to switch
gears.
Databases are more than row storage engines. They are algorithm machines, helping programmers solve highly
scalable, tricky problems that would otherwise not be discovered until the data is in production. They do this by
using their big bag of heuristics and tuning methods - a form of distilled computer science sprinkled onto your
application.
Benchmarks, particularly those made by the TPC council, are co-evolving with the database industry. As new optimisations
are discovered, benchmarks are updated to test for these optimisations to see just how far database can be pushed. This
kind of benchmarking allows database engine developers to track how they are doing against the state of the art.
Just like in biology, we can learn a lot about the species of today by studying their fossils and ancestry. Database
benchmarks are the predators of the savannah - the environment that databases evolved to survive in. Each little
detail of a benchmark gives us a clue about the genetic makeup of databases - even the ones in present time.
Our first visit to history is the TPC-H benchmark - come with me on a journey to discover the origins of our data DNA.
I have been spending a lot of time thinking about testing and reviewing testing lately. At a superficial level - testing looks simple: Write test matrix, code tests, run tests, learn from results, refine. But we often overlook the nuances.
Today, I take a break from my usual talk about databases to share my thoughts on test incentives, CI/CD, unit testing and system level engineering.