Posts with tag: query-planning

Introducing the TPC series - TPC-H Query 1: Column Storage and Local Aggregation

July 14, 2025

After the wonderful feedback on the previous blog about Iceberg - it is now time to switch gears.

Databases are more than row storage engines. They are algorithm machines, helping programmers solve highly scalable, tricky problems that would otherwise not be discovered until the data is in production. They do this by using their big bag of heuristics and tuning methods - a form of distilled computer science sprinkled onto your application.

Benchmarks, particularly those made by the TPC council, are co-evolving with the database industry. As new optimisations are discovered, benchmarks are updated to test for these optimisations to see just how far database can be pushed. This kind of benchmarking allows database engine developers to track how they are doing against the state of the art.

Just like in biology, we can learn a lot about the species of today by studying their fossils and ancestry. Database benchmarks are the predators of the savannah - the environment that databases evolved to survive in. Each little detail of a benchmark gives us a clue about the genetic makeup of databases - even the ones in present time.

Our first visit to history is the TPC-H benchmark - come with me on a journey to discover the origins of our data DNA.

Databases are Just Loops - Row and Batch execution

May 30, 2024

Our database journey makes a brief stop. We need to appreciate an important design decision every database must make: Should I use row or batch execution?

Depending on the database vendor - you may get one or the other - or even both. What are the trade-offs? Let's take a look.

I recommend reading my previous posts before exploring this one, if you have not done so already:

Part 1: Joins
Part 2: Aggregates

In preparation for exploring the design space, we must now consider a slightly more complicated query than those we've seen so far:

SELECT a.x, SUM(b.y)
FROM a
INNER JOIN b ON a.x = b.y
GROUP BY a.x

This query does two operations:

Join a.x = b.y
Aggregate function SUM on b.y using grouping column a.x

There are a few more operations in play here that you may not think about as operations yet:

Finding rows in a for the join
Finding rows in b for the join
Sending rows to the client

Depending on the access method (=algorithm) chosen for the join, what exactly happens in that "finding" of rows depends on the database. For our example, I will assume a hash join is used.

As you can see at this point, writing out what a query does and how it does it, gets tedious and wordy. We are only looking at a simple query so far - and using plain English to explain what it does is getting complicated.

Because database people live with databases every day, the good database vendors have come up with another way to render what happens in queries. With a bit of training, you can easily learn how to read this.

What is Cost in a Query Planner?

December 5, 2021

If you ever looked at query plan from a SQL databases - you are likely to have come into something called "cost". You may even have heard that most advanced database use "cost based optimisers" But what exactly is cost and how does the database use it?