Recent Posts - page 1

TPC-H Query 12 and Query 14 - The limits of Estimation

May 31, 2026

Today, we are going to talk about estimating filter selectivity and how to fail at it. Two queries are similar enough that we will analyse them together: Q12 and Q14.

Like listening to a song from the '80s, it is good to be back to TPC-H. I have been busy doing work with Floe. If you are a regular reader of this blog, I think you will like this article I wrote recently:

Joining in Spark and Databricks

How LLMs changed my tooling, flow states and repo structure

March 24, 2026

Like most programmers - I am highly opiniated about what tools I use and how I want those tools to work. There have been very few changes to my preferences for at least 15 years. But with the coming of LLMs - I can no longer defend my old habits. This old dog had to learn new trick - this is my story.

TPC-H Query 11 – Diving into Statistics

February 18, 2026

Times have been busy after joining: Floe. I highly recommend you check out our blog there to see what we are up to.

I am not giving up on my TPC-H series! Today we are half-way through the workload and have arrived at Query 11. If I am succeeding in what I set out to do, regular readers now have a good grasp on:

The large, often orders of magnitude, impact of good Query Optimisation.
How to come up with good query plans manually and validate those made by machines.
A basic grasp of statistics and what they mean for query planners.

Today's query is pretty simple. Your new skills will let you find the optimal query plan easily.

I am going to take this chance to talk about statistics and how they relate to Query 11. We will also be talking more about bloom filters and what they can do for your analytical workload.

I am joining Floe

February 2, 2026

database

Dear readers. I am delighted to announce that I have joined the company Floe.

Floe will be building a disaggregated query optimiser, a new execution engine and a caching layer that will make Iceberg suck less in the cloud. It ties in perfectly with my vision and my deep interest in query optimisation. We believe that it is possible, with some clever engineering, to run ad-hoc queries directly on top of your lakehouse.

And we got the team to pull it off!

Database Services and Disaggregation

January 1, 2026

Iceberg and Parquet, for all their flaws, have shown us a fascinating path forward for the database industry: Disaggregation. Apache Arrow is quickly moving us in the direction of common interchange formats inside the database and on the wire.

It's now possible to imagine a future where databases aren't single systems from one vendor, but made by combining multiple components, from different contributors, into a single coherent system.

This idea isn't new, and I claim no credit for observing it. But I'd like to share my perspective on it — since that's what I do here.

TPC-H Query 10 - Histograms and Functional Dependency

December 20, 2025

Welcome back to the TPC-H series, dear reader. And happy holidays to those of you who've already shut down.

In today's educational blog, I'm going to teach you about:

The importance of histograms
When not to do bushy joins
Functional dependencies and how they speed up queries
Bloom filters

This is a lot of ground to cover in the around 5-15 minutes I have your attention. Every deep dive starts at the surface — let us jump into the deep sea.

TPC-H Query 9 - Composite Key Joins and Pre-aggregation

December 8, 2025

Today's query will give us a new insight about about query optimisers — because one of the joins
contains a little extra surprise: Composite key joins. We will also learn about a new, strong optimisation that we haven't seen before: Aggregating before joining.

This is the first time we encounter some series work on partsupp and its strange relationship to lineitem

Let us proceed in the familiar way.

TPC-H Query 8 - Loops, merges and hash

December 1, 2025

In today's look at TPCH-H Q08, we're going to once again walk through an optimal join order example.

The query is quite boring, but it does allow me to talk about a few other things related to optimiser costing: Join Algorithms (or as some would call them: "Physical operations").

TPC-H Query 7 - Optimiser Reasoning

November 26, 2025

It is time to resume the TPC-H series and look at Query 7.

We will learn about how query optimisers can decompose filters and reason about the structure of expressions to reduce join work.

This query is also a good way to teach us about how query optimisers use statistics and constraints.

This is the first blog where we can now use SQL Arena to look at query plans. The SQL Arena is an ongoing project where I am using the tooling I have written to generate comparable query plans between various database engines. All the work is open source, details in the link above.

TPC-H Query 6 - Expression Optimisation

October 13, 2025

And now, for something completely different.

This week on TPC-H query analysis - we are not going to look at join ordering. Today's query does not have any joins.

But as we shall see, there is still performance to be had for a clever query optimiser.