Dear readers.
I am delighted to announce that I have joined the company Floe.
Floe will be building a disaggregated query optimiser, a new execution engine
and a caching layer that will make Iceberg suck less in the cloud.
It ties in perfectly with my vision and my deep interest in query optimisation.
We believe that it is possible, with some clever engineering, to run ad-hoc
queries directly on top of your lakehouse.
Iceberg and Parquet, for all their flaws, have shown us a fascinating path forward for the
database industry: Disaggregation.
Apache Arrow is quickly moving us in the direction of common interchange formats inside the database
and on the wire.
It's now possible to imagine a future where databases aren't single systems from one vendor, but made by combining
multiple components, from different contributors, into a single coherent system.
This idea isn't new, and I claim no credit for observing it.
But I'd like to share my perspective on it — since that's what I do here.
Welcome back to the TPC-H series, dear reader.
And happy holidays to those of you who've already shut down.
In today's educational blog, I'm going to teach you about:
The importance of histograms
When not to do bushy joins
Functional dependencies and how they speed up queries
Bloom filters
This is a lot of ground to cover in the around 5-15 minutes I have your attention.
Every deep dive starts at the surface — let us jump into the deep sea.
Today's query will give us a new insight about about query optimisers — because one of the joins contains a little extra surprise: Composite key joins.
We will also learn about a new, strong optimisation that we haven't seen before: Aggregating before joining.
This is the first time we encounter some series work on partsupp and its strange relationship to lineitem
In today's look at TPCH-H Q08, we're going to once again walk through an optimal join
order example.
The query is quite boring, but it does allow me to talk about a few other things related
to optimiser costing: Join Algorithms (or as some would call them: "Physical operations").
It is time to resume the TPC-H series and look at Query 7.
We will learn about how query optimisers can decompose filters and reason about
the structure of expressions to reduce join work.
This query is also a good way to teach us about how query optimisers use statistics and
constraints.
This is the first blog where we can now use SQL Arena to look at
query plans.
The SQL Arena is an ongoing project where I am using the tooling I have written to generate
comparable query plans between various database engines.
All the work is open source, details in the link above.
Today we are looking at a Q04 — which on the surface is similar to Q17.
Like Q17, Q04 has a correlated subquery that can be de-correlated using a join.
But sometimes, a regular INNER JOIN is not enough to de-correlate a query.
You need special join types that don't appear as part of the standard SQL syntax.
There are some in our industry who can read a schema and instantly see the joins.
They dream in sets, write SQL queries routinely and reach for window functions without hesitation.
And then, there are the others...
For them, the problem isn’t laziness, or stubbornness, or lack of training.
Tragically, they were born this way.