Dear readers.
I am delighted to announce that I have joined the company Floe.
Floe will be building a disaggregated query optimiser, a new execution engine
and a caching layer that will make Iceberg suck less in the cloud.
It ties in perfectly with my vision and my deep interest in query optimisation.
We believe that it is possible, with some clever engineering, to run ad-hoc
queries directly on top of your lakehouse.
Iceberg and Parquet, for all their flaws, have shown us a fascinating path forward for the
database industry: Disaggregation.
Apache Arrow is quickly moving us in the direction of common interchange formats inside the database
and on the wire.
It's now possible to imagine a future where databases aren't single systems from one vendor, but made by combining
multiple components, from different contributors, into a single coherent system.
This idea isn't new, and I claim no credit for observing it.
But I'd like to share my perspective on it — since that's what I do here.
When talking about Data Lakes and how people access them - we must address some of the misconceptions that made
them popular in the first place.
One of the largest misconceptions is this: "Joins are expensive". It is apparently believed (as I discovered from
discussions on LinkedIn) that it costs less CPU to turn your data model into a flat table and serve that to users -
than it does to join everytime you access the table.
It is certainly true that object stores allow nearly infinite I/O (although at a very high CPU cost to handle
the HTTP overhead) and nearly infinite storage. But, is there really such a thing as "sacrificing disk space to save
the CPU cost of joins?"
Let us finally look at what is so wrong with the Iceberg spec and why this simply isn't a serious attempt at solving
the metadata problem of large Data Lakes.
Iceberg: The great unifying vision finally allowing us to escape the vendor lock-in of our database engines. One
table and metadata format to find them ... And in the darkness bind them!
I love the idea! But I loathe the spec.
In this post, I’ll explain why you should be deeply skeptical of Iceberg as it exists today. I’ll argue that its design
flaws are so severe, we're watching a new HADOOP-style disaster unfold in real time.
My reaction is visceral, but not simple. It requires a historical lens, which is why we must split this into a historical
and a current day part. (Also, splitting it gives LinkedIn followers time to bring out their pitchforks)
In my last post about high speed DML, I talked how it is possible to modify tables at the
kind of speeds that a modern SSD can deliver. I sketched an outline of an algorithm that
can easily get us into the >10GB/sec INSERT speed range (with an array of SSD). With the right hardware and a low
latency transaction log - we may even reach into the 100GB/sec range for DML. That's without
scaling out to multiple nodes.
But, we also saw that when we DELETE or UPDATE data, it leaves behind tombstones in the data,
markers that tell us: "This data is dead and no longer visible". I alluded to the problem of cleaning up these data
graves.
It is time to talk about cleaning up these tombstones. Get your shovels ready - because the
work we are about to do will be unpleasant.
After a brief intermezzo about testing (read about my thoughts here: Testing is Hard and we often use the wrong
Incentives) - it is time to continue our journey together to where we will explore databases and
all the wonderful things they can do.
To fully enjoy this blog entry, it will be useful if you first read the previous posts:
Why are Databases so hard to Make? Part 1 - CPU usage
Why are Databases so Hard to Make? Part 2 – Logging to Disk
We have previously seen that optimising database CPU usage requires a solid foundation in computer science and algorithms. To properly scale queries - you must master single threaded, classical algorithms as well as modern concurrency and high speed locking.
Today, we take a break from the heavy computer science algorithms and zoom into transaction logs - a crucial component of every database. Our journey is to the land of I/O and hardware knowledge.
On the surface, the problem we try to solve appears simple and straightforward. But at this point, you know that this isn't how we travel together.
In our previous blogs, we have visited the idea that "databases are just loops". At this point, my dear readers may rightfully ask: "if those database are indeed just loops - why is it they are so hard to make?" "Why have we not nailed databases once and for all?" The answer is complicated - but I want to take a stab at it.
Today, we look at the CPU usage of databases - delving deeper into what those loops actually look like. We will see how you use "just loops" thinking to make predictions about what database you want to deploy. You will learn how to sanity existing database platforms - to cut through all the bullshit in the market.
Our database journey makes a brief stop. We need to appreciate an important design decision every database must make: Should I use row or batch execution?
Depending on the database vendor - you may get one or the other - or even both. What are the trade-offs? Let's take a look.
I recommend reading my previous posts before exploring this one, if you have not done so already:
Part 1: Joins
Part 2: Aggregates
In preparation for exploring the design space, we must now consider a slightly more complicated query than those we've seen so far:
SELECT a.x, SUM(b.y)
FROM a
INNER JOIN b ON a.x = b.y
GROUP BY a.x
This query does two operations:
Join a.x = b.y
Aggregate function SUM on b.y using grouping column a.x
There are a few more operations in play here that you may not think about as operations yet:
Finding rows in a for the join
Finding rows in b for the join
Sending rows to the client
Depending on the access method (=algorithm) chosen for the join, what exactly happens in that "finding" of rows depends on the database. For our example, I will assume a hash join is used.
As you can see at this point, writing out what a query does and how it does it, gets tedious and wordy. We are only looking at a simple query so far - and using plain English to explain what it does is getting complicated.
Because database people live with databases every day, the good database vendors have come up with another way to render what happens in queries. With a bit of training, you can easily learn how to read this.