Iceberg, The Right Idea - The Wrong Spec - Part 2 of 2: The Spec
Let us finally look at what is so wrong with the Iceberg spec and why this simply isn't a serious attempt at solving the metadata problem of large Data Lakes.
Let us finally look at what is so wrong with the Iceberg spec and why this simply isn't a serious attempt at solving the metadata problem of large Data Lakes.
In my last post about high speed DML, I talked how it is possible to modify tables at the
kind of speeds that a modern SSD can deliver. I sketched an outline of an algorithm that
can easily get us into the >10GB/sec INSERT speed range (with an array of SSD). With the right hardware and a low
latency transaction log - we may even reach into the 100GB/sec range for DML. That's without
scaling out to multiple nodes.
But, we also saw that when we DELETE or UPDATE data, it leaves behind tombstones in the data,
markers that tell us: "This data is dead and no longer visible". I alluded to the problem of cleaning up these data
graves.
It is time to talk about cleaning up these tombstones. Get your shovels ready - because the work we are about to do will be unpleasant.
After a brief intermezzo about testing (read about my thoughts here: Testing is Hard and we often use the wrong Incentives) - it is time to continue our journey together to where we will explore databases and all the wonderful things they can do.
To fully enjoy this blog entry, it will be useful if you first read the previous posts:
Let's get to it...
We have previously seen that optimising database CPU usage requires a solid foundation in computer science and algorithms. To properly scale queries - you must master single threaded, classical algorithms as well as modern concurrency and high speed locking.
Today, we take a break from the heavy computer science algorithms and zoom into transaction logs - a crucial component of every database. Our journey is to the land of I/O and hardware knowledge.
On the surface, the problem we try to solve appears simple and straightforward. But at this point, you know that this isn't how we travel together.