Iceberg, The Right Idea - The Wrong Spec - Part 2 of 2: The Spec
Let us finally look at what is so wrong with the Iceberg spec and why this simply isn't a serious attempt at solving the metadata problem of large Data Lakes.
Let us finally look at what is so wrong with the Iceberg spec and why this simply isn't a serious attempt at solving the metadata problem of large Data Lakes.
Iceberg: The great unifying vision finally allowing us to escape the vendor lock-in of our database engines. One table and metadata format to find them ... And in the darkness bind them!
I love the idea! But I loathe the spec.
In this post, I’ll explain why you should be deeply skeptical of Iceberg as it exists today. I’ll argue that its design flaws are so severe, we're watching a new HADOOP-style disaster unfold in real time.
My reaction is visceral, but not simple. It requires a historical lens, which is why we must split this into a historical and a current day part. (Also, splitting it gives LinkedIn followers time to bring out their pitchforks)
In my last post about high speed DML, I talked how it is possible to modify tables at the
kind of speeds that a modern SSD can deliver. I sketched an outline of an algorithm that
can easily get us into the >10GB/sec INSERT speed range (with an array of SSD). With the right hardware and a low
latency transaction log - we may even reach into the 100GB/sec range for DML. That's without
scaling out to multiple nodes.
But, we also saw that when we DELETE or UPDATE data, it leaves behind tombstones in the data,
markers that tell us: "This data is dead and no longer visible". I alluded to the problem of cleaning up these data
graves.
It is time to talk about cleaning up these tombstones. Get your shovels ready - because the work we are about to do will be unpleasant.
After a brief intermezzo about testing (read about my thoughts here: Testing is Hard and we often use the wrong Incentives) - it is time to continue our journey together to where we will explore databases and all the wonderful things they can do.
To fully enjoy this blog entry, it will be useful if you first read the previous posts:
Let's get to it...
These days, columnar storage formats are getting a lot more attention in relational databases. Parquet, with its superior compression, is quickly taking over from CSV formats. SAP Hana, Vertica, Yellowbrick, Databricks, SQL Azure and many others all promote columnar representations as the best option for analytical workloads.
It is not hard to see why. But there are tradeoffs.