Posts with tag: se

Iceberg, The Right Idea - The Wrong Spec - Part 2 of 2: The Spec

July 10, 2025

Let us finally look at what is so wrong with the Iceberg spec and why this simply isn't a serious attempt at solving the metadata problem of large Data Lakes.

Iceberg, The Right Idea - The Wrong Spec - Part 1 of 2: History

July 6, 2025

Iceberg: The great unifying vision finally allowing us to escape the vendor lock-in of our database engines. One table and metadata format to find them ... And in the darkness bind them!

I love the idea! But I loathe the spec.

In this post, I’ll explain why you should be deeply skeptical of Iceberg as it exists today. I’ll argue that its design flaws are so severe, we're watching a new HADOOP-style disaster unfold in real time.

My reaction is visceral, but not simple. It requires a historical lens, which is why we must split this into a historical and a current day part. (Also, splitting it gives LinkedIn followers time to bring out their pitchforks)

Why are Databases so Hard to Make? - Digging up Graves

October 4, 2024

In my last post about high speed DML, I talked how it is possible to modify tables at the kind of speeds that a modern SSD can deliver. I sketched an outline of an algorithm that can easily get us into the >10GB/sec INSERT speed range (with an array of SSD). With the right hardware and a low latency transaction log - we may even reach into the 100GB/sec range for DML. That's without scaling out to multiple nodes.

But, we also saw that when we DELETE or UPDATE data, it leaves behind tombstones in the data, markers that tell us: "This data is dead and no longer visible". I alluded to the problem of cleaning up these data graves.

It is time to talk about cleaning up these tombstones. Get your shovels ready - because the work we are about to do will be unpleasant.

Why are Databases so Hard to Make? - High Speed DML

September 24, 2024

After a brief intermezzo about testing (read about my thoughts here: Testing is Hard and we often use the wrong Incentives) - it is time to continue our journey together to where we will explore databases and all the wonderful things they can do.

To fully enjoy this blog entry, it will be useful if you first read the previous posts:

Why are Databases so hard to Make? Part 1 - CPU usage
Why are Databases so Hard to Make? Part 2 – Logging to Disk

Let's get to it...

Row or Column based Storage?

April 7, 2023

These days, columnar storage formats are getting a lot more attention in relational databases. Parquet, with its superior compression, is quickly taking over from CSV formats. SAP Hana, Vertica, Yellowbrick, Databricks, SQL Azure and many others all promote columnar representations as the best option for analytical workloads.

It is not hard to see why. But there are tradeoffs.