Iceberg and Parquet, for all their flaws, have shown us a fascinating path forward for the
database industry: Disaggregation.
Apache Arrow is quickly moving us in the direction of common interchange formats inside the database
and on the wire.
It's now possible to imagine a future where databases aren't single systems from one vendor, but made by combining
multiple components, from different contributors, into a single coherent system.
This idea isn't new, and I claim no credit for observing it.
But I'd like to share my perspective on it — since that's what I do here.
When talking about Data Lakes and how people access them - we must address some of the misconceptions that made
them popular in the first place.
One of the largest misconceptions is this: "Joins are expensive". It is apparently believed (as I discovered from
discussions on LinkedIn) that it costs less CPU to turn your data model into a flat table and serve that to users -
than it does to join everytime you access the table.
It is certainly true that object stores allow nearly infinite I/O (although at a very high CPU cost to handle
the HTTP overhead) and nearly infinite storage. But, is there really such a thing as "sacrificing disk space to save
the CPU cost of joins?"
It is difficult to find words that accurately describe the cruelty, selfishness and outright evil on display from
the White House these days. The guiding principle of Gordon Gekko: "Greed.. is GOOD", has
finally reached a crescendo. As long as you are greedy - you can be above the law and the sickness of our society
is refleced in the mentally deranged leaders we elect.
I think we must accept that it is possible to have a coherent world interpretation
through the lens of "Might makes Right". It's the way sociopaths view others,
it is the system that dictators will have you accept. In such a world view, "trust" simply does not
exist - except in the short term wielding of power to create fear. In a might makes
right world - every transaction is a zero sum game. To have a winner, there must be a loser.
You can choose to submit yourself to might - or you can fight back. Today, I want to talk about a few
bugs in the "might is right" based world view that we can exploit as engineers. Hopefully, we can grapple
together on how we can collectivity dignify our species again.
Does breaking systems into smaller parts you can understand individually really make them easier to manage and scale?
Today, we explore the pitfalls of this obsession and draw lessons from nature and other
fields of engineering. As always, this will be a controversial take - if it wasn't, why would
you bother reading this instead of some LLM generated nonsense on LinkedIn?
Python has now infected computer science departments and data analysts across the planet. The resulting ecosystem is
a mess of libraries - that are often poorly designed out outright harmful.
Recently, I have had to write a few libraries of my own and this has taught me a lot about what makes a good Python
library. In this series of blog entries, I will share this knowledge and tell you about the lessons I learned so
you don't have to suffer through them.