I want to teach you an important skill that will serve your well as a database specialist.
One blog entry is not going to be enough, but here is my goal:
When you look at an SQL query in the future, you should be able to answer this question:
"What query plan do I want this query to have?"
You will also be able to make an educated guess about what join order is best.
As you hoard optimisation tricks in your arsenal - you will also be able to make guesses about what additional tricks
the database could play (and you can force those tricks if the database does not do it for you).
Queries can be incredibly complex and still work very well, the beauty of SQL.
But, no matter the complexity - the method I'm going to teach you will allow you to reason about the query.
It starts with the selectivity of filters and Query 3 from TPC-H is a good way to illustrate it.
Today's blog is written in collaboration with my friend Martin Broholm Andersen.
Martin and I are both C++ programmers, but we come from different programming traditions.
These differences provide us with a rich source of subjects to argue about.
We both learn a lot from these arguments.
I've been working for years on old Cmake projects (from the 2.x era). Martin has been busy moving legacy
.sln / .vcxproj projects at his workplace to modern, CMake 3.x style projects.
His reasons for moving to CMake weren't to be able to cross compile or lock down binaries – but to scale,
reorganize and merge the large code base he manages.
Martin is a Visual Studio user, I use CLion.
We hope our mixed experience will give you a good experience no matter what tooling you prefer.
Martin's knowledge of modern CMake far exceeds mine. As he was melding his mind into mine, we both
learned (and argued) a lot. This blog contains the result of our discussions, and our final agreement on best practices
for creating modern style repos that use Cmake 3.x.
We expect that this blog will be updated and expanded to serve as a reference for both experienced developers and new
users who're just getting started with Cmake and C++.
The repo containing our template CMake project can be found on GitHub:
Modern Cmake.
As is the tradition on Database Doctor – I encourage disagreement and discussions.
The great promise databases make to programmers is: "Tell me what you want and I will figure out the
fastest way to do it."
A database is a computer science engine — it knows things and optimisations that the average programmer has
not heard about...
Sometimes...
Some queries look "easy" to programmers — but databases often need to apply a method called de-correlation to
make them effective.
Even back in the 90ies, the great minds of the TPC council knew how to design queries that look for this optimisation.
Today we will learn how to spot these cases and what happens when databases fail at optimising them.
When talking about Data Lakes and how people access them - we must address some of the misconceptions that made
them popular in the first place.
One of the largest misconceptions is this: "Joins are expensive". It is apparently believed (as I discovered from
discussions on LinkedIn) that it costs less CPU to turn your data model into a flat table and serve that to users -
than it does to join everytime you access the table.
It is certainly true that object stores allow nearly infinite I/O (although at a very high CPU cost to handle
the HTTP overhead) and nearly infinite storage. But, is there really such a thing as "sacrificing disk space to save
the CPU cost of joins?"
After the wonderful feedback on the previous blog about Iceberg - it is now time to switch
gears.
Databases are more than row storage engines. They are algorithm machines, helping programmers solve highly
scalable, tricky problems that would otherwise not be discovered until the data is in production. They do this by
using their big bag of heuristics and tuning methods - a form of distilled computer science sprinkled onto your
application.
Benchmarks, particularly those made by the TPC council, are co-evolving with the database industry. As new optimisations
are discovered, benchmarks are updated to test for these optimisations to see just how far database can be pushed. This
kind of benchmarking allows database engine developers to track how they are doing against the state of the art.
Just like in biology, we can learn a lot about the species of today by studying their fossils and ancestry. Database
benchmarks are the predators of the savannah - the environment that databases evolved to survive in. Each little
detail of a benchmark gives us a clue about the genetic makeup of databases - even the ones in present time.
Our first visit to history is the TPC-H benchmark - come with me on a journey to discover the origins of our data DNA.
Let us finally look at what is so wrong with the Iceberg spec and why this simply isn't a serious attempt at solving
the metadata problem of large Data Lakes.
Iceberg: The great unifying vision finally allowing us to escape the vendor lock-in of our database engines. One
table and metadata format to find them ... And in the darkness bind them!
I love the idea! But I loathe the spec.
In this post, I’ll explain why you should be deeply skeptical of Iceberg as it exists today. I’ll argue that its design
flaws are so severe, we're watching a new HADOOP-style disaster unfold in real time.
My reaction is visceral, but not simple. It requires a historical lens, which is why we must split this into a historical
and a current day part. (Also, splitting it gives LinkedIn followers time to bring out their pitchforks)
It is difficult to find words that accurately describe the cruelty, selfishness and outright evil on display from
the White House these days. The guiding principle of Gordon Gekko: "Greed.. is GOOD", has
finally reached a crescendo. As long as you are greedy - you can be above the law and the sickness of our society
is refleced in the mentally deranged leaders we elect.
I think we must accept that it is possible to have a coherent world interpretation
through the lens of "Might makes Right". It's the way sociopaths view others,
it is the system that dictators will have you accept. In such a world view, "trust" simply does not
exist - except in the short term wielding of power to create fear. In a might makes
right world - every transaction is a zero sum game. To have a winner, there must be a loser.
You can choose to submit yourself to might - or you can fight back. Today, I want to talk about a few
bugs in the "might is right" based world view that we can exploit as engineers. Hopefully, we can grapple
together on how we can collectivity dignify our species again.
Does breaking systems into smaller parts you can understand individually really make them easier to manage and scale?
Today, we explore the pitfalls of this obsession and draw lessons from nature and other
fields of engineering. As always, this will be a controversial take - if it wasn't, why would
you bother reading this instead of some LLM generated nonsense on LinkedIn?
Python has now infected computer science departments and data analysts across the planet. The resulting ecosystem is
a mess of libraries - that are often poorly designed out outright harmful.
Recently, I have had to write a few libraries of my own and this has taught me a lot about what makes a good Python
library. In this series of blog entries, I will share this knowledge and tell you about the lessons I learned so
you don't have to suffer through them.