Database Doctor
Writing on databases, performance, and engineering.

Recent Posts - page 2

Cover

TPC-H Query 3 - Join Ordering and Heap Sorting

I want to teach you an important skill that will serve your well as a database specialist. One blog entry is not going to be enough, but here is my goal:

When you look at an SQL query in the future, you should be able to answer this question: "What query plan do I want this query to have?" You will also be able to make an educated guess about what join order is best. As you hoard optimisation tricks in your arsenal - you will also be able to make guesses about what additional tricks the database could play (and you can force those tricks if the database does not do it for you).

Queries can be incredibly complex and still work very well, the beauty of SQL. But, no matter the complexity - the method I'm going to teach you will allow you to reason about the query. It starts with the selectivity of filters and Query 3 from TPC-H is a good way to illustrate it.

Read More...

Cover

Modern CMake for C++ Projects

Today's blog is written in collaboration with my friend Martin Broholm Andersen.

Martin and I are both C++ programmers, but we come from different programming traditions. These differences provide us with a rich source of subjects to argue about. We both learn a lot from these arguments.

I've been working for years on old Cmake projects (from the 2.x era). Martin has been busy moving legacy .sln / .vcxproj projects at his workplace to modern, CMake 3.x style projects. His reasons for moving to CMake weren't to be able to cross compile or lock down binaries – but to scale, reorganize and merge the large code base he manages.

Martin is a Visual Studio user, I use CLion. We hope our mixed experience will give you a good experience no matter what tooling you prefer.

Martin's knowledge of modern CMake far exceeds mine. As he was melding his mind into mine, we both learned (and argued) a lot. This blog contains the result of our discussions, and our final agreement on best practices for creating modern style repos that use Cmake 3.x.

We expect that this blog will be updated and expanded to serve as a reference for both experienced developers and new users who're just getting started with Cmake and C++.

The repo containing our template CMake project can be found on GitHub:

  • Modern Cmake.

As is the tradition on Database Doctor – I encourage disagreement and discussions.

Read More...

Cover

TPC-H Query 2 and 17 - De-correlation

The great promise databases make to programmers is: "Tell me what you want and I will figure out the fastest way to do it."

A database is a computer science engine — it knows things and optimisations that the average programmer has not heard about...

Sometimes...

Some queries look "easy" to programmers — but databases often need to apply a method called de-correlation to make them effective. Even back in the 90ies, the great minds of the TPC council knew how to design queries that look for this optimisation.

Today we will learn how to spot these cases and what happens when databases fail at optimising them.

Read More...

Cover

Joins are NOT Expensive! - Raw Reading

When talking about Data Lakes and how people access them - we must address some of the misconceptions that made them popular in the first place.

One of the largest misconceptions is this: "Joins are expensive". It is apparently believed (as I discovered from discussions on LinkedIn) that it costs less CPU to turn your data model into a flat table and serve that to users - than it does to join everytime you access the table.

It is certainly true that object stores allow nearly infinite I/O (although at a very high CPU cost to handle the HTTP overhead) and nearly infinite storage. But, is there really such a thing as "sacrificing disk space to save the CPU cost of joins?"

Today, let us put this to the test.

Read More...

Cover

Introducing the TPC series - TPC-H Query 1: Column Storage and Local Aggregation

After the wonderful feedback on the previous blog about Iceberg - it is now time to switch gears.

Databases are more than row storage engines. They are algorithm machines, helping programmers solve highly scalable, tricky problems that would otherwise not be discovered until the data is in production. They do this by using their big bag of heuristics and tuning methods - a form of distilled computer science sprinkled onto your application.

Benchmarks, particularly those made by the TPC council, are co-evolving with the database industry. As new optimisations are discovered, benchmarks are updated to test for these optimisations to see just how far database can be pushed. This kind of benchmarking allows database engine developers to track how they are doing against the state of the art.

Just like in biology, we can learn a lot about the species of today by studying their fossils and ancestry. Database benchmarks are the predators of the savannah - the environment that databases evolved to survive in. Each little detail of a benchmark gives us a clue about the genetic makeup of databases - even the ones in present time.

Our first visit to history is the TPC-H benchmark - come with me on a journey to discover the origins of our data DNA.

Read More...

Cover

Iceberg, The Right Idea - The Wrong Spec - Part 1 of 2: History

Iceberg: The great unifying vision finally allowing us to escape the vendor lock-in of our database engines. One table and metadata format to find them ... And in the darkness bind them!

I love the idea! But I loathe the spec.

In this post, I’ll explain why you should be deeply skeptical of Iceberg as it exists today. I’ll argue that its design flaws are so severe, we're watching a new HADOOP-style disaster unfold in real time.

My reaction is visceral, but not simple. It requires a historical lens, which is why we must split this into a historical and a current day part. (Also, splitting it gives LinkedIn followers time to bring out their pitchforks)

Read More...

Greed vs Bravery Based Engineering

It is difficult to find words that accurately describe the cruelty, selfishness and outright evil on display from the White House these days. The guiding principle of Gordon Gekko: "Greed.. is GOOD", has finally reached a crescendo. As long as you are greedy - you can be above the law and the sickness of our society is refleced in the mentally deranged leaders we elect.

I think we must accept that it is possible to have a coherent world interpretation through the lens of "Might makes Right". It's the way sociopaths view others, it is the system that dictators will have you accept. In such a world view, "trust" simply does not exist - except in the short term wielding of power to create fear. In a might makes right world - every transaction is a zero sum game. To have a winner, there must be a loser.

You can choose to submit yourself to might - or you can fight back. Today, I want to talk about a few bugs in the "might is right" based world view that we can exploit as engineers. Hopefully, we can grapple together on how we can collectivity dignify our species again.

Read More...

Cover

Coupling, Complexity, and Coding

Why is the IT industry obsessed with decoupling?

Does breaking systems into smaller parts you can understand individually really make them easier to manage and scale?

Today, we explore the pitfalls of this obsession and draw lessons from nature and other fields of engineering. As always, this will be a controversial take - if it wasn't, why would you bother reading this instead of some LLM generated nonsense on LinkedIn?

Read More...

Making Decent Python Libraries - Part 1

Python has now infected computer science departments and data analysts across the planet. The resulting ecosystem is a mess of libraries - that are often poorly designed out outright harmful.

Recently, I have had to write a few libraries of my own and this has taught me a lot about what makes a good Python library. In this series of blog entries, I will share this knowledge and tell you about the lessons I learned so you don't have to suffer through them.

Read More...