The Database Doctor
Musing about Databases

Why are Databases so Hard to Make? Part 1 - CPU usage

In our previous blogs, we have visited the idea that "databases are just loops". At this point, my dear readers may rightfully ask: "if those database are indeed just loops - why is it they are so hard to make?" "Why have we not nailed databases once and for all?" The answer is complicated - but I want to take a stab at it.

Today, we look at the CPU usage of databases - delving deeper into what those loops actually look like. We will see how you use "just loops" thinking to make predictions about what database you want to deploy. You will learn how to sanity existing database platforms - to cut through all the bullshit in the market.

Making Loops faster

When you want to execute a SQL query faster, you can:

  1. Use the the best possible data structure
  2. Make the loops faster on the CPU they run on
  3. Use multiple CPU cores to parallelise the loops.
    • Typically, we call this "Scaling up"
  4. Use multiple machines (or Pods, if you are a Docker person) to create distributed execution of the query
    • We will call this "scaling out"

A lot of databases, including Postgres and MySQL, use single threaded execution to run queries. One query, one thread. Yes, I know Postgres is getting better in this area and some things are now multi threaded. Improving single threaded execution is about optimising 1+2 above. A lot or performance can be gained by being clever about these two steps. They are also the simplest steps to understand.

Using Good Data Structures

Let us go back to the query we saw at the very beginning of this journey: (see Databases are just Loops under the Hood – Part 1: Joins)

SELECT x, y
FROM a
INNER JOIN b ON a.x = b.y

Recall that we could express the query as this Python loop:

a: list[Row] = [<some rows>]
b: list[Row] = [<some other rows>]
output: list[Row] = []
for a_row in a:
  for b_row in b:
    if a_row.x == b_row.y:
      output.append(Row(x=a_row,x, y=b_row.y)

It is time to think like a computer scientist!

You realise that the above execution is of time complexity O(|a| * |b|). This is obviously brain dead! A simple improvement is to turn b into a dictionary. We can rewrite to:

a: list[Row] = [<some rows>]
b: list[Row] = [<some other rows>]
b_hash = {r.b, r for r in b}
output: list[Row] = []
for a_row in a:
  if a in b_hash  
      output.append(Row(x=a_row,x, y=b_hash[x].y)

What we have here, is a "hash join access method" (see: Databases are just Loops – Part 2: GROUP BY).

Building a dictionary (= hash table) over the rows in b turns the algorithm from time complexity O(|a| * |b|) into time complexity O([a| + |b|). For non trivial size of b , we will gain orders of magnitude faster execution with just this trick. In other words: Having Hash Joins available in your database is a big deal.

There were database in history who did not have hash joins (for example, older version of MySQL). For those databases, you needed to be extra careful with your indexes. If Hash Join is not available, your only other options become loop join and merge join - both are very expensive (they have their uses, which is a subject for another blog entry)

Takeaway so far: Be a proper computer scientist and pick smart data structures.

In the rest of this blog entry, we will use hash joins as our sample use case. I picked those because they are crucial to nearly every modern databases. They are also easy to understand for most programmers - we have all used hash tables before. Hash joins are the fastest joins available to us when we want to perform a lot of join operations in a single query. We will use a specific way of thinking in the next sections to reason about the hash join. This way of thinking can be also be applied to other database operations.

Single Threaded Execution

I would like you to focus on the join loop above and ask yourself two questions:

Look again:

for a_row in a:
  if a in b_hash  
      output.append(Row(x=a_row,x, y=b_hash[x].y)

You might now be thinking something like: "I have no idea" or "How can I possibly answer that question unless I run it on an actual machine?" or the very worst, the willfully ignorant: "There is no answer to this question - it depends".

If so, you may at this point in your career only have spent coding time in Python, Java, JavaScript, R or another high level language. Maybe you are even an architect or designer telling other people how to build big computer systems! Maybe you didn't think much about these subjects - because modern programming languages allow you to get work done, make money and not think about these things. If you are that person and still reading this: now is a good time to wake up an learn how a modern computer actually works. Because we are about to see just how incredible modern hardware is - if you can wield it.

I am going to introduce you to a way of thinking that database engine developers learn very quickly: Putting boundaries on things.

Let us start with an overly optimistic, upper boundary. How fast could this loop possibly go under optimal conditions?

We can make some pretty strong assumptions about the physical world we live in (that wont change every year "because IT moves so fast" and all the similar lies they told you):

A modern CPU will use pipelining of instructions - which means it can execute more than one CPU instruction per cycle. Keep that in mind - but pretend for a moment that a CPU can run around 1 instruction/cycle. What can we now say about the upper boundary of the loop above?

What must happen?

All of the above are extraordinarily generous assumptions about how fast we can do this. The numbers roughly add up to something in the range of 10-15 CPU cycles for a single join.

We now have a very optimistic, upper boundary on our theoretical join for single threaded execution. Combining our data we get

At this point, my performance optimising readers may object: "Wait, can we shave off some instructions by using SIMD to calculate multiple hashes at the same time?". You are correct - we can. But it turns out this is rarely worth it in the real world. The reason is that the memory read of the hash table tends to dominate. The actual hash calculation is in the noise. Why?

In the above example, we assumed the read happens from L1 cache. The L1 cache of a modern CPU is tiny - typically in the order of a few tens to hundreds of KB. No real workload for hash join fits in L1 cache. What happens if we have to hit L2 or L3 cache? Or even main memory?

If we are being more realistic - the actual joins/sec you can do on a modern, top end, best in class, CPU is around ~10-50M joins/sec. We get this number by taking 4-5B cycles/sec and dividing it by 100-300 cycles/memory lookup

The memory cost of this kind of lookup is so dominant that even Python has a chance at keeping up. Doing a simple loop like the above with 10K rows in hash_b and 100M rows in a I can run get around ~3M joins/sec in Python. Not too bad for a language that is not focused on performance.

Single threaded takeaway: You can join at roughly 10M rows/sec - give or take an order of magnitude - on a single CPU core.

Benchmark it in Python yourself

This is the program I ran. I ran it in Python 3.12. Curious to know what numbers you get:

lookup_size = 10000
a_rows = [i % lookup_size for i in range(100000000)]
b_hash = {i: i for i in range(lookup_size)}

random.shuffle(a_rows)
matches = 0
start_time = time.time()
for a in a_rows:
if a in b_hash:
   matches ^= (b_hash[a] + a) # Do something that is really cheap


end_time = time.time()
print(f"checksum {matches} found at speed: {len(a_rows) / (end_time - start_time)} joins/sec")

Intermezzo - How powerful are Modern Databases and CPUs?

We have seen the limits of single threaded execution now. Before we talk about scalability - I want to stop for a moment to ponder the numbers.

Reiterating:

On a single core, with a good implementation - you can find and read >10M items in a large hash table per second. If the circumstances are just right - you can go 10x faster than that. That is assuming the hash table is in memory of course... We can talk disk another day.

What kind of business problem can you solve with that kind of speed?

As per the time of writing, there are around 7B people on this planet. Assume half of them have a credit card. Assume they use that credit card an average 10 times per day (that is some dream consumers for you right there).

Let us say that a credit card transaction between you and me is:

I am going to assume those updates are also hash lookups. Which is reasonable in a memory resident database. One credit card transaction is 4 hash lookups (give or take).

Back of the envelope, we can now calculate the global credit card throughput for our species:

Turning that into transactions/sec:

We are assuming one transaction is 4 hash table lookups, so that is 16M hash table lookups/sec.

In other words: From the perspective of the database - a single CPU core, with a good implementation of hash lookups, could handle all credit card traffic on the planet. The rest is overhead from apps! You could run the world's banking systems on your iPhone!

Think about that a bit, before you burn a power budget the size of Poland's to run your little crypto pyramid scheme and claim that crypto currency is the new money.

Multi Threaded Execution - Scaling Up

What if 10M hash lookups/sec just isn't enough?

The first path open to us is multi core execution. A modern CPU, like the beautiful AMD Epyc, can be acquired relatively cheap. It has 160 cores - some serious compute power on a single package! Modern CPUs can also hold significantly larger hash tables in their L3 caches. We are moving into the GB range of L3 caches - L3 cache is the new DRAM. This is all great news for database developers!

It appears, at first glance, that a 100x improvement in query execution is in reach for the database engineer who is clever about multi threading.

Our use of Python as the "sample language" is now getting a little tricky. Python hides too many details about concurrency - you can't "see" the execution at the level it needs to be visible for this type of analysis. But let me try to see how far the analogy goes...

Consider again this loop:

b_hash = {r.b, r for r in b} # Assume very fast (i.e. b is small)

for a_row in a:
  if a in b_hash  
      output.append(Row(x=a_row,x, y=b_hash[x].y)

I am going to assume that b_hash is so fast to create that the time it takes does not factor into the runtime of of the query that contains the loop. I leave it as an exercise for the reader to think about how to make creation of the hash table parallel - or you can just read on.

Using libraries like concurrent, we can write something like this (ChatGPT wrote this - I am not sure it works)

with concurrent.futures.ThreadPoolExecutor() as executor:
   output = list(
      executor.map(lambda a: 
                   b_hash[a.x] if a in b_hash else None), a_row))

This doesn't tell us anything about what must happen behind the scenes! But we can infer this by thinking a bit deeper.

If presented with a list of data in a_row to join in parallel, we must:

For trivial cases, like the list parallelism above, this operation scales nearly linearly. For non trivial cases - you will need locks and coordination between threads. Creating low impact thread coordination is an extraordinary tricky piece of hacking - which deserves a blog entry of its own.

Amdahl's Law - The Pain that keeps Giving

A principle called Amdahl's Law comes into play when designing for multi core execution. There is typically an upper limit to how much extra speed you can get, just by adding more cores.

Consider the case of building a large hash table in parallel. We need to do the following operations:

The time we need to synchronise with other threads is time we did not spend actually building the hash table. That time is not zero. Because that time would not need to be spend in single threaded execution - we are no longer scaling linearly.

Multi Threaded Execution Takeaway: With the current generation of CPU cores, you can probably do around 1B joins/sec in a single query. This is an extraordinary amount of throughput and it is only going to get bigger as time goes by.

Amdahl's Law Exercises

Here are some SQL queries that are challenging to execute in parallel. Think about how much extra speed you will get by throwing cores at the problem and how you would write a program to make use of all cores.

Every 10th row:

-- Every 10th row

SELECT * FROM (
  SELECT ROW_NUMBER() OVER (ORDER BY x) AS rn
  FROM BigTable
) AS subquery
WHERE rn % 10 = 0

Finding duplicate values:

(particularly interesting when there is a great number of distinct x and very few duplicates)

SELECT x
FROM BigTable
GROUP BY x
HAVING COUNT(*) > 1

Sending ordered rows to the client without a LIMIT or TOP:

SELECT x
FROM BigTable
ORDER BY y

Scaling Out - You need a Good Reason

We have come to a train station on our journey. On the platform - we see a train parked with the name: "The Hype". Are we going to get on it to speed up our journey even further?

Scale out computing has, for a very long time, been a dream we all aspired to. The idea is simple: I want to build a compute cluster out of cheap, easily available, cost effective hardware. I then want to treat that computer cluster as if it was a single database system. This, in principle, also removes all single points of failure. Because it is simple to understand - but hard to implement - it makes a lot of people feel very smug when they talk about it.

Software architects have argued for this idea for longer than anyone can remember. With the cloud - it sounds every more tempting. Because what is a cloud, if not a collection of someone else's hardware, build of cheap components, that you can quickly get more of if you want? At least that is the principle...

If you have been thinking carefully about the multi threading example - you might be able to infer what is coming: Amdahl's law is merciless. Remember that accessing DRAM when the data is not in L3 cache, is expensive. In the order of 100-300 CPU cycles. During those cycles, the CPU is doing nothing - except burning your cloud credits. Accessing the data across a network card - even a very powerful one - is at least 10x longer - maybe 100x (depending on how smart you are). That's a lot of compute down the drain.

"But Doctor", I hear you say, "isn't that what shuffle operations are supposed to solve?"... I'm glad you asked.. let us dig in

Shuffle - The Secret, 5th Operation

When we design databases, we typically think of four operations that we must optimise our loops for:

There is a hidden operation that only distributed database engines have to use. It is called: "Shuffling" - sometimes also known as "exchange" or "distribute". To understand shuffle, we need to visit our join query again. This time, let us EXPLAIN it on a distributed database system

EXPLAIN
SELECT x, y
FROM a
INNER JOIN b ON a.x = b.y

If we assume that a and b are very large and that their data is spread all over a distributed cluster, we may get something like this:

SELECT
INNER HASH JOIN ON (a.x = b.y)
|-SHUFFLE (a.x)
| SCAN a
|-SHUFFLE (b.y)
  SCAN b

On a single node system (i.e. not distributed), the same query would look like this:

SELECT
INNER HASH JOIN ON (a.x = b.y)
|-SCAN a
|-SCAN b

What is SHUFFLE used for in the distributed case?

A Shuffle Example

Think about what needs to happen if we want to hash join across multiple machines. if Machine 1 is looking at a row from a - how does it know if there is a matching row in b on Machine 2?

To understand what is going on with a shuffle, think of x and y values as being suits of playing cards - diamonds (♦) , spades (♠), clubs (♣) and hearts (♥). Our sample scenario has two machines (Machine 1 and Machine 2).

Our distributed query optimiser may decide as follows:

if Machine 1 observes a black suit card - it knows that is should just send that card to machine 2 - if the card is red suit - Machine 1 will just keep it

Similarly, if machine 2 observes a red suit card, it sends it to machine 1, if it observes a black suit card, it keeps it.

With these decisions, Machine 1 and Machine 2 both know that when they join on x and y, any matches will already be on the relevant machine and we generate right results without having to synchronise between members of the cluster.

We can illustrate this with the actual execution plan for each machine:

Machine 1:

Output1 Machine1 (♦ and ♥)                 
♦♥      SELECT                                              
♦♥      HASH JOIN ON (a.x = b.y)        
♦♥      |-SHUFFLE (a.x)            
♦♥♠♣    | SCAN a                   
♦♥      |-SHUFFLE (b.y)            
♦♥♠♣      SCAN b                   

Machine 2:

Output2 Machine2 (♠ and ♣)          
♠♣      SELECT                            
♠♣      HASH JOIN ON (a.x = b.y)    
♠♣      |-SHUFFLE (a.x)  
♦♥♠♣    | SCAN a                    
♠♣      |-SHUFFLE (b.y) 
♦♥♠♣      SCAN b

Shuffle is Expensive

What needs to happen in a shuffle is this:

These are all memory intensive operations - copying data around and handling buffers.

While the cost of actually transmitting the data can be partially offloaded to the network card - it cannot be fully offloaded. For high speed databases - TCP/IP is too slow and costs too much CPU to use (try to run four 10Gb NIC at full speed on your Linux box and run htop). Typically, RDMA or custom protocols on top of UDP are used instead.

Shuffling is not free. It is not as expensive as a join, but it typically contributes significantly to the overall performance of a query and the distributed system as a whole. This makes distributed queries that operate on the same data as a non distributed system - cost more overall CPU (i.e. it makes your cloud bill larger)

Scaling out Takeaway: When scaling out databases, it is possible to reach a nearly unlimited amount of performance. However, the aggregate CPU cost per query is significantly higher in a distributed system than running the same query in a scale up system

The problem of Skew

This blog entry is already getting too long. But I want to put something on the table (pun intended) for you to think about.

In our example we had x and y represent suits in a large deck of cards. There are four suits available in this domain (i.e. ♦♥♠♣).

How would we scale the query to more than 4 machines?

What would happen if 90% of all the data was hearts? (♠). How would this scale with the algorithm outlined?

Summary

Today, we have gone over some of the advanced tricks databases use to optimise CPU usage when running loops.

We have seen that it is possible to do some rough calculations to predict how long things should take. This type of "order of magnitude" reasoning is a useful way to think about computation in general. It allows you to sanity check your own algorithms.

We have seen that there is a CPU cost incurred when moving to multi threaded execution and that this cost goes up even more if we move to scale-out solutions.

Hopefully, I have at least convinced you that those "just loops" are a little more complicated than meets the eye.