Recent Posts - page 3

Making Decent Python Libraries - Part 1

December 16, 2024

Python has now infected computer science departments and data analysts across the planet. The resulting ecosystem is a mess of libraries - that are often poorly designed out outright harmful.

Recently, I have had to write a few libraries of my own and this has taught me a lot about what makes a good Python library. In this series of blog entries, I will share this knowledge and tell you about the lessons I learned so you don't have to suffer through them.

Why are Databases so Hard to Make? - Digging up Graves

October 4, 2024

In my last post about high speed DML, I talked how it is possible to modify tables at the kind of speeds that a modern SSD can deliver. I sketched an outline of an algorithm that can easily get us into the >10GB/sec INSERT speed range (with an array of SSD). With the right hardware and a low latency transaction log - we may even reach into the 100GB/sec range for DML. That's without scaling out to multiple nodes.

But, we also saw that when we DELETE or UPDATE data, it leaves behind tombstones in the data, markers that tell us: "This data is dead and no longer visible". I alluded to the problem of cleaning up these data graves.

It is time to talk about cleaning up these tombstones. Get your shovels ready - because the work we are about to do will be unpleasant.

Why are Databases so Hard to Make? - High Speed DML

September 24, 2024

After a brief intermezzo about testing (read about my thoughts here: Testing is Hard and we often use the wrong Incentives) - it is time to continue our journey together to where we will explore databases and all the wonderful things they can do.

To fully enjoy this blog entry, it will be useful if you first read the previous posts:

Why are Databases so hard to Make? Part 1 - CPU usage
Why are Databases so Hard to Make? Part 2 – Logging to Disk

Let's get to it...

Testing is Hard and we often use the wrong Incentives

June 30, 2024

I have been spending a lot of time thinking about testing and reviewing testing lately. At a superficial level - testing looks simple: Write test matrix, code tests, run tests, learn from results, refine. But we often overlook the nuances.

Today, I take a break from my usual talk about databases to share my thoughts on test incentives, CI/CD, unit testing and system level engineering.

Why are Databases so Hard to Make? - Logging to Disk

June 23, 2024

We have previously seen that optimising database CPU usage requires a solid foundation in computer science and algorithms. To properly scale queries - you must master single threaded, classical algorithms as well as modern concurrency and high speed locking.

Today, we take a break from the heavy computer science algorithms and zoom into transaction logs - a crucial component of every database. Our journey is to the land of I/O and hardware knowledge.

On the surface, the problem we try to solve appears simple and straightforward. But at this point, you know that this isn't how we travel together.

Why are Databases so Hard to Make? - CPU usage

June 9, 2024

In our previous blogs, we have visited the idea that "databases are just loops". At this point, my dear readers may rightfully ask: "if those database are indeed just loops - why is it they are so hard to make?" "Why have we not nailed databases once and for all?" The answer is complicated - but I want to take a stab at it.

Today, we look at the CPU usage of databases - delving deeper into what those loops actually look like. We will see how you use "just loops" thinking to make predictions about what database you want to deploy. You will learn how to sanity existing database platforms - to cut through all the bullshit in the market.

Databases are Just Loops - Row and Batch execution

May 30, 2024

Our database journey makes a brief stop. We need to appreciate an important design decision every database must make: Should I use row or batch execution?

Depending on the database vendor - you may get one or the other - or even both. What are the trade-offs? Let's take a look.

I recommend reading my previous posts before exploring this one, if you have not done so already:

Part 1: Joins
Part 2: Aggregates

In preparation for exploring the design space, we must now consider a slightly more complicated query than those we've seen so far:

SELECT a.x, SUM(b.y)
FROM a
INNER JOIN b ON a.x = b.y
GROUP BY a.x

This query does two operations:

Join a.x = b.y
Aggregate function SUM on b.y using grouping column a.x

There are a few more operations in play here that you may not think about as operations yet:

Finding rows in a for the join
Finding rows in b for the join
Sending rows to the client

Depending on the access method (=algorithm) chosen for the join, what exactly happens in that "finding" of rows depends on the database. For our example, I will assume a hash join is used.

As you can see at this point, writing out what a query does and how it does it, gets tedious and wordy. We are only looking at a simple query so far - and using plain English to explain what it does is getting complicated.

Because database people live with databases every day, the good database vendors have come up with another way to render what happens in queries. With a bit of training, you can easily learn how to read this.

Databases are just Loops - GROUP BY

May 28, 2024

In my previous post - I introduced the idea that you can think of database queries as a series of loops. Let me take this ideas even further - introducing more complex database concepts in terms more familiar to computer programmers.

Databases can do this cool thing called "aggregation". You may think of it like GROUP BY. You will have seen queries like this:

SELECT x, SUM(v) AS agg_y
FROM a
GROUP BY x

What is going on here? And how can we think of this like a loop?

Aggregate or GROUP BY?

To establish some terminology: You will hear me refer to GROUP BY as "aggregation" in the following passages.

We will call SUM(v) above an "aggregate function" and we will call x a "grouping column". We will refer to the entire loop as the "aggregation loop"

Databases are just Loops - Joins

May 24, 2024

I decided to write this post in response to my recent discussions with Matt Martin. Matt and I have been sparring lately over software performance. During one of these discussions, Matt said the magic words in the title of this post. I hypothesise that this quote carries of lot of explaining power for those who are not fully familiar with databases.

Let us play with this idea a bit! I'm going to use Python to show examples of "database loops". Obviously, no one in their sane mind would implement a database in Python. But Python has the great advantage of being readable to nearly everyone. Let's go...

Finding things in Data Models with Property Bags

April 13, 2024

When designing database schemas, you may find yourself needing to create a property bag table that is in a 1-n relationship with an entity table. It isn't always viable to just add more columns to a table because you may not know all the properties you want in advance. The problem with these property bags is that they can be hard to query.

In this blog entry, we will explore the property bag data modelling use case and provide some tricks that will make it very simple to query those bags.