The Database Doctor
Musing about Databases

Why are Databases so Hard to Make? Part 5 - Log Sequencing

In a previous post we saw how databases write their logs to disk. The transaction log acts as an aggregator of a lot of small writes, turning them into larger writes. At the same time, the log plays an important role in the guarantee that a database has to make. To reiterate ad-nauseam, the guarantee is:

If you send me a statement that changes data and I acknowledge that statement, whatever data you changed is now safely on disk

This guarantee is the cornerstone of relational databases, the thing which makes them so powerful and the part of the reason they run the world in ways you can bet your life on.

In my blog on high speed DML I talked about how we can make this guarantee while still achieving high speed writes. In that blog (go read it, if you have not) I alluded to the fact that transaction logs can also be used for resumable replication. If you paid careful attention to that argument - you might have some beef with the details. I did in fact pull a fast one on you - which I will remedy today.

Let us talk about how databases sequence time.

The Order of Things

What we know about transaction logs so far:

  1. In order to make data persistent, I must put an entry into the transaction log (either the data itself or the allocation of that data)
  2. The transaction log is the master record of where data is in the database
  3. The log writes go directly to disk, bypassing any memory buffering. When we say "safely on disk" we mean it!
  4. The writes in the transaction logs are sequential, with each data modification appending to the log

Does something about the above standout as odd?

Why do the writes need to be sequential (bullet 4)? Why can't I just write "somewhere" on disk, using the aggregation tricks we already discussed in the logging to disk blog? In fact, wouldn't having multiple logging structures potentially lead to a more scalable database implementation?

Where is the Metadata?

We want a database which supports replicas (and their related concept: backups) easily and efficiently. Typically, there will be one primary database and one or more secondary replicas in a production system. Not only does that provide disaster recovery - it can also be used for scaling read traffic.

To maintain a replica, we need a way to track changes to every database object. Those changes can then be applied to the secondary when they occur on the primary. Change tracking could be done by comparing everything in the primary with everything on the secondary on a regular basis - and then transmit the difference.

Note: we don't need to compare data in every table, since we already track allocation of data separately fron the data itself. We just need to compare the metadata of the database. Think of this as comparing snapshots of the primary metadata with snapshots of the secondary metadata.

Some back of the envelope calculations:

Metadata as Snapshots - thought Experiment

Assume I need 16 bytes to track the location of a disk block, along with its version number (so I can see what has changed). We already assumed that disk blocks are at least 8KB in size - and can go all the way to 100MB or above per block.

A 1TB database made up of 100MB disk blocks would need 10,000 metadata entries to track all allocations. That's 160MB of metadata. At 100TB (not unreasonable for a large database) we are looking at 100x that - going into 16GB of metadata. This assumes we only track changes to entire blocks of 100MB. Doing a full "delta" between the primary and secondary is viable at these sizes, at least we can keep the metadata memory resident. Comparing two 160MB data structure can be done pretty fast on a modern machine. it is not good use of computing resources, but it is doable. With enough CPU cores, it is even doable with 16GB of metadata.

If we use allocations of 100MB per block, any changes to data that is smaller than 100MB will need special tracking. We are not going to tell the secondary: "Hey, we changed one row in this 100MB block - go fetch all the data and figure out which one it was". That would not only be inefficient, it would be embarrassing to admit we even tried that.

To improve the situation, we could have each metadata block contain additional metadata about the rows that changed inside that block. But - you are still losing this metadata game. An evil opponent would now be able to write a query which changes half the rows in every disk block. This would result in a lot of extra metadata - orders of magnitude more. Doing a "full delta" between primary and the secondary would now be computationally unfeasible. We would have a very bizarre form of unexplainable behaviour, where some queries that are "just wrong" would grind the replica to a halt.

What about the concurrency impact of algorithms of this form? If we want to compare the primary metadata with the secondary metadata, we need an immutable view of both sides while we perform the compare. But we don't want the primary to freeze its metadata while we do this (it would become readonly if we did). That means we need a way to version metadata while we query it and keep track of the changes that occur while read it. Not trivial to implement.

Metadata as a Delta, not a Snapshot

The snapshot thought experiment should hopefully have given you some optimisation ideas. An obvious optimisation would be to use a data structure to keep track of the metadata which changed since the last replication event. The primary would keep track of the state of the replica and decide what to transmit based on the tracked delta.

What happens if a replica decides to go missing for a long time or has a network disconnect?

The metadata delta we maintain for the replica will grow large. Humour me as I go all "cloudy". We will dismiss that with: "we can throw cloud hardware at that". But something more sinister is happening. To bring a replica up to date - we must apply all the pending metadata changes in a single transaction - potentially shipping a lot of data over the network. That results in a massive burst of replication traffic. By the time we are done applying the changes - we may have built up yet another pile of pending metadata. This "catch up" phase could go on for a long time - potentially forever if the primary is writing fast. We will also need a large amount of disk or memory to keep the pending changes around on the replica until they are applied.

In the worst case, where the primary has changed all data since the replica went offline, the replica will need twice the storage of the primary to come back online:

  1. One copy of the storage for the current state of the replica (which we need to roll back to if we fail to apply the delta). This allows people to read the replica while we sort out the mess.
  2. A copy of the delta which must be applied in a single transaction

What happens if we have more than one replica? We could keep a list of changes for each replica. Each of these lists contain a lot of redundant information - but they won't contain exactly the same information. This is because each replica may be in different "points in time" as compared with the primary. The size of the metadata we need to keep is roughly proportional to the number of replicas we have. This is not scalable for read heavy workloads with many replicas - nor is it easy to manage for the poor DBA who needs to track it.

Metadata as an Ordered Buffer

What if every change to the database metadata added an element to an ordered buffer? The head of the buffer would contain the oldest change and the tail the latest change. If we gave each change a unique, sequential identifier, a replica would only need to keep track of what change it has last seen - and then read the buffer from that point onwards.

Let us call each delta on the primary Dx? Assume we have a primary with 10 changes like this:

Primary Buffer
D1 -> D2 -> D3 -> D4 -> D5 -> D6 -> D7 -> D8 -> D9 -> D10

Here, D1 is the oldest change and D10 is the latest change.

We have two replicas. Replica 1 has fallen far behind and only seen D2. The other, Replica 2, is reasonably up-to-date and has seen D7

It is now trivial to bring both replicas in sync with the primary:

Replica 1: 
Apply: D3->D4->D5->D6->D7->D8->D9->D10

Replica 2:
Apply: D8->D9->D10 

But wait, it gets better! We don't need to wait for all the changes to be applied before we allow users to see the updated values. We don't even need to copy the sum of the deltas over the wire in one shot.

We can apply one change at a time, and after every change - the replica is in a state consistent with that delta. In other words: There existed a point in time where the data looked exactly like what we observe on the replica. For example, we may apply D3->D4 to Replaca 1 and what the users will see is the data, "as it looked" on the primary after D4 occurred. This is called a transactionally consistent view of the data (The C in the ACID).

Once all replicas are up-to-date - we can then remove the changes from the Primary Buffer. This way of replicating data is incredibly powerful, because we have de-duplicated all the delta information. We can now maintain any number of replicas with near zero cost.

If I want to take incremental backups of the data - the data structure above makes it trivial. All I need to do is read the deltas I have not seen since I last took a backup and copy the data pointed at by that metadata into the backup media.

Metadata Buffers and Log Sequencing

Hopefully, I have convinced you that a metadata buffer is a not only good idea, it is the optimal data structure for tracking changes in a database.

Earlier, I said:

  1. The writes in the transaction logs are sequential, with each data modification appending to the log

Recall that the transaction log is the master record of all changes to database allocations. It observes every change. If we always append to the log - the log is also the metadata buffer we just described. If all changes, no matter what table they touch, go to the same transaction log - we have essentially sequenced changes in the database into a single, linear space.

The A in the ACID property of a database require that a transaction touching multiple tables must either make all changes fully visible at the same time or none of them must be visible. By writing all changes to the same log file we achieved this:

By sequencing all changes - we have (almost by accident) implemented something very powerful.

Time Travel

Look at this sequence again:

Primary Buffer
D1 -> D2 -> D3 -> D4 -> D5 -> D6 -> D7 -> D8 -> D9 -> D10

Assume I in a state where I know all changes as per D1.

If I apply D2 to that state, I get the state of the database as it was after D2. I have effectively travelled to the point in time that D2 represents and I would look at data "as it looked" on that day. In fact, I could travel to any pointer in time reprented by D1 up to D10 by just applying the log changes to the database.

Taking this to its extreme conclusion:

Obviously, reconstructing the database in this way is not cheap. But we can optimise, by regularly taking snapshots of the resulting metadata. For example, we could take snapshots every 3 changes:

Buffer:      D1 -> D2 -> D3 -> D4 -> D5 -> D6 -> D7 -> D8 -> D9 -> D10     
        
Snapshot: S0                S1                S2                S3

If I want to travel to the point in time represented by D5 - I can start from S1 and apply D4->D5. Our previous, back of the envelope calculation, showed us that metadata is a fraction of the total database size. As long as we don't take snapshots too frequently, the cost of keeping these snapshots around is manageable..

Readers who are not yet asleep might think of another optimisation: Keep both the before and after value of the change in the log. We could make the log reversible and get something like this:

Redo: D1 -> D2 -> D3 -> D4 -> D5 -> D6 -> D7 -> D8 -> D9 -> D10
Undo: D2 <- D1 <- D3 <- D4 <- D5 <- D6 <- D7 <- D8 <- D9 <- D10

Assume I find myself at D10 after an "oh s*** event" where a user issued a massive DELETE and commited it. I could use the reverse delta and rewind by applying D9<-D10 - which brings the database back to D9 - before the DELETE. Databases typically refer to the reverse delta as the "undo".

Log Sequence Numbers and Friends

At this point, I hope you can visualise the database log as a sequence of changes that have occurred in the database. It is a logical clock ordered by some integer value. Typically, databases will use a 64-bit integer for this sequence number. We call this number the "Log Sequence Number" (LSN). By picking a 64-bit integer to represent it, we ensure the LSN increases with every write and never wraps around. Because as we all know, you never run out of 64-bit integers no matter how many writes you do.

Another way to think of the LSN is as a virtual address into wall clock time. As we saw, keeping the transaction log around forever allows us to travel to any "LSN point in time" of the database. Obviously, storing an ever-growing log on disk may prohibitively expensive. But, we can mark an "event horison" that is the furthest point in the past we will travel to.

Transaction Sequence Numbers

Most databases use something called a transaction id or transaction sequence number to supplement the LSN.

The transaction number is typically used, together with locks, to control who can see what in a database. Though the concept is very similar to the LSN - it is not the same thing.

Databases have different ways to handle the relationship between the LSN and the transaction number. Let us look at a few case studies.

SQL Server: LSN and XSN

When a transaction starts in SQL Server, it is assigned a unique, 64 bit XSN. The database has a map of all the currently active transactions and their XSN. Transactions touching data will add a version pointer to that data, telling readers how to undo the changes the transaction made. The version pointer contains the XSN of the transaction modifying the data. The version pointers are stored in the tempdb database (which can be a source of contention for high speed workloads).

When data is read from a page in memory, the algorithm roughly goes like this:

if not memory_page.has_version_pointer():
    # Unmodified data, no other writers. We can read this, because there are no changes to it

    return memory_page

if memory_page.version_pointer.xsn < transaction.xsn:
    # Someone touched this page before me. Can I read the data safely?

    if memory_page.version_pointer.xsn not in transaction.active_snapshot:
        # Data was modified by someone who has committed. We can read the data

        return memory_page

# Data was modified by someone and they have not comitted, we need to find the previous version

return reconstruct_old_data(memory_page.version_pointer)

The version pointer is a "dirty" marker - it tells readers that they will need to look elsewhere for the data they want to read than on the page they are currently looking at.

Because SQL Server stores data that is committed to disk in its data files (along with the metadata of its allocation), we know that when we read a page from disk - it must have been committed. This greatly simplifies visibility checks. When SQL server flushes committed data to disk, it also writes the LSN of that data to the page. This makes it easy for backup and replication to find the changed data (we just have to find all the LSN bigger than the one we already backed up).

Note that SQL Servers algorithm is optimised for short-lived transactions where the version data can be purged quickly. It is also optimised for a read heavy workload because we generally try to keep the most current version of the disk block quickly accessible.

DB2 LSN and OUW ID

DB2, these days mostly forgotten, uses a concept similar to SQL Server. It has as LSN keeping track of the write ordering. It calls the concept of transaction number the "Unit of Work ID" UOW ID.

Many of the old DB2 researchers ended up working in SQL Server - so it is hardly surprising the concepts are similar.

MySQL with InnoDB - LSN and db_trx_id

Again, we are dealing with something very similar to DB2 and SQL Server. InnoDB uses a LSN to track the write ordering and uses transaction numbering (which they call db_trx_id) to track the visibility of data.

Writers modifying rows provide undo versions of those rows. LSN is written to the data files when the block is flushed from memory.

Postgres LSN and txid

Like the database we have seen so far, Postgres uses a LSN to track changes in the database. The LSN, as expected, is the logical clock which orders the writes. It can be inspected with pg_current_wal_lsn(). Postgres calls the transaction log the "Write Ahead Log" (WAL) and its workings generally map to everything you know about transactions logs so far. The LSN in Postgres is a 64-bit integer.

Unfortunately, that's where the sanity ends.

Like SQL Server, Postgres has the notion of a transaction number. In Postgres, it is called: txid. The txid is a 32-bit value (for bizarre, historical reasons). Since any database worth running will have more than 4B transactions arriving at its door, Postgres has come up with a hacky way to "wrap" the xid around. This is called a "transaction wraparound" and it's a sick implementation that I will spare you the details of describing.

From here, it gets even more bizarre. Unlike most other databases, Postgres stores two values for each row: xmin and xmax. Unlike other databases, they are stored for every row (or "tuple", as Postgres likes to call it because it's old school). The xmin is the txid that brought the row into existence, the xmax (if set) is the xid that deleted the row. Why would you use a good naming convention for central concepts in your database?

Similar to other databases, when a transaction starts - it is assigned a txid and it keeps track of all other, active transactions at that time. Unlike other databases, The min and max txid for concurrently running transactions are also captured in the snapshot - and called: xmin and xmax - not to be confused with the related concept on the row. For an in depth treatment of the data structure, see this excellent blog by Jan Nidzwetzki: Introduction to Snapshots and Tuple Visibility in PostgreSQL

A running transaction follows (roughly) this algorithm to decide if data is visible:

if row.xmax < transaction.xmin:
    # Deleted data before my snapshot

    return None

if row.xmax is None and row.xmin not in transaction.active_xid:
    # Data is not deleted, and it is not currenty being changed

    return row

if row.xmax >= transaction.xmax:
    # A transaction started after the snapshot was taken and has deleted the row. We can still see it

    return row

if row.xmas in transaction.active_xid:
    # A transaction is currently deleting the row. But it isn't me, so I can see it

    return row

if row.xmin in transaction.active_xid:
    # A transaction is currently creating the row. But it isn't me, so I can't see it

    return None

if row.xmin > transaction.xmin:
    # Row was created before the snapshot was taken, I can see it

    return row

if row.xmin > transaction.xmax:
    # Row was created after the snapshot was taken, I can't see it

    return None

As you might be able to infer, this is a mess! Since the txid are only 32-bits - you end up with a lot of bookkeeping and the need to "freeze" txid. There is also the issue of having to store the xmin and xmax on every row - which adds overhead to the storage. You can think of xmax as a form of tombstone in Postgres - so it could be argued that xmax is needed. But, once xmin is on disk and the transaction it represents has committed, it is just wasted space. Fortunately, if you are running Postgres you probably don't have a big database - so you might not care about the extra bytes.

Fortunately, Postgres does write the LSN that last modified the page to disk (in the pg_lsn field of the page header). Which means you can reconstruct the write order of the database (and also means you don't need the xmin and xmax).

Postgres' transaction logic, while solid from an ACID perspective, is a very poor implementation of transaction management. By now you can probably see why.

Let us put the final nail in the coffin and see what state-of-the-art looks like.

Oracle SCN

Oracle uses a System Change Number (SCN) to track changes in the database. Unlike most other database, Oracle has chosen (wisely I claim) to unify the "transaction clock" and the "log clock".

When a transaction starts in Oracle, it reads the current SCN. On the surface, the SCN is similar to the LSN in other databases: it is a monotonically increasing value that represents a specific point in the write timeline of the database. The SCN you read when the transaction starts is called the START_SCN.

When the transaction commits it gets assigned COMMIT_SCN and all data it touched will be updated with that value.

When a transaction reads data, it follows this very, simple algorithm:

if transactoin.start_scn >= disk_block.commit_scn:
    # Data was committed before the transaction started

    return disk_block
else:
    # Data was committed after the transaction started, not visible

    return None

How do we get into a state where this is possible? How on earth does the COMMIT_SCN get written to disk, so we can use it this way?

In Oracle, like in other databases, each transaction has a transaction number - but that is where the resemblance ends. Oracle calls this unique number the xid. What exactly the xid really is doesn't actually matter - just think of it like a pointer.

When a transaction touches data in Oracle, it will write to a special part of the block metadata called the "Interested Transaction List" (ITL). It will add the xid of the writing transaction to the ITL. This tells readers of that block: "If you are reading this block - the data you are looking at has been modified, you should reconstruct it by looking at the ITL".

When a transaction commits in Oracle, it receives a COMMIT_SCN. This SCN is the current SCN of the database at the time of the commit (i.e. the logical clock) and is unique to that transaction commit. The COMMIT_SCN is written to the redo log (which is Oracle's version of the transaction log) along with the xid. With this information safely on disk, we now have a few choices:

  1. Immediately visit the metadata of blocks touched by the transaction and set the COMMIT_SCN in the block (clearing out the ITL entry in the process)
  2. Do nothing and wait for the next reader of the block to come along and set the COMMIT_SCN for me.
  3. Visit the blocks with ITL entries when we have some CPU time to spare and clean up the entries and set the COMMIT_SCN on the block.

All of these operations can be done without taking any locks and can be perfectly amortised over time.

This is an extraordinary elegant way to handle logical transaction clocks. It also allows opens the path to an almost trivial implementation of time travel - which Oracle calls "Flashback".

Summary

Today, I have hit you with a wall of text about how databases sequence time. If you are still with me: Well done and thanks!

A summary for the TL;DR crowd:

Enjoy the rest of your weekend - and don't hesitate to catch me in LinkedIn for the next online fight.

PS: if this information is worth paying for - tip me on Patreon. Surely, if stupid YouTubers can get paid to post content about cats - this blog might be worth a few bucks to you?