Why are Databases so Hard to Make? Part 5 - Log Sequencing
In a previous post we saw how databases write their logs to disk. The transaction log acts as an aggregator of a lot of small writes, turning them into larger writes. At the same time, the log plays an important role in the guarantee that a database has to make. To reiterate ad-nauseam, the guarantee is:
If you send me a statement that changes data and I acknowledge that statement, whatever data you changed is now safely on disk
This guarantee is the cornerstone of relational databases, the thing which makes them so powerful and the part of the reason they run the world in ways you can bet your life on.
In my blog on high speed DML I talked about how we can make this guarantee while still achieving high speed writes. In that blog (go read it, if you have not) I alluded to the fact that transaction logs can also be used for resumable replication. If you paid careful attention to that argument - you might have some beef with the details. I did in fact pull a fast one on you - which I will remedy today.
Let us talk about how databases sequence time.
The Order of Things
What we know about transaction logs so far:
- In order to make data persistent, I must put an entry into the transaction log (either the data itself or the allocation of that data)
- The transaction log is the master record of where data is in the database
- The log writes go directly to disk, bypassing any memory buffering. When we say "safely on disk" we mean it!
- The writes in the transaction logs are sequential, with each data modification appending to the log
Does something about the above standout as odd?
Why do the writes need to be sequential (bullet 4)? Why can't I just write "somewhere" on disk, using the aggregation tricks we already discussed in the logging to disk blog? In fact, wouldn't having multiple logging structures potentially lead to a more scalable database implementation?
Where is the Metadata?
We want a database which supports replicas (and their related concept: backups) easily and efficiently. Typically, there will be one primary database and one or more secondary replicas in a production system. Not only does that provide disaster recovery - it can also be used for scaling read traffic.
To maintain a replica, we need a way to track changes to every database object. Those changes can then be applied to the secondary when they occur on the primary. Change tracking could be done by comparing everything in the primary with everything on the secondary on a regular basis - and then transmit the difference.
Note: we don't need to compare data in every table, since we already track allocation of data separately fron the data itself. We just need to compare the metadata of the database. Think of this as comparing snapshots of the primary metadata with snapshots of the secondary metadata.
Some back of the envelope calculations:
Metadata as Snapshots - thought Experiment
Assume I need 16 bytes to track the location of a disk block, along with its version number (so I can see what has changed). We already assumed that disk blocks are at least 8KB in size - and can go all the way to 100MB or above per block.
A 1TB database made up of 100MB disk blocks would need 10,000 metadata entries to track all allocations. That's 160MB of metadata. At 100TB (not unreasonable for a large database) we are looking at 100x that - going into 16GB of metadata. This assumes we only track changes to entire blocks of 100MB. Doing a full "delta" between the primary and secondary is viable at these sizes, at least we can keep the metadata memory resident. Comparing two 160MB data structure can be done pretty fast on a modern machine. it is not good use of computing resources, but it is doable. With enough CPU cores, it is even doable with 16GB of metadata.
If we use allocations of 100MB per block, any changes to data that is smaller than 100MB will need special tracking. We are not going to tell the secondary: "Hey, we changed one row in this 100MB block - go fetch all the data and figure out which one it was". That would not only be inefficient, it would be embarrassing to admit we even tried that.
To improve the situation, we could have each metadata block contain additional metadata about the rows that changed inside that block. But - you are still losing this metadata game. An evil opponent would now be able to write a query which changes half the rows in every disk block. This would result in a lot of extra metadata - orders of magnitude more. Doing a "full delta" between primary and the secondary would now be computationally unfeasible. We would have a very bizarre form of unexplainable behaviour, where some queries that are "just wrong" would grind the replica to a halt.
What about the concurrency impact of algorithms of this form? If we want to compare the primary metadata with the secondary metadata, we need an immutable view of both sides while we perform the compare. But we don't want the primary to freeze its metadata while we do this (it would become readonly if we did). That means we need a way to version metadata while we query it and keep track of the changes that occur while read it. Not trivial to implement.
Metadata as a Delta, not a Snapshot
The snapshot thought experiment should hopefully have given you some optimisation ideas. An obvious optimisation would be to use a data structure to keep track of the metadata which changed since the last replication event. The primary would keep track of the state of the replica and decide what to transmit based on the tracked delta.
What happens if a replica decides to go missing for a long time or has a network disconnect?
The metadata delta we maintain for the replica will grow large. Humour me as I go all "cloudy". We will dismiss that with: "we can throw cloud hardware at that". But something more sinister is happening. To bring a replica up to date - we must apply all the pending metadata changes in a single transaction - potentially shipping a lot of data over the network. That results in a massive burst of replication traffic. By the time we are done applying the changes - we may have built up yet another pile of pending metadata. This "catch up" phase could go on for a long time - potentially forever if the primary is writing fast. We will also need a large amount of disk or memory to keep the pending changes around on the replica until they are applied.
In the worst case, where the primary has changed all data since the replica went offline, the replica will need twice the storage of the primary to come back online:
- One copy of the storage for the current state of the replica (which we need to roll back to if we fail to apply the delta). This allows people to read the replica while we sort out the mess.
- A copy of the delta which must be applied in a single transaction
What happens if we have more than one replica? We could keep a list of changes for each replica. Each of these lists contain a lot of redundant information - but they won't contain exactly the same information. This is because each replica may be in different "points in time" as compared with the primary. The size of the metadata we need to keep is roughly proportional to the number of replicas we have. This is not scalable for read heavy workloads with many replicas - nor is it easy to manage for the poor DBA who needs to track it.
Metadata as an Ordered Buffer
What if every change to the database metadata added an element to an ordered buffer? The head of the buffer would contain the oldest change and the tail the latest change. If we gave each change a unique, sequential identifier, a replica would only need to keep track of what change it has last seen - and then read the buffer from that point onwards.
Let us call each delta on the primary Dx
? Assume we have a primary with 10 changes like this:
Primary Buffer
D1 -> D2 -> D3 -> D4 -> D5 -> D6 -> D7 -> D8 -> D9 -> D10
Here, D1
is the oldest change and D10
is the latest change.
We have two replicas. Replica 1
has fallen far behind and only seen D2
. The other, Replica 2
, is reasonably
up-to-date and has seen D7
It is now trivial to bring both replicas in sync with the primary:
Replica 1:
Apply: D3->D4->D5->D6->D7->D8->D9->D10
Replica 2:
Apply: D8->D9->D10
But wait, it gets better! We don't need to wait for all the changes to be applied before we allow users to see the updated values. We don't even need to copy the sum of the deltas over the wire in one shot.
We can apply one change at a time, and after every change - the replica is in a state consistent with that delta. In
other words: There existed a point in time where the data looked exactly like what we observe on the replica. For
example,
we may apply D3->D4
to Replaca 1
and what the users will see is the data, "as it looked" on the primary after D4
occurred. This is called a transactionally consistent view of the data (The C in the ACID).
Once all replicas are up-to-date - we can then remove the changes from the Primary Buffer
. This way of replicating
data is incredibly powerful, because we have de-duplicated all the delta information. We can now maintain any number
of replicas with near zero cost.
If I want to take incremental backups of the data - the data structure above makes it trivial. All I need to do is read the deltas I have not seen since I last took a backup and copy the data pointed at by that metadata into the backup media.
Metadata Buffers and Log Sequencing
Hopefully, I have convinced you that a metadata buffer is a not only good idea, it is the optimal data structure for tracking changes in a database.
Earlier, I said:
- The writes in the transaction logs are sequential, with each data modification appending to the log
Recall that the transaction log is the master record of all changes to database allocations. It observes every change. If we always append to the log - the log is also the metadata buffer we just described. If all changes, no matter what table they touch, go to the same transaction log - we have essentially sequenced changes in the database into a single, linear space.
The A in the ACID property of a database require that a transaction touching multiple tables must either make all changes fully visible at the same time or none of them must be visible. By writing all changes to the same log file we achieved this:
- Rollback: Mark the transaction as aborted in the log and undo the changes it has made (we have those in the log)
- Commit: We mark the transaction as committed and signal that that anyone on the replica who has observed this log entry can assume the previous log entries for the same transaction are now valid
By sequencing all changes - we have (almost by accident) implemented something very powerful.
Time Travel
Look at this sequence again:
Primary Buffer
D1 -> D2 -> D3 -> D4 -> D5 -> D6 -> D7 -> D8 -> D9 -> D10
Assume I in a state where I know all changes as per D1
.
If I apply D2
to that state, I get the state of the database as it was after D2
. I have effectively travelled
to the point in time that D2
represents and I would look at data "as it looked" on that day. In fact, I could
travel to any pointer in time reprented by D1
up to D10
by just applying the log changes to the database.
Taking this to its extreme conclusion:
- If
D1
represents the very first change to the database - And if I keep the transaction log forever
- Then: I can travel to any point of time in the history of the database
Obviously, reconstructing the database in this way is not cheap. But we can optimise, by regularly taking snapshots of the resulting metadata. For example, we could take snapshots every 3 changes:
Buffer: D1 -> D2 -> D3 -> D4 -> D5 -> D6 -> D7 -> D8 -> D9 -> D10
Snapshot: S0 S1 S2 S3
If I want to travel to the point in time represented by D5
- I can start from S1
and apply D4->D5
. Our previous,
back of the envelope calculation, showed us that metadata is a fraction of the total database size. As long as we
don't take snapshots too frequently, the cost of keeping these snapshots around is manageable..
Readers who are not yet asleep might think of another optimisation: Keep both the before and after value of the change in the log. We could make the log reversible and get something like this:
Redo: D1 -> D2 -> D3 -> D4 -> D5 -> D6 -> D7 -> D8 -> D9 -> D10
Undo: D2 <- D1 <- D3 <- D4 <- D5 <- D6 <- D7 <- D8 <- D9 <- D10
Assume I find myself at D10
after an "oh s*** event" where a user issued a massive DELETE
and commited it. I
could use the reverse delta and rewind by applying D9<-D10
- which brings the database back to D9
- before the
DELETE
. Databases typically refer to the reverse delta as the "undo".
Log Sequence Numbers and Friends
At this point, I hope you can visualise the database log as a sequence of changes that have occurred in the database.
It is a logical clock ordered by some integer value. Typically, databases will use a 64-bit integer for this sequence
number. We call this number the "Log Sequence Number" (LSN
). By picking a 64-bit integer to represent it,
we ensure the LSN
increases with every write and never wraps around. Because as we all know, you never
run out of 64-bit integers no matter how many writes you do.
Another way to think of the LSN
is as a virtual address into wall clock time. As we saw, keeping the transaction log
around forever allows us to travel to any "LSN
point in time" of the database. Obviously, storing an ever-growing log
on disk may prohibitively expensive. But, we can mark an "event horison" that is the furthest point in the past we will
travel to.
Transaction Sequence Numbers
Most databases use something called a transaction id or transaction sequence number to supplement the LSN
.
The transaction number is typically used, together with locks, to control who can see what in a database. Though the
concept is very similar to the LSN
- it is not the same thing.
- The
LSN
orders the writes that have already occurred in the database. It is the logical clock of the writes. You use it to construct replicas and backups and to time travel. - The transaction number is a memory construct, used to keep track of what transactions are currently in progress and who can see what data. It is another logical clock that tells current transactions what they can see.
Databases have different ways to handle the relationship between the LSN
and the transaction number. Let us look at a
few case studies.
SQL Server: LSN
and XSN
When a transaction starts in SQL Server, it is assigned a unique, 64 bit XSN
. The database has a map of all
the currently active transactions and their XSN
. Transactions touching data will add a version pointer
to that data, telling readers how to undo the changes the transaction made. The version pointer contains the XSN
of
the transaction modifying the data. The version pointers are stored in the tempdb
database (which can be a source
of contention for high speed workloads).
When data is read from a page in memory, the algorithm roughly goes like this:
if not memory_page.has_version_pointer():
# Unmodified data, no other writers. We can read this, because there are no changes to it
return memory_page
if memory_page.version_pointer.xsn < transaction.xsn:
# Someone touched this page before me. Can I read the data safely?
if memory_page.version_pointer.xsn not in transaction.active_snapshot:
# Data was modified by someone who has committed. We can read the data
return memory_page
# Data was modified by someone and they have not comitted, we need to find the previous version
return reconstruct_old_data(memory_page.version_pointer)
The version pointer is a "dirty" marker - it tells readers that they will need to look elsewhere for the data they want to read than on the page they are currently looking at.
Because SQL Server stores data that is committed to disk in its data files (along with the metadata of its
allocation),
we know that when we read a page from disk - it must have been committed. This greatly simplifies visibility checks.
When SQL server flushes committed data to disk, it also writes the LSN
of that data to the page. This makes it easy
for backup and replication to find the changed data (we just have to find all the LSN bigger than the one we already
backed up).
Note that SQL Servers algorithm is optimised for short-lived transactions where the version data can be purged quickly. It is also optimised for a read heavy workload because we generally try to keep the most current version of the disk block quickly accessible.
DB2 LSN
and OUW ID
DB2, these days mostly forgotten, uses a concept similar to SQL Server. It has as LSN
keeping track of the write
ordering. It calls the concept of transaction number the "Unit of Work ID" UOW ID
.
Many of the old DB2 researchers ended up working in SQL Server - so it is hardly surprising the concepts are similar.
MySQL with InnoDB - LSN
and db_trx_id
Again, we are dealing with something very similar to DB2 and SQL Server. InnoDB uses a LSN
to track the write
ordering and uses transaction numbering (which they call db_trx_id
) to track the visibility of data.
Writers modifying rows provide undo versions of those rows. LSN
is written to the data files when the
block is flushed from memory.
Postgres LSN
and txid
Like the database we have seen so far, Postgres uses a LSN
to track changes in the database. The LSN
, as expected,
is the logical clock which orders the writes. It can be inspected with pg_current_wal_lsn()
. Postgres calls the
transaction log the "Write Ahead Log" (WAL
) and its workings generally map to everything you know about transactions
logs so far. The LSN
in Postgres is a 64-bit integer.
Unfortunately, that's where the sanity ends.
Like SQL Server, Postgres has the notion of a transaction number. In Postgres, it is called: txid
. The txid
is a
32-bit value (for bizarre, historical reasons). Since any database worth running will have more than
4B transactions arriving at its door, Postgres has come up with a hacky way to "wrap" the xid
around. This is called
a "transaction wraparound" and it's a sick implementation that I will spare you the details of describing.
From here, it gets even more bizarre. Unlike most other databases, Postgres stores two values for each row: xmin
and xmax
. Unlike other databases, they are stored for every row (or "tuple", as Postgres likes to call it
because it's old school). The xmin
is the txid
that brought the row into existence, the xmax
(if set) is the xid
that deleted the row. Why would you use a good naming convention for central concepts in your database?
Similar to other databases, when a transaction starts - it is assigned a txid
and it keeps track of all other, active
transactions at that time. Unlike other databases, The min and max txid
for concurrently running transactions are also
captured in the snapshot - and called: xmin
and xmax
- not to be confused with the related concept on the row.
For an in depth treatment of the data structure, see this excellent blog by Jan Nidzwetzki:
Introduction to Snapshots and Tuple Visibility in PostgreSQL
A running transaction follows (roughly) this algorithm to decide if data is visible:
if row.xmax < transaction.xmin:
# Deleted data before my snapshot
return None
if row.xmax is None and row.xmin not in transaction.active_xid:
# Data is not deleted, and it is not currenty being changed
return row
if row.xmax >= transaction.xmax:
# A transaction started after the snapshot was taken and has deleted the row. We can still see it
return row
if row.xmas in transaction.active_xid:
# A transaction is currently deleting the row. But it isn't me, so I can see it
return row
if row.xmin in transaction.active_xid:
# A transaction is currently creating the row. But it isn't me, so I can't see it
return None
if row.xmin > transaction.xmin:
# Row was created before the snapshot was taken, I can see it
return row
if row.xmin > transaction.xmax:
# Row was created after the snapshot was taken, I can't see it
return None
As you might be able to infer, this is a mess! Since the txid
are only 32-bits - you end up with a lot of bookkeeping
and the need to "freeze" txid
. There is also the issue of having to store the xmin
and xmax
on every row -
which adds overhead to the storage. You can think of xmax
as a form of tombstone in Postgres - so
it could be argued that xmax
is needed. But, once xmin
is on disk and the transaction it represents has
committed, it is just wasted space. Fortunately, if you are running Postgres you probably don't have a big database - so
you might not care about the extra bytes.
Fortunately, Postgres does write the LSN
that last modified the page to disk (in the pg_lsn
field of the page
header). Which means you can reconstruct the write order of the database (and also means you don't need the xmin
and
xmax
).
Postgres' transaction logic, while solid from an ACID perspective, is a very poor implementation of transaction management. By now you can probably see why.
Let us put the final nail in the coffin and see what state-of-the-art looks like.
Oracle SCN
Oracle uses a System Change Number (SCN
) to track changes in the database. Unlike most other database, Oracle has
chosen (wisely I claim) to unify the "transaction clock" and the "log clock".
When a transaction starts in Oracle, it reads the current SCN
. On the surface, the SCN
is similar to the LSN
in other databases: it is a monotonically increasing value that represents a specific point in the write
timeline of the database. The SCN
you read when the transaction starts is called the START_SCN
.
When the transaction commits it gets assigned COMMIT_SCN
and all data it touched will be updated with that value.
When a transaction reads data, it follows this very, simple algorithm:
if transactoin.start_scn >= disk_block.commit_scn:
# Data was committed before the transaction started
return disk_block
else:
# Data was committed after the transaction started, not visible
return None
How do we get into a state where this is possible? How on earth does the COMMIT_SCN
get written to disk, so we can use
it this way?
In Oracle, like in other databases, each transaction has a transaction number - but that is where the resemblance ends.
Oracle calls this unique number the xid
. What exactly the xid
really is doesn't actually matter - just think of it
like a pointer.
When a transaction touches data in Oracle, it will write to a special part of the block metadata called the
"Interested Transaction List" (ITL). It will add the xid
of the writing transaction to the ITL. This tells readers
of that block: "If you are reading this block - the data you are looking at has been modified, you should
reconstruct it by looking at the ITL".
When a transaction commits in Oracle, it receives a COMMIT_SCN
. This SCN
is the current SCN
of the database at
the time of the commit (i.e. the logical clock) and is unique to that transaction commit. The COMMIT_SCN
is written to
the redo log (which is Oracle's version of the transaction log) along with the xid
. With this information safely on
disk, we now have a few choices:
- Immediately visit the metadata of blocks touched by the transaction and set the
COMMIT_SCN
in the block (clearing out the ITL entry in the process) - Do nothing and wait for the next reader of the block to come along and set the
COMMIT_SCN
for me. - Visit the blocks with ITL entries when we have some CPU time to spare and clean up the entries and set
the
COMMIT_SCN
on the block.
All of these operations can be done without taking any locks and can be perfectly amortised over time.
This is an extraordinary elegant way to handle logical transaction clocks. It also allows opens the path to an almost trivial implementation of time travel - which Oracle calls "Flashback".
Summary
Today, I have hit you with a wall of text about how databases sequence time. If you are still with me: Well done and thanks!
A summary for the TL;DR crowd:
- Database orders all writes using a unique, 64-bit integer for every transaction written. This ordering is typically
called the
LSN
(but Oracle calls it theSCN
). - This ordering, reflected on disk, is close to optimal as compared to letting the database just write data anywhere.
- The
LSN
is a logical clock of the data. - Using this logical clock, we can replicate, backup and time travel in the database.
- Most database use a separate, logical clock to control currently running transactions that is related to
LSN
, but not the same as theLSN
. Oracle is special this way and unifies the two concepts into theSCN
.
Enjoy the rest of your weekend - and don't hesitate to catch me in LinkedIn for the next online fight.
PS: if this information is worth paying for - tip me on Patreon. Surely, if stupid YouTubers can get paid to post content about cats - this blog might be worth a few bucks to you?