The Database Doctor
Musing about Databases

Row or Column based Storage?

These days, columnar storage formats are getting a lot more attention in relational databases. Parquet, with its superior compression, is quickly taking over from CSV formats. SAP Hana, Vertica, Yellowbrick, Databricks, SQL Azure and many others all promote columnar representations as the best option for analytical workloads.

It is not hard to see why. But there are tradeoffs.

Row Groups

Both column and row based storage will store rows in groups. This optimises disk access patterns. If you stored a single row at a time, you would have to write very small disk blocks every time you want to persist it. Even with modern SSD storage - this isn't an effective way to access disk. And no, memory mapped files does not get you out of this conundrum - for reasons I might blog about later.

Row Groups in Row Based Storage

In a row based store, the number or rows you store in a row group is typically influenced by the optimal block size of the disk storage you use. The optimal block size for traditional block storage s in the single to double digit KB. For example, SQL Server will use 8KB and Oracle allows you to use a variable value - but typically one that is between 4KB and 64K.

These small block row groups allow the you to quickly read and change single rows. You simply read a small block from disk (or memory), change its content and writing it back to disk. You may decide to implement compression inside the block. But since the amount of mutual entropy in a KB sized data chunk is normally low. The resulting compression rates is often low too - perhaps 2-3x if you are fortunate.

Row Groups in Column Based Storage

In a column based storage, you store each column (sometime groups of columns) in it's own disk block. That way, you can read (from disk) a subset of columns of the columns making up the table without having to read the entire table.

Consider what this disk layout means when you need to read more than one column from the table. For example, take this simple SELECT statement:

SELECT a, b
FROM Foo
WHERE x = 42

To evaluate this statement on a columnar storage, you must read x, a and b from the column based storage. But, you must also eliminate all the a and b values that do not belong in a row where x = 42. If all columns are stored in a long stream on disk without any grouping into the rows they belong to - you will end up performing a massive join between a, b and x column streams based on their row offsets. That's not very effective. Because of this, your column based storage typically stores a set of rows together in a row group. That way, you can read row group, decompressed it, filter it and recombine the columns into rows without needing to "align" all the columns together across the entire table. You can also have multiple threads fetch row groups in parallel without having to coordinate between the threads.

Consider what this observation means for the arrangement on disk. Each column may compress very differently, depending on amount of entropy in that column. Ideally, you want to store each on its own disk blocks (so you don't have to read the entire row every time you read a subset of columns).

You seem to have a series of conflicting design requirements:

You typically solve this by storing large row groups and packing columns intelligently inside the disk blocks that make up a row group. A column based row group and its disk blocks is typically several MB in size. A single row group may contain millions of rows.

The downside is of course that changes to a single row in a row group require a full rewrite of the entire row group or a large fraction of it.

Column based storage for Analytical workloads

A columnar representation on disk has many advantages over a row based representation for analytical workloads

All these factors make columnar storage the preferred way to organise data on disk in analytical databases.

Row based storage for OLTP and Real time Systems

In OLTP databases you care about low latency response times, high concurrency and real time data ingestion. Data changes quickly and all the time. Design parameters like these make the tradeoff between column and row based storage is less obvious. To make matters worse, an OLTP query is typically looking only at a few rows instead of a big set of rows - and you have thousands if not millions of these queries hitting your system every second. You really can't afford to issue large disk reads for each of them - even if you have the best storage systems in the world.

In summary, your OLTP system that need real time access to data at very low latency (typically in the single digit milliseconds) will often use row based storage. It is faster to change row based storage, its faster to access single rows, and it requires less background operations to keep it neatly packed on disk and in memory. But, you pay when you perform analytical queries and your storage size per row is larger.

Summary - and Hybrid storage

You probably want to have your cake and eat it too. Which is why hybrid row/column storage is a thing. There are a range of options for mixing up the two strategies that are beyond the scope of this blog entry.

Some ideas to get you thinking