Databases and Disaggregation
Iceberg and Parquet, for all their flaws, have shown us a fascinating path forward for the database industry: Disaggregation. Apache Arrow is quickly moving us in the direction of common interchange formats inside the database and on the wire.
It's now possible to imagine a future where databases aren't single systems from one vendor, but made by combining multiple components, from different contributors, into a single coherent system.
This idea isn't new, and I claim no credit for observing it. But I'd like to share my perspective on it — since that's what I do here.
Why Disaggregate?
Historically, in the times of DB2, Oracle, SQL Server and Teradata – a database came from one vendor only. Irrespective of what you think about Cathedrals and Bazaars there's something to be said about having the full stack under the control of a single entity. Dictatorships get stuff done! A small team of highly skilled individuals can more easily grasp the architectural impact of making changes to a complex, coupled systems. And databases are some of the most complex things we build in the industry.
On the other hand, because databases are so fiendishly complex, we could benefit from recruiting highly specialised talent from across the globe. As Linus Torvalds has proven: you can build world-class software as a community with contributors specialising in different parts of the stack.
Why not build databases the way Linus builds Linux?
Open source has shown us that for teams to function autonomously, robust agreements are essential.
What Components make up a database?
Nearly all databases, even the non-relational ones, follow a modular, architectural pattern. These patterns are nearly all the same – irrespective of who made the database.
Let's look at each component and look at what the state-of-the-art looks like. As we do, I'll share my perspective on what a future, specialised component team might contribute to the space.
Parsing and Routing
Parsing is the act of taking a piece of text, SQL in the case of databases, and turning it into a structured AST that can machines can reason about.
While this sounds like a "solved problem", it is not!
Shortcutting Parse
When a simple statement arrives, such as INSERT INTO foo VALUES (...) it's possible to
shortcut the parsing by taking fast paths through the parser the moment you realise that you're dealing
with a simple INSERT.
For workloads that rely on heavy singleton insert activity (ex: poorly designed Kafka queues - as if there
was another design category), this can be a big win in terms of CPU and latency.
Similarly, we can recognise queries of the form SELECT ... FROM foo WHERE x = <some const>
State-of-the-art parsers will recognise a query as simple, auto-parameterise it and take shortcuts through the parse process. The AST for such simple queries is cacheable without needing any user configuration. High-speed web servers play similar tricks.
Most open source databases don't employ these tricks.
Routing
Web servers achieve millions of requests per node without breaking a sweat. Most databases struggle to do that, even for simple queries. Caching systems, for example, memcached, are a notable exception. Why this difference?
Many databases don't properly use async programming to scale request and instead rely on blocking network I/O. They thus drown themselves in context switching and kernel time instead of using modern programming techniques. A future, disaggregated parser/router could do a lot better here. What does your architectural world look like when you can serve 1M or even 10M read request from Postgres with 10,000 concurrent users without needing pg bounce — all from a single server?
Combined with high-speed parsing, you can also route queries depending on their complexity to different compute resources.
Commercial databases who do deploy these tricks often beat dedicated key/value caching systems in throughput per node. For example, here's Oracle TimesTen going at 45M reads/second on a single node.
Rewriting
Rewrites are a form of optimisation that doesn't need a query planner. They're straightforward, rule-based changes to the parsed AST. A good rewriter not only improves performance, it also simplifies the rest of the stack.
For example, consider this:
SELECT ...
FROM foo
WHERE (a = 1+1 OR a = 3) OR a IN (2, 5, 6)
A good parser/rewriter, before we even care about query planning or relational algebra, can realise that this is the same as:
SELECT ...
FROM foo
WHERE a IN (2, 3, 5, 6)
Bringing the query into such a form is often useful to later steps in the query planning and
execution process.
When possible, the rewriter can also constant fold expression (ex: the 1+1 case above)
Using De Morgan's Theorems,
we can turn complex expressions into simpler forms more suitable for fast execution and
data access.
Since databases are good at joining, we can also use rewrites to turn queries into more practical, canonical forms - replaces complex expression with joins and standard, relational operators.
For example, we can further rewrite the above to:
SELECT ...
FROM foo1
JOIN (SELECT 2 AS __a UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 6) ON a = __a
We could also rewrite this:
SELECT a
FROM foo
WHERE b in (SELECT x FROM bar)
To this:
SELECT a
FROM foo
SEMI JOIN bar ON b = x
With the right rewrites, we can entirely remove the need for a special "In list evaluator" operators in the query engine. Every in-list can be turned into a join. I leave the general form of this IN-list rewriting as an exercise to the reader – or as a subject to be fought over on LinkedIn.
Take some time to play with any open source database: you will realise that the current state of an open source databases is in a bad place with regard to rewrites.
Syntax Sugar
SQL is notoriously verbose. But it doesn't have to be that way. Syntax sugar is what makes a language feel good. Parsers can easily provide it.
Consider this great invention by DuckDB:
SELECT a, b, SUM(x)
FROM foo
GROUP BY ALL
How about this annoying, very non-dry, pattern:
SELECT a + b as x, (a + b) * c AS y
FROM oo;
I think it might have been Teradata, who first came up with the idea that you could write the above as:
SELECT a + b AS x, x * c AS y
Much better, right? A good parser could add syntax sugar to SQL without the underlying components even noticing it and without breaking any contracts.
Error Messages
I hope to live in a world where SQL, despite its flaws, becomes easier to use for more people.
But databases are notoriously bad at providing users with helpful error messages. Consider this relatively benign case from PostgreSQL:
CREATE TABLE foo (x1 INT, x2 INT);
CREATE TABLE bar (x3 INT);
SELECT x3 FROM foo;
-- ERROR: column "x3" does not exist
-- Hint: Perhaps you meant to reference the column "foo.x1" or the column "foo.x2".
SELECT *
FROM foo
JOIN bar USING (x3);
-- ERROR: column "x3" specified in USING clause does not exist in left table
The first error is useful (though we didn't need the prefix foo in the message)
Why isn't the second error suggesting that we might mean another column?
Imagine you get the second error in a large query — what on earth does "the left table" even mean?
Why didn't PostgreSQL say something like: "column x3 doesn't exist in foo, did you mean x1 or x2?"
Parsers have the context to provide better error messages. Despite 50 years of database industry research, we're still missing the basic usability features.
Query Lifecycle Management
There's a crucial question every serious consumer of a database must ask himself: "Where did my query spend its time?"
You want a high-precision breakdown of time into:
- How long did planning take?
- How long did I wait for other queries to complete before I could start executing?
- Which node in the execution plan node used most time?
- Was the I/O and CPU enough and did the computation cause skew?
- How many operations/second am I getting for each node in the plan?
- Did the query wait for a stupid Python program to spool 1M rows back to the client?
- Did the query end up waiting for a cluster node to restart?
- Etc...
This is an instrumentation problem – and simple tools like Grafana aren't solving it. This isn't about log files. A query must be a first-class citizen in the architecture. The metadata (itself a tabular structure) must contain a data model that allows for easy analysis of queries, both individual queries and queryable aggregates over entire workloads.
Leaders in this space have done great work already.
Consider SQL Servers sys schema and Oracle V$ and DBA_ views as good examples of the
details you need.
Consider PostgreSQL pg_stat_activity and pg_stat_statements as a core example of misunderstanding the problem.
Reading and merging log files and piecing them together to form a full picture of a query (or a group of queries) is simply not good enough. And we don't want to escape into some open source tool with a UX designed by committee to view the data. We already have a database language for analysing data – that’s what SQL is for!
We have to do better, and we could even consider standardising on a common data model for query diagnostics.
The idea of having standard system views exists in the ANSI-SQL standard with the INFORMATION_SCHEMA schema.
Workload Management (WLM)
Cloud providers sell us the idea that compute resources are infinitely scalable. Elasticity and scale: the great snake oil of our times. Of course, infinitely scalable compute resources are also infinite money, at cloud markup pricing.
We, who live in the real world, know that you can't just allow queries to consume as many resources as they want. Anyone who has worked on a large system has seen inexperienced users run queries that not only consume a lot of CPU, but also disturb other users who need to get work done. With the advent of LLMs, we now have an entirely new class of village idiots: The "AI" exploratory queries that talk to our data via MCP. It's a jungle out there, and someone needs to put some rules in place.
Workload management is the discipline of controlling who gets to run, when they run and how many resources they can use when they do. The ANSI-SQL standard is silent on this topic.
There's significant room for open source database vendors to contribute to the discussion and help provide a standard framework for query workload management. We need to move beyond the "let me limit CPU of this process" and towards a more declarative approach to controlling compute resources.
Note that there's a misaligned incentive for cloud vendors here.
If you're in the business of selling a cloud database and benefitting from bundling compute resources with your
database – why would you want your users to be in fine control of CPU usage?
We need an advanced cgroup implemention made for databases engines.
Metadata management
I could write an entire blog entry about how hard maintaining good metadata is — and I probably will.
For now, let me provide some areas where a disaggregated metadata manager could help everyone.
Detailed, inexpensive statistics
Several algorithms exist that allow us to maintain good statistics over large tables. Examples include:
- Theta Sketches
- Hyper Log Log
- Various histogram sampling methods (VLDB has several, good papers on this)
- LLM based, adaptive learning of correlations
- Correlated column statistics
These data structures are absent even in a large, big data system that'd greatly benefit from them. The Iceberg spec is notably silent about these powerful data structures.
Garbage Collection and Background Tasks
A good metadata manager could at least partially deal with the garbage collection problem
that arises from trickle feed and small update workloads.
Note that I'm not endorsing poorly designed space management systems
such as Iceberg or PostgreSQL VACUUM.
What I am saying is that if you must live with these poor implementations, a good metadata
manager can help you deal with the problem at the right rate.
Query Planner
My blog should be sufficient evidence that query planning in open source databases is very far from being a solved problem.
A disaggregated planning team could focus on understanding the fiendishly complex problem of query planning without having to worry about other parts of the stack.
Such a disaggregated planner is already in existence, ex: GPOrca. Unfortunately, this project seems to have been mostly abandoned. There's also some exciting work from Andy Pavlo's optd.
EXPLAIN instrumentation
As my work on SQL Arena has clearly shown – it's desirable to have a unified way to view query plans for SQL statements.
I'm not the first to notice this (ex: Towards a Unified Query Plan Representation).
Additionally, there's more work to be done in exposing the internal reasoning of the query optimiser to the human (or robotic) consumer of query plans.
Better Estimation and Adaptive Planning
Using the detailed statistics described above, optimisers can make much better decisions about query plans. The more we know about the underlying data, the smarter our query plans can be. And smart query plans mean faster queries and happier users.
But even the best statistics can hide subtle correlations in the data. This is where adaptive planning comes in. You try to get better at planning by gradually learning about the data you are querying.
Several strategies exist that we can use:
Sampling queries and deferred planning is in use by engines like Snowflake and Teradata. When the planner realises that its confidence in an estimation is low, it can run queries against the data to get a better idea. It's also possible to make query plans that defer full execution until immediate result cardinalities are known. This couples the planner with the execution engine – creating a feedback loop.
Post Query Sampling used by BigQuery, SQL Server, Oracle and likely others. The idea here's to keep statistics about previous queries and the size of intermediate results. The query planner can then learn from this information to make better decisions about future query execution.
There's a space for innovation, standardisation and decoupling here. You can provide a general format for storing sampled information about the data model that's accessible in an open way to multiple database engines. We could even put that data into the metadata store in a standard format.
Execution Engine
Historically, execution engines have fallen into two, broad categories:
- Row / Streaming / OLTP
- Columnar / Batch / Analytical
Row Based and OLTP system are solved problems
Row-based engines tend to run OLTP systems. Their preferred join is the nested loop. You tune data access with B-tree indexes, and full normalisation is the name of the game. These systems concern themselves with low latency and very high user concurrency.
I claim with a bit of hand-waving that OLTP systems are a solved problem.
It's realistic to expect tens or hundreds of thousands transactions per second – even on single node systems. The amount of throughput available on modern hardware is insane and isn't a limiting factor for OLTP. Of course, that kind of throughput isn't available to PostgreSQL users (and only barely in reach for MySQL), and the open source community could certainly do better in this space to compete with the big players. The improvements we can expect from open source OLTP systems come from better space management, async programming techniques, I/O optimisation, better write ahead logs (including replication streams) and general tuning of the hot paths. These are problems with known solutions that have been around for decades in engines like SQL Server and Oracle. Someone's yet to create an execution engine that can compete with the big boys at a more reasonable cost – or for free. You don't even need to innovate, you need to mimic.
In a well-indexed OLTP system with good statistics, a "good enough" query plan is generally easy to find (as long as you don't run analytical queries on the system). But there's some small progress to be made in creating better OLTP query plans, particularly when combined with adaptive query planning.
Columnar and Analytical Engines leave a lot to be desired
We typically use Columnar engines for analytical systems. They benefit greatly from SIMD optimisations, which for some hot paths provide about an order of magnitude of improvement in performance. Their bread and butter join is the hash join. They often access data via coarse grained, columnar indexes and stored metadata like bloom filters and min/max values. Users will typically apply some kind of denormalisation to eliminate the need for joins between two large tables. Your main concern here's throughput and dealing with fiendishly complex queries – such as those generated by ETL flows ("pipelines" for you data engineers out there) or LLMs exploring your data on behalf of a user prompt.
Columnar engines typically receive the most attention from open source and new database vendors. Exactly because this isn't a solved problem. I suspect there's also a coolness factor in play here. SIMD execution is just really fun (if you're the kind of masochist I am)!
There's a space for making significant contributions to the industry in Columnar Execution Engines. I'm not aware of any databases that apply all the latest technology to boost execution performance.
Here are some examples.
Hash Join Research
Anyone who has ever tried to optimise joins/CPU cycle in a columnar system will understand that a single hash table data structure isn't best for all cases. To achieve optimal performance, an execution engine must implement different hash tables for different situations. For example, if you know that one side of the join is to a unique key, you can do better than a general hash table implementation.
Yellowbrick uses three different types of hash tables to speed up joins. It chooses the exact implementation based on the observed cardinality of the input. DuckDB has hybrid hashes for the same reason.
This is an area of very active research. It has a very high cool factor too. For example:
A decoupled execution engine could provide a pluggable join node interface with a well-defined contract. Programmers could then contribute their own, optimised algorithms for the join node. We could have "fastest join node" competitions, and the entire industry would benefit from it. Letting someone focus only on a small (but important) part of the Execution Engine would open the talent pool to more players.
There’s an equally large area of research for aggregates.
Adaptive Execution
Bad query plans are a notorious source of problems in databases. As data sizes grow, so does the impact of bad planning: A single bad query plan is the difference between a massive cloud bill and being able to run the query on your laptop.
Using the techniques described in the query planner section, you can eliminate many of the worst issues. But you can do even better by allowing the planner to interact with the Execution Engine while the query is running (creating a managed coupling via a standard interface).
For example, you can change the execution strategy if you discover that the query you carefully planned and estimated doesn't work well for the data you're actually processing.
Simple solutions (not adopted by most systems) exist that already deliver great value. For example, you can swap the build and probe side of a hash join if the build side was bigger than you expected. Teradata and BigQuery can do this. SQL Server can't – but it can dynamically choose between loop and hash join at execution time.
Single Node vs. Distributed Execution
DuckDB and SQL Server made an interesting bet on the future: Single node scale grows with better hardware at a speed that makes distributed computation a tiny part of the market that isn't worth pursuing. If you spend time tuning for a single node, you can achieve some ridiculously high performance numbers. You can buy 192 cores in an AMD processor today. That's an insane amount of compute power if you know how to wield it. You can move hundreds of millions of rows every second on such a node. It fits under your desk and costs around 5-10K USD (not per month, but to acquire and keep for years). You can trawl through 100GB/second scanning data (and that's before you decompress it).
Computing at the edge is powerful enough to run nearly all non-LLM workloads.
Alas, the cloud vendors don't benefit from a future of massive compute at the edge. They need you to believe that scalability only comes from distributed computation. Of course, they're right that acquiring 8–16 core servers in large amounts is cheaper than buying 192 core machines. Distributed computation also provides build in failure domain protection – all your compute is no longer in a single, fallible box. Can we agree that consuming "just the compute we need" and giving it back to the pool is the way we want to operate? That renting is better than owning? If so, that opens up a world of spot instances and spot market pricing. We can mass-produce small servers at scale, and they're broadly appealing to both small and large players. We could even cut x64/x86 and Intel/AMD out of the market and just run on ARM (disclaimer: I’m the embarrassed owner of Intel shares). With the right software, swarms of small machines can serve nearly any scale we need. The efficiency of the market should make compute cheaper if we all share it and rely on mass-produced hardware this way.
Yes? Cui bono?
No? You're thinking critically about the uncomfortable questions now?
- About the profits these monopolies can command?
- Do they actually pass the savings from their mass purchasing power to you: the consumer?
- Are you paying a fair price for the "added value" subscription required to manage distributed software?
- What's the long-term impact of your data being held ransom by a closed data format stored inside an object storage system hosted by a company that bends the knee to the Trump administration?
- Is the complexity explosion of making these disjoint, amateur products work together worth it – particularly once you start paying "data engineers" to build pipelines everytime you need a new report?
- How much work do you, as the customer, absorb in labour cost to make the service deliver your business value?
- Can you reliably achieve the uptime you need when the vendor providing your hardware doesn't have incentives aligned with your interest in keeping your business online?
Stop thinking about all this and pretend to be a good capitalist while paying dues to your feudal, cloud lords.
Don't get me wrong: There's obviously a case for distributed computation. Having recently abandoned Windows, I'm in love with my M5 (=ARM) processor. But a database must make the best use of the individual scale nodes first before falling back to adding more nodes to the cluster.
And of course, there’s an entire component to be made to handle high speed transfer or arrow buffers over unstable networks. TCP isn't the answer here.
Storage Engine and Cache
My little cloud rant brings me to the final part of the stack that we can disaggregate: storage!
With Iceberg, we have an open, disaggregated format for storing data in a columnar format. It's a format designed to run on top of object stores – because that is the hell we're currently living in. One fine day, history might start moving forward again, and we'll get reasonable priced, distributed storage which allows block level read/write access. Perhaps we'll even get persistent memory one day, and this entire discussion won't age well.
For now, abandon all hope ye who use cloud: You’re stuck with object stores and ephemeral, blocked-based NVMe. It's dark and depressing, but it's the reality we're in together.
That now raises the question: "If we buy into the premise that we're stuck with Iceberg - can we do something decent with NVMe caching to claw our way out of the self-made circle of hell we're in?"
Can we emulate block-based database storage with NVMe on top of object stores and Parquet to provide high-speed, cached access to data? For reading of data, my intuition tells me that we can do a decent job. But it will be difficult to pull off. Even small cloud nodes can scan data at several GB/second from their local NVMe drives. We should be able to make use of this speed to provide a sort of "NVMe buffer pool" that'd make our data lakehouse look like a real database. Perhaps we don't need to think of lakehouses as upstream, pipeline requiring, precursors to real database reporting?
Could we disaggregate the storage caching engine from the storage format?
What doesn't follow from Disaggregation
I've gone through more text than my usual blogs. This topic is dear to my heart, and I hope I made sense to you.
Before I sign off for the day, I'd like to make some architectural observations relevant to this discussion.
I've sketched out several database components. Developers could independently work on each of these – picking the area of specialisation they prefer. Whenever you componentise the world in this way, you must think very carefully about the interfaces between these components - so they can remain stable over time. A future of disaggregated databases will need strong contracts in place that are designed by people with a solid foundation in database engineering – we have work to do here!
Note that just because something is an isolatable component, it does not follow that it's also a good candidate for being a microservice. Just because your FPS shooter allows modding – it doesn’t follow that those mods must run at the end of some RPC service endpoint. Like games, databases are high speed compute systems and those systems don't play well with too many network hops. You can have pluggable components in the same process space, you dont need a network between every component. This is a fact we seem to have forgotten in this in "modern" architecture era. There are some obvious cases where separating components into different servers/services might make sense. For example, query planning often has interesting memory behaviours that might not play well with an execution engine that needs low-level access to the CPU. We can probably afford a network hop between the query planner and the execution engine. Once we reach the execution engine - we are typically expecting to spend at least a few tens of milliseconds on each query — and maybe that means we can afford another network hop to the storage engine.
You'll no doubt have noticed the implied coupling between query planners and execution engines. To make the right plan, the planner must know the rough CPU consumption of each execution operation. That CPU consumption will depend on the specific execution engine — the engine needs a way to signal that. The execution engine may also need to contact the planner to re-plan queries that aren't running the way the planner expected they would. The stack isn't a strict hierarchy: components interact with each other in both directions.
Finally, and perhaps most importantly for the open source community: Components in different parts of the stack don't have to be written in the same programming language. We don't need to have a Rust vs. C++ fight here as long as we agree on how components talk to each other. That, in turn, should allow us to draw talent from a much larger pool of developers.
You could even prototype some components and "mods" in Python terrible programming language loved
by data engineers and Javascript language that should have died with Netscape.
This might speed up prototyping of new ideas as we could again draw on a larger talent pool.
Summary
In this first blog entry of 2026 I ventured into predictions about a future dominated by disaggregated databases.
I believe that it's possible to split traditional databases into smaller components that we can develop and deploy independently. Despite 50 years of engineering, databases are still far from being a "solved problem". I've provided several research and engineering directions that I think are worth exploring.
Databases are fiendishly complex systems, and I hope that by disaggregating them, we can allow programmers to specialise in smaller parts of the stack. This could create a future where we assemble databases from highly optimised components which will each be more powerful than anything we've seen before. It would also allow vendors to compete in specific parts of the stack instead of building entire engines themselves.
And perhaps, with this world, history can finally move forward again! We'll be able to deploy databases in the cloud (or on premises) that can have the kind of functionality and convenience my generation was fortunate enough to grow up with.
When I allow myself to dream, I could even imagine a world where we don't constantly reinvent the wheel and repeat the same 10-year reinvention cycle. The viscous cycle that brought us Hadoop, NoSQL, Kafka, ORM and Spark – only to rediscover why people invented relational databases and SQL in the first place.
Welcome to 2026 – let's get disaggregating!


17 min
1/22/2026
Listen