When talking about Data Lakes and how people access them - we must address some of the misconceptions that made
them popular in the first place.
One of the largest misconceptions is this: "Joins are expensive". It is apparently believed (as I discovered from
discussions on LinkedIn) that it costs less CPU to turn your data model into a flat table and serve that to users -
than it does to join everytime you access the table.
It is certainly true that object stores allow nearly infinite I/O (although at a very high CPU cost to handle
the HTTP overhead) and nearly infinite storage. But, is there really such a thing as "sacrificing disk space to save
the CPU cost of joins?"
When designing database schemas, you may find yourself needing to create a property bag table that is in a 1-n
relationship with an entity table. It isn't always viable to just add more columns to a table because you may not know
all the properties you want in advance. The problem with these property bags is that they can be hard to query.
In this blog entry, we will explore the property bag data modelling use case and provide some tricks that will make it
very simple to query those bags.
A central value add of modeling databases for analytical purposes and speed is that you can restore the sanity, often lost in the source system, that comes from using good keys in each table.
But what exactly does it mean for a key to be "good"?