Today we are looking at a Q04 — which on the surface is similar to Q17.
Like Q17, Q04 has a correlated subquery that can be de-correlated using a join.
But sometimes, a regular INNER JOIN is not enough to de-correlate a query.
You need special join types that don't appear as part of the standard SQL syntax.
When talking about Data Lakes and how people access them - we must address some of the misconceptions that made
them popular in the first place.
One of the largest misconceptions is this: "Joins are expensive". It is apparently believed (as I discovered from
discussions on LinkedIn) that it costs less CPU to turn your data model into a flat table and serve that to users -
than it does to join everytime you access the table.
It is certainly true that object stores allow nearly infinite I/O (although at a very high CPU cost to handle
the HTTP overhead) and nearly infinite storage. But, is there really such a thing as "sacrificing disk space to save
the CPU cost of joins?"
I decided to write this post in response to my recent discussions with Matt Martin. Matt and I have been sparring lately over software performance. During one of these discussions, Matt said the magic words in the title of this post. I hypothesise that this quote carries of lot of explaining power for those who are not fully familiar with databases.
Let us play with this idea a bit! I'm going to use Python to show examples of "database loops". Obviously, no one in their sane mind would implement a database in Python. But Python has the great advantage of being readable to nearly everyone. Let's go...
Most relational databases are both by human SQL query writers and tools that automatically generate queries. Query tools frequently add joins into the queries, even when those joins are not needed. This results in unnecessary query slowdowns and increase CPU usage.
For example, consider the classic Star Schema model with a fact table in the middle and dimension tables surrounding it. From a reporting tool perspective, it is useful to have a view on top of that schema which joins all the tables and presents the user with the illusion of a single table. This approach of creating large join views has some drawbacks that can only be offset by query optimisers if the database designer has correctly declared key constraints in the data model.