Mash's Musings


How to grok SQL

Published Apr. 27, 2021

The rigidity of SQL is both a blessing and a curse. Once you "get it", you really get it, there are almost no surprises. But its inflexible syntax and difficulty to test also means that it can take some time to grok and lead to lots of silent errors.

Confusingly, SQL is not executed top to bottom, the manner in which a human would read it. Reasoning about and reading SQL in the actual execution order, as outlined below, provides much greater clarity. One general tip is to use CTEs (Common Table Expressions AKA WITH statements) liberally to keep things organized.

FROM

The most important part of any query and the source of most mistakes. Reasoning about this involves visualizing, mentally or on paper, the full expansion of each join. Make sure you can state exactly what the the primary keys are on each table before joining. Write out some rows you expect to be dropped with inner joins, some rows you expect to be empty with outer joins, and some rows you expect to be duplicated/expanded if you're not joining two tables at the exact same level of aggregation. Joining at the wrong level of aggregation is really easy and won't return any errors, quietly returning a completely incorrect number.

By the end of this step you should have a sense of some giant megatable you're creating and be satisfied with what each row in the megatable represents. This is likely different than what your left-most table started with. Most of the columns in your megatable will never be used and the columns you joined on will likely be duplicated—this is okay.

WHERE

Once you have your megatable, decide which rows you don't need. This part is relatively well understood. The most common mistake seen stems from trying to filter on fields in joined tables when nothing was actually joined. Again, having a clear visualization of the megatable will help avoid this. Also be explicit about how you handle nulls: if A is null and B is 1, A != B may not do what you expect...

GROUP BY

You should know what level of aggregation you need for your output—only group by those fields! This is the second most common source of confusion. If you find yourself starting to group by numeric fields, stop! If you find yourself grouping by fields that are 1:1 or 1:many with your main field, stop! Really common mistakes would be grouping by city, state, and country or grouping by person_id, person_name, and title. In both of these situations, what you're really trying to do is just group by city or person_id first (within a CTE) and then join in the additional fields; the additional grouping fields are completely redundant! Adding all sorts of unnecessary groupings makes the query perform worse and adds confusion.

HAVING

Did you know that this even exists? WHERE but applied at the level of GROUP BY. Only want to find duplicates? Only care about groups above a certain threshold? This is where you encode it.

SELECT

While this seems like the meat and potatoes when first learning, this is really near the end of the process and should be straightforward after thinking through everything else. Be frugal with what fields you're pulling; a SELECT * might be quick to write but could slow you down 10x in run time. Window functions can get tricky and are outside of the scope of this post.

ORDER BY

You're almost there! Given how late this is in the execution, you can order by all the fancy new fields you defined in the last step (aside from window functions, womp womp).

LIMIT

You made it! Try to be frugal here as well since you probably just want to make sure things are working. Forgetting this can lead to some long wait times as your front-end tries to download 1 GB files when a few KB would have sufficed.