I wrote recently about the path of learning SQL and I wanted to take a little excursion into what step 6 can look like. In that recent post, I provided examples of when one can use window functions. We generally want to compare things between days or some other aggregated measure. Unfortunately, as anyone working with data knows, the source data is often at a transactional level with many many records per day. This problem is easily handled by CTEs: first we aggregate our records by day as a CTE and then we apply the window functions to the aggregated CTE. But, did you know that this can (in most SQL engines) be elegantly and concisely written without the use of a CTE? Let's take a look at an example.
Let's say that we have a table of temperature recordings and we want to
compare the change in daily highs. We have a table
temps
that records the time
and
temp
at given reading. And because we are sane and
sensible, we record our temperature in Celsius.
time | temp |
---|---|
2021-07-01 12:14:42 | 20.0 |
2021-07-01 17:01:14 | 25.0 |
2021-07-02 06:08:22 | 18.0 |
2021-07-02 20:00:11 | 23.0 |
2021-07-03 04:39:21 | 14.0 |
2021-07-03 11:30:00 | 21.0 |
2021-07-03 22:09:43 | 18.0 |
The first step would involve figuring out the max temperatures for each day.
SELECT date_trunc('day', time) AS day, max(temp) AS max_temp
FROM temps
GROUP BY 1
ORDER BY 1
Which gets us:
day | max_temp |
---|---|
2021-07-01 | 25.0 |
2021-07-02 | 23.0 |
2021-07-03 | 21.0 |
Now with a little window function wizardry, we can compute our day-over-day change.
WITH daily AS (
SELECT date_trunc('day', time) AS day, max(temp) AS max_temp
FROM temps
GROUP BY 1
ORDER BY 1
)
SELECT *,
lag(max_temp) OVER(ORDER BY day) AS yesterday_max,
max_temp - lag(max_temp) OVER(ORDER BY day) AS dod_change
FROM daily
Which gets us:
day | max_temp | yesterday_max | dod_change |
---|---|---|---|
2021-07-01 | 25.0 | ||
2021-07-02 | 23.0 | 25.0 | -2.0 |
2021-07-03 | 21.0 | 23.0 | -2.0 |
All is well and we have what we need. But we can do better than that!
SELECT date_trunc('day', time) AS day,
max(temp) AS max_temp,
lag(max(temp)) OVER(ORDER BY date_trunc('day', time)) AS yesterday_max,
max(temp) - lag(max(temp)) OVER(ORDER BY date_trunc('day', time)) AS dod_change
FROM temps
GROUP BY 1
ORDER BY 1
day | max_temp | yesterday_max | dod_change |
---|---|---|---|
2021-07-01 | 25.0 | ||
2021-07-02 | 23.0 | 25.0 | -2.0 |
2021-07-03 | 21.0 | 23.0 | -2.0 |
Slick! If you look carefully, all we really did was sub in
max(temp)
and date_trunc('day', time)
into our
window function where we previously had used the aliased fields
max_temp
and day
respectively. This nesting
can look quite confusing at first, especially if you start doing things
like sum(count(*)) OVER()
to get the total count of records
by day but I think it is wonderfully clear when you understand what's
happening.
Like all things, practice makes perfect. Don't expect to get this right the first time. But next time you see this pattern of aggregation then window function, stretch yourself and see if you can do it in one shot. Who knows who you can become with all the extra time saved from writing those 5 extra lines of code!?