The past months, I have been working on a feature which deal with grouping records by date and summing up values.
I have learn a couple of date functions in PostgreSQL that are very useful when it comes to grouping records together based on datetime column.
In this post, we will go through a bit of the context of the feature I work on, and we will walk through the process of implementing it. Along the way, I’ll share the date function I learned.
A bit of context on the feature I worked on. We are building a steps counter feature into our mobile, integrating Apple Health and Google Fit. The data provided by Apple Health is especially interesting as they provide the steps dataset in an hourly interval.
So on the database, we will have a
steps table as follow:
|started_at||timestamp with time zone|
|ended_at||timestamp with time zone|
And this is how the records will looks like:
We then have a endpoint which provide aggregated data of users steps count for a specific date range (this week/month, previous week/month and etc). So basically, to implement the feature we need to:
- Group users steps by date range (week, month, year).
- Sum up users
- Populate labels according to provided date range.
- When range is week, the labels should indicate
- When range is month, the labels should indicate
- When range is year, the labels should indecate
- When range is week, the labels should indicate
Since we are storing the timestamp as
UTC and our users are using
timezone, we will also need to consider the timezone while querying the data.
But for the simplicity of this post, I’ll skip that part, since dealing with
different timezone can be a post for another day.
Let the lesson begins
Okay, enough of boring context. By implementing the feature above, we are going to learn the following date functions in PostgreSQL:
- Truncate date with
- Extract date parts, such as weekday, month and year with
- Format date with
For those who want to get their hands dirty, you can run the following scripts to setup your database environment to play around with as you go through the article:
$ createdb step_db $ psql step_db
Then, copy and paste the following SQL statement into your
CREATE TABLE "steps" ( "id" SERIAL, "count" integer, "start_at" timestamptz, "end_at" timestamptz, "user_id" integer, PRIMARY KEY ("id"));
- Insert data from
INSERT INTO "steps" (count, start_at, end_at, user_id) SELECT floor(random() * 50 + 1)::int, d, d + interval '59 minutes 59 seconds', 1 FROM generate_series('2020-01-01'::timestamptz, '2020-01-07 00:00:00'::timestamptz, interval '1 hour') as d;
Do change the series date range if you want to generate more data.
Grouping records by day with
Since, our steps value are inserted as hourly instead of daily. We need to write a query to group the steps count by the date and sum it up. So ideally, the result should return something like:
At the very first sight, you might thought we could do something like this and it’s done:
SELECT sum(count) as total_count, start_at as date FROM steps GROUP BY start_at ORDER BY start_at;
But then, after running this query, it will actually return something like this:
The initial query we wrote, its grouping by datetime instead of date. To achieve what we want, we need to group by just the date of the row.
To do that, We can use
date_trunc function in PostgreSQL to truncate a timestamp
up to part of the timestamp like
hour, and etc (For the full
options, refer here).
SELECT sum(count) as total_count, date_trunc('day', start_at) as date FROM steps GROUP BY date_trunc('day', start_at) ORDER BY date_trunc('day', start_at);
which will return:
Close enough to what we want, but how do we remove the time component of the timestamp when SQL return the result? We can format it.
Now we have the query for grouping the rows by date and get the sum of the
steps count. The next step is to learn how we can format the timestamp. In
PostgreSQL, there is a
to_char formatting function that can convert various
data type to formatted string. And it works on timestamp also.
For example, we can use:
SELECT to_char('2020-01-01'::timestamptz, 'YYYY-MM-DD'); -- => to_char -- ------------ -- 2020-01-01
to format the timestamp to return just it’s date component. Now, we can update our previous SQL query to use it:
SELECT sum(count) as total_count, to_char(start_at, 'YYYY-MM-DD') as date FROM steps GROUP BY to_char(start_at, 'YYYY-MM-DD') ORDER BY to_char(start_at, 'YYYY-MM-DD');
which return this result, exactly what we wanted:
Extract date part
So, let’s say we wanted to show the label of the count in the format of
weekdays if the results show is in the week range (from Monday to Sunday). We
could also achieve it with
to_char like this:
SELECT sum(count) as total_count, to_char(start_at, 'YYYY-MM-DD') as date, to_char(start_at, 'Dy') as label FROM steps GROUP BY to_char(start_at, 'YYYY-MM-DD'), to_char(start_at, 'Dy') ORDER BY to_char(start_at, 'YYYY-MM-DD');
We need to add the formating to the
GROUP BY also since we are selecting it.
With this, the result will be as follow:
We are using the template patterns of
Dy here which formatted the date to the
day name such as Mon, Tue, Wed (For the full list of patterns available, refer
to the here).
But, be aware of this approach when we use
ORDER BY with
to_char that return alphabet
instead of integer in string. For example, let say we want to group by month and
sum the steps count, we might do something like this:
SELECT sum(count) as total_count, to_char(start_at, 'YYYY-MM') as date, --- 'Mon' template pattern will format 2020-01-01 to 'Jan' to_char(start_at, 'Mon') as label FROM steps GROUP BY to_char(start_at, 'YYYY-MM'), to_char(start_at, 'Mon') ORDER BY to_char(start_at, 'Mon');
This will not return the order of rows from Jan to Dec, instead it would look something like this:
where the label is sorted alphabetically since the column type is string.
One way to prevent this from happening is used the approach shown above, where we
ORDER BY to_char(start_at, 'YYYY-MM') instead, since the value formatted is
integer in string, it will still be sorted as expected.
The other approach would be using
date_part to extract part of the timestamp
to_char. The difference between
date_part always return number (to be exact is
double precision) type.
to_char you have more options to format your date.
SELECT sum(count) as total_count, to_char(start_at, 'YYYY-MM') as date, date_part('month', start_at) as label FROM steps GROUP BY to_char(start_at, 'YYYY-MM'), date_part('month', start_at) ORDER BY date_part('month', start_at);
Through the post, we have gone through the business requirement and implemented it iteratively using different date function in PostgreSQL. To sum up:
date_truncto truncate date. It is useful when our data is stored in date time granularity and we want it to group by date/month/year.
to_charto format date. It is very powerful as we can also used it to extract part of the date, use it in grouping and etc. But, beware when you are using it in
ORDER BYclause as it is sorted alpabetically.
date_partto extract part of the date. It is useful when we need to extract part of the date as integer. Perhaps, we want to use it for calculation or in
Utilizing these few function I have managed to implement aggregation of data for different period such as:
- Showing daily steps for the current/past week where labels are Mon, Tues, …, Sun
- Showing daily steps for the current/past month where labels are 1, 2, …, 31
- Showing monthly steps for the current/past year where labels are Jan, Feb, …, Dec
There are probably more use cases of these date functions that I haven’t come across yet. So, don’t let this post limit your usage.