posts

Apr 14, 2020

Essential Date Functions for Grouping in PostgreSQL

Estimated Reading Time: 6 minutes (1249 words)

The past months, I have been working on a feature which deal with grouping records by date and summing up values.

I have learn a couple of date functions in PostgreSQL that are very useful when it comes to grouping records together based on datetime column.

In this post, we will go through a bit of the context of the feature I work on, and we will walk through the process of implementing it. Along the way, I’ll share the date function I learned.

Context

A bit of context on the feature I worked on. We are building a steps counter feature into our mobile, integrating Apple Health and Google Fit. The data provided by Apple Health is especially interesting as they provide the steps dataset in an hourly interval.

So on the database, we will have a steps table as follow:

ColumnType
idint
countint
started_attimestamp with time zone
ended_attimestamp with time zone
user_idint

And this is how the records will looks like:

FieldValue
id1
count34
started_at2020-02-26 13:00:00+08
ended_at2020-02-26 14:00:00+08
user_id1
FieldValue
id2
count319
started_at2020-02-26 12:00:00+08
ended_at2020-02-26 13:00:00+08
user_id1

We then have a endpoint which provide aggregated data of users steps count for a specific date range (this week/month, previous week/month and etc). So basically, to implement the feature we need to:

Since we are storing the timestamp as UTC and our users are using GMT +8 timezone, we will also need to consider the timezone while querying the data. But for the simplicity of this post, I’ll skip that part, since dealing with different timezone can be a post for another day.

Let the lesson begins

Okay, enough of boring context. By implementing the feature above, we are going to learn the following date functions in PostgreSQL:

Setup

For those who want to get their hands dirty, you can run the following scripts to setup your database environment to play around with as you go through the article:

$ createdb step_db
$ psql step_db

Then, copy and paste the following SQL statement into your psql console.

  1. Create steps table.
CREATE TABLE "steps" (
  "id" SERIAL,
  "count" integer,
  "start_at" timestamptz,
  "end_at" timestamptz,
  "user_id" integer,
  PRIMARY KEY ("id"));
  1. Insert data from 2020-01-01 to 2020-01-07 at UTC.
INSERT INTO "steps" (count, start_at, end_at, user_id)
SELECT floor(random() * 50 + 1)::int, d, d + interval '59 minutes 59 seconds', 1
FROM generate_series('2020-01-01'::timestamptz,
                         '2020-01-07 00:00:00'::timestamptz,
                         interval '1 hour') as d;

Do change the series date range if you want to generate more data.

Grouping records by day with date_trunc

Since, our steps value are inserted as hourly instead of daily. We need to write a query to group the steps count by the date and sum it up. So ideally, the result should return something like:

total_countdate
34232020-01-01
45232020-01-02

At the very first sight, you might thought we could do something like this and it’s done:

SELECT sum(count) as total_count, start_at as date
FROM steps
GROUP BY start_at
ORDER BY start_at;

But then, after running this query, it will actually return something like this:

total_countdate 
312020-01-01 00:00:00+00
172020-01-01 01:00:00+00
312020-01-01 02:00:00+00

The initial query we wrote, its grouping by datetime instead of date. To achieve what we want, we need to group by just the date of the row.

To do that, We can use date_trunc function in PostgreSQL to truncate a timestamp up to part of the timestamp like day, month, hour, and etc (For the full options, refer here).

SELECT sum(count) as total_count, date_trunc('day', start_at) as date
FROM steps
GROUP BY date_trunc('day', start_at)
ORDER BY date_trunc('day', start_at);

which will return:

total_countdate
5442020-01-01 00:00:00+00
7122020-01-02 00:00:00+00

Close enough to what we want, but how do we remove the time component of the timestamp when SQL return the result? We can format it.

Format Date

Now we have the query for grouping the rows by date and get the sum of the steps count. The next step is to learn how we can format the timestamp. In PostgreSQL, there is a to_char formatting function that can convert various data type to formatted string. And it works on timestamp also.

For example, we can use:

SELECT to_char('2020-01-01'::timestamptz, 'YYYY-MM-DD');
-- =>  to_char
--     ------------
--     2020-01-01

to format the timestamp to return just it’s date component. Now, we can update our previous SQL query to use it:

SELECT sum(count) as total_count, to_char(start_at, 'YYYY-MM-DD') as date
FROM steps
GROUP BY to_char(start_at, 'YYYY-MM-DD')
ORDER BY to_char(start_at, 'YYYY-MM-DD');

which return this result, exactly what we wanted:

total_countdate
5442020-01-01
7122020-01-02

Extract date part

So, let’s say we wanted to show the label of the count in the format of weekdays if the results show is in the week range (from Monday to Sunday). We could also achieve it with to_char like this:

SELECT sum(count) as total_count,
  to_char(start_at, 'YYYY-MM-DD') as date,
  to_char(start_at, 'Dy') as label
FROM steps
GROUP BY to_char(start_at, 'YYYY-MM-DD'), to_char(start_at, 'Dy')
ORDER BY to_char(start_at, 'YYYY-MM-DD');

We need to add the formating to the GROUP BY also since we are selecting it. With this, the result will be as follow:

total_countdatelabel
5442020-01-01Wed
7122020-01-02Thu

We are using the template patterns of Dy here which formatted the date to the day name such as Mon, Tue, Wed (For the full list of patterns available, refer to the here).

But, be aware of this approach when we use ORDER BY with to_char that return alphabet instead of integer in string. For example, let say we want to group by month and sum the steps count, we might do something like this:

SELECT sum(count) as total_count,
  to_char(start_at, 'YYYY-MM') as date,
  --- 'Mon' template pattern  will format 2020-01-01 to 'Jan'
  to_char(start_at, 'Mon') as label
FROM steps
GROUP BY to_char(start_at, 'YYYY-MM'), to_char(start_at, 'Mon')
ORDER BY to_char(start_at, 'Mon');

This will not return the order of rows from Jan to Dec, instead it would look something like this:

total_countdatelabel
179362020-04Apr
191602020-08Aug
39212020-12Dec
177142020-02Feb

where the label is sorted alphabetically since the column type is string.

One way to prevent this from happening is used the approach shown above, where we ORDER BY to_char(start_at, 'YYYY-MM') instead, since the value formatted is integer in string, it will still be sorted as expected.

The other approach would be using date_part to extract part of the timestamp instead of to_char. The difference between date_part from to_char is date_part always return number (to be exact is double precision) type. With to_char you have more options to format your date.

SELECT sum(count) as total_count,
  to_char(start_at, 'YYYY-MM') as date,
  date_part('month', start_at) as label
FROM steps
GROUP BY to_char(start_at, 'YYYY-MM'), date_part('month', start_at)
ORDER BY date_part('month', start_at);

which return:

total_countdatelabel
187012020-011
177142020-022
194952020-033

Wrap Up

Through the post, we have gone through the business requirement and implemented it iteratively using different date function in PostgreSQL. To sum up:

Utilizing these few function I have managed to implement aggregation of data for different period such as:

There are probably more use cases of these date functions that I haven’t come across yet. So, don’t let this post limit your usage.