3

Window Functions in MongoDB 5.0

 1 year ago
source link: https://www.percona.com/blog/window-functions-in-mongodb-5-0/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Percona Database Performance Blog

Window Functions in MongoDB 5.0

I have already presented in previous posts some of the new features available on MongoDB 5.0: resharding and time series collections. Please have a look if you missed them:

MongoDB 5.0 Time Series Collections

Resharding in MongoDB 5.0

In this article, I would like to present another new feature: window functions.

Window functions are quite popular on relational databases, they permit the run of a window across sorted documents producing calculations over each step of the window. Typical use cases are calculating rolling averages, correlation scores, or cumulative totals. You can achieve the same result even with older versions of MongoDB or with databases where window functions are not available. But this comes at the cost of more complexity because multiple queries are usually required, and saving somewhere temporary data is needed as well.

Instead, the window functions let you run a single query and get the expected results in a more efficient and elegant way.

Let’s see how the feature works on MongoDB 5.0.

The window functions

A new aggregation stage $setWindowFields is available on MongoDB 5.0. This is the one that provides the window functions capability.

The following is the syntax of the stage:

Shell
  $setWindowFields: {
    partitionBy: <expression>,
    sortBy: {
      <sort field 1>: <sort order>,
      <sort field 2>: <sort order>,
      <sort field n>: <sort order>
    output: {
      <output field 1>: {
        <window operator>: <window operator parameters>,
        window: {
          documents: [ <lower boundary>, <upper boundary> ],
          range: [ <lower boundary>, <upper boundary> ],
          unit: <time unit>
      <output field 2>: { ... },
      <output field n>: { ... }
  • partitionBy (optional): some expression to group the document. If omitted by default all the documents are grouped into a single partition
  • sortBy (required in some cases ): sorting the documents. Uses the $sort syntax
  • output (required): specifies the documents to append to the result set. Basically, this is the parameter that provides the result of the window function
  • window (optional): defines the inclusive window boundaries and how the boundaries should be used for the calculation of the window function result

Well, the definitions may look cryptic but a couple of simple examples will clarify how you can use them.

The test dataset

I have a Percona Server for MongoDB 5.0 running and I got some public data about COVID-19 infections, hospitalizations, and other info from Italy. The data are available on a per-day and per-region basis from the following link: https://github.com/pcm-dpc/COVID-19/tree/master/dati-regioni.

I loaded just a few months’ data spanning 2021 and 2022. Data is labeled in Italian, so I created a similar and reduced collection just for the needs of this article.

Here is a sample of the documents:

Shell
> db.covid.find({"region":"Lombardia"}).sort({"date":1}).limit(5)
{ "_id" : ObjectId("62ab5f7d017d030e4cb314e9"), "region" : "Lombardia", "total_cases" : 884125, "date" : ISODate("2021-10-01T15:00:00Z") }
{ "_id" : ObjectId("62ab5f7d017d030e4cb314fe"), "region" : "Lombardia", "total_cases" : 884486, "date" : ISODate("2021-10-02T15:00:00Z") }
{ "_id" : ObjectId("62ab5f7d017d030e4cb31516"), "region" : "Lombardia", "total_cases" : 884814, "date" : ISODate("2021-10-03T15:00:00Z") }
{ "_id" : ObjectId("62ab5f7d017d030e4cb31529"), "region" : "Lombardia", "total_cases" : 884920, "date" : ISODate("2021-10-04T15:00:00Z") }
{ "_id" : ObjectId("62ab5f7d017d030e4cb3153d"), "region" : "Lombardia", "total_cases" : 885208, "date" : ISODate("2021-10-05T15:00:00Z") }

Each document contains the daily number of total COVID infections from the beginning of the pandemic for a specific Italian region.

Calculate daily new cases

Let’s create our first window function.

Since we have in the collection only the number of total cases, we would like to calculate the number of new cases per day. This way we can understand if the status of the pandemic is getting worse or improving.

You can achieve that by issuing the following aggregation pipeline:

Shell
> db.covid.aggregate( [
{ $setWindowFields: {
    partitionBy : "$region",
    sortBy: { date: 1 },
    output: {
      previous: {
        $push: "$total_cases",
        window: {
          range: [-1, -1],
          unit: "day"
{ $unwind:"$previous"},
{ $addFields: {
    new_cases: {
      $subtract: ["$total_cases","$previous"]
{ $match: { "region": "Lombardia" } },
{ $project: { _id:0, region:1, date:1, new_cases: 1}  }
{ "region" : "Lombardia", "date" : ISODate("2021-10-02T15:00:00Z"), "new_cases" : 361 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-03T15:00:00Z"), "new_cases" : 328 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-04T15:00:00Z"), "new_cases" : 106 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-05T15:00:00Z"), "new_cases" : 288 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-06T15:00:00Z"), "new_cases" : 449 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-07T15:00:00Z"), "new_cases" : 295 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-08T15:00:00Z"), "new_cases" : 293 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-09T15:00:00Z"), "new_cases" : 284 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-10T15:00:00Z"), "new_cases" : 278 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-11T15:00:00Z"), "new_cases" : 87 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-12T15:00:00Z"), "new_cases" : 306 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-13T15:00:00Z"), "new_cases" : 307 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-14T15:00:00Z"), "new_cases" : 273 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-15T15:00:00Z"), "new_cases" : 288 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-16T15:00:00Z"), "new_cases" : 432 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-17T15:00:00Z"), "new_cases" : 297 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-18T15:00:00Z"), "new_cases" : 112 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-19T15:00:00Z"), "new_cases" : 412 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-20T15:00:00Z"), "new_cases" : 457 }
{ "region" : "Lombardia", "date" : ISODate("2021-10-21T15:00:00Z"), "new_cases" : 383 }

The pipeline also contains stages to make the output more readable. Let’s focus on the $setWindowFields anyway.

In the first stage, we define the window function in order to create for each document a new field containing the total cases from the previous day. The field was obviously named previous.

Then we’ll use this information in the following stages to simply calculate the difference between the total cases of “today” and “yesterday”. Then we get the daily increase.

Take a look at how the window function has been created. We used $push to fill the new field with the value of total_cases. In the window document, we defined the range as [-1,-1]. These numbers represent the lower and upper boundaries of the window and they both correspond to the previous (-1) document in the window. It spans only one document: yesterday. In this case, the usage of sortBy is relevant because it tells MongoDB which order to consider the documents in the windows. The trick of defining the range as [-1,-1] to get yesterday’s data is possible because the documents are properly sorted.

Calculate moving average

Let’s now calculate the moving average. We’ll consider the last week of data to calculate the average of new cases on a daily basis. This kind of parameter was a very popular one during the peak of the pandemic to trigger a lot of discussions around the forecasts and to address the decisions of the governments. Well, it’s a simplification. There were also other relevant parameters, but the moving average was one of them.

To calculate the moving average we need the daily new cases we have calculated in the previous example. We can reuse those values in different ways like adding another “$setWindowField” stage in the previous pipeline, adding the new_cases field on existing documents, or creating another collection as I did for simplicity this way using the $out stage:

Shell
> db.covid.aggregate( [ { $setWindowFields: { partitionBy : "$region", sortBy: { date: 1 }, output: { previous: { $push: "$total_cases", window: { range: [-1, -1],  unit: "day" } } } } }, { $unwind:"$previous"},  { $addFields: { new_cases: { $subtract: ["$total_cases","$previous"] } } }, { $project: { region:1, date:1, new_cases: 1} }, { $out: "covid_daily"  }  ] )

Now we can calculate the moving average on the covid_daily collection. Let’s do it with the following aggregation:

Shell
> db.covid_daily.aggregate([
{ $setWindowFields: {
    partitionBy : "$region",
    sortBy : { date: 1 },
    output: {
      moving_average: {
        $avg: "$new_cases",
        window: {
          range: [-6, 0],
          unit: "day"
{ $project: { _id:0  } }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-02T15:00:00Z"), "new_cases" : 49, "moving_average" : 49 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-03T15:00:00Z"), "new_cases" : 36, "moving_average" : 42.5 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-04T15:00:00Z"), "new_cases" : 14, "moving_average" : 33 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-05T15:00:00Z"), "new_cases" : 35, "moving_average" : 33.5 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-06T15:00:00Z"), "new_cases" : 61, "moving_average" : 39 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-07T15:00:00Z"), "new_cases" : 54, "moving_average" : 41.5 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-08T15:00:00Z"), "new_cases" : 27, "moving_average" : 39.42857142857143 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-09T15:00:00Z"), "new_cases" : 48, "moving_average" : 39.285714285714285 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-10T15:00:00Z"), "new_cases" : 19, "moving_average" : 36.857142857142854 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-11T15:00:00Z"), "new_cases" : 6, "moving_average" : 35.714285714285715 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-12T15:00:00Z"), "new_cases" : 55, "moving_average" : 38.57142857142857 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-13T15:00:00Z"), "new_cases" : 56, "moving_average" : 37.857142857142854 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-14T15:00:00Z"), "new_cases" : 45, "moving_average" : 36.57142857142857 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-15T15:00:00Z"), "new_cases" : 41, "moving_average" : 38.57142857142857 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-16T15:00:00Z"), "new_cases" : 26, "moving_average" : 35.42857142857143 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-17T15:00:00Z"), "new_cases" : 39, "moving_average" : 38.285714285714285 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-18T15:00:00Z"), "new_cases" : 3, "moving_average" : 37.857142857142854 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-19T15:00:00Z"), "new_cases" : 45, "moving_average" : 36.42857142857143 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-20T15:00:00Z"), "new_cases" : 54, "moving_average" : 36.142857142857146 }
{ "region" : "Abruzzo", "date" : ISODate("2021-10-21T15:00:00Z"), "new_cases" : 72, "moving_average" : 40 }

Note we have defined the range boundaries as [-6,0] in order to span the last week’s documents for the current document.

Notes about window functions

We have used unit: “day” in the window definition, but this option field can also have other values like year, quarter, month, week, day, hour, and so on.

There are multiple operators that can be used with $setWindowFields: $avg, $count, $first, $last, $max, $min, $derivative, $sum, $rank and many others you can check on the documentation.

There are a few restrictions about window functions usage. Please have a look at the official documentation in case you hit some of them.

Conclusion

The new window function is a very good feature deployed on MongoDB 5.0. It could make life easier for a lot of developers.

For getting more details and to check the restrictions you can, have a look at the following page:

https://www.mongodb.com/docs/manual/reference/operator/aggregation/setWindowFields/

Percona Server for MongoDB 5.0 is a drop-in replacement for MongoDB Community. You can use it for free and you can rely on enterprise-class features like encryption at rest, LDAP authentication, auditing, and many others. You can also rely on all new features of MongoDB Community 5.0, including window functions.

Take a look at Percona Server for MongoDB.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK