Window Functions in SQL

Posted Apr 8, 2023 Updated Apr 8, 2023

By Shashwat Pandey

10 min read

A window function in SQL is a type of function that performs a calculation across a set of rows in a table, and returns a result for each row based on that calculation. The calculation is performed over a “window” of rows that is defined by the ORDER BY clause of the query.

This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row; the rows retain their separate identity.

The window function operates on the result set produced by the query, rather than on the individual rows of the table. This allows the function to perform calculations that involve multiple rows without grouping them into aggregates.

Some common window functions are:

ROW_NUMBER()
RANK()
DENSE_RANK()
SUM()
AVG()
MAX()
NTILE()

SELECT column1,
       column2,
       SUM(column2) OVER (PARTITION BY column1 ORDER BY column3) AS some_name_for_the_new_column
FROM database.table
WHERE <some_condition>

The first part of the above aggregation, SUM(), looks a lot like any other aggregation. Adding “OVER” designates it as a window function.

You can read it as “take the sum of __ over the entire result set in order by __”

If you’d like to narrow the window from the entire dataset to individual groups within the dataset, you can use PARTITION BY.

ROW_NUMBER()

Suppose we have a table named “employees” with columns “id”, “name”, “department”, and “salary”. We want to assign a unique row number to each employee, based on their salary, so that the employee with the highest salary gets assigned row number 1.

We can use the ROW_NUMBER() function along with the ORDER BY clause to achieve this.

SELECT id, name, department, salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num
FROM employees;

This query selects all columns from the "employees" table, and adds a new column named "row_num" that contains the row number assigned to each employee based on their salary. The ROW_NUMBER() function is applied over the window of all rows, and is ordered by the “salary” column in descending order, so that the employee with the highest salary gets assigned row number 1.

The result of this query would be a table with columns "id", "name", "department", "salary", and "row_num", where each row represents an employee and their corresponding row number based on their salary.

RANK() & DENSE_RANK()

Suppose we have a table named "students" with columns "id", "name", and "score". We want to rank the students based on their scores, and assign the same rank to students with the same score. We can use the RANK() function for this.

SELECT id, name, score, RANK() OVER (ORDER BY score DESC) AS rank
FROM students;

This query selects all columns from the "students" table, and adds a new column named "rank" that contains the rank assigned to each student based on their score. The RANK() function is applied over the window of all rows, and is ordered by the "score" column in descending order. Students with the same score will be assigned the same rank, and the next rank will be skipped. For example, if two students have the same highest score, they will both be assigned a rank of 1, and the next student with the next lower score will be assigned a rank of 3 (skipping rank 2).

Now, suppose we want to assign consecutive ranks to students with the same score, without skipping any ranks. We can use the DENSE_RANK() function for this.

SELECT id, name, score, DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM students;

This query selects all columns from the "students" table, and adds a new column named "dense_rank" that contains the dense rank assigned to each student based on their score. The DENSE_RANK() function is applied over the window of all rows, and is ordered by the "score" column in descending order. Students with the same score will be assigned the same dense rank, and the next rank will be the next consecutive number. For example, if two students have the same highest score, they will both be assigned a dense rank of 1, and the next student with the next lower score will be assigned a dense rank of 2.

Difference between the ROW_NUMBER() and RANK() functions

Suppose we have a table named "sales" with columns "region", "product", and "sales_amount". We want to assign a unique identifier to each sale in the table, and rank the sales based on their sales amount within each region.

To assign a unique identifier to each sale, we can use the ROW_NUMBER() function. To rank the sales based on their sales amount within each region, we can use the RANK() function.

SELECT region, product, sales_amount,
ROW_NUMBER() OVER (ORDER BY sales_amount DESC) AS row_num,
RANK() OVER (PARTITION BY region ORDER BY sales_amount DESC) AS rank
FROM sales;

In this query, the ROW_NUMBER() function is applied over the entire table to assign a unique identifier to each sale based on its sales amount, ordered by sales_amount in descending order. This means that each sale in the table will have a unique row number.

The RANK() function, on the other hand, is applied over each region separately to assign a rank to each sale within its region based on its sales amount, ordered by sales_amount in descending order. This means that each sale in a region with the same sales amount will be assigned the same rank, and the next rank will be skipped.

For example, suppose we have the following sales data:

Region	Product	Sales Amount
West	A	100
West	B	200
West	C	200
East	A	150
East	B	150
East	C	250

The result of the above query would be:

Region	Product	Sales Amount	row_num	rank
West	B	200	1	1
West	C	200	2	1
West	A	100	3	3
East	C	250	4	1
East	A	150	5	2
East	B	150	6	2

As we can see, each sale in the table is assigned a unique row number based on its sales amount. In the West region, the sales for products B and C have the same sales amount and are assigned the same rank of 1, while the sale for product A has a lower sales amount and is assigned a rank of 3. In the East region, the sale for product C has the highest sales amount and is assigned a rank of 1, while the sales for products A and B have the same sales amount and are assigned the same rank of 2.

LAG and LEAD

Suppose we have a table named "sales" with columns "date", "product", and "sales_amount". We want to calculate the difference in sales amounts for each product between consecutive days.

To do this, we can use the LAG() and LEAD() functions. The LAG() function allows us to access the value of a previous row, while the LEAD() function allows us to access the value of a subsequent row, based on a specified column and ordering within a window.

SELECT date, product, sales_amount,
sales_amount - LAG(sales_amount) OVER (PARTITION BY product ORDER BY date) AS sales_diff_prev,
LEAD(sales_amount) OVER (PARTITION BY product ORDER BY date) - sales_amount AS sales_diff_next
FROM sales;

In this query, we use the LAG() function to calculate the difference in sales amounts for each product between the current day and the previous day. We use the LEAD() function to calculate the difference in sales amounts for each product between the current day and the next day.

The PARTITION BY clause is used to partition the window by product, so that the LAG() and LEAD() functions are applied to the rows with the same product. The ORDER BY clause is used to order the rows within each partition by date, so that we can access the previous and next sales amounts for each product.

For example, suppose we have the following sales data:

Date	Product	Sales Amount
2022-01-01	A	100
2022-01-02	A	150
2022-01-03	A	200
2022-01-01	B	50
2022-01-02	B	70
2022-01-03	B	90

The result of the above query would be:

Date	Product	Sales Amount	sales_diff_prev	sales_diff_next
2022-01-01	A	100	NULL	50
2022-01-02	A	150	50	50
2022-01-03	A	200	50	NULL
2022-01-01	B	50	NULL	20
2022-01-02	B	70	20	20
2022-01-03	B	90	20	NULL

As we can see, the sales_diff_prev column shows the difference in sales amounts between the current day and the previous day for each product, while the sales_diff_next column shows the difference in sales amounts between the current day and the next day for each product. For product A, the sales amounts increase by 50 and then 50 again, while for product B, the sales amounts increase by 20 and then 20 again.

NTILE()

The NTILE() function is a window function in SQL that allows you to divide a result set into a specified number of groups, or "tiles" based on a specified column and ordering within a window.

Suppose we have a table named "employees" with columns "employee_id", "employee_name", and "salary". We want to divide the employees into four equal groups based on their salaries.

To do this, we can use the NTILE() function. The NTILE() function takes an integer argument specifying the number of tiles to divide the result set into. The result set is divided into this number of groups based on the values in the specified column and ordering within a window.

SELECT employee_id, employee_name, salary,
NTILE(4) OVER (ORDER BY salary DESC) AS salary_group
FROM employees;

In this query, we use the NTILE() function to divide the result set into four groups based on the salaries in descending order. The OVER clause is used to specify the window, which in this case is the entire result set. The ORDER BY clause is used to order the rows within the window by salary in descending order.

For example, suppose we have the following employee data:

Employee ID	Employee Name	Salary
1	Alice	50000
2	Bob	60000
3	Charlie	70000
4	Dave	80000
5	Eve	90000
6	Frank	100000

The result of the above query would be:

Employee ID	Employee Name	Salary	Salary Group
6	Frank	100000	1
5	Eve	90000	1
4	Dave	80000	2
3	Charlie	70000	2
2	Bob	60000	3
1	Alice	50000	4

As we can see, the employees are divided into four groups based on their salaries, with each group containing two employees. The salary_group column shows the group number for each employee. The employees with the highest salaries are in group 1, while the employees with the lowest salaries are in group 4.

We can use window aliases to simplify the above example. Here’s how we can refactor the example:

SELECT employee_id, employee_name, salary,
NTILE(4) OVER salary_window AS salary_group
FROM employees
WINDOW salary_window AS (ORDER BY salary DESC);

In this refactored query, we define a window alias named "salary_window" that orders the rows by salary in descending order. We then use this window alias in the NTILE() function to divide the result set into four groups based on the salaries in the salary_window window.

The result of the query is the same as the previous example. The only difference is that we use a window alias to simplify the syntax and make the query more readable.

Thanks for reading !!!

Window Functions in SQL

Window Functions in SQL

ROW_NUMBER()

RANK() & DENSE_RANK()

Difference between the ROW_NUMBER() and RANK() functions

LAG and LEAD

NTILE()

Recommend

SpringBoot线程池和Java线程池的实现原理 - twilight0402

How to miss a delivery date

大模型进化加速，高质量数据成稀缺性来源 | 见智研究

Grep | New kind of search engine

打造数字产业集群新优势

【CV晨读】字节或成中国最赚钱互联网公司；微信QQ故障被定性为一级事故

Teaching ChatGPT to Speak my Son’s Invented Language

玩家称低价区XGP已无法叠加多年最高限制13个月

Here's why people shouldn't freak out yet about losing their jobs to AI automati...

不装了？无视工信部的“最后通牒”，苹果正式表态：我就是行规|隐私保护|苹果公司|财务...

About Joyk