4

Window Functions in SQL

 1 year ago
source link: https://shashwatpandey4.github.io//posts/window-functions/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Window Functions in SQL

Posted Apr 8, 2023 Updated Apr 8, 2023
By Shashwat Pandey
10 min read

A window function in SQL is a type of function that performs a calculation across a set of rows in a table, and returns a result for each row based on that calculation. The calculation is performed over a “window” of rows that is defined by the ORDER BY clause of the query.

This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row; the rows retain their separate identity.

The window function operates on the result set produced by the query, rather than on the individual rows of the table. This allows the function to perform calculations that involve multiple rows without grouping them into aggregates.

Some common window functions are:

  1. ROW_NUMBER()
  2. RANK()
  3. DENSE_RANK()
  4. SUM()
  5. AVG()
  6. MAX()
  7. NTILE()
SELECT column1,
       column2,
       SUM(column2) OVER (PARTITION BY column1 ORDER BY column3) AS some_name_for_the_new_column
FROM database.table
WHERE <some_condition>

The first part of the above aggregation, SUM(), looks a lot like any other aggregation. Adding “OVER” designates it as a window function.

You can read it as “take the sum of __ over the entire result set in order by __”

If you’d like to narrow the window from the entire dataset to individual groups within the dataset, you can use PARTITION BY.

ROW_NUMBER()

Suppose we have a table named “employees” with columns “id”, “name”, “department”, and “salary”. We want to assign a unique row number to each employee, based on their salary, so that the employee with the highest salary gets assigned row number 1.

We can use the ROW_NUMBER() function along with the ORDER BY clause to achieve this.

SELECT id, name, department, salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num
FROM employees;

This query selects all columns from the "employees" table, and adds a new column named "row_num" that contains the row number assigned to each employee based on their salary. The ROW_NUMBER() function is applied over the window of all rows, and is ordered by the “salary” column in descending order, so that the employee with the highest salary gets assigned row number 1.

The result of this query would be a table with columns "id", "name", "department", "salary", and "row_num", where each row represents an employee and their corresponding row number based on their salary.

RANK() & DENSE_RANK()

Suppose we have a table named "students" with columns "id", "name", and "score". We want to rank the students based on their scores, and assign the same rank to students with the same score. We can use the RANK() function for this.

SELECT id, name, score, RANK() OVER (ORDER BY score DESC) AS rank
FROM students;

This query selects all columns from the "students" table, and adds a new column named "rank" that contains the rank assigned to each student based on their score. The RANK() function is applied over the window of all rows, and is ordered by the "score" column in descending order. Students with the same score will be assigned the same rank, and the next rank will be skipped. For example, if two students have the same highest score, they will both be assigned a rank of 1, and the next student with the next lower score will be assigned a rank of 3 (skipping rank 2).

Now, suppose we want to assign consecutive ranks to students with the same score, without skipping any ranks. We can use the DENSE_RANK() function for this.

SELECT id, name, score, DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM students;

This query selects all columns from the "students" table, and adds a new column named "dense_rank" that contains the dense rank assigned to each student based on their score. The DENSE_RANK() function is applied over the window of all rows, and is ordered by the "score" column in descending order. Students with the same score will be assigned the same dense rank, and the next rank will be the next consecutive number. For example, if two students have the same highest score, they will both be assigned a dense rank of 1, and the next student with the next lower score will be assigned a dense rank of 2.

Difference between the ROW_NUMBER() and RANK() functions

Suppose we have a table named "sales" with columns "region", "product", and "sales_amount". We want to assign a unique identifier to each sale in the table, and rank the sales based on their sales amount within each region.

To assign a unique identifier to each sale, we can use the ROW_NUMBER() function. To rank the sales based on their sales amount within each region, we can use the RANK() function.

SELECT region, product, sales_amount,
ROW_NUMBER() OVER (ORDER BY sales_amount DESC) AS row_num,
RANK() OVER (PARTITION BY region ORDER BY sales_amount DESC) AS rank
FROM sales;

In this query, the ROW_NUMBER() function is applied over the entire table to assign a unique identifier to each sale based on its sales amount, ordered by sales_amount in descending order. This means that each sale in the table will have a unique row number.

The RANK() function, on the other hand, is applied over each region separately to assign a rank to each sale within its region based on its sales amount, ordered by sales_amount in descending order. This means that each sale in a region with the same sales amount will be assigned the same rank, and the next rank will be skipped.

For example, suppose we have the following sales data:

RegionProductSales Amount
WestA100
WestB200
WestC200
EastA150
EastB150
EastC250

The result of the above query would be:

RegionProductSales Amountrow_numrank
WestB20011
WestC20021
WestA10033
EastC25041
EastA15052
EastB15062

As we can see, each sale in the table is assigned a unique row number based on its sales amount. In the West region, the sales for products B and C have the same sales amount and are assigned the same rank of 1, while the sale for product A has a lower sales amount and is assigned a rank of 3. In the East region, the sale for product C has the highest sales amount and is assigned a rank of 1, while the sales for products A and B have the same sales amount and are assigned the same rank of 2.

LAG and LEAD

Suppose we have a table named "sales" with columns "date", "product", and "sales_amount". We want to calculate the difference in sales amounts for each product between consecutive days.

To do this, we can use the LAG() and LEAD() functions. The LAG() function allows us to access the value of a previous row, while the LEAD() function allows us to access the value of a subsequent row, based on a specified column and ordering within a window.

SELECT date, product, sales_amount,
sales_amount - LAG(sales_amount) OVER (PARTITION BY product ORDER BY date) AS sales_diff_prev,
LEAD(sales_amount) OVER (PARTITION BY product ORDER BY date) - sales_amount AS sales_diff_next
FROM sales;

In this query, we use the LAG() function to calculate the difference in sales amounts for each product between the current day and the previous day. We use the LEAD() function to calculate the difference in sales amounts for each product between the current day and the next day.

The PARTITION BY clause is used to partition the window by product, so that the LAG() and LEAD() functions are applied to the rows with the same product. The ORDER BY clause is used to order the rows within each partition by date, so that we can access the previous and next sales amounts for each product.

For example, suppose we have the following sales data:

DateProductSales Amount
2022-01-01A100
2022-01-02A150
2022-01-03A200
2022-01-01B50
2022-01-02B70
2022-01-03B90

The result of the above query would be:

DateProductSales Amountsales_diff_prevsales_diff_next
2022-01-01A100NULL50
2022-01-02A1505050
2022-01-03A20050NULL
2022-01-01B50NULL20
2022-01-02B702020
2022-01-03B9020NULL

As we can see, the sales_diff_prev column shows the difference in sales amounts between the current day and the previous day for each product, while the sales_diff_next column shows the difference in sales amounts between the current day and the next day for each product. For product A, the sales amounts increase by 50 and then 50 again, while for product B, the sales amounts increase by 20 and then 20 again.

NTILE()

The NTILE() function is a window function in SQL that allows you to divide a result set into a specified number of groups, or "tiles" based on a specified column and ordering within a window.

Suppose we have a table named "employees" with columns "employee_id", "employee_name", and "salary". We want to divide the employees into four equal groups based on their salaries.

To do this, we can use the NTILE() function. The NTILE() function takes an integer argument specifying the number of tiles to divide the result set into. The result set is divided into this number of groups based on the values in the specified column and ordering within a window.

SELECT employee_id, employee_name, salary,
NTILE(4) OVER (ORDER BY salary DESC) AS salary_group
FROM employees;

In this query, we use the NTILE() function to divide the result set into four groups based on the salaries in descending order. The OVER clause is used to specify the window, which in this case is the entire result set. The ORDER BY clause is used to order the rows within the window by salary in descending order.

For example, suppose we have the following employee data:

Employee IDEmployee NameSalary
1Alice50000
2Bob60000
3Charlie70000
4Dave80000
5Eve90000
6Frank100000

The result of the above query would be:

Employee IDEmployee NameSalarySalary Group
6Frank1000001
5Eve900001
4Dave800002
3Charlie700002
2Bob600003
1Alice500004

As we can see, the employees are divided into four groups based on their salaries, with each group containing two employees. The salary_group column shows the group number for each employee. The employees with the highest salaries are in group 1, while the employees with the lowest salaries are in group 4.

We can use window aliases to simplify the above example. Here’s how we can refactor the example:

SELECT employee_id, employee_name, salary,
NTILE(4) OVER salary_window AS salary_group
FROM employees
WINDOW salary_window AS (ORDER BY salary DESC);

In this refactored query, we define a window alias named "salary_window" that orders the rows by salary in descending order. We then use this window alias in the NTILE() function to divide the result set into four groups based on the salaries in the salary_window window.

The result of the query is the same as the previous example. The only difference is that we use a window alias to simplify the syntax and make the query more readable.

Thanks for reading !!!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK