42

Why window function in SQL is so important that you should learn it right now?

 4 years ago
source link: https://www.tuicool.com/articles/Q7JV3qr
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Why window function in SQL is so important that you should learn it right now?

Oct 26 ·7min read

In my last article, I mentioned that I was asked about how to use a window function many times. And after mastering it, I am sure that I should learn much earlier because it can help me carry out more in-depth analysis through SQL.

My last article about using SQL for data analysis

Not like group by function which will reduce the number of rows, window function can perform the aggregation for each row without reducing. There are a number of different window functions. But today I will tell you basic knowledge on how to use window function and how you can apply in your data analysis.

The dataset that I will use today is “student-mat” from “Student Alcohol Consumption” obtained from Kaggle. Then the dataset is stored in dataset.student_mat. You can download the dataset from the below link:

In general, there are two types of window functions; one is built-in function like row_number, dense_rank, lag, and lead; another is aggregation just like usual aggregation function. For window function, there is a significant keyword “over” . Once you notice “over” within a query, you can tell there is a window function.

In the first part, I will explain how to use the aggregate window function since I believe you already have some knowledge of using normal aggregation function. After knowing the basics of window function it is easier to understand built-in functions.

  1. Aggregate window function
  2. Built-in window function

Aggregate window function

The basic syntax for an aggregate window function is a normal aggregation function but following an over clause statement. Here I will use the data to demonstrate. The first three columns in the dataset are school, sex, and age. And now you want to know the age difference for each sex versus overall in each school. Then you can use the window function.

The first step is to calculate the average age per school. This can be obtained from ‘avg(age) over (partition by school)’ . The first part before over is the same as the normal average function. The second part after over is called “partition by” clause which is required to be enclosed by parentheses. “Partition by school” means all records with the same values in school are selected and carry the calculation.

select school, sex,age, 
avg(age) over (partition by school) as age_avg_by_shl , 
age - avg(age) over (partition by school) as age_diff_from_shl_avg 
from dataset.student_mat
yInEjer.png!web
age_avg_by_shl is not a constant in the whole column but varies by school.

16.5215 for school equal to “GP” and 18.0217 for school equal to MS match with normal average function group by the school.

fEBbyq7.png!web

So now you can calculate the difference between each student and the average, as what age_diff_from_shl_avg performs in the query.

NvErEfI.png!web

Finally, you can get the average age difference by school and sex.

select school, sex, round(avg(age),2) as avg_age , round(avg(age_avg_by_shl),2) as avg_age_school, round(avg(age_diff_from_shl_avg),2) as avg_age_diff_from_shl
from 
(
select school, sex,age, avg(age) over (partition by school) as age_avg_by_shl , age - avg(age) over (partition by school) as age_diff_from_shl_avg 
from dataset.student_mat 
) a
group by school, sex
iYZJVju.png!web

From now you can tell the age for females is higher than males in GP but not in MS.

There is another column address showing whether the student lives in urban or rural areas. Then you can use a window function to calculate the percentage of students living in each area by school and area.

The first step is to calculate the number of students in each combination of school and address as a usual group by aggregation. Then to calculate the number of students over the total number of students in each school and in each address area by using window functions. “sum(student_cnt) over (partition by school)” will return the total numbers of students for GP and MS while “student_cnt / sum(student_cnt) over (partition by address)” will return the total numbers of students for urban and rural areas. As a result, you can get the percentage of students by school and address area.

select school, address, student_cnt
, student_cnt / sum(student_cnt) over (partition by school) as percent_by_shl
, student_cnt / sum(student_cnt) over (partition by address) as percent_by_address
from 
(
select school, address, count(1) as student_cnt
from dataset.student_mat 
group by school, address
) a
2e2QFje.png!web

So you can tell there are more students in GP living in an urban area while there are more students in MS living in a rural area. And if you find a student living in urban area, you are almost certain that this student is from GP since over 93% of urban students study in GP. This already becomes a predictive analysis that you can use to classify the school for the student by its address only.

The third example of the aggregation window function is a bit more complex. Since the dataset is called “Student Alcohol Consumption”, of course, we should do some analyses on it. There are two categorical columns “Dalc” and “Walc” showing consumption on workday and weekend. Then we can find out if alcohol consumption will impact the final result indicated by column “g3”.

To simplify, I will just add up two columns together rather than separating them. Then to calculate the overall average of g3 and also the difference between individual g3 and the overall average of g3. To calculate the overall average of g3, here we cannot use group by function since this will reduce the number of rows. Instead, we use the window function to obtain and put the result on each row.

The second part after over for syntax “avg(g3) over ()” is empty because we don’t need to categorize the dataset. We need the whole dataset to calculate the overall average. Therefore the parentheses are empty inside.

Now you can calculate the average difference for each wkalc group.

select wkalc, round(avg(avg_g3_overall),2) as avg_g3_overall, round(avg(g3),2) as avg_g3, round(avg(g3_diff),2) as avg_g3_diff
from 
(
select (Dalc + Walc) as wkalc, g3, avg(g3) over () as  avg_g3_overall, g3 - avg(g3) over () as g3_diff
from dataset.student_mat
) a 
group by wkalc
order by wkalc

There is no clear relationship between alcohol consumption and g3 results. So let’s drink (?)

Built-in window function

Now I hope you know basic usage on aggregation window function. Next, I will talk about the built-in window function. Below is the list for 11 built-in window functions :

CUME_DIST(), DENSE_RANK(), FIRST_VALUE(), LAG(), LAST_VALUE() LEAD(), NTH_VALUE(), NTILE(), PERCENT_RANK(), RANK(), ROW_NUMBER()

I will not explain all in detail. You can get the detailed information from the below link:

I will use rank() as an example to demonstrate how to use it.

Similar to the aggregation window function, an over clause statement is required.

There is a categorical column “studytime” from 1 to 4. And we want to know if study time is a factor of g3. Apart from calculating the average of g3 by studytime, we can use rank() to get the order.

After getting the average of g3 for each studytime, we can use the average result to rank the order by descending order. Then we can order the ranking to get the comparison directly. This method is useful especially when there are many groups that it is not possible to get ranking by naked eyes.

select studytime, avg_g3, rank() over (order by avg_g3 desc) as ranking 
from 
(
select studytime, avg(g3) as avg_g3
from dataset.student_mat
group by studytime
) a
order by ranking

So it is still true the studying more will get better results.

PS: There are other two built-in functions which also provide a ranking, one is dense_rank(), another is row_number(). The difference is how they return ranks when there are rows with the same values. You can know more from the below link:

https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html#function_rank

Ending

Mastering window functions provides you much more in-depth data analysis and sometimes it can even help you perform predictive analytics. As I said in my last article, mastering SQL is necessary for landing a data analyst position and window function is constantly asked during an interview (I can’t remember how many times I was asked about it during interviews). Therefore you should also start learning window functions after reading my article. I hope you can master it as well and get your dream job. That’s all for today’s article. If you enjoy it, please leave your comment and give me a clap. And share this article to let more people know the importance of window function. See you next time.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK