30

​4 Essential SQL Window Functions and Examples for a Data Scientist

 3 years ago
source link: https://datascientistschool.com/blog/213380/window-functions-sql-interview-questions-and-answers
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

4 Essential SQL Window Functions and Examples for a Data Scientist Interview in 2020

d319d78290e72f91471c247e356cf3cc?size=30
by Leon Dec. 6, 2020

Introduction

Image for post

About the author: Leon is a data science and machine learning executive from a FAANG (Facebook, Apple, Amazon, Netflix, Google) company in Silicon Valley. Prior to his current role, he runs the machine learning org at EdTech company Chegg and worked as a research scientist at amazon.com building large scale machine learning systems.

He has interviewed thousands of data science or machine learning candidates as a hiring manager or part of a hiring committee in his career.


WINDOW functions are a family of SQL functions that are asked quite often during a data scientist job interview, however, writing a bug-free SQL query using a WINDOW function could be quite challenging for any candidates, especially for those who just get started with SQL.

In this article, I want to show you some typical WINDOW functions based on interview questions, patterns, common strategies to tackle them, and step by step solutions for a few examples.

Outline

I am going to break down this article into 4 sections:

  1. In the first section, I will go through some of the basic WINDOW functions from regular aggregate functions, such as AVGMIN/MAXCOUNT, SUM to get you started with the concept.
  2. For section 2, I will focus on ranking related functions, such as ROW_NUMBERRANK, and RANK_DENSE. Those functions are extremely useful when generating ranking indexes within underlying groups, and you should become very fluent in using them before entering a data scientist interview.
  3. In the third section, I will talk about how to generate statistics (e.g. percentiles, quartiles, median, etc .) with the NTILE function, which is a very common task as a data scientist.
  4. In the last section, let’s focus on LAG and LEAD, two functions that are super important if you are interviewing for a role that requires handling time series data.

Let’s get started.


Section 1. WINDOW functions based on regular aggregate functions (AVG, MIN/MAX, SUM, COUNT)

Window functions are a family of functions that perform calculations across a set of rows that are somehow related to the current row.

This is comparable to the type of calculation that can be done with an aggregate function, but unlike regular aggregate functions, window functions do not group several rows into a single output row — the rows retain their own identities.

Behind the scenes, the window functions actually process more than just the current row of the query results.

Image for post

All examples in this article are based on a movie DVD rental business data. In this first example our goal is to compare each movie dvd’s replacement cost to the average cost of movies sharing the same MPAA ratings.

SELECT
    title,
    rating,
    replacement_cost,
    AVG(replacement_cost) OVER(PARTITION BY rating) AS avg_cost
FROM film;

For those of you who are not based in the United States, MPAA rating is a film rating system that determines a film’s suitability for certain audiences based on a film’s content. For example, G means it’s appropriate for all ages, while PG-13 contains materials that could be inappropriate for children under 13.

There is no GROUP BY clause for the AVG function, but how does the SQL engine know which rows to use to compute the average? The answer is the PARTITION BY clause inside the OVER() utility, and we are computing the average based on a unique value of rating.

In the final output, every row has the average cost from the same rating, and you can perform analysis such as divide replacement cost by average cost to find relative expense for every movie compared to others sharing the same rating.

Image for post

https://sqlpad.io/playground/

All tables in this article are available on SQLPad’s online SQL playground. If you want to follow along and submit queries against those tables, please feel free to go to sqlpad.io/playground and have some fun.

Image for post

Let’s take a look at another example. In this example: I want to compare every movie’s length (in minutes) to the maximum length of the movie from its category.

SELECT 
 title, 
 name, 
 length, 
 MAX(length) OVER(PARTITION BY name) AS max_length
FROM (
 SELECT F.title, C.name, F.length
 FROM film F
 INNER JOIN film_category FC
 ON FC.film_id = F.film_id
 INNER JOIN category C
 ON C.category_id = FC.category_id
) X;

It’s very similar to the first example, but I combined a MAX function with OVER and PARTITION BY to create a window function, which computes the maximum movie length inside the same movie category.

For the first row: story side, its length is 163 minutes, and the maximum length of an action movie (same category) is 185. If I compare each movie’s length to that maximum length, I can get a sense of how long this specific movie is, comparing to others from the same category, which could be a very useful piece of information, as movies from different categories can have quite different lengths.

Image for post
SELECT
 film_id,
 title,
 length,
 SUM(length) OVER(ORDER BY film_id) AS running_total,
 SUM(length) OVER() AS overall,
 SUM(length) OVER(ORDER BY film_id) * 100.0 /SUM(length) OVER() AS running_percentage
FROM film
ORDER BY film_id;

Let’s take a look at this example. This one is a little complicated. We are calculating a running sum with a window function.

Assuming it’s a holiday season, and I want to binge-watch all of the 1000 movies, starting from movie id=1. After finishing each movie, I want to find out what my overall progress is. I can use SUM, and OVER to calculate a running total of time to get my overall progress.

Notice that there is no PARTITION BY clause because I am not grouping those movies into any sub-categories. I want to compute my overall progress but not based on any subgroups or categories. I am quite ambitious, don’t you think 😃?

Another thing to notice is that if I don’t add anything inside of the OVER() function, I actually get the total number of minutes from the entire movie catalog. As you can see from the second from the last column: they all have the same value of 115267, but after I add the ORDER BY clause, I get the running total of the minutes up to that specific row (running_total column).

Again, please feel free to go to sqlpad’s playground and play with this film table until you become comfortable with the syntax.

If you are interested in practice a few more WINDOW functions which we just covered, here are 4 exercises for you to reinforce your learning.

Time to complete: ~ 30 -45 mins.


Section 2: ROW_NUMBER, RANK, DENSE_RANK

Let’s go through some of the most important WINDOW functions: ROW_NUMBER and RANK.

Image for post
SELECT
 F.film_id,
 F.title,
 F.length,
 ROW_NUMBER() OVER(ORDER BY length DESC) AS row_num
FROM film F
ORDER BY row_number;

The goal of this example is to create a ranking index based on the length of a movie for the entire movie catalog.

As you can see, the ROW_NUMBER function generates a sequence of integers, starting from 1, for each row.

And you might have noticed that movies with the same lengths were assigned different row numbers (it is randomly assigned behind the scene if there is a tie), and each row has a unique number.

Image for post
SELECT
 F.film_id,
 F.title,
 F.length,
 C.name AS category,
 ROW_NUMBER() OVER(PARTITION BY C.name ORDER BY F.length DESC) row_num
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
ORDER BY C.name, row_number;

Let’s take a look at another example. Instead of comparing a movie’s length to all other movies from the entire catalog, we can rank them within each movie category, with the help of PARTITION BY.

ROW_NUMBER with OVER and PARTITION BY is a regular pattern that is frequently used in advanced SQL. Mastering this pattern can make your daily data processing job much easier.

For example, imagine you are working at an e-commerce company, and it has a global business. Your boss asks you to send her a list of best sellers for each country. You can use ROW_NUMBER and PARTITION BY to easily generate this list.

Image for post
SELECT
 F.film_id,
 F.title,
 F.length,
 RANK() OVER(ORDER BY length DESC) AS ranking
FROM film F
ORDER BY ranking;

Let’s take a look at the RANK function, which is very similar to ROW_NUMBER. The difference between RANK and ROW_NUMBER is that RANK assigns the same unique values if there is a tie and restarts the next value with the total number of rows up to that row. Notice how it jumps from 1 to 11.

The largest number of rankings could be the total number of rows (we have 1000 movie titles here), as there is no tie in the last row.

Image for post
SELECT
 F.film_id,
 F.title,
 F.length,
 C.name AS category,
 RANK() OVER(PARTITION BY C.name ORDER BY F.length DESC) ranking
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
ORDER BY C.name, ranking;

Similarly, we can also generate rankings within a subgroup with the help of PARTITION BY.

Image for post
SELECT
 F.film_id,
 F.title,
 F.length,
 DENSE_RANK() OVER(ORDER BY length DESC) AS ranking
FROM film F
ORDER BY ranking;

The last function I want to show you is DENSE_RANK. It is very similar to RANK, but differs in how it handles ties. It restarts with the next immediate consecutive value rather than creating a gap.

As you can see here for the first 2 rows, two movies both have a value of 1, instead of restarting from 3, the next dense_rank value starts as 2.

Image for post
SELECT
 F.film_id,
 F.title,
 F.length,
 C.name AS category,
 DENSE_RANK() OVER(PARTITION BY C.name ORDER BY F.length DESC) ranking
FROM film F
INNER JOIN film_category FC
ON FC.film_id = F.film_id
INNER JOIN category C
ON C.category_id = FC.category_id
ORDER BY C.name, ranking;

Similarly, if we combine PARTITION BY with DENSE_RANK, we can get consecutive rankings inside a category. The largest value for a dense_rank is the total number of unique values inside of a partition.

In summary, ROW_NUMBERRANK, and DENSE_RANK are 3 functions that are very helpful to generate rankings. I have used ROW_NUMBER quite often at work as a data scientist, and I have also used RANK occasionally when dealing with ties (rare).

Time for some exercises, I have prepared 3 exercises to help with your understanding.

Time to complete: ~30 -45 mins.

Section 3: NTILE

In this section, I am going to show you how to create statistics using NTILE.

NTILE is a very useful function, especially for data analytics professionals. For example, as a data scientist, you probably need to create robust statistics such as quartile, quintile, median, etc in your daily job, and NTILE makes it very easy to generate those numbers.

NTILE takes an argument of the number of buckets, and then creates this number of buckets as equally as possible, based on how the rows are partitioned and ordered inside the OVER function.

Image for post
SELECT
 film_id,
 title,
 length,
 NTILE(100) OVER(ORDER BY length DESC) AS percentile
FROM film
ORDER BY percentile;

Let’s take a look at example 1, where we created 100 buckets, and we ordered all of the movies by their length descendingly. Therefore, the longest ones will be assigned to bucket 1, and the shortest ones will be assigned a bucket of 100.

Image for post

For the second example, we created a few more statistics such as DECILES (10 buckets), and QUARTILES (4 buckets), and we also partitioned them by MPAA ratings, so the statistics are relative to each unique MPAA rating.

NTILE is a very straightforward window function that can be very useful for your daily job as a data scientist. Let’s do some exercises to help you remember its syntax and reinforce your learning in this lecture.

Some exercises for fun.

Time to complete: ~30 -45 mins.


Section 4: LAG, Lead

In the last section, I will walk you through two WINDOW functions: LAG and LEAD, which are extremely useful for dealing with time-related data.

The main difference between LAG and LEAD is that LAG gets data from previous rows, while LEAD is the opposite, which fetches data from the following rows.

We can use either one of the two functions to compare month over month growth for example. As a data analytics professional, you are very likely to work on time-related data, and if you are able to use LAG or LEAD efficiently, you will be a very productive data scientist.

Their syntax is very similar to other window functions. Instead of focusing on the format of the syntax, let me show you a couple of examples.

Image for post
WITH daily_revenue AS (
 SELECT
 DATE(payment_ts) date,
 SUM(amount) revenue
 FROM payment
 WHERE DATE(payment_ts) >= '2020–05–24'
 AND DATE(payment_ts) <= '2020–05–31'
 GROUP BY DATE(payment_ts)
)
SELECT
 date,
 revenue,
 LAG(revenue, 1) OVER (ORDER BY date) prev_day_sales,
 revenue *1.0/LAG(revenue,1) OVER (ORDER BY date) AS dod
 FROM daily_revenue
ORDER BY date;
  1. * In the first step, we created daily movie rental revenue with CTE (common table expression).
  2. And in the second step, we appended the previous day’s revenue to the current day using the LAG function.
  3. Notice that the first row of the last 2 columns is empty, it’s simply because there is no previous day since May 24th is the first row available.
  4. We also specified the offset, which is 1, so we fetch the next row. If you change this number to 2, then you compare the current day’s revenue to the day before the previous day.
  5. Finally, we divided the current day’s revenue by the previous day, so we can create our daily revenue growth.
Image for post
WITH daily_revenue AS (
 SELECT
 DATE(payment_ts) date,
 SUM(amount) revenue
 FROM payment
 WHERE DATE(payment_ts) >= '2020–05–24'
 AND DATE(payment_ts) <= '2020–05–31'
 GROUP BY DATE(payment_ts)
)
SELECT
 date,
 revenue,
 LAG(revenue, 2) OVER (ORDER BY date) prev_day_sales,
 revenue *1.0/LAG(revenue,2) OVER (ORDER BY date) AS dod
FROM daily_revenue
ORDER BY date;

Let’s take a look at another example. It’s very similar to the previous one, but instead of appending the previous day’s revenue, we used the LEAD function with an offset of 1 to get the next day’s movie rental revenue.

We then divided the next day’s revenue by the current day’s revenue to get the day over day growth.

Image for post

For this lecture, you can try the following 2 exercises to help you get familiar with the syntax.

Time to complete: ~45 mins — 1 hour.


Summary

Great job, if you have followed through all the examples, you have seen most of the common WINDOW functions/patterns. Congratulations!

WINDOW functions are a family of SQL utilities that are asked quite often during a data scientist job interview. Writing a bug-free WINDOW function query could be quite challenging. It takes time and practice to truly master those functions, and that’s why I created sqlpad.io, where you can practice a list of 80 SQL interview questions, please feel free to give it a try.

If you want more help on WINDOW functions, feel free to check out my cracking the SQL interview for data scientist course, and if you want some extra help for your job search, you can also hire me as your job search mentor.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK