Practical SQL for Data Analysis
source link: https://hakibenita.com/sql-for-data-analysis
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Practical SQL for Data Analysis
What you can do without Pandas
Pandas is a very popular tool for data analysis. It comes built-in with many useful features, it's battle tested and widely accepted. However, pandas is not always the best tool for the job.
SQL databases has been around since the 1970s. Some of the smartest people in the world worked on making it easy to slice, dice, fetch and manipulate data quickly and efficiently. SQL databases have come such a long way, that many developers and data scientists lost track of what they can do with the database they already have!
In this article I demonstrate how to use SQL to perform fast and efficient data analysis.
Victoria Holmes" />Illustration by Victoria HolmesTable of Contents
SQL vs Pandas Performance
Imagine a simple table with 1M users, each with a username and an indication if the user was activated or not. A simple data analysis task would be to answer how many activated and inactivated users are there?
⚙ Benchmark setupLet's start with a naive approach using pandas:
import psycopg2 import pandas as pd connection = psycopg2.connect(dbname='db') with connection.cursor() as cursor: cursor.execute('SELECT * FROM users') df = pd.DataFrame( cursor.fetchall(), columns=['id', 'username', 'activated'], ) result = df.groupby(by='activated').count() print(result)
The script uses psycopg2 to create a connection to the database. It then fetches data from the users table into a pandas dataframe, and calls groupby
to get the counts for activated and inactivated users.
This script produces the following output:
id username activated False 900029 900029 True 99971 99971
We got an answer to our question, but at what cost?
Pandas True Cost
Let's execute this function again, but this time look at the memory usage:
(venv) $ python test_pandas_naive.py id username activated False 900029 900029 True 99971 99971 Filename: test_pandas_naive.py Line # Mem usage Increment Occurences Line Contents ============================================================ 3 38.8 MiB 38.8 MiB 1 @profile 4 def run(): 5 41.2 MiB 2.3 MiB 1 import psycopg2 6 78.2 MiB 37.1 MiB 1 import pandas as pd 7 8 78.6 MiB 0.4 MiB 1 connection = psycopg2.connect(dbname='db') 9 78.6 MiB 0.0 MiB 1 with connection.cursor() as cursor: 10 179.9 MiB 101.3 MiB 1 cursor.execute('SELECT * FROM users') 11 386.0 MiB 12.7 MiB 2 df = pd.DataFrame( 12 373.4 MiB 193.5 MiB 1 cursor.fetchall(), 13 373.4 MiB 0.0 MiB 1 columns=['id', 'username', 'activated'], 14 ) 15 16 386.0 MiB 0.0 MiB 1 result = df.groupby(by='activated').count() 17 386.0 MiB 0.0 MiB 1 print(result)
To view the memory usage of the program we use the package memory-profiler. We used this technique in the past to find the fastest way to load data into PostgreSQL using Python.
The output shows the overall memory usage for each row in the program, as well as the additional memory added by each row in the column "Increment". The output for this program reveals some interesting finds:
-
Pandas alone consumes ~37M of memory: Just importing pandas, before even doing anything with it, consumes a significant amount of memory. For comparison, importing psycopg2 only adds 2.3MB of memory to the program.
-
Fetching the data into memory consumed an additional ~300MB: When we fetched the data into memory, and then into a pandas dataframe, the program occupied an additional 300MB. For reference, the size of the table in the database is only 65MB.
If we ignore the 38MB consumed by the profiler itself, the program consumed 347MB of memory, and executing this script without the profiler took 1.101s to complete.
Removing Unnecessary Data
Our quick analysis showed that fetching the data consumed the most memory. To optimize that, we can try to fetch less data. For example, we don't really use the username column, so maybe we can not fetch it from the database:
(venv) $ python test_pandas.py id activated False 900029 True 99971 Filename: test_pandas.py # Mem usage Increment Occurences Line Contents ======================================================== 3 38.7 MiB 38.7 MiB 1 @profile 4 def run(): 5 41.0 MiB 2.3 MiB 1 import psycopg2 6 78.3 MiB 37.2 MiB 1 import pandas as pd 7 8 78.6 MiB 0.3 MiB 1 connection = psycopg2.connect(dbname='db') 9 78.6 MiB 0.0 MiB 1 with connection.cursor() as cursor: 10 132.1 MiB 53.6 MiB 1 cursor.execute('SELECT id, activated FROM users') 11 142.1 MiB -90.7 MiB 2 df = pd.DataFrame( 12 232.8 MiB 100.7 MiB 1 cursor.fetchall(), 13 232.8 MiB 0.0 MiB 1 columns=['id', 'activated'], 14 ) 15 16 142.1 MiB 0.0 MiB 1 result = df.groupby(by='activated').count() 17 142.1 MiB 0.0 MiB 1 print(result)
By explicitly providing a list of columns to the query and fetching only what we need, the program now consumes only 232MB, or 193MB without the overhead of the profiler. This is an improvement from the previous attempt which consumed 347MB of memory.
Executing the script without the profiler took 0.839s compared to the previous program which took 1.1s.
Aggregating in the Database
The most memory in the program is still the data being fetched into memory. What if instead of first fetching the data and aggregating using pandas, we would aggregate the data in the database, and create a pandas dataframe from the results:
$ python test_db.py activated cnt 0 False 900029 1 True 99971 Filename: test_db.py # Mem usage Increment Occurences Line Contents ======================================================== 3 38.6 MiB 38.6 MiB 1 @profile 4 def run(): 5 41.0 MiB 2.3 MiB 1 import psycopg2 6 78.0 MiB 37.0 MiB 1 import pandas as pd 7 8 78.3 MiB 0.4 MiB 1 connection = psycopg2.connect(dbname='db') 9 78.3 MiB 0.0 MiB 1 with connection.cursor() as cursor: 10 78.3 MiB 0.0 MiB 1 cursor.execute(''' 11 SELECT activated, count(*) AS cnt 12 FROM users 13 GROUP BY activated 14 ''') 15 78.3 MiB 0.0 MiB 2 result = pd.DataFrame( 16 78.3 MiB 0.0 MiB 1 cursor.fetchall(), 17 78.3 MiB 0.0 MiB 1 columns=['activated', 'cnt'], 18 ) 19 79.3 MiB 1.0 MiB 1 print(result)
This is a big leap compared to the previous attempt. Doing the processing in the database and fetching aggregated results consumed only 79MB of memory, or 40MB if we remove the overhead of the profiler. This is a big improvement!
Executing the script without the profiler took 0.380s, which is twice as fast as the previous program which took 0.839s.
Removing Pandas
At this point the only significant memory hog is pandas itself. Just for fun and reference, let's see what the program consumes without pandas:
$ python test_db_plain.py ([(False, 900029), (True, 99971)],) Filename: test_db_plain.py # Mem usage Increment Occurences Line Contents ======================================================== 3 38.9 MiB 38.9 MiB 1 @profile 4 def run(): 5 41.2 MiB 2.4 MiB 1 import psycopg2 6 7 41.7 MiB 0.5 MiB 1 connection = psycopg2.connect(dbname='db') 8 41.7 MiB 0.0 MiB 1 with connection.cursor() as cursor: 9 41.7 MiB 0.0 MiB 1 cursor.execute(''' 10 SELECT activated, count(*) AS cnt 11 FROM users 12 GROUP BY activated 13 ''') 14 41.7 MiB 0.0 MiB 1 result = cursor.fetchall(), 15 16 41.7 MiB 0.0 MiB 1 print(result)
After removing pandas and keeping the results as a python list of tuples, the program consumes 41MB of memory, or just 2.8MB if we ignore the profiler overhead. This is a huge difference from where we started!
The timing is also much lower. Without the profile this program completes in just 0.114s. That's 70% less than the previous attempt using pandas, and overall 90% faster than the first program.
Results Summary
This is the a summary of the results:
Program Peak Memory % Memory Diff Runtime % Runtime Diff Pandas with entire table 347 MB
1.101s
Pandas with only necessary data 193 MB -44% 0.839s -23% Pandas with aggregation in database 40 MB -80% 0.380s -54% No Pandas, aggregation in database 2.3 MB -94% 0.114s -70%
This benchmark does not mention the memory consumed by the database itself - this is intentional. Databases usually consume a configurable amount of memory, and than manage allocations between different buffers and system components internally. Over the years, databases have gotten pretty good at managing their memory so you won't have to. Whether you decide to use the database or not, the memory is already paid for, so you might as well use it!
Pandas and SQL: Better Together!
Programs that consume a lot of memory are a huge pain. Developers need powerful development environments, iterations are slower and the entire process takes more time. From an infrastructure perspective, resources cost money, and the more you scale the more you have to pay. The costs pile up pretty quickly.
All of this is not to say that Pandas is unnecessary, or that it can be replaced. Pandas provide great benefits and it has proven itself as being incredibly valuable. The same thing can be said for databases.
To take advantage of both worlds and create lightweight programs that are also fast, use SQL and Pandas together!
I'm focusing on Pandas and Numpy because they are the most popular, but the concepts described in the article apply to other tools and languages such as R, Julia, Matlab, SASS and so on. To make the argument even more compelling, I include interactive Hex Notebooks you can experiment with on your own.
Basics
The SQL query language was invented more than 40 years ago, and it is the most popular language for querying relational data. SQL is defined in an ANSI standard but there are still subtle differences between popular database engines such as PostgreSQL, MySQL, Oracle, SQL Server and others.
These are the common clauses of an SQL query:
SELECT <expressions> FROM <tables> JOIN <to other table> ON <join condition> WHERE <predicates> GROUP BY <expressions> HAVING <predicate> ORDER BY <expressions> LIMIT <number of rows>
In PostgreSQL only the SELECT
clause is really mandatory, so you can mix and match to do what you want.
Common Table Expressions
It's sometimes useful to split a large query into smaller steps. Using SQL, you can define a common table expression or "CTE" in short, with the WITH
clause:
WITH emails AS ( SELECT '[email protected]' AS email ) SELECT * FROM emails; email ─────────────────── [email protected]
You can have multiple CTE's in a single query, and they can even depend on each other:
WITH emails AS ( SELECT '[email protected]' AS email ), normalized_emails AS ( SELECT lower(email) AS email FROM emails ) SELECT * FROM normalized_emails; email ─────────────────── [email protected]
Common table expressions are a great way to split a big query into smaller chunks, perform recursive queries and even to cache intermediate results!
Generating Data
Generating data is very handy. Sometimes you need to generate data for practice, sometime you need to generate a time series or a small table to join to. There are several ways to generate data in SQL:
UNION ALL
WITH dt AS ( SELECT 1 AS id, 'haki' AS name UNION ALL SELECT 2, 'benita' ) SELECT * FROM dt; id │ name ────┼──────── 1 │ haki 2 │ benita
Using UNION ALL
you can combine, or concatenate, the results of multiple queries.
Concatenating query results is very common, but it can be a bit tedious for generating data.
VALUES LIST
WITH dt AS ( SELECT * FROM ( VALUES (1, 'haki'), (2, 'benita') ) AS t(id, name) ) SELECT * from dt;
Using the VALUES
keyword you can provide a list of rows, and then define names and types using a "table alias list" t(..)
. The t
can be any name. Using a VALUES
list is very useful when you need to generate small sets of data, or as the documentation calls it, "constants table".
UNNEST
To generate small sets of one dimensional data, you can unnest
a PostgreSQL array:
WITH dt AS ( SELECT unnest(array[1, 2]) AS n ) SELECT * FROM dt; n ─── 1 2
This is more restricting than VALUES
as it can only produce a one dimensional table of the same datatype, but we are going to use it later.
GENERATE_SERIES
To generate large amounts of data, PostgreSQL provides a table function called generate_series
:
WITH dt AS ( SELECT * FROM generate_series(0, 5) AS t(n) ) SELECT * FROM dt; n ── 0 1 2 3 4 5
The function generate_series
accepts three arguments: start, stop and step. In the example above we did not specify a step, so the default 1
was used. We can provide a different step to generate a different series:
WITH dt AS ( SELECT * FROM generate_series( 0, -- start 10, -- stop 2 -- step ) AS t(n) ) SELECT * FROM dt; n ──── 0 2 4 6 8 10
To generate a list of even numbers, we set the step to 2.
The function generate_series
is not restricted just to integers, it can be used for other types as well. One common examples is generating date ranges:
WITH daterange AS ( SELECT * FROM generate_series( '2021-01-01 UTC'::timestamptz, -- start '2021-01-02 UTC'::timestamptz, -- stop interval '1 hour' -- step ) AS t(hh) ) SELECT * FROM daterange; hh ──────────────────────── 2021-01-01 00:00:00+00 2021-01-01 01:00:00+00 2021-01-01 02:00:00+00 ... 2021-01-01 22:00:00+00 2021-01-01 23:00:00+00 2021-01-02 00:00:00+00
To generate a 24 hour range we provided generate_series
with a start and end data, and set the step to a 1 hour interval.
GENERATE_SERIES with row numbers
As mentioned above, generate_series
is a "table function". There is a little trick with table functions to include row numbers in the result:
WITH daterange AS ( SELECT * FROM generate_series( '2021-01-01'::timestamptz, -- start '2021-01-02'::timestamptz, -- stop interval '1 hour' -- step ) WITH ORDINALITY AS t(hh, n) ) SELECT * FROM daterange; hh │ n ────────────────────────┼──── 2021-01-01 00:00:00+00 │ 1 2021-01-01 01:00:00+00 │ 2 2021-01-01 02:00:00+00 │ 3 ... 2021-01-01 22:00:00+00 │ 23 2021-01-01 23:00:00+00 │ 24 2021-01-02 00:00:00+00 │ 25
Using WITH ORDINALITY
, the results now include another column with the row number.
Random
To generate random numbers PostgreSQL provides a random
function that returns a value between 0 and 1:
SELECT random(); 0.5917508391168769
To generate values at different ranges you can random
in an expression:
-- Random float between 0 and 100 SELECT random() * 100; 59.17508391168769 -- Random integer between 1 and 100 SELECT ceil(random() * 100); 59 -- Random integer between 11 and 100 SELECT 10 + ceil(random() * 90); 59
It's a common mistake to use round
instead of ceil
or floor
to generate a range of integers. Using round
may produce inconsistent distribution. Consider the following query to generate random integers in the range 0 - 4 using round
instead ceil
:
SELECT round(random() * 3) AS n, count(*) FROM generate_series(0, 1000) GROUP BY 1; n │ count ──┼─────── 0 │ 150 1 │ 328 2 │ 341 3 │ 182
Notice how the values 0 and 3 are coming up less than 1 and 2. Using round, random values less than 0.5 will be rounded down to 0, and random numbers greater than 2.5 will be rounded up to 3, while for example, random values between 0.5 and 1.5 will be rounded to 1. This makes the edges less likely to come up.
01230123
Random distribution using roundThis problem can be solved by either rounding up or down. Consider the same query using ceil
:
SELECT ceil(random() * 3) AS n, count(*) FROM generate_series(0, 1000) GROUP BY 1; n │ count ──┼─────── 1 │ 328 2 │ 339 3 │ 334
Using ceil
produces more evenly distributed random numbers.
0123123
Random distribution using ceilRandom Choice
You can use the random
function to pick a random value from a list of values:
SELECT (array['red', 'green', 'blue'])[ceil(random() * 3)] AS color FROM generate_series(1, 5); color ─────── green green blue green blue
The expression defines an array of colors, and then uses random
to get a random element from the array. Notice that in PostgreSQL, arrays start at 1:
-- In PostgreSQL arrays start at 1 SELECT (array['red', 'green', 'blue'])[1]; array ─────── red
Sampling
Sampling a random portion of a table is a very common when training a model. A simple way to fetch a random portion of a table is combining random
with LIMIT
:
db=# WITH sample AS ( SELECT * FROM users ORDER BY random() LIMIT 10000 ) SELECT count(*) FROM sample; count ─────── 10000 (1 row) Time: 205.643 ms
To sample 10K random rows from the table you first sort in a random order, and then take the first 10K rows.
Using random
to sample data is great, but for very large datasets it can be inefficient. PostgreSQL provides other methods of sampling a proportion of a table, which are more suited for large tables.
PostgreSQL provides two sampling methods, SYSTEM
and BERNOULLI
. To sample a table, use the TABLESAMPLE
keyword in the FROM
clause, and provide the sampling method along with it's arguments. For example, sampling 10% of the table using the SYSTEM
sampling method:
db=# WITH sample AS ( SELECT * FROM users TABLESAMPLE SYSTEM(10) ) SELECT count(*) FROM sample; count ─────── 95400 (1 row) Time: 13.690 ms
The SYSTEM
sampling method works by sampling blocks rather than rows, which makes it very fast. The table we sampled contains 1M rows, and the sample returned slightly less than 10K rows. For large datasets it's not uncommon to compromise accuracy for performance.
Another sampling method provided by PostgreSQL is BERNOULLI
:
db=# WITH sample AS ( SELECT * FROM users TABLESAMPLE BERNOULLI(10) ) SELECT count(*) FROM sample; count ──────── 100364 (1 row) Time: 54.593 ms
Unlike the SYSTEM
sampling method, BERNOULLI
works at the row level which makes it a bit slower, but the results are better distributed.
These are the timings for sampling 10% of table with 1M rows using different sampling methods:
Sampling Method Timingrandom()
205ms
BERNOULLI
54ms
SYSTEM
13ms
If you need to sample from a large table consider using TABLESAMPLE
.
Example: Train / Test Split with SQL
A common task when analyzing data is to split a dataset for training and testing. The training dataset is used to train the model, and the test dataset is used to evaluate the model.
To put what you've seen so far to practice, generate a transactions table with some random data:
CREATE TABLE transaction AS SELECT id, '2021-01-01'::date + interval '1 day' * ceil(random() * 365) as billed_at, round(10 + 90 * random()) as charged_amount, random() > 0.6 as reported_as_fraud FROM generate_series(1, 10) AS id ORDER BY 1;
The transaction table include the date and amount of the transaction, and an indication whether the transaction was reported as fraudulent.
Before we move on, let's break it down:
SELECT '2021-01-01'::date + interval '1 day' * ceil(random() * 365) as billed_at;
Add a random number of days between 0 and 365 to January 1st, 2021 to produce a random date in that year.
SELECT round(10 + 90 * random()) as charged_amount;
Produce a random round charged amount between 10 and 100.
SELECT random() > 0.6 as reported_as_fraud;
Produce the parameter we want to estimate. In our fake data, we want to have 40% fraudulent transactions. Using an expression we produce a boolean value which will evaluate to true ~40% of the times.
This is what the data looks like:
db=# SELECT * FROM transaction; id │ billed_at │ charged_amount │ reported_as_fraud ────┼─────────────────────┼────────────────┼─────────────────── 1 │ 2021-05-22 00:00:00 │ 54 │ t 2 │ 2021-05-31 00:00:00 │ 63 │ f 3 │ 2021-11-11 00:00:00 │ 26 │ t 4 │ 2021-07-04 00:00:00 │ 64 │ t 5 │ 2021-02-27 00:00:00 │ 90 │ t 6 │ 2021-05-21 00:00:00 │ 20 │ t 7 │ 2021-07-29 00:00:00 │ 69 │ t 8 │ 2021-02-24 00:00:00 │ 20 │ f 9 │ 2021-05-07 00:00:00 │ 36 │ f 10 │ 2021-05-05 00:00:00 │ 38 │ f
To test a model which classifies transactions as fraudulent, we want to split the table into a training and test datasets. One way to do that is adding a column, but we are going to create two separate tables instead.
To create a table similar to an existing table in PostgreSQL, you can use the following commands:
CREATE TABLE transaction_training AS TABLE transaction WITH NO DATA; CREATE TABLE transaction_test AS TABLE transaction WITH NO DATA;
This is a really handy syntax! We simply tell PostgreSQL to create a table similar to another table, but with no data.
Next, we want to split the data in the transaction
table between transaction_training
and transaction_test
. We want our training set to include 80% of the rows, in this case 8 rows:
WITH training_transaction_ids AS ( INSERT INTO transaction_training SELECT * FROM transaction ORDER BY random() LIMIT 8 RETURNING id ) INSERT INTO transaction_test SELECT * FROM transaction WHERE id NOT IN (SELECT id FROM training_transaction_ids);
To populate data for training we select from the transaction
table, shuffle the rows using ORDER BY random()
and then insert into transaction_training
just the first 8 rows.
To insert only the remaining rows into the test table, we keep the ids of the training rows by specifying RETURNING id
in a common table expression (the WITH
clause). We then insert rows into transaction_test
and exclude rows in training_transaction_ids
. For more on this technique check out how to implement complete processes using WITH
and RETURNING
.
This is the result:
db=# SELECT * FROM transaction_training; id │ billed_at │ charged_amount │ reported_as_fraud ────┼─────────────────────┼────────────────┼─────────────────── 6 │ 2021-05-21 00:00:00 │ 20 │ t 4 │ 2021-07-04 00:00:00 │ 64 │ t 5 │ 2021-02-27 00:00:00 │ 90 │ t 2 │ 2021-05-31 00:00:00 │ 63 │ f 10 │ 2021-05-05 00:00:00 │ 38 │ f 3 │ 2021-11-11 00:00:00 │ 26 │ t 9 │ 2021-05-07 00:00:00 │ 36 │ f 7 │ 2021-07-29 00:00:00 │ 69 │ t (8 rows) db=# SELECT * FROM transaction_test; id │ billed_at │ charged_amount │ reported_as_fraud ────┼─────────────────────┼────────────────┼─────────────────── 1 │ 2021-05-22 00:00:00 │ 54 │ t 8 │ 2021-02-24 00:00:00 │ 20 │ f (2 rows)
And there you have it, a training dataset and a test dataset with SQL, directly in the database!
Descriptive Statistics
When you get a fresh data set, the first thing you usually want to do is get familiar with it. Some people call this "Exploratory data analysis", or EDA for short. Pandas, as well as other languages and tools, provide some utility functions to produce descriptive statistics.
Describing a Series
Describing a numeric series using pandas:
>>> import pandas as pd >>> s = pd.Series([1, 2, 3]) >>> s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 dtype: float64
To generate descriptive statistics in SQL, you can use the following query:
WITH s AS ( SELECT * FROM (VALUES (1), (2), (3)) AS t(n) ) SELECT count(*), avg(n), stddev(n), min(n), percentile_cont(array[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY n), max(n) FROM s; count │ avg │ stddev │ min │ percentile_cont │ max ──────┼────────┼───────────┼─────┼─────────────────┼───── 3 │ 2.0000 │ 1.0000000 │ 1 │ {1.5,2,2.5} │ 3
Basic aggregate functions in SQL produced a similar output to that of Pandas. The interesting part here is function percentile_cont
.
The function percentile_cont
is an Ordered-Set Aggregate Function, meaning, it operates with respect to some order. To illustrate, in the query above you can replace both min
and max
with percentile_cont
:
WITH s AS ( SELECT * FROM (VALUES (1), (2), (3)) AS t(n) ) SELECT percentile_cont(array[ 0, -- <--- min 0.25, 0.5, 0.75, 1 -- <--- max ]) WITHIN GROUP (ORDER BY n), FROM s; percentile_cont ───────────────── {1,1.5,2,2.5,3}
Another common use for percentile_cont
is to find the median of a sequence of numbers:
WITH s AS (SELECT * FROM generate_series(1, 10) AS t(n)) SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY n), percentile_cont(0.5) WITHIN GROUP (ORDER BY n) FROM s; percentile_disc │ percentile_cont ─────────────────┼───────────────── 5 │ 5.5
The query demonstrates two types of medians:
percentile_disc
returns the value that 50% of the table is less than. Notice that5
is present in the table.percentile_cont
returns a value that 50% of the values are less than. The value 5.5 is not present in the table, it's the value between 5 and 6 which divides the values in the table in the middle (0.5).
Both functions can accept an array of values, in which case they will return a corresponding list of results.
Describing a Categorical Series
Previously we described a list of numbers. This time we want to describe a list of categorical values. For example, pandas will produce the following output:
>>> s = pd.Series(['a', 'a', 'b', 'c']) >> s.describe() count 4 unique 3 top a freq 2 dtype: object
Using SQL we can produce similar output:
WITH s AS (SELECT unnest(array['a', 'a', 'b', 'c']) AS v) SELECT count(*), count(DISTINCT V) AS unique, mode() WITHIN GROUP (ORDER BY V) AS top FROM s; count │ unique │ top ───────┼────────┼───── 4 │ 3 │ a
To calculate the number of unique values you used COUNT(DISTINCT ...)
. To get the value that appears most often in the series, i.e. the one with the highest frequency, you used another ordered set function called mode
.
Subtotals
Another useful technique to analyze data is producing sub totals. We already saw how to apply aggregate functions on the table, but how about multiple aggregation levels in the same query?
Let's imagine a table of employees. For each employee we keep the name, the role and the department they work at:
WITH emp AS ( SELECT * FROM (VALUES ('Haki', 'R&D', 'Manager'), ('Dan', 'R&D', 'Developer'), ('Jax', 'R&D', 'Developer'), ('George', 'Sales', 'Manager'), ('Bill', 'Sales', 'Developer'), ('David', 'Sales', 'Developer') ) AS t( name, department, role ) ) SELECT * FROM emp; name │ department │ role ────────┼────────────┼─────────── Haki │ R&D │ Manager Dan │ R&D │ Developer Jax │ R&D │ Developer George │ Sales │ Manager Bill │ Sales │ Developer David │ Sales │ Developer
To find the number of employees with each role in each departments, you can use GROUP BY
:
WITH emp AS ( /* ... */ ) SELECT department, role, COUNT(*) FROM emp GROUP BY department, role; department │ role │ count ────────────┼───────────┼─────── R&D │ Developer │ 2 R&D │ Manager │ 1 Sales │ Manager │ 1 Sales │ Developer │ 2
Rollup
What if we want to also get the number of employees in each department, in all roles:
WITH emp AS ( /* ... */ ) SELECT department, role, COUNT(*) FROM emp GROUP BY ROLLUP(department), role; department │ role │ count ────────────┼───────────┼─────── R&D │ Developer │ 2 R&D │ Manager │ 1 Sales │ Manager │ 1 Sales │ Developer │ 2 R&D │ ¤ │ 3 -- <-- Total for R&D Sales │ ¤ │ 3 -- <-- Total for Sales
Notice the use of the subclause ROLLUP
in the GROUP BY
clause.
To add a subtotal for each department, we tell the database to "rollup" by the department field. The database then added two additional aggregate results, one for each department.
The database can actually produce sub totals in several levels. For example, to add the grand total of the number of employees in all departments, we can tell the database to "rollup" the role field as well:
WITH emp AS ( /* ... */ ) SELECT department, role, COUNT(*) FROM emp GROUP BY ROLLUP(department, role); department │ role │ count ────────────┼───────────┼─────── ¤ │ ¤ │ 6 -- <-- Grand total R&D │ Developer │ 2 R&D │ Manager │ 1 Sales │ Manager │ 1 Sales │ Developer │ 2 R&D │ ¤ │ 3 -- <-- Total for R&D Sales │ ¤ │ 3 -- <-- Total for Sales
The query now includes several subtotals. To identify the aggregate level for each row, use the function GROUPING
:
WITH emp AS ( /* ... */ ) SELECT department, role, COUNT(*), GROUPING(department) AS department_subtotal, GROUPING(department, role) AS grand_total FROM emp GROUP BY ROLLUP(department), role; department │ role │ count │ department_subtotal │ grand_total ────────────┼───────────┼───────┼─────────────────────┼───────────── Sales │ Developer │ 2 │ 0 │ 0 Sales │ Manager │ 1 │ 0 │ 0 R&D │ Developer │ 2 │ 0 │ 0 R&D │ Manager │ 1 │ 0 │ 0 ¤ │ Manager │ 2 │ 1 │ 2 ¤ │ Developer │ 4 │ 1 │ 2
Cube
When talking about subtotals, or aggregates at multiple levels, OLAP usually comes to mind. OLAP cube is a technique where all the subtotals are pre-calculated to make retrieval faster. Using ROLLUP
we can achieve this by providing all possible combinations, but there is an easier way to do that:
WITH emp AS ( /* ... */ ) SELECT department, role, COUNT(*) FROM emp GROUP BY CUBE(department, role); department │ role │ count ────────────┼───────────┼─────── ¤ │ ¤ │ 6 -- <-- Grand Total R&D │ Developer │ 2 R&D │ Manager │ 1 Sales │ Manager │ 1 Sales │ Developer │ 2 R&D │ ¤ │ 3 -- <-- Subtotal for R&D department Sales │ ¤ │ 3 -- <-- Subtotal for Sales department ¤ │ Manager │ 2 -- <-- Subtotal for Manager role ¤ │ Developer │ 4 -- <-- Subtotal for Developer role
CUBE
generates subtotals for all possible combinations. In the examples above, using CUBE
added an additional subtotal for each department.
Grouping Sets
Both CUBE
and ROLLUP
are syntactic sugar of GROUPING SETS
:
WITH emp AS ( /* ... */ ) SELECT department, role, COUNT(*) FROM emp GROUP BY GROUPING SETS ( (), -- <-- Grand total (role), -- <-- Subtotal by role (department), -- <-- Subtotal by department (role, department) -- <-- No aggregation, the row itself ); department │ role │ count ────────────┼───────────┼─────── ¤ │ ¤ │ 6 Sales │ Developer │ 2 Sales │ Manager │ 1 R&D │ Developer │ 2 R&D │ Manager │ 1 ¤ │ Manager │ 2 ¤ │ Developer │ 4 R&D │ ¤ │ 3 Sales │ ¤ │ 3
Using GROUPING SETS
you can tell the database exactly which subtotals to generate. The query above is generating all possible combinations, so it's equivalent to CUBE
.
Pivot Tables
Pivot tables are a technique to reshape data, and pandas includes a very powerful pivot_table
function:
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'name': ['Haki', 'Dan', 'Jax', 'George', 'Bill', 'David'], ... 'department': ['R&D', 'R&D', 'R&D', 'Sales', 'Sales', 'Sales',], ... 'role': ['Manager', 'Developer', 'Developer', 'Manager', 'Developer', 'Developer'], ... }) >>> pd.pivot_table(df, values='name', index='role', columns='department', aggfunc='count') department R&D Sales role Developer 2 2 Manager 1 1
Conditional Expressions
To recreate the "pivot" above in SQL, do the following:
WITH emp AS ( SELECT * FROM (VALUES ('Haki', 'R&D', 'Manager'), ('Dan', 'R&D', 'Developer'), ('Jax', 'R&D', 'Developer'), ('George', 'Sales', 'Manager'), ('Bill', 'Sales', 'Developer'), ('David', 'Sales', 'Developer') ) AS t( name, department, role ) ) SELECT role, SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END) as "R&D", SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as "Sales" FROM emp GROUP BY role; role │ R&D │ Sales ───────────┼─────┼─────── Manager │ 1 │ 1 Developer │ 2 │ 2
Using CASE
, you constructed a "Conditional Expression" that returns the value 1 to sum
only for a specific department. By adding a conditional expression for every department, you "reshaped" the data to a pivot table.
Pivot Table
Aggregate Expressions
Using CASE
is flexible but it's a bit tedious. Applying conditions on aggregates is so useful that SQL added special syntax for it:
WITH emp AS ( /* ... */ ) SELECT role, COUNT(*) FILTER (WHERE department = 'R&D') as "R&D", COUNT(*) FILTER (WHERE department = 'Sales') as "Sales" FROM emp GROUP BY role;
This way of reshaping data into "pivot tables" is very common for visualizing and analyzing data. There is no doubt Pandas is much more flexible and comfortable when it comes to pivot tables. However, the process of turning rows into columns is also very common in ETL processes, where data is denormalized.
The bottom line is that Pandas may be better to quickly analyze and visualize small sets of data, but ETL processes may benefit from doing this process in the database using conditional or aggregate expressions.
Running and Cumulative Aggregation
Aggregations over a sliding window are very common, usually on a time series. For example, traders use moving averages as an indication of a stock's trend, running sums can be used to backtest an anomaly detection strategy and so on.
To illustrate, take this table listing daily temperatures:
WITH temperatures AS ( SELECT * FROM (VALUES ('2021-01-01'::date, 10), ('2021-01-02'::date, 12), ('2021-01-03'::date, 13), ('2021-01-04'::date, 14), ('2021-01-05'::date, 18), ('2021-01-06'::date, 15), ('2021-01-07'::date, 16), ('2021-01-08'::date, 17) ) as t(t, c) ) SELECT * FROM temperatures; t │ c ────────────┼──── 2021-01-01 │ 10 2021-01-02 │ 12 2021-01-03 │ 13 2021-01-04 │ 14 2021-01-05 │ 18 2021-01-06 │ 15 2021-01-07 │ 16 2021-01-08 │ 17
Window Functions
Say you want to compare each day to the hottest day ever:
WITH temperatures AS ( /* ... */ ) SELECT *, MAX(c) OVER (PARTITION BY 1) AS hottest_temperature FROM temperatures; t │ c │ hottest_temperature ────────────┼────┼───────────────────── 2021-01-01 │ 10 │ 18 2021-01-02 │ 12 │ 18 2021-01-03 │ 13 │ 18 2021-01-04 │ 14 │ 18 2021-01-05 │ 18 │ 18 2021-01-06 │ 15 │ 18 2021-01-07 │ 16 │ 18 2021-01-08 │ 17 │ 18
By adding the OVER (PARTITION ...)
clause to the aggregate function MAX
, you turned it into a window function. Window functions operate on a set of rows determined by the PARTITION
clause. Since you used a constant value PARTITION BY 1
, the function operates on all the rows.
To complete the query, use the result of the window function in an expression:
WITH temperatures AS ( /* ... */ ) SELECT *, (c::float / MAX(c) OVER (PARTITION BY 1) - 1) * 100 AS compared_to_hottest_day FROM temperatures; t │ c │ compared_to_hottest_day ────────────┼────┼──────────────────────── 2021-01-01 │ 10 │ -44.44444444444444 2021-01-02 │ 12 │ -33.333333333333336 2021-01-03 │ 13 │ -27.77777777777778 2021-01-04 │ 14 │ -22.22222222222222 2021-01-05 │ 18 │ 0 2021-01-06 │ 15 │ -16.666666666666664 2021-01-07 │ 16 │ -11.111111111111116 2021-01-08 │ 17 │ -5.555555555555558
If you're are not sure what is the purpose of casting the temperature to float, make sure to read my tip about dividing integers in SQL.
Sliding Window
Comparing each day's temperature against the hottest temperature ever can be useful, but more often than not, you want to compare a value to a limited period, or in other words, a sliding window.
Sliding window
To find the highest temperature in the last three days for example, you can add a frame clause:
WITH temperatures AS ( /* ... */ ) SELECT *, MAX(c) OVER ( ORDER BY t ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS hottest_temperature_last_three_days FROM temperatures; t │ c │ hottest_temperature_last_three_days ────────────┼────┼───────────────────────────────────── 2021-01-01 │ 10 │ 10 2021-01-02 │ 12 │ 12 2021-01-03 │ 13 │ 13 2021-01-04 │ 14 │ 14 2021-01-05 │ 18 │ 18 2021-01-06 │ 15 │ 18 2021-01-07 │ 16 │ 18 2021-01-08 │ 17 │ 17
Once again you used a window function, but this time you added a frame clause to it, stating the window should include 2 previous rows and the current one.
Try it!
The frame syntax is very flexible, and it is not restricted to ROWS
. The query above can be expressed using a range frame as well:
WITH temperatures AS ( /* ... */ ) SELECT *, MAX(c) OVER ( ORDER BY t RANGE BETWEEN '2 days' PRECEDING AND '0 days' FOLLOWING ) AS hottest_temperature_last_three_days FROM temperatures; t │ c │ hottest_temperature_last_three_days ────────────┼────┼───────────────────────────────────── 2021-01-01 │ 10 │ 10 2021-01-02 │ 12 │ 12 2021-01-03 │ 13 │ 13 2021-01-04 │ 14 │ 14 2021-01-05 │ 18 │ 18 2021-01-06 │ 15 │ 18 2021-01-07 │ 16 │ 18 2021-01-08 │ 17 │ 17
Notice how nice the RANGE
syntax is... it reads like an actual sentence!
Linear Regression
Another common tool for analyzing data is linear regression.
Linear Regression
For example, performing linear regression using Pandas and Scipy:
>>> import pandas as pd >>> import scipy.stats >>> df = pd.DataFrame([[1.2, 1], [2, 1.8], [3.1, 2.9]]) >>> slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(df[0], df[1]) (1.0 -0.2000000000000004 1.0 9.003163161571059e-11 0.0)
Most developers probably don't expect the database to have statistical functions, but PostgreSQL does:
WITH t AS (SELECT * FROM (VALUES (1.2, 1.0), (2.0, 1.8), (3.1, 2.9) ) AS t(x, y)) SELECT regr_slope(y, x) AS slope, regr_intercept(y, x) AS intercept, sqrt(regr_r2(y, x)) AS r FROM t; slope │ intercept │ r ────────────────────┼──────────────────────┼─── 1.0000000000000002 │ -0.20000000000000048 │ 1
Using statistical aggregate functions in PostgreSQL we got results similar to scipy.
Interpolation
Data cleaning is an important part of any data job, and handling missing values is a big part of that.
Fillna, forward fill and backward fill
Fill with Constant
The simplest way to fill in missing data is with some constant value. Using Pandas for example, this is done using the fillna
function:
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame(['A', 'B', np.NaN, 'D', np.NaN, np.NaN, 'G']) >>> df.fillna('X') 0 0 A 1 B 2 X 3 D 4 X 5 X 6 G
In SQL, if your missing values are NULL, you can use a condition expression CASE
, or use the shorter COALESCE
function:
WITH tb AS ( SELECT * FROM (VALUES (1, 'A' ), (2, 'B' ), (3, null), (4, 'D' ), (5, null), (6, null), (7, 'G' ) ) AS t(n, v) ) SELECT n, coalesce(v, 'X') AS v FROM tb; n │ v ───┼─── 1 │ A 2 │ B 3 │ X 4 │ D 5 │ X 6 │ X 7 │ G
The function COALESCE
accepts any number of arguments and return the first one that is not NULL.
Back and Forward Fill
Filling values with constants is easy, but not always possible. Another common interpolation technique is filling empty values with previous or following non-missing values.
Pandas offers several variations on back and forward filling, for example:
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame(['A', 'B', np.NaN, 'D', np.NaN, np.NaN, 'G']) >>> df.fillna(method='ffill') # or df.ffill() >>> 0 0 A 1 B 2 B 3 D 4 D 5 D 6 G >>> df.fillna(method='bfill') # or df.bfill() or df.backfill() 0 0 A 1 B 2 D 3 D 4 G 5 G 6 G
To achieve the same using SQL, you can use a subquery:
WITH tb AS ( SELECT * FROM (VALUES (1, 'A' ), (2, 'B' ), (3, null), (4, 'D' ), (5, null), (6, null), (7, 'G' ) ) AS t(n, v) ) SELECT *, -- Find the next not null value coalesce(v, ( SELECT v FROM tb AS tb_ WHERE tb_.n < tb.n AND v IS NOT NULL ORDER BY n DESC LIMIT 1 )) AS ffill_v, -- Find the previous not null value coalesce(v, ( SELECT v FROM tb as tb_ WHERE tb_.n > tb.n AND v IS NOT NULL ORDER BY n ASC LIMIT 1 )) as bfill_v FROM tb; n │ v │ ffill_v │ bfill_v ───┼───┼─────────┼───────── 1 │ A │ A │ A 2 │ B │ B │ B 3 │ ¤ │ B │ D 4 │ D │ D │ D 5 │ ¤ │ D │ G 6 │ ¤ │ D │ G 7 │ G │ G │ G
The SQL version is a bit longer, but it is fairly expressive, and it gives great flexibility.
NOTE: It's tempting to use the window function LEAD
and LAG
here, but these function can only be used when filling single row gaps. Once you have more than one consecutive missing row, LEAD
and LAG
may leave you with missing values.
Linear Interpolation
Another common interpolation technique for discrete data is linear interpolation.
>>> import pandas as pd ... import numpy as np ... ... df = pd.DataFrame([ ... (np.datetime64('2021-01-01'), 10), ... (np.datetime64('2021-01-02'), 12), ... (np.datetime64('2021-01-03'), np.NaN), ... (np.datetime64('2021-01-04'), 14), ... (np.datetime64('2021-01-05'), np.NaN), ... (np.datetime64('2021-01-06'), np.NaN), ... (np.datetime64('2021-01-07'), 18), ... (np.datetime64('2021-01-08'), 15) ... ], columns=('t', 'c')) >>> # Assume data is evenly distributed >>> df['c'].interpolate('linear') 0 10.000000 1 12.000000 2 13.000000 3 14.000000 4 15.333333 5 16.666667 6 18.000000 7 15.000000
Linear interpolation works by filling missing values along a linear line between two known coordinates (x1, y1) and (x2, y2). In this case, The two known coordinates are the last and the next known date and temperatures.
You already found the next and previous known value for each row when you implemented back and forward fill:
WITH temperatures AS ( SELECT * FROM (VALUES ('2021-01-01'::date, 10), ('2021-01-02'::date, 12), ('2021-01-03'::date, null), ('2021-01-04'::date, 14), ('2021-01-05'::date, null), ('2021-01-06'::date, null), ('2021-01-07'::date, 18), ('2021-01-08'::date, 15) ) as t(t, c) ), temperatures_with_previous_values AS ( SELECT *, -- Last known temperature ( SELECT array[extract('epoch' FROM t), c] FROM temperatures as temperatures_ WHERE temperatures_.t < temperatures.t AND c IS NOT NULL ORDER BY temperatures_.t DESC LIMIT 1 ) AS last_known_temperature, -- Next known temperature ( SELECT array[extract('epoch' FROM t), c] FROM temperatures as temperatures_ WHERE temperatures_.t > temperatures.t AND c IS NOT NULL ORDER BY temperatures_.t ASC LIMIT 1 ) AS next_known_temperature FROM temperatures ) SELECT * FROM temperatures_with_previous_values; t │ c │ last_known_temperature │ next_known_temperature ────────────┼────┼────────────────────────┼──────────────────────── 2021-01-01 │ 10 │ ¤ │ {1609545600,12} 2021-01-02 │ 12 │ {1609459200,10} │ {1609718400,14} 2021-01-03 │ ¤ │ {1609545600,12} │ {1609718400,14} 2021-01-04 │ 14 │ {1609545600,12} │ {1609977600,18} 2021-01-05 │ ¤ │ {1609718400,14} │ {1609977600,18} 2021-01-06 │ ¤ │ {1609718400,14} │ {1609977600,18} 2021-01-07 │ 18 │ {1609718400,14} │ {1610064000,15} 2021-01-08 │ 15 │ {1609977600,18} │ ¤
There are two main differences here from what you've done before:
-
You converted the date to a number: This is called "epoch", the number of seconds since 1970. To convert the date you used the function
extract('epoch' FROM t)
. -
You keep two values from the previous and next row: To implement linear interpolation we need coordinates, which are both the date and the temperature. To return multiple values from the previous row, you constructed an array
array[extract('epoch' FROM t), c]
.
To calculate a missing value with two known coordinates using linear interpolation, use the following formula:
y = y0 + (x - x0) * ((y1 - y0) / (x1 - x0))
You already have all the data available, so just organize it a bit:
WITH temperatures AS ( /* ... */ ), temperatures_with_previous_values AS ( /* ... */ ), -- This step is just for convenience temperatures_prep AS ( SELECT t, c, extract('epoch' from t) as x, last_known_temperature[1] as x0, last_known_temperature[2] as y0, next_known_temperature[1] as x1, next_known_temperature[2] as y1 FROM temperatures_with_previous_values ) SELECT t, c, x, x0, y0, x1, y1 FROM temperatures_prep; t │ c │ x │ x0 │ y0 │ x1 │ y1 ────────────┼────┼────────────┼────────────┼────┼────────────┼──── 2021-01-01 │ 10 │ 1609459200 │ ¤ │ ¤ │ 1609545600 │ 12 2021-01-02 │ 12 │ 1609545600 │ 1609459200 │ 10 │ 1609718400 │ 14 2021-01-03 │ ¤ │ 1609632000 │ 1609545600 │ 12 │ 1609718400 │ 14 2021-01-04 │ 14 │ 1609718400 │ 1609545600 │ 12 │ 1609977600 │ 18 2021-01-05 │ ¤ │ 1609804800 │ 1609718400 │ 14 │ 1609977600 │ 18 2021-01-06 │ ¤ │ 1609891200 │ 1609718400 │ 14 │ 1609977600 │ 18 2021-01-07 │ 18 │ 1609977600 │ 1609718400 │ 14 │ 1610064000 │ 15 2021-01-08 │ 15 │ 1610064000 │ 1609977600 │ 18 │ ¤ │ ¤
Now that you have all the data neatly organized, you can use the formula to calculate missing values:
WITH temperatures AS ( /* ... */ ), temperatures_with_previous_values AS ( /* ... */ ), temperatures_prep AS ( /* ... */ ) SELECT t, c, CASE WHEN c IS NOT NULL THEN c ELSE y0 + (x - x0) * ((y1 - y0) / (x1 - x0)) END AS interpolated_c FROM temperatures_prep ; t │ c │ interpolated_c ────────────┼────┼──────────────────── 2021-01-01 │ 10 │ 10 2021-01-02 │ 12 │ 12 2021-01-03 │ ¤ │ 13 2021-01-04 │ 14 │ 14 2021-01-05 │ ¤ │ 15.333333333333334 2021-01-06 │ ¤ │ 16.666666666666668 2021-01-07 │ 18 │ 18 2021-01-08 │ 15 │ 15
And there it is, the missing temperatures were filled using linear interpolation.
The complete queryBinning
Binning, or "bucketing", is a technique to group values together.
Binning
Custom Binning
Custom binning is most common for categorical data or for discrete data when ranges are pre-determined.
Image you have a table with student grades, and you want to classify them to letter grades A-F:
WITH grades AS ( SELECT round(70 + sin(n) * 30)::int AS grade FROM generate_series(1, 100) AS n ) SELECT CASE WHEN grade < 60 THEN 'F' WHEN grade < 70 THEN 'D' WHEN grade < 80 THEN 'C' WHEN grade < 90 THEN 'B' ELSE 'A' END AS letter_grade, COUNT(*) FROM grades GROUP BY letter_grade ORDER BY letter_grade; letter_grade │ count ──────────────┼─────── A │ 29 B │ 10 C │ 12 D │ 10 F │ 39
Custom binning can also use expressions to categorize data into custom groups. In the american grade system for example, the letter grade can also be calculated based on the percentile, and not the absolute grade:
WITH grades AS ( SELECT round(70 + sin(n) * 30)::int AS grade FROM generate_series(1, 100) AS n ), percent_grades AS ( SELECT percent_rank() OVER (ORDER BY grade) as percent_grade FROM grades ) SELECT CASE WHEN percent_grade < 0.6 THEN 'F' WHEN percent_grade < 0.7 THEN 'D' WHEN percent_grade < 0.8 THEN 'C' WHEN percent_grade < 0.9 THEN 'B' ELSE 'A' END AS letter_grade, COUNT(*) FROM percent_grades GROUP BY letter_grade ORDER BY letter_grade;
To find the relative grade of every student based on the grades of all other students, you used the window function percent_rank
. The function returns a value between 0 and 1 that represents the rank of the current row relative to all other rows.
Custom binning is mostly useful for when the data is familiar, or within a known set of values. When exploring unknown or unbound sets of data there are other binning techniques you can use.
Equal Height Binning
Say you need to divide students to groups based on their grade, and you want every group to have roughly the same number of students. To achieve this, PostgreSQL provides a function called NTILE
:
WITH grades AS ( SELECT round(70 + sin(n) * 30)::int AS grade FROM generate_series(1, 100) AS n ), grades_with_tiles AS ( SELECT *, ntile(10) OVER (ORDER BY grade) AS bucket FROM grades ) SELECT min(grade) AS from_grade, max(grade) AS to_grade, count(*) AS cnt, bucket FROM grades_with_tiles GROUP BY bucket ORDER BY from_grade; from_grade │ to_grade │ cnt │ bucket ────────────┼──────────┼─────┼──────── 40 │ 41 │ 10 │ 1 41 │ 45 │ 10 │ 2 47 │ 53 │ 10 │ 3 53 │ 61 │ 10 │ 4 61 │ 70 │ 10 │ 5 71 │ 79 │ 10 │ 6 79 │ 87 │ 10 │ 7 89 │ 95 │ 10 │ 8 95 │ 99 │ 10 │ 9 99 │ 100 │ 10 │ 10
Divding values into bins or buckets with roughly the same frequency is called "Equal Height Binning". Notice how each group holds exactly 10 rows.
Try it!
The function NTILE
is a window function. It accepts the number of buckets, in this case 10, and an order by clause in which to divide the range by. Window functions can't be used as a group by key, so you need to use either a subquery or a CTE to add the "bucket" field.
Equal Width Binning
So far you divided students to groups based on arbitrary letter grades (custom binning) and to equally sizes groups based on their grades (equal width binning). None of these grouping technics gives you a good sense of the data distribution. One way to visualize the data and get a sense of how grades are distributed is using a histogram.
Histogram
To draw a histogram you need to divide grades into equal width ranges. Grades range from 0 to 100, so you can split the range to 10 bars of 10 each:
WITH grades AS ( SELECT round(70 + sin(n) * 30)::int AS grade FROM generate_series(1, 100) AS n ) SELECT floor((grade - 1) / 10) as bucket, min(grade) from_grade, max(grade) to_grade, count(*) FROM grades GROUP BY bucket ORDER BY bucket; bucket │ from_grade │ to_grade │ count ────────┼────────────┼──────────┼─────── 3 │ 40 │ 40 │ 7 4 │ 41 │ 50 │ 20 5 │ 51 │ 58 │ 12 6 │ 61 │ 70 │ 11 7 │ 71 │ 79 │ 11 8 │ 81 │ 90 │ 14 9 │ 92 │ 100 │ 25
To assign each grade to the right bucket we used a little arithmetics. This worked out nicely because the arithmetics here are fairly simple, but what if you wanted smaller buckets? Say 20 buckets of width 5? or 25 buckets of width 4? That would have made the calculation more complicated.
To simplify the task of assigning values into equal width buckets within a predefined range, PostgreSQL provides the function width_bucket
:
WITH grades AS ( SELECT round(70 + sin(n) * 30)::int AS grade FROM generate_series(1, 100) AS n ) SELECT width_bucket(grade, 0 ,101, 20) as bucket, (width_bucket(grade, 0 ,101, 20) - 1) * 5 as low_bound, width_bucket(grade, 0 ,101, 20) * 5 as high_bound, count(*) FROM grades GROUP BY bucket ORDER BY bucket; bucket │ low_bound │ high_bound │ count ───────┼───────────┼────────────┼─────── 8 │ 35 │ 40 │ 7 9 │ 40 │ 45 │ 13 10 │ 45 │ 50 │ 7 11 │ 50 │ 55 │ 8 12 │ 55 │ 60 │ 4 13 │ 60 │ 65 │ 7 14 │ 65 │ 70 │ 4 15 │ 70 │ 75 │ 7 16 │ 75 │ 80 │ 4 17 │ 80 │ 85 │ 6 18 │ 85 │ 90 │ 8 19 │ 90 │ 95 │ 7 20 │ 95 │ 100 │ 18
Ths function width_bucket
accepts the value to assign, two arguments for the lower and higher bounds of the range, and the number of buckets to divide the range into.
You should be careful with the values you set for the higher and lower bounds. The higher bound of the range is exclusive, meaning, if you set the higher bound to 100, grades that equal 100 will be considered out of range, and will result in an additional bucket. This is why the query above uses 101 as the higher bound.
To calculate the higher bound of each bucket, we multiply the index of the bucket by the width. To get the lower bound, we multiple the width by the index minus one, or in other words, the higher bound of the previous bucket.
Histograms are great for visualization, but if you try to draw a histogram from the result above you won't be able to get a real sense of the distribution because you might have gaps. Notice for example, how the range above starts with bucket 9, which is not the first bucket. This is because in our grades table, no one got a grade which is less than 40.
The function width_bucket
is useful, but we already do most of the hard work, so might as well generate the buckets on our own using generate_series
:
SELECT bucket, (bucket - 1) * 5 + 1 AS low_bound, bucket * 5 AS high_bound FROM generate_series(1, 20) AS bucket; bucket │ low_bound │ high_bound ───────┼───────────┼──────────── 1 │ 1 │ 5 2 │ 6 │ 10 3 │ 11 │ 15 4 │ 16 │ 20 5 │ 21 │ 25 6 │ 26 │ 30 7 │ 31 │ 35 8 │ 36 │ 40 9 │ 41 │ 45 10 │ 46 │ 50 11 │ 51 │ 55 12 │ 56 │ 60 13 │ 61 │ 65 14 │ 66 │ 70 15 │ 71 │ 75 16 │ 76 │ 80 17 │ 81 │ 85 18 │ 86 │ 90 19 │ 91 │ 95 20 │ 96 │ 100
The query generates 20 ranges of 5 within a range 1 to 100. To create the histogram, use this table as an axis and join it to the grades:
WITH grades AS ( SELECT round(70 + sin(n) * 30)::int AS grade FROM generate_series(1, 100) AS n ), buckets AS ( SELECT bucket, (bucket - 1) * 5 + 1 AS low_bound, bucket * 5 AS high_bound FROM generate_series(1, 20) AS bucket ) SELECT bucket, low_bound, high_bound, COUNT(grade) AS cnt FROM buckets LEFT JOIN grades ON grade BETWEEN low_bound AND high_bound GROUP BY bucket, low_bound, high_bound ORDER BY bucket; bucket │ low_bound │ high_bound │ cnt ───────┼───────────┼────────────┼───── 1 │ 1 │ 5 │ 0 2 │ 6 │ 10 │ 0 3 │ 11 │ 15 │ 0 4 │ 16 │ 20 │ 0 5 │ 21 │ 25 │ 0 6 │ 26 │ 30 │ 0 7 │ 31 │ 35 │ 0 8 │ 36 │ 40 │ 7 9 │ 41 │ 45 │ 13 10 │ 46 │ 50 │ 7 11 │ 51 │ 55 │ 8 12 │ 56 │ 60 │ 4 13 │ 61 │ 65 │ 7 14 │ 66 │ 70 │ 4 15 │ 71 │ 75 │ 7 16 │ 76 │ 80 │ 4 17 │ 81 │ 85 │ 6 18 │ 86 │ 90 │ 8 19 │ 91 │ 95 │ 7 20 │ 96 │ 100 │ 18
To make sure you don't have any gaps in the data, you LEFT JOIN
ed the grades to the generated axis table buckets
. As a result, some rows do not have a value. Using COUNT(*)
count rows, so buckets with no grades return 1. To overcome that, count only rows with grades using COUNT(grade)
.
Try it!
To finish off with a bang, you can enhance your query with a little ascii chart to display the histogram straight in the terminal:
WITH grades AS ( SELECT round(70 + sin(n) * 30)::int AS grade FROM generate_series(1, 100) AS n ), buckets AS ( SELECT bucket, (bucket - 1) * 5 + 1 AS low_bound, bucket * 5 AS high_bound FROM generate_series(1, 20) AS bucket ) SELECT low_bound || ' - ' || high_bound as bounds, COUNT(grade) AS cnt, repeat('■', COUNT(grade)::int) as chart FROM buckets LEFT JOIN grades ON grade BETWEEN low_bound AND high_bound GROUP BY bucket, low_bound, high_bound ORDER BY bucket; bounds │ cnt │ chart ─────────┼─────┼──────────────────── 1 - 5 │ 0 │ 6 - 10 │ 0 │ 11 - 15 │ 0 │ 16 - 20 │ 0 │ 21 - 25 │ 0 │ 26 - 30 │ 0 │ 31 - 35 │ 0 │ 36 - 40 │ 7 │ ■■■■■■■ 41 - 45 │ 13 │ ■■■■■■■■■■■■■ 46 - 50 │ 7 │ ■■■■■■■ 51 - 55 │ 8 │ ■■■■■■■■ 56 - 60 │ 4 │ ■■■■ 61 - 65 │ 7 │ ■■■■■■■ 66 - 70 │ 4 │ ■■■■ 71 - 75 │ 7 │ ■■■■■■■ 76 - 80 │ 4 │ ■■■■ 81 - 85 │ 6 │ ■■■■■■ 86 - 90 │ 8 │ ■■■■■■■■ 91 - 95 │ 7 │ ■■■■■■■ 96 - 100 │ 18 │ ■■■■■■■■■■■■■■■■■■
And there you have it! A histogram with SQL, right there in your terminal...
Take Away
If there is one thing you take away from this article, it should be this - use the best tool for the job!. Pandas is great, and SQL is also great. Each has its strengths and weaknesses, and you have a better chance of creating an optimal data pipeline if know both!
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK