3

使用SQL统计箱形图数据

 2 years ago
source link: https://www.biaodianfu.com/sql-boxplot.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

数据, 术→技巧

使用SQL统计箱形图数据

钱魏Way · 2021-09-06 · 14 次浏览

箱形图常用于数据的可视化,先前的文章中介绍过使用Python生成箱形图。箱形图中通常包含的数据有:

  • 最小值(不包含异常值)Lower Whisker = Q1-1.5(Q3-Q1)
  • 最大值(不包含异常值)Upper Whisker = Q3+1.5(Q3-Q1)
  • 平均值(可通过设置显示)
  • 下四分位Q1
  • 上四分位Q3

除了可视化外,箱型图有时也会用于异常检测等场景。今天主要介绍的是如何通过SQL(Hive SQL)来获取箱形图的值。

方式一:使用分位函数获取近似值

hive中有两个函数percentile和percentile_approx,可以用来计算分位数。

  • percentile:percentile(col, p) col是要计算的列(值必须为int类型),p的取值为0-1,若为5即中位数。
  • percentile_approx:percentile_approx(col, p)。列为数值类型可以使int也可以使float。percentile_approx还有一种形式percentile_approx(col, p,B),参数B控制内存消耗的近似精度,B越大,结果的精度越高。默认值为10000。当col字段中的distinct值的个数小于B时,结果就为准确的百分位数。

代码实现:

SELECT *, q1 - 1.5 * (q3 - q1) AS price_lower
, q3 + 1.5 * (q3 - q1) AS price_upper
FROM (
SELECT product, count(1) AS quantity, avg(price) AS price_mean
, max(price) AS price_max, min(price) AS price_min
, percentile_approx(price, 0.5) AS price_median
, percentile_approx(price, 0.25) AS q1
, percentile_approx(price, 0.75) AS q3
FROM testdb.product_price
GROUP BY product
SELECT *, q1 - 1.5 * (q3 - q1) AS price_lower
    , q3 + 1.5 * (q3 - q1) AS price_upper
FROM (
    SELECT product, count(1) AS quantity, avg(price) AS price_mean
        , max(price) AS price_max, min(price) AS price_min
        , percentile_approx(price, 0.5) AS price_median
        , percentile_approx(price, 0.25) AS q1
        , percentile_approx(price, 0.75) AS q3
    FROM testdb.product_price
    GROUP BY product
) t

方式二:使用窗口函数获取精确值

WITH details AS (
SELECT product, price, ROW_NUMBER() OVER (PARTITION BY od ORDER BY price) AS rn
, SUM(1) OVER (PARTITION BY product ) AS total
FROM testdb.product_price
quartiles AS (
SELECT product, price
, AVG(CASE
WHEN rn >= FLOOR(total / 2.0) / 2.0
AND rn <= FLOOR(total / 2.0) / 2.0 + 1
THEN price / 1.0
ELSE NULL
END) OVER (PARTITION BY product ) AS q1
, AVG(CASE
WHEN rn >= total / 2.0
AND rn <= total / 2.0 + 1
THEN price / 1.0
ELSE NULL
END) OVER (PARTITION BY product ) AS median
, AVG(CASE
WHEN rn >= CEIL(total / 2.0) + FLOOR(total / 2.0) / 2.0
AND rn <= CEIL(total / 2.0) + FLOOR(total / 2.0) / 2.0 + 1
THEN price / 1.0
ELSE NULL
END) OVER (PARTITION BY product ) AS q3
FROM details
SELECT *, q1 - 1.5 * (q3 - q1) AS price_lower
, q3 + 1.5 * (q3 - q1) AS price_upper
FROM (
SELECT product, count(1) AS order_count, avg(price) AS price_mean
, max(price) AS price_max, min(price) AS price_min
, AVG(median) AS price_median, AVG(q1) AS q1
, AVG(q3) AS q3
FROM quartiles
GROUP BY product
WITH details AS (
        SELECT product, price, ROW_NUMBER() OVER (PARTITION BY od ORDER BY price) AS rn
            , SUM(1) OVER (PARTITION BY product ) AS total
        FROM testdb.product_price
    ), 
    quartiles AS (
        SELECT product, price
            , AVG(CASE 
                WHEN rn >= FLOOR(total / 2.0) / 2.0
                    AND rn <= FLOOR(total / 2.0) / 2.0 + 1
                THEN price / 1.0
                ELSE NULL
            END) OVER (PARTITION BY product ) AS q1
            , AVG(CASE 
                WHEN rn >= total / 2.0
                    AND rn <= total / 2.0 + 1
                THEN price / 1.0
                ELSE NULL
            END) OVER (PARTITION BY product ) AS median
            , AVG(CASE 
                WHEN rn >= CEIL(total / 2.0) + FLOOR(total / 2.0) / 2.0
                    AND rn <= CEIL(total / 2.0) + FLOOR(total / 2.0) / 2.0 + 1
                THEN price / 1.0
                ELSE NULL
            END) OVER (PARTITION BY product ) AS q3
        FROM details
    )
SELECT *, q1 - 1.5 * (q3 - q1) AS price_lower
    , q3 + 1.5 * (q3 - q1) AS price_upper
FROM (
    SELECT product, count(1) AS order_count, avg(price) AS price_mean
        , max(price) AS price_max, min(price) AS price_min
        , AVG(median) AS price_median, AVG(q1) AS q1
        , AVG(q3) AS q3
    FROM quartiles
    GROUP BY product
) t

参考链接:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK