36

Smoothing Data In SQL

 4 years ago
source link: https://www.tuicool.com/articles/MFRRVzf
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Implement A Moving Average to Smooth Your Data in SQL

Aug 14 ·4min read

A problem found throughout the world of data is how to distinguish signal from noise. When dealing with data that comes in a sequence, such as time series data (the most familiar example but by no means the only example), a frequent method of dealing with the problem is to apply a smoother.

Smoothers can range from the very simple to very sophisticated, but we will concentrate on the simpler end of the spectrum.

All these smoothers, no matter how simple or how complicated have in common that they rely on the data appearing in a particular order. To calculate the smoother, you need to be able to refer to multiple rows around the row in focus, something SQL was not always well set up to do.

In more recent years, major implementations of SQL including MySQL, SQLite, PostgreSQL and SQL Server have added Lead() and Lag() functions allowing calculations to be performed on rows appearing before and after a row of interest. We can make use of these functions to implement simple smoothers in SQL, such as a weighted moving average.

Like all the windowing functions, SQL gives you flexibility on how to specify the data that the function applies to using the ‘PARTITION’ and ‘OVER’ key words. In semantic terms

SELECT Index, Classifier, Value, LAG(Value) OVER (Partition By Classifier ORDER BY Index)
FROM YourTable

Note that the default offset is one place but you can specify the offset within the Lag function e.g. Lag(Value,2) is two rows behind the original row.

However, we can’t return multiple window function variables in the same column, so we can’t do the arithmetic we need to do to calculate the weighted moving average itself. Therefore we need to create a Common Table Expression with to provide the lagged columns, and run a subsequent query to calculate the weighted moving average itself. Again, in purely semantic terms:

WITH Base_Table
AS
(SELECT Index, Classifier, Value 
  ,LAG(Value) OVER (Partition By Classifier ORDER BY Index) as LagVal1
  ,LAG(Value,2) OVER (Partition By Classifier ORDER BY Index) as LagVal2FROM YourTable)SELECT Index, Classifier
            , (Value + LagVal1 + LagVal2)/3 as MovingAverage
From Base_Table

However, to illustrate properly we need a good example. For better or worse, I’m an Australian from Victoria, so the football code I follow is Australian Rules (less likely to be true if I was an Australian from NSW or Queensland). My team is the Essendon Bombers who recently recorded one of the biggest losses of their history against the Western Bulldogs, only managing to score 33 points to the Western Bulldogs’ 137 points.

By way of self-flagellation, I thought it would instructive to calculate the moving average of the scores of both teams in recent matches in order to better see how each of them are trending.

Firstly, we need to create a table containing the scores from the last several matches.

CREATE TABLE AFL_SCORES(  Team VarChar(30)
, RoundNum Int
, Score Int
)
;INSERT INTO AFL_SCORESVALUES
  ('Essendon', 14, 71)
, ('Essendon', 15, 77)
, ('Essendon', 16, 76)
, ('Essendon', 17, 86)
, ('Essendon', 18, 96)
, ('Essendon', 19, 106)
, ('Essendon', 20, 67)
, ('Essendon', 21, 33)
, ('Western Bulldogs', 14, 73)
, ('Western Bulldogs', 15, 66)
, ('Western Bulldogs', 16, 71)
, ('Western Bulldogs', 17, 74)
, ('Western Bulldogs', 18, 89)
, ('Western Bulldogs', 19, 113)
, ('Western Bulldogs', 20, 80)
, ('Western Bulldogs', 21, 137);

Then we have the query itself, following the pattern introduced above of a CTE to create the lagged columns and a subsequenty query to calculate the moving average itself. Note that this is the simple moving average, but in many cases a weighted moving average is used, with a variety of different weighting schemes to suit differing tastes.

WITH LagsTable
AS
(SELECT   Team
          , RoundNum
          , Score
          , Lag(Score)    OVER (Partition By Team Order By RoundNum) AS ScoreLag1
          , Lag(Score,2)  OVER (Partition By Team Order By RoundNum) AS ScoreLag2FROM AFL_SCORES)SELECT   Team
         , RoundNum
         , (Score+ScoreLag1+ScoreLag2)/3 As MovingAverageFROM LagsTable

Giving the result:

AFfURr3.png!web

Possibly a little bit of comfort for Essendon fans there in the sense that it appears the Western Bulldogs have been improving for several rounds in a row, a stronger trend than Essendon’s more recent collapse in scoring.

In any case, a useful, if personally painful way to illustrate how SQL can be used to calculate a moving average with window functions.

Robert de Graaf’s book, Managing Your Data Science Projects , is out now through Apress.

Follow Robert on Twitter


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK