Statistics is the Grammar of Data Science — Part 3/5
source link: https://www.tuicool.com/articles/hit/amAvyyi
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Statistics is the Grammar of Data Science — Part 3/5
Statistics refresher to kick start your Data Science journey
This is the 3rd article of the ‘Statistics is the Grammar of Data Science’ series, covering Measures of location (percentiles and quartiles) and Moments .
Revision
Bookmarks to the rest of the articles for easy access:
Part 3: Measures of Location | Moments :triangular_flag_on_post:
Part 4: Covariance | Correlation
Part 5: Conditional Probability | Bayes’ Theorem
Measures of Location
Percentiles
Percentiles divide ordered data into hundredths . In a sorted dataset, a given percentile is the point at which that percent of the data is less than the point we are at.
The 50th percentile is pretty much the median.
For instance, imagine the growth chart of baby girls from birth until 2 years old. By following the lines, we can see that 98% of the one year old baby girls weigh less than 11.5Kg.
Another popular example is a country’s income distribution. The 99th percentile is the income at which 99% of the rest of the country is making less than that amount, and 1% is making more. In the case of the UK on the graph below, this is £75,000.
Quartiles
Quartiles are special percentiles, which divide the data into quarters . The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median is called both the second quartile, Q2, and the 50th percentile.
Interquartile Range (IQR)
The IQR is a number that indicates how spread the middle half (i.e. the middle 50%) of the dataset is and can help determine outliers. It is the difference between the Q3 and Q1.
IQR = Q3 - Q1
Generally speaking, outliers are those data points that fall outside from the Q1 – 1.5 x IQR and Q3 + 1.5 x IQR range.
Box Plots
Box plots (also called box and whisker plots) illustrate:
- how concentrated the data is, and
- how far the extreme values are from most of the data.
A box plot is comprised of a scaled horizontal or vertical axis and a rectangular box .
The minimum and maximum values are the endpoints of the axis (-15 and 5 in this case). The Q1 marks one end of the box and the Q3 the other end of the blue box.
The ‘ whiskers ’ (shown in purple) extend from the ends of the box to the smallest and largest data values. There are also box plots that have dots marking outlier values (shown in red). In those cases, the whiskers are not extending to the minimum and maximum values.
:pencil2: Boxplots on a Normal Distribution
There is a subtle nuance with boxplots on normal distributions: Even though they are called quartile 1 (Q1) and quartile 3 (Q1), they don’t really represent 25% of the data! They represent 34.135%, and the area in between is not 50%, but 68.27%.
Moments
Moments describe various aspects of the nature and shape of our distribution.
#1 — The first moment is the mean of the data, which describes the location of the distribution.
#2 — The second moment is the variance , which describes the spread of the distribution. High values are more spread out than smaller values.
#3 — The third moment is the skewness and it is basically a measure of how lopsided a distribution is. A positive skew means we have a left lean and a long right tail. This means that the mean is to the right of the bulk of our data. And vice versa:
#4 — The fourth moment is the kurtosis , which describes how thick the tail is and how sharp the peak is. It indicates how likely it is to find extreme values in our data. Higher values make outliers more likely. This sounds a lot like spread (variance) but is subtly different.
We can see that the higher peak values have a higher kurtosis value, i.e. the topmost curve has a higher kurtosis than the bottommost curve.
That’s all folks! This was a rather short article; we learnt how important the percentiles are, as they indicate where we stand in relation to everyone else. Then we saw a special category, called quartiles, and their application into finding outliers. Finally, we explored the four ‘moments’ which describe a curve’s shape.
Thanks for reading! Part 4 is coming soon…
I regularly write about Technology & Data on Medium — if you would like to read my future posts then please ‘Follow’ me!
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK