52

Statistics is the Grammar of Data Science — Part 3/5

 5 years ago
source link: https://www.tuicool.com/articles/hit/amAvyyi
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Statistics is the Grammar of Data Science — Part 3/5

Statistics refresher to kick start your Data Science journey

URrEbeM.jpg!webRZZZvyz.jpg!web

This is the 3rd article of the ‘Statistics is the Grammar of Data Science’ series, covering Measures of location (percentiles and quartiles) and Moments .

Revision

Bookmarks to the rest of the articles for easy access:

Part 2: Data Distributions

Part 3: Measures of Location | Moments :triangular_flag_on_post:

Part 4: Covariance | Correlation

Part 5: Conditional Probability | Bayes’ Theorem

Measures of Location

Percentiles

Percentiles divide ordered data into hundredths . In a sorted dataset, a given percentile is the point at which that percent of the data is less than the point we are at.

The 50th percentile is pretty much the median.

For instance, imagine the growth chart of baby girls from birth until 2 years old. By following the lines, we can see that 98% of the one year old baby girls weigh less than 11.5Kg.

EvIriqY.png!web6RRJniV.png!web
Girls’ growth chart. Courtesy: World Health Organisation Child Growth Standards

Another popular example is a country’s income distribution. The 99th percentile is the income at which 99% of the rest of the country is making less than that amount, and 1% is making more. In the case of the UK on the graph below, this is £75,000.

ARv6Bbm.png!webZZRBB3i.png!web
UK income distribution. Courtesy: Wikipedia

Quartiles

Quartiles are special percentiles, which divide the data into quarters . The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median is called both the second quartile, Q2, and the 50th percentile.

Interquartile Range (IQR)

The IQR is a number that indicates how spread the middle half (i.e. the middle 50%) of the dataset is and can help determine outliers. It is the difference between the Q3 and Q1.

IQR = Q3 - Q1
VbuQviA.jpg!webU7JZbuQ.jpg!web
IQR. Courtesy: Wikipedia

Generally speaking, outliers are those data points that fall outside from the Q1 – 1.5 x IQR and Q3 + 1.5 x IQR range.

Box Plots

Box plots (also called box and whisker plots) illustrate:

  • how concentrated the data is, and
  • how far the extreme values are from most of the data.
6fQfEv2.png!webjaAnmmf.png!web
Elements of a boxplot. Courtesy: Wikimedia

A box plot is comprised of a scaled horizontal or vertical axis and a rectangular box .

The minimum and maximum values are the endpoints of the axis (-15 and 5 in this case). The Q1 marks one end of the box and the Q3 the other end of the blue box.

The ‘ whiskers ’ (shown in purple) extend from the ends of the box to the smallest and largest data values. There are also box plots that have dots marking outlier values (shown in red). In those cases, the whiskers are not extending to the minimum and maximum values.

:pencil2: Boxplots on a Normal Distribution

There is a subtle nuance with boxplots on normal distributions: Even though they are called quartile 1 (Q1) and quartile 3 (Q1), they don’t really represent 25% of the data! They represent 34.135%, and the area in between is not 50%, but 68.27%.

Fbei2uA.png!webVrIR3uj.png!web
Comparison of a boxplot of a nearly normal distribution (top) and a PDF for a normal distribution (bottom). Courtesy: Wikipedia

Moments

Moments describe various aspects of the nature and shape of our distribution.

#1 — The first moment is the mean of the data, which describes the location of the distribution.

#2 — The second moment is the variance , which describes the spread of the distribution. High values are more spread out than smaller values.

#3 — The third moment is the skewness and it is basically a measure of how lopsided a distribution is. A positive skew means we have a left lean and a long right tail. This means that the mean is to the right of the bulk of our data. And vice versa:

bYja2qf.png!webVfEVZva.png!web
Skewness. Courtesy: Wikipedia

#4 — The fourth moment is the kurtosis , which describes how thick the tail is and how sharp the peak is. It indicates how likely it is to find extreme values in our data. Higher values make outliers more likely. This sounds a lot like spread (variance) but is subtly different.

YZrymyj.png!webuuqAjiR.png!web
Kurtosis illustration of three curves. Courtesy: Wikipedia

We can see that the higher peak values have a higher kurtosis value, i.e. the topmost curve has a higher kurtosis than the bottommost curve.

That’s all folks! This was a rather short article; we learnt how important the percentiles are, as they indicate where we stand in relation to everyone else. Then we saw a special category, called quartiles, and their application into finding outliers. Finally, we explored the four ‘moments’ which describe a curve’s shape.

Thanks for reading! Part 4 is coming soon…

I regularly write about Technology & Data on Medium — if you would like to read my future posts then please ‘Follow’ me!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK