34

Foundation statistics terms decoded

 4 years ago
source link: https://towardsdatascience.com/foundation-statistics-terms-decoded-f1def3721c1e?gi=32425222a16d
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Explained in VS style, first left then right!

Apr 27 ·3min read

E32aqaA.png!web

Believe me! You can!

Any person who decides to embark upon the long and the difficult journey of getting well versed with Statistical learning methods (SLM) is faced with overtly technical explanations to some very basic terminologies. This post aims to debunk the popular myth of statistics being ‘ Difficult’ by explaining a few foundation terms in plain English.

Example case, If we would like to determine the success of a Medium story (I) based on different parameters like Reading grade level (D), words per sentence(D), time to read(D) etc. (A medium article reference at the end)

  1. Dependent(Response) VS Independent(Predictor) variables

The probability of the success of a story is the Dependent variable or the response and the variables used to predict the same (marked as D) are the Independent variables or the predictors.

2. Prediction VS Inference methods

If I want to use a method which is concerned with only the best output predictions irrespective of the relationship between the dependent and the independent variables we will go for a prediction method (generally, complex and less interpretable like Decision trees). But, if the objective is also to understand the nature of relationships between the response and the predictors, then we will use an Inference method (easy to understand like Linear regression).

3. Parametric VS Non Parametric methods

A Parametric method is a two step approach. Firstly, we make an assumption about the nature and shape of the Independent variable, let’s say here we assume that it is Linear in nature.

Probability of Success= a+ b(Reading grade level) + c(words per sentence) +……..+n(time to read)

The second step is to then predict the coefficients (a to n)using a technique like OLS (Ordinary least squares) etc. This reduced model-based approach is Parametric in nature.

Whereas, in non-parametric methods, we do not make any explicit assumptions about the functional form of the Independent variable. Such approaches can have a major advantage over parametric approaches because by avoiding any assumption, they have the potential to accurately fit a wider range of possible shapes of Independent variable.

4. Supervised VS Unsupervised learning methods

The problem that we are talking about is of the relation of the success of a medium story to the predictors described above which then helps us determine if the story was successful or not. To develop this model, we must have access to a proper training data wherein we have the success results available for a bunch of already published stories to train the model which will then predict the success of the future stories. This is a supervised learning approach.

However, if we ever come across a problem of let’s say, determine if Trump will like the next country head he meets(that he has never met before). Now , here we do not have any past data available so we cannot train a model. Over here we will deploy a model that will learn on its own and will become intelligent on the go with every new meeting. These kinds of models require regular feeds to become reliable predictors over time. This is an unsupervised learning approach.

Last but definitely not the least,

5. Regression VS Classification

Variables can be characterized as either quantitative or qualitative (also known as categorical). Quantitative variables take on numerical values like words per sentence, time to read the article etc. Whereas, categorical variables take on values in one of K different classes or categories like Yes and No for a variable like Sentence case in the Title etc.

In simple terms, we refer to problems with a quantitative response as regression problems, while those involving a categorical response are often referred to as classification problems.

I have referred the following medium story to get to know about the predictors of the success of a medium story. It’s a nice read.

Thank you for reading:) Watch this space for more on statistics, data analytics and Machine learning!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK