Important Topics in Machine Learning That Every Data Scientist Must Know

Evaluating the Basics of Machine Learning

Jul 24 ·7min read

bInqyqA.jpg!web

Image by Trist’n Joseph

Machine learning (ML) is “an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.” ML algorithms are used to find patterns in data that generate insight and help make data-driven decisions and predictions. These types of algorithms are employed every day to make critical decisions in medical diagnosis, stock trading, transportation, legal matters and much more. Therefore, it can be seen why data scientists place ML on such a high pedestal; it provides a medium for high priority decisions, that can guide better business and smart actions, in real-time without human intervention.

Now, ML models do not necessarily ‘learn’ like how humans learn. Rather, these algorithms use computational methods to understand information directly from data without relying on a predetermined equation as a model. To do this, the algorithms are made to determine a pattern in data and develop a target function which best maps an input variable, x , to a target variable, y . It must be noted here that the true form of the target function is usually unknown. If the function was known, then ML would not be needed.

Therefore, the idea is to determine the best estimate of this target function by conducting sound inference about the sample data to then apply and optimize the appropriate ML technique for the situation at hand. Different situations require that different assumptions be made about the form of the function being estimated. Additionally, different ML algorithms make different assumptions about the shape the function, and thus, how it should be optimized. Understandably, it is easy to get overwhelmed with how much there is to learn with ML. So, in this post, I discuss two important topics in ML that every data scientist should know.

RfqyEje.jpg!web

Image by Trist’n Joseph

The Type of Learning

ML algorithms are often categorized as either supervised or unsupervised , and this broadly refers to whether the dataset being used in labelled or not. Supervised ML algorithms apply what has been learned in the past to new data by using labelled examples to predict future outcomes. Essentially, the correct answer is known for these types of problems and the estimated model’s performance is judged based on whether or not the predicted output is correct. In contrast, unsupervised ML algorithms refer to those developed when the information used to train the model is neither classified nor labelled. These algorithms work by attempting to make sense out of data by extracting features and patterns that can be found within the sample.

Now semi-supervised learning does exist, and it takes the middle ground between supervised and unsupervised learning. That is, a small portion of the data might be labelled, and the remainder is not.

yyi2meR.jpg!web

Image by Trist’n Joseph

Supervised learning is useful when the task given is a classification or regression problem. Classification problems refer to grouping observations or input data into discrete ‘classes’ based on particular criteria developed by the model. A typical example of this would be predicting whether an email is spam or non-spam. The model would be developed and trained on a dataset containing both spam and non-spam emails, where each observation is appropriately labelled.

Regression problems, on the other hand, refer to the process of accepting a set of input data and determining a continuous quantity as the output. A common example of this is predicting an individual’s income, given their education level, gender, and the total amount of hours worked.

eaeIvaU.jpg!web

Image by Trist’n Joseph

Unsupervised learning is most appropriate when the answer to a particular question is more or less unknown. These algorithms are mainly used for clustering and anomaly detection because it is possible to detect similarities throughout observations without knowing exactly what the observation refers to. For example, one can look at the colour, size, and shape of various flowers and then roughly separate them into groups without truly knowing the species of each flower. Additionally, consider a credit card company monitoring consumer behaviour. It would be possible to detect fraudulent transactions by monitoring where transactions have occurred. For example, consider a credit card is frequently used in New York. If on a particular day, the card is used in New York, Los Angeles and Hongkong, then it could be considered an anomaly and the system should alert the relevant parties.

miqmQ3j.jpg!web

Image by Trist’n Joseph

2. Model Fitting

Fitting a model refers to making an algorithm determine the relationship between the predictors and the outcome so that future values can be predicted. Recall that the models are developed using training data, which is ideally a large random sample that accurately reflects a population. This necessary action comes with some very undesirable risks. Fully accurate models are difficult to estimate because sample data are subject to random noise. This random noise, along with the number of assumptions made by the researcher, has the potential to cause ML models to learn fake patterns within the data. If one tries to combat this risk by making too few assumptions, it can cause the model to not learn enough information from the data. These issues are known as overfitting and underfitting, and the goal is to determine an appropriate mix between simplicity and complexity.

Fr2UziA.jpg!web

Image by Trist’n Joseph

Overfittingoccurs when a model learns ‘too much’ from the training data, including random noise. Models are then able to determine very intricate patterns within the data, but this negatively affects the performance on new data. The noise picked up in the training data does not apply to new or unseen data, and the model is unable to generalize the patterns found. Certain ML models are more prone to overfitting than others, and these include both nonlinear and nonparametric models. For these types of models, overfitting can be overcome altering the model itself. Consider a nonlinear equation to the 4th power. It is possible to reduce overfitting by reducing the power of the model to maybe the 3rd power once acceptable results will still be produced. Alternatively, overfitting can be limited by applying cross-validation or regularization to the model parameters.

aARRJzq.jpg!web

Image by Trist’n Joseph

Underfitting, on the other hand, occurs when a model is unable to learn a sufficient amount of information from the training data. Then models are then unable to determine suitable patterns within the data, and this negatively affects the performance on new data. Since very little is learned, the model cannot apply much to unseen data and it is unable to generalize observations for the research problem at hand. Commonly, underfitting is as a result of model misspecification and can be fixed by using a more appropriate ML algorithm. For example, is a linear equation is used to estimate a nonlinear problem, underfitting will occur. Although this is true, underfitting can also be corrected through cross-validation and parameter regularization.

Cross-validationis a technique used to evaluate a model’s fit by training several models on various subsets of the sample dataset and then evaluating them on a complementary subset of the training set.

Regularizationrefers to the process of adding information to a model parameter in order to combat poor model performance. This can be through specifying that a parameter follows a particular distribution, such as the normal distribution versus a uniform distribution; or by giving a range of values that a parameter must fall within.

y6vyqaF.jpg!web

Image by Trist’n Joseph

Machine learning models are extremely powerful, but with great power comes great responsibility. Developing the most appropriate ML model requires that the researcher adequately understands the problem at hand and what techniques will be suitable given the circumstance. Understanding whether a problem is supervised or unsupervised will provide some insight into what type of ML algorithm will be used; while understanding the model fit can prevent poor model performance when deployed. Happy modelling!