

Important Topics in Machine Learning That Every Data Scientist Must Know
source link: https://towardsdatascience.com/important-topics-in-machine-learning-that-every-data-scientist-must-know-9e387d880b3a?gi=e267a73e99e3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Evaluating the Basics of Machine Learning
Jul 24 ·7min read
Machine learning (ML) is “an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.” ML algorithms are used to find patterns in data that generate insight and help make data-driven decisions and predictions. These types of algorithms are employed every day to make critical decisions in medical diagnosis, stock trading, transportation, legal matters and much more. Therefore, it can be seen why data scientists place ML on such a high pedestal; it provides a medium for high priority decisions, that can guide better business and smart actions, in real-time without human intervention.
Now, ML models do not necessarily ‘learn’ like how humans learn. Rather, these algorithms use computational methods to understand information directly from data without relying on a predetermined equation as a model. To do this, the algorithms are made to determine a pattern in data and develop a target function which best maps an input variable, x , to a target variable, y . It must be noted here that the true form of the target function is usually unknown. If the function was known, then ML would not be needed.
Therefore, the idea is to determine the best estimate of this target function by conducting sound inference about the sample data to then apply and optimize the appropriate ML technique for the situation at hand. Different situations require that different assumptions be made about the form of the function being estimated. Additionally, different ML algorithms make different assumptions about the shape the function, and thus, how it should be optimized. Understandably, it is easy to get overwhelmed with how much there is to learn with ML. So, in this post, I discuss two important topics in ML that every data scientist should know.
- The Type of Learning
ML algorithms are often categorized as either supervised or unsupervised , and this broadly refers to whether the dataset being used in labelled or not. Supervised ML algorithms apply what has been learned in the past to new data by using labelled examples to predict future outcomes. Essentially, the correct answer is known for these types of problems and the estimated model’s performance is judged based on whether or not the predicted output is correct. In contrast, unsupervised ML algorithms refer to those developed when the information used to train the model is neither classified nor labelled. These algorithms work by attempting to make sense out of data by extracting features and patterns that can be found within the sample.
Now semi-supervised learning does exist, and it takes the middle ground between supervised and unsupervised learning. That is, a small portion of the data might be labelled, and the remainder is not.
Supervised learning is useful when the task given is a classification or regression problem. Classification problems refer to grouping observations or input data into discrete ‘classes’ based on particular criteria developed by the model. A typical example of this would be predicting whether an email is spam or non-spam. The model would be developed and trained on a dataset containing both spam and non-spam emails, where each observation is appropriately labelled.
Regression problems, on the other hand, refer to the process of accepting a set of input data and determining a continuous quantity as the output. A common example of this is predicting an individual’s income, given their education level, gender, and the total amount of hours worked.
Unsupervised learning is most appropriate when the answer to a particular question is more or less unknown. These algorithms are mainly used for clustering and anomaly detection because it is possible to detect similarities throughout observations without knowing exactly what the observation refers to. For example, one can look at the colour, size, and shape of various flowers and then roughly separate them into groups without truly knowing the species of each flower. Additionally, consider a credit card company monitoring consumer behaviour. It would be possible to detect fraudulent transactions by monitoring where transactions have occurred. For example, consider a credit card is frequently used in New York. If on a particular day, the card is used in New York, Los Angeles and Hongkong, then it could be considered an anomaly and the system should alert the relevant parties.
2. Model Fitting
Fitting a model refers to making an algorithm determine the relationship between the predictors and the outcome so that future values can be predicted. Recall that the models are developed using training data, which is ideally a large random sample that accurately reflects a population. This necessary action comes with some very undesirable risks. Fully accurate models are difficult to estimate because sample data are subject to random noise. This random noise, along with the number of assumptions made by the researcher, has the potential to cause ML models to learn fake patterns within the data. If one tries to combat this risk by making too few assumptions, it can cause the model to not learn enough information from the data. These issues are known as overfitting and underfitting, and the goal is to determine an appropriate mix between simplicity and complexity.
Overfittingoccurs when a model learns ‘too much’ from the training data, including random noise. Models are then able to determine very intricate patterns within the data, but this negatively affects the performance on new data. The noise picked up in the training data does not apply to new or unseen data, and the model is unable to generalize the patterns found. Certain ML models are more prone to overfitting than others, and these include both nonlinear and nonparametric models. For these types of models, overfitting can be overcome altering the model itself. Consider a nonlinear equation to the 4th power. It is possible to reduce overfitting by reducing the power of the model to maybe the 3rd power once acceptable results will still be produced. Alternatively, overfitting can be limited by applying cross-validation or regularization to the model parameters.
Underfitting, on the other hand, occurs when a model is unable to learn a sufficient amount of information from the training data. Then models are then unable to determine suitable patterns within the data, and this negatively affects the performance on new data. Since very little is learned, the model cannot apply much to unseen data and it is unable to generalize observations for the research problem at hand. Commonly, underfitting is as a result of model misspecification and can be fixed by using a more appropriate ML algorithm. For example, is a linear equation is used to estimate a nonlinear problem, underfitting will occur. Although this is true, underfitting can also be corrected through cross-validation and parameter regularization.
Cross-validationis a technique used to evaluate a model’s fit by training several models on various subsets of the sample dataset and then evaluating them on a complementary subset of the training set.
Regularizationrefers to the process of adding information to a model parameter in order to combat poor model performance. This can be through specifying that a parameter follows a particular distribution, such as the normal distribution versus a uniform distribution; or by giving a range of values that a parameter must fall within.
Machine learning models are extremely powerful, but with great power comes great responsibility. Developing the most appropriate ML model requires that the researcher adequately understands the problem at hand and what techniques will be suitable given the circumstance. Understanding whether a problem is supervised or unsupervised will provide some insight into what type of ML algorithm will be used; while understanding the model fit can prevent poor model performance when deployed. Happy modelling!
References:
machinelearningmastery.com/how-machine-learning-algorithms-work/
Other Useful Material:
simplilearn.com/importance-of-machine-learning-for-data-scientists-article
towardsdatascience.com/important-topics-in-machine-learning-you-need-to-know-21ad02cc6be5
Recommend
-
40
There is a Japanese word, tsundoku (積ん読), which means buying and keeping a growing collection of books, even though you don’t really read them all. I think we Developers and Data Scientists are...
-
27
Data Science, dig right into building models, who cares about Domain Knowledge, Right? Data Science is all in the blaze at the moment! A quick search of the keyword on Google yields not a Wikipedia p...
-
20
Image by
-
30
Image by
-
13
On a predictive modeling project, machine learning algorithms learn a mapping from input variables to a target variable. The most common form of predictive modeling project involves so-called structured data or t...
-
46
The concepts that are likely to be encountered at an interview.
-
10
What every aspiring data scientist needs to know about codingEspecially if you don’t have a background in computer science
-
9
Photo by picjumbo.com on PexelsWhile looking to get into the User Experien...
-
6
What is Machine Learning and Why is it Important? Jun 25, 2021 Machine Learning is a subset of Artificial Intelligence (commonly called AI). It allows comp...
-
5
Thanks to Machine Learning, Scientist Finally Recover Text From The Charred Scrolls of Vesuvius
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK