Basic Steps of Machine Learning

Inside Ai

Through the Lens of Oncology

Jul 27 ·6min read

Let’s start our deeper dive into the basics of machine learning. We will begin with the fundamental theory behind the use of machine learning and the outline we will use for approaching novel problems. We will culminate with a discussion of potential problems that could be solved with some common ML models.

aqEvym3.jpg!web

From NCI on Unsplash

Why Machine Learning?

Machine learning requires a new way of approaching problems. In traditional programming, a human with some domain knowledge must derive a program that transforms an input to a desired output. In this case, the input and transformation are known.

In machine learning, it is the transformation that is unknown. That is, we provide the inputs and the resulting output (at least in supervised learning). As our modeling becomes more complex, a human working out the necessary transformation for all features through conventional programming becomes intractable.

eAvE7b3.png!web

Differences between conventional programming and machine learning are highlighted by their output. Link

Furthermore, as our datasets become larger and more complicated, it becomes difficult for a human to identify novel, significant patterns in the data. But machines can make sense of this dimensionality more easily. That said, there is a limit to the amount of features we can include in our model due to the “curse of dimensionality”. As dimensions increase the distance between our points grows and the data grows sparse. We will discuss later why this matters and how to combat the “curse”.

How should we approach a ML problem?

I’ll start by saying that this post is for a generalized ML model. The relative use-cases, strengths, and weaknesses of various models are outside the scope of this piece, but resources are available with a quick Google search. Here , for example.

1 — Get Data

This could take many forms. If you have access to electronic health records (EHR), great. If you have experimental data from a single study or data across a meta-study, even better. Generally, the more you can examine at first, the better. During data cleaning and EDA we will get a better feel for the data and will make more informed decisions about what to include, or not.

APIs are a great resource. Here are some options to check out:

HealthData.gov : COVID-19, EHR, Medicare, etc.
HealthIT.gov : EHR, state-wide Performance Indexes, Clinician Churn, etc.
Rapid API : has a collection of healthcare-related APIs on a wide-range of topics

2 — Process Your Data

Once you have a collection of data, you need to pre-process it for use in a machine learning model. Some models have very specific requirements for the input data (like no missing entries), so make sure you read the documentation for your chosen model and package, if applicable.

There are some steps to be taken regardless of model type. Some processes you should be familiar with:

Handling missing values : you can drop NaN values, impute median/mean from similar subsets, or flag the missing value as a class or chosen stand-in value.
Remove duplicate entries : Pandas has a drop_duplicate() method.
Check for Inconsistencies : for example, if there are misspellings that add extra categories in a column or a numerical value is stored as a ‘string’ or ‘object’ — check with df.column.dtype
Filter out influential outliers : a word of warning, you should not remove an outlier unless it is a real and present danger. Not all outliers will be influential to your model’s accuracy.
If categorical, check for class imbalances .

Another process you will likely need to do is feature selection and engineering. This could include creating dummy variables for categorical columns, removing features with multicollinearity (correlated variables), or creating a new column that provides potentially new insights based on other columns.

Here’s a greatarticle to delve deeper into feature engineering, including scaling, log transformations, one-hot encoding, etc.

3 — Choose a Model

Finally, time for a model! Again, there are many resources to determine what model will fit your needs. Here’s a great summary from scikit-learn .

Based on our use case we will need to determine whether our model should utilize supervised or unsupervised learning . For supervised learning, we are attempting to train a model that properly transforms inputs to corresponding known outputs. This means we need to know what the output is, but this is not always the case. Unsupervised learning would then be used, in which we let the model find the relationships within our data.

Some examples of supervised learning models are regression and classification models, like logistic regression, support vector machines (SVMs), or neural networks.

For unsupervised learning, we have no known outputs. These models are often used inclustering analysis and dimensionality reduction .

4 — Training

So now we have a model chosen and data collected and processed. We are ready to start training the model. We do this by first splitting our dataset into training, validation, and test sets. The exact ratio of train/validate/test data depends on what model you use. Models with fewer features are easier to tune and thus may need less validation data.

Depending on your package, you will train in different ways. As always, check the documentation. Scipy and Keras models have a .fit() method. PyTorch requires writing your own training loop.

5 — Evaluation

There are a fewmetrics you’ve probably heard of you can use to determine the success of your model:

Accuracy : percentage of correctly classified outputs (True Positives + True Negatives)
Precision : ratio of True Positives to True and False Positives
Recall : ratio of True Positives to Total Actual Positives (True Positives + False Negatives)

These are all very similar at first glance. Precision is important in cases where False Positives are important. Recall is important in cases where False Negatives are important. As always, keeping your business understanding and goals in mind should drive your evaluation metrics.

I caution against using accuracy as your main, or only, evaluation metric. As an example, let’s say you are developing a model to predict if a patient has a rare disease that only affects 0.01% of the population. Due to the large class imbalance our model will predict “no disease” every time! And that will have an accuracy of 99.99% ! But, as you can guess, this is not a useful model.

A note on using your validation set to tune your model. As you iteratively improve your model, be aware that testing against your validation set over and over will lead to overfitting to that set and the model will not generalize well to new data (the actual test set). One way to combat this is to use synthetic methods to “shuffle” the train and validation set via cross validation, like k-fold cross validation. In short, you treat different subsets of the training data as the validation set for that single fitting and shuffle for the next.

Here’s a great, humorous use of machine learning computer vision: chihuahua or muffin ?

6 — Tuning

Before we begin, we must delineate between parameters and hyper-parameters. Parameters are what the model itself is changing using error minimization and gradient descent to alter the weights and biases for different features.

Hyperparameters are what the human user can alter, for example, altering the number of iterations the model goes through, changing the learning rate, the number of and composition of hidden layers, dropout, optimizer and loss functions, etc. These hyperparameters are how we actually tune our model. With a lot of patience and trial and error, we can find a more optimal setup for better metrics. Keep in mind that depending on our goals, inference or prediction , this process could look very different. Perhaps we need to add complexity to get a little boost to accuracy or we need to remove complexity to make our model explainable (i.e. less black box).

Here are someresources to delve deeper intotuning.

7 — Prediction

Now that we have our model trained and tuned we can start making predictions! This is what we’ve been building to. Exactly what we are predicting obviously depends on what our model is. Perhaps using a CNN to predict the presence of cancer by inputting histological slides. Or using EHR records to develop a classification model to predict patient hospitalizations.

To Recap Our Process

Collect and Prepare Your Data
Choose a Model
Train, Evaluate, Tune, Repeat
Predict!

So now you can go forth and data science with a plan in mind! Keep in mind that this is not a linear process, you should expect to repeat these steps as many times as necessary for your goals.

My previous article on the applications of machine learning in oncology can be found here —

Machine Learning & AI Applications in Oncology

Recent advancements in oncology have led to exciting options for cancer treatment and long-term remission. However…

Basic Steps of Machine Learning

Inside Ai