The 4 steps necessary before fitting a machine learning model

A plain, object-oriented approach to data processing.

Mar 6 ·5min read

AFbe2ui.jpg!web

Photo by chuttersnap on Unsplash

There are many steps in a common machine learning pipeline and much thought that goes into architecting it. There is the problem definition, data acquisition, error detection and data cleaning, etc. In this story, we begin with the assumption that we have a clean and ready to go dataset.

With that in mind, we outline the four steps necessary before fitting any machine learning model. We then implement those steps in Pytorch, using a common syntax for invoking multiple method calls; method chaining. The goal is to define a simple yet generalizable API, that transforms any raw dataset into a format that is ready to be consumed by a machine learning model.

To this end, we will use the build pattern , which constructs a complex object using a step by step approach.

The builder pattern is a design pattern , which provides a flexible solution to object creation problems in object-oriented programming . Its aim is to separate the construction of a complex object from its representation.

So, what are those 4 things? In its most simple case, processing data before modelling includes four distinct actions:

Load the data
Split into train/valid/test sets
Label the data tuples
Obtain batches of data

In the following sections, I analyze those four steps one-by-one and implement them in code. Our goal is to finally create a PyTorch DataLoader , an abstraction PyTorch uses to represent an iterable over a dataset. Having a DataLoader is the first step in setting up the training loop. So, without further ado, let us get our hands dirty.

Loading the data

For this example, we use a mock dataset that is kept in a pandas DataFrame format. Our goal is to create one PyTorch Dataloader class for the training set and one for the validation set. Thus, let us build a class named DataLoaderBuilder that is responsible for building those classes.

We see that the only operation of the DataLoaderBuilder is to store a data variable, which type is a torch.tensor . So now, we need a way to initialize it from a pandas DataFrame . For that, we use a python classmethod .

The classmethod is a plain python class method, but instead of receiving self as the first argument it receives a class . Thus, given a pandas DataFrame , we turn the DataFrame into a PyTorch tensor and instantiate the DataLoaderBuilder class, that is passed to the method as the cls argument. Optionally, we can keep only the columns of the DataFrame we care about. After defining it, we patch it to the main DataLoaderBuilder class.

Splitting into Training & Validation

For this example, we split the dataset into two sets; training and validation. It is easy to extend the code and split it into three sets; training, validation and testing.

We want to split the dataset randomly, and keep some percentage of the data for training and set aside what is left for validation. To this end, we use Pytorch’s SubsetRandomSampler . You can read more about this sampler and many more sampling methods in the official PyTorch documentation .

By default, we keep 90% of the data for training and we split across rows ( axis=0 ). Another detail in the code is that we return self . Thus, after creating the train_data and valid_data splits, we return back the whole class. This will permit as to use method chaining in the end.

Label the Dataset

Next, we should label the dataset. Most of the time, we use some feature variables to predict a depended variable (i.e. the target). That is, of course, called supervise learning. The label_by_func method annotates the dataset according to a given function. After this call, the dataset is usually converted to (features, target) tuples.

We see that the label_by_func method accepts a function as an argument and applies it to the train and valid sets. Our job is to design a function that serves our purposes any time we want to label a dataset of some form. Later in the “putting it all together” example we show how simple it is to create such a function.

Create Batches

Finally, only one step is left; break the dataset into batches. For this, we can leverage PyTorch’s TensorDataset and DataLoader classes.

This is the last method in the chain, thus, we name it “ build” . It creates the train and valid datasets and having them it is easy to instantiate the corresponding Pytorch DataLoader , with a known batch size. Keep in mind that we now have labelled the data, thus, self.train_data is a tuple of features and a target variable. Consequently, self.train_data[0] keeps the features and self.train_data[1] holds the target.

Having that in place, let us put it all together with a simple example.

In this example, we create a dummy dataset of three columns, where the last column stores the target or depended variable. We then define a get_label function, that pulls the last column and creates a features-target tuple. Finally, using method chaining we can easily create the data loaders we need from a given pandas DataFrame .

Conclusion

In this story, we saw what are the four necessary steps of data processing before fitting any model, assuming that the dataset is clean. Although this is a toy example, it can be used and extended to cover a wide variety of machine learning problems.

Also, there are steps that are not covered in this article (e.g. data normalization or augmentation for computer vision) but the goal of the story is to provide a general idea on how to structure code that solves a relevant problem.

My name is Dimitris Poulopoulos and I’m a machine learning researcher at BigDataStack and PhD(c) at the University of Piraeus, Greece. I have worked on designing and implementing AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA. If you are interested in reading more posts about Machine Learning, Deep Learning and Data Science, follow me on Medium , LinkedIn or @james2pl on twitter.

The 4 steps necessary before fitting a machine learning model