The First Two Questions Every Data Scientist Must Answer

Selecting the right model and the composition of the training dataset are constant challenges in every data science project.

Jesus Rodriguez

Feb 6 ·5min read

1*7KnQ56KvbAJSfXlJQPnckQ.jpeg?q=20

Building machine learning applications in the real world is a never-ending process of selecting and refining the right elements of a specific solution. Among those elements, the selection of the correct model and the right structure of the training dataset are, arguably, the two most important decisions that data scientists need to make when architecting deep learning solutions. How to decide what deep learning model to use for a specific problem? How do we know whether we are using the correct training dataset or we should gather more data? Those questions are the common denominator across all stages of the lifecycle of a deep learning application. Even though there is no magic answer to those questions, there are several ideas that could guide your decision-making process. Let’s start with the selection of the correct deep learning model.

— What mode should I use?

— How much training data should I gather?

Selecting a Baseline Model

The first thing to figure out when exploring an artificial intelligence(AI) problem is to determine whether its a deep learning problem or not. Many AI scenarios are perfectly addressable using basic machine learning algorithms. However, if the problem falls into the category of “AI-Complete” scenarios such as vision analysis, speech translation, natural language process or others of similar nature, then we need to start thinking about how to select the right deep learning model.

Identifying the correct baseline model for a deep learning problem is a complex task that can be segmented into two main parts:

I) Select the core learning algorithm.

II)Select the optimization algorithm that complements the algorithm selected on step 1.

Most deep learning algorithms are correlated to the structure of the training dataset. Again, there is no silver bullet for selecting the right algorithm for a deep learning problem but, some of the following design guidelines should help in the decision:

a) If the input dataset is based on images or similar topological structures, then the problem can be tackled using convolutional neural networks(CNNs)(see my previous articles about CNNs).

b)If the input is a fixed-size vector, we should be thinking of using a feed-forward network with inter layer connectivity.

c) If the input is sequential in nature, then we have a problem better suited for recurrent or recursive neural networks.

nieQNjA.png!web

Those principles are mostly applicable to supervised deep learning algorithms. However, there are plenty of deep learning scenarios that can benefit from unsupervised deep learning models. In scenarios such as natural language processing or image analysis, using unsupervised learning models can be a useful technique to determine relevant characteristics of the input dataset and structure it accordingly.

In terms of the optimization algorithm, you can rarely go wrong using stochastic gradient descent(SGD). Variations of SGD such as the ones using momentum or learning decay models are very popular in the deep learning space. Adam is, arguably, the most popular alternative to SGD algorithms especially when combined with CNNs.

Now we have an idea of how to select the right deep learning algorithm for a specific scenario. The next step is to validate the correct structure of the training dataset. We will discuss that in the next part of this article.

Building the Right Training Dataset

Structuring a proper training dataset is an essential aspect of effective deep learning models but one that is particularly hard to solve. Part of the challenge comes from the intrinsic relationship between a model and the corresponding training dataset. If the performance of a model is below expectations, it is often hard to determine whether the causes are related to the model itself or to the composition of the training dataset. While there is no magic formula for creating the perfect training dataset, there are some patterns that can help.

When confronted with a deep learning model with poor performance, data scientists should determine if the optimization efforts should focus on the model itself or on the training data. In most real-world scenarios, optimizing a model is exponentially cheaper than gathering additional clean data and retraining the algorithms. From that perspective, data scientists should make sure that the model has been properly optimized and regularized before considering collecting additional data.

Typically, the first rule to consider when a deep learning algorithm is underperforming is to evaluate whether it’s using the entire training dataset. Very often data scientists will be shocked to find out that models that are not working correctly are only using a fraction of the training data. At that point, a logical thing to consider is to increase the capacity of the model(the number of potential hypothesis it can formulate) by adding extra layers and additional hidden units per layer. Another ideas to explore in that scenario is to optimize the model’s hyperparameters. If none of those ideas work, then it might be time to consider gathering more training data.

BzymIz2.png!web

The process of enriching a training dataset can be cost prohibited in many scenarios. To mitigate that, data scientists should implement a data wrangling pipeline that is constantly labeling new records. semi-supervised learning strategies might also help to incorporate unlabeled records as part of the training dataset.

The imperative question in scenarios that require extra training data always is: how much data? Assuming that the composition of the training dataset doesn’t drastically vary with new records, we can estimate the appropriate size of the new training dataset by monitoring its correlation with the generalization error. A basic principle to follow in that situation is to increase the training dataset at a logarithmic scale by, for example, doubling the number of instances each time. In some cases, we can improve the training dataset by simply creating variations using noise generation models or regularization techniques such as Bagging(read my recent article about Bagging).

Building machine learning solutions is a constant trial and error exercise. Recent techniques such as neural architecture search are definitely helping to address some of the challenges of model selection and dataset size but they still require a lot of work to be widely adopted. For now, selecting the right model and the right training dataset remains one of the biggest challenges faced by data scientists when building machine learning solutions in the real world.

Selecting a Baseline Model

Building the Right Training Dataset

Recommend

Kotlin-Android-Extensions 库使用及源码解析

CS246: Mining Data Sets

Macbook Pro 16 寸问题收集帖

[译] 操作系统是什么？1954-1964 历史调查（2019）

如何阅读技术文档 - Jeffery的博客 | Jeffery Blog

Flutter学习指南App,一起来玩Flutter吧~

战疫情——flutter App

你可能不知道的 JavaScript 模块化野史

Exclusive: China's mobile giants to take on Google's Play store - sources - Reut...

苹果调整收费政策：Mac与iOS应用不需购买两次

About Joyk