50 Deep Learning Interview Questions, Part 1/2

Practice Problems and Solutions on Deep Learning.

Below are 25 questions on deep learning which can help you test your knowledge, as well as being a good review resource for interview preparation.

1.Why is it necessary to introduce non-linearities in a neural network?

Solution: otherwise, we would have a composition of linear functions, which is also a linear function, giving a linear model. A linear model has a much smaller number of parameters, and is therefore limited in the complexity it can model.

2.Describe two ways of dealing with the vanishing gradient problem in a neural network.

Solution:

Using ReLU activation instead of sigmoid.
Using Xavier initialization.

3.What are some advantages in using a CNN (convolutional neural network) rather than a DNN (dense neural network) in an image classification task?

Solution: while both models can capture the relationship between close pixels, CNNs have the following properties:

It is translation invariant — the exact location of the pixel is irrelevant for the filter.
It is less likely to overfit — the typical number of parameters in a CNN is much smaller than that of a DNN.
Gives us a better understanding of the model — we can look at the filters’ weights and visualize what the network “learned”.
Hierarchical nature — learns patterns in by describing complex patterns using simpler ones.

4.Describe two ways to visualize features of a CNN in an image classification task.

Solution:

Input occlusion — cover a part of the input image and see which part affect the classification the most. For instance, given a trained image classification model, give the images below as input. If, for instance, we see that the 3rd image is classified with 98% probability as a dog, while the 2nd image only with 65% accuracy, it means that

Activation Maximization — the idea is to create an artificial input image that maximize the target response (gradient ascent).

5.Is trying the following learning rates: 0.1,0.2,…,0.5 a good strategy to optimize the learning rate?

Solution:No, it is recommended to try a logarithmic scale to optimize the learning rate.

6.Suppose you have a NN with 3 layers and ReLU activations. What will happen if we initialize all the weights with the same value? what if we only had 1 layer (i.e linear/logistic regression?)

Solution:If we initialize all the weights to be the same we would not be able to break the symmetry; i.e, all gradients will be updated the same and the network will not be able to learn. In the 1-layers scenario, however, the cost function is convex (linear/sigmoid) and thus the weights will always converge to the optimal point, regardless of the initial value (convergence may be slower).

7.Explain the idea behind the Adam optimizer.

Solution:Adam, or adaptive momentum, combines two ideas to improve convergence: per-parameter updates which give faster convergence, and momentum which helps to avoid getting stuck in saddle point.

8.Compare batch, mini-batch and stochastic gradient descent.

Solution: batch refers to estimating the data by taking the entire data, mini-batch by sampling a few datapoints, and SGD refers to update the gradient one datapoint at each epoch. The tradeoff here is between how precise the calculation of the gradient is versus what size of batch we can keep in memory. Moreover, taking mini-batch rather than the entire batch has a regularizing effect by adding random noise at each epoch.

9.What is data augmentation? Give examples.

Solution:Data augmentation is a technique to increase the input data by performing manipulations on the original data. For instance in images, one can: rotate the image, reflect (flip) the image, add Gaussian blur

10.What is the idea behind GANs?

Solution:GANs, or generative adversarial networks, consist of two networks (D,G) where D is the “discriminator” network and G is the “generative” network. The goal is to create data — images, for instance, which are undistinguishable from real images. Suppose we want to create an adversarial example of a cat. The network G will generate images. The network D will classify images according to whether they are a cat or not. The cost function of G will be constructed such that it tries to “fool” D — to classify its output always as cat.

11.What are the advantages of using Batchnorm?

Solution:Batchnorm accelerates the training process. It also (as a byproduct of including some noise) has a regularizing effect.

12. What is multi-take learning? When should it be used?

Solution:Multi-tasking is useful when we have a small amount of data for some task, and we would benefit from training a model on a large dataset of another task. Parameters of the models are shared — either in a “hard” way (i.e the same parameters) or a “soft” way (i.e regularization / penalty to the cost function).

13.What is end-to-end learning? Give a few of its advantages.

Solution:End-to-end learning is usually a model which gets the raw data and outputs directly the desired outcome, with no intermediate tasks or feature engineering. It has several advantages, among which: there is no need to hand craft features, and it generally leads to lower bias.

14.What happens if we use a ReLU activation and then a sigmoid as the final layer?

Solution:Since ReLU always outputs a non-negative result, the network will constantly predict one class for all the inputs!

15.How to solve the exploding gradient problem?

Solution:A simple solution to the exploding gradient problem is gradient clipping — taking the gradient to be ±M when its absolute value is bigger than M, where M is some large number.

16.Is it necessary to shuffle the training data when using batch gradient descent?

Solution:No, because the gradient is calculated at each epoch using the entire training data, so shuffling does not make a difference.

17.When using mini batch gradient descent, why is it important to shuffle the data?

Solution:otherwise, suppose we train a NN classifier and have two classes — A and B, and that al

18.Describe some hyperparameters for transfer learning.

Solution:How many layers to keep, how many layers to add, how many to freeze.

19. Is dropout used on the test set?

Solution: No! only in the train set. Dropout is a regularization technique that is applied in the training process.

20. Explain why dropout in a neural network acts as a regularizer.

Solution: There are several (related) explanations to why dropout works. It can be seen as a form of model averaging — at each step we “turn off” a part of the model and average the models we get. It also adds noise, which naturally has a regularizing effect. It also leads to more sparsity of the weights and essentially prevents co-adaptation of neurons in the network.

21.Give examples in which a many-to-one RNN architecture is appropriate.

Solution: A few examples are: sentiment analysis, gender recognition from speech, .

22. When can’t we use BiLSTM? Explain what assumption that has to be made.

Solution: in any bi-directional model, we assume that we have access to the next elements of the sequence in a given “time”. This is the case for text data (i.e sentiment analysis, translation etc.), but not the case for time-series data.

23. True/false: adding L2 regularization to a RNN can help with the vanishing gradient problem.

Solution: false! Adding L2 regularization will shrink the weights towards zero, which can actually make the vanishing gradients worse in some cases.

24. Suppose the training error/cost is high and that the validation cost/error is almost equal to it. What does it mean? What should be done?

Solution: this indicates underfitting. One can add more parameters, increase the complexity of the model, or lower the regularization.

25. Describe how L2 regularization can be explained as a sort of a weight decay.

Solution: Suppose our cost function is C(w), and that we add a penalization c|w|2 . When using gradient descent, the iterations will look like

w = w -grad(C)(w) — 2cw = (1–2c)w — grad(C)(w)

In this equation, the weight is multiplied by a factor < 1.

50 Deep Learning Interview Questions, Part 1/2