Machine Learning Foundations Homework 4

May 14, 2020 • 唐涵

Website of Machine Learning Foundations by Hsuan-Tien Lin: https://www.csie.ntu.edu.tw/~htlin/mooc/.

Question 1

Overfitting and Deterministic Noise

Deterministic noise depends on H, as some models approximate f better than others. Assume H′⊂H and that f is fixed. In general (but not necessarily in all cases), if we use H′ instead of H, how does deterministic noise behave?

In general, deterministic noise will decrease.
In general, deterministic noise will increase.
In general, deterministic noise will be the same.
If dvc(H′)≤12dvc(H) deterministic noise will increase, else it will decrease.
If dvc(H′)≤12dvc(H), deterministic noise will decrease, else it will increase.

Sol: In general, deterministic noise will increase because our hypothesis set gets simpler, which makes it less likely to obtain a good g (recall that deterministic noise is the difference between the best hypothesis h∗ in the given hypothesis set and the target function f).

Question 2

Regularization With Polynomials

Polynomial models can be viewed as linear models in a space Z, under a nonlinear transform Φ:X→Z. Here, Φ transforms the scalar x into a vector z of Legendre polynomials, z=(1,L1(x),L2(x),…,LQ(x)). Our hypothesis set will be expressed as a linear combination of these polynomials,

HQ={h|h(x)=wTz=Q∑q=0wqLq(x)},

where L0(x)=1.

Consider the following hypothesis set defined by the constraint:

H(Q,c,Qo)={h|h(x)=wTz∈HQ;wq=c for q≥Qo},

which of the following statement is correct?

H(10,0,3)∪H(10,1,4)=H3
H(10,1,3)∩H(10,1,4)=H1
H(10,0,3)∩H(10,0,4)=H2
H(10,0,3)∪H(10,0,4)=H4
none of the other choices

Sol:

H(10,0,3): wq=0,∀q≥3, so it contains polynomials of the form ∑2q=0wqLq(x);

H(10,0,4): wq=0,∀q≥4, so it contains polynomials of the form ∑3q=0wqLq(x).

Therefore, H(10,0,3)∩H(10,0,4)=H2 is the answer.

Question 3

Regularization and Weight Decay

Consider the augmented error

Eaug(w)=Ein(w)+λNwTw,

with some λ>0.

If we want to minimize the augmented error Eaug(w) by gradient descent, with η as learning rate, which of the followings are the correct update rules?

w(t+1)⟵w(t)−η∇Ein(w(t))
w(t+1)⟵(1−2ηλN)w(t)−η∇Ein(w(t))
w(t+1)⟵(1+2ηλN)w(t)−η∇Ein(w(t))
w(t+1)⟵(1+ηλN)w(t)−η∇Ein(w(t))
w(t+1)⟵(1−ηλN)w(t)−η∇Ein(w(t))

Sol: Since

∇Eaug(w)=∇Ein(w)+λN⋅2w,

we have the update rule

w(t+1)=w(t)−η∇Eaug(w(t))=(1−2ηλN)w(t)−η∇Ein(w(t)).

Question 4

Let wlin be the optimal solution for the plain-vanilla linear regression and wreg(λ) be the optimal solution for the formulation above. Select all the correct statement below:

‖wreg(λ)‖=‖wlin‖ for any λ>0
‖wreg(λ)‖ is a non-decreasing function of λ for λ≥0
‖wreg(λ)‖ is a non-increasing function of λ for λ≥0
‖wreg(λ)‖>‖wlin‖ for any λ>0
none of the other choices

Sol:

From the constrained minimization of Ein:

minwEin(w)subject towTw≤C,

which is equivalent to the unconstrained minimization of Eaug, we can deduce that if wlin satisfies the constraint wTw≤C, then wlin=waug, otherwise, wlin>waug.

As for the monotonicity of waug, the increasing of λ is equivalent to the decreasing of C, restricting the growth of w and hence, ‖wreg(λ)‖ is non-increasing.

Question 5

Leave-One-Out Cross-Validation

You are given the data points: (−1,0),(ρ,1),(1,0),ρ≥0, and a choice between two models: constant [h0(x)=b0] and linear [h1(x)=a1x+b1]. For which value of ρ would the two models be tied using leaving-one-out cross-validation with the squared error measure?

√√3+4
√√3−1
√9+4√6
√9−√6
none of the other choices

Sol:

ECV,h0=133∑n=1en,h0=133∑n=1[h−0(xn)−yn)]2.

For the first model, which is a constant model, the best b0 we can choose over training set is the average of the y values, i.e.,

h−0(x1)=1/2h−0(x2)=0h−0(x3)=1/2.

Then substitute into ECV,h0:

ECV,h0=13[(1/2−0)2+(0−1)2+(1/2−0)2]=13[1/4+1+1/4].

As for the second model, the best line using two points is exactly the line through them, and hence

h−1(x1)=1ρ−1(x1−1)=−2ρ−1h−1(x2)=0h−1(x3)=1ρ+1(x3+1)=2ρ+1.

Then ECV,h1=133∑n=1[h−1(xn)−yn)]2=13[(−2ρ−1)2+1+(2ρ+1)2].

Equate ECV,h0 and ECV,h1, we can solve for ρ=√9+4√6.

Question 6

Learning Principles

In Problem 6-7, suppose that for 5 weeks in a row, a letter arrives in the mail that predicts the outcome of the upcoming Monday night baseball game. (Assume there are no tie). You keenly watch each Monday and to your surprise, the prediction is correct each time. On the day after the fifth game, a letter arrives, stating that if you wish to see next week’s prediction, a payment of NTD 1,000 is required. Which of the following statement is true?

To make sure that at least one person receives correct predictions on all 5 games from him, at least 64 letters are sent before the fifth game.
If the sender wants to make sure that at least one person receives correct predictions on all 5 games from him, the sender should target to begin with at least 5 people.
To make sure that at least one person receives correct predictions on all 5 games from him, after the first letter ‘predicts’ the outcome of the first game, the sender should target at least 16 people with the second letter.
none of the other choices.

Sol: If we just randomly guess the result, then the probability of getting the correct result is 12. So to make 5 consecutive correct ‘predictions’, the probability based on randomly guessing is 132.

The ‘strategy’ is like the following:

Before the first game: mailling out 32 letters with 16 predicting A wins and the other 16 B wins, so half of them will be correct.
Before the second game: in those we correctly predicted the first time, choose 8 of them to mail out letters which predict A wins and the other 8 with letters predicting B wins.
Repeat the same process until the fifth game.

After the fifth game, there will be one person who received correct predictions on all 5 games. Then we mail out him a payment letter.

So the correct answer is the third statement.

Question 7

If the cost of printing and mailling out each letter is NTD 10. If the sender sends the minimum number of letters out, how much money can be made for the above ‘fraud’ to success once? That is, one of the recipients does send him NTD 1,000 to receive the prediction of the 6-th game?

NTD 400
NTD 340
NTD 460
NTD 430
NTD 370

Sol: The total number of letters the sender needs to send is

32+16+8+4+2+1=63,

where the last ‘1’ is the letter with the payment requirement attached. So the money can be made is

1000−630=370.

Question 8

For Problem 8-10, we consider the following. In our credit card example, the bank starts with some vague idea of what constitutes a good credit risk. So as customers x1,x2,⋯,xN arrive, the bank applies its vague idea to approve credit cards for some of these customers based on a formula a(x). Then, only those who get credit cards are monitored to see if they default or not. For simplicity, suppose that the first N=10,000 customers were given credit cards by the credit approval function a(x). Now that the bank knows the behavior of these customers, it comes to you to improve their algorithm for approving credit. The bank gives you the data (x1,y1),…,(xN,yN). Before you look at the data, you do mathematical derivations and come up with a credit approval function. You now test it on the data and, to your delight, obtain perfect prediciton.

What is M, the size of your hypothesis set?

We have no idea about it

Sol: Since we come up with one particular credit approval function before looking at the data, this is the only function we are considering, so the size of the hypothesis set is M=1.

Question 9

With such an M, what does the Hoeffding bound say about the probability that the true average error rate of g is worse than 1% for N=10,000?

Sol: By Hoeffding inequality (with one fixed hypothesis)

P(|Ein(g)−Eout(g)|>ϵ)≤2exp(−2Nϵ2),

where ϵ=0.01,N=10,000 in this case, the probability is 2exp(−2Nϵ2)=0.2707.

Sol:

You assure the bank that you have got a system g for approving credit cards for new customers, which is nearly error-free. Your confidence is given by your answer to the previous question. The bank is thrilled and uses your g to approve credit for new customers. To their dismay, more than half their credit cards are being defaulted on. Assume that the customers that were sent to the old credit approval function and the customers that were sent to your g are indeed i.i.d. from the same distribution, and the bank is lucky enough (so the ‘bad luck’ that “the true error of g is worse than 1%” does not happen).

Select all the true claims for this situation.

By applying a(x)XORg(x) to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
By applying a(x)ORg(x) to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
By applying a(x)ANDg(x) to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
By applying a(x)XNORg(x) to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.

Sol: The data we have are not clean in the sense that they were contaminated by the function a since we used a to obtain label y. Our prediciton function g is reliable only if the data are indeed the result of the function a. Therefore, we should use the conjunction a(x)ANDg(x) to approve credit.

Question 11

Virtual Examples and Regularization

Consider linear regression with virtual examples. That is, we add K virtual examples (˜x1,˜y1),(˜x2,˜y2),…,(˜xK,˜yK) to the training data set, and solve

minw1N+K(N∑n=1(yn−wTxn)2+K∑k=1(˜yk−wT˜xk)2).

We will show that using some ‘special’ virtual examples, which were claimed to be possible way to combat overfitting in Lecture 9, is related to regularization, another possible way to combat overfitting discussed in Lecture 10. Let ˜X=[˜x1˜x2…˜xK]T, and ˜y=[˜y1,˜y2,…,˜yK]T.

What is the optimal w to the optimization problem above, assuming that all the inversions exists?

(XTX+˜XT˜X)−1(XTy+˜XT˜y)
(XTX+˜XT˜X)−1(˜XT˜y)
(XTX)−1(˜XT˜y)
(XTX)−1(XTy+˜XT˜y)
none of the other choices

Sol:

1N+K(N∑n=1(yn−wTxn)2+K∑k=1(˜yk−wT˜xk)2)=1N+K[‖y−Xw‖2+‖˜y−˜Xw‖2].

Take gradient and set to 0:

1N+K[(−X)T⋅2(y−Xw)+(−˜X)T⋅2(˜y−˜Xw)]=0(XTX+˜XT˜X)w=XTy+˜XT˜yw=(XTX+˜XT˜X)−1(XTy+˜XT˜y).

Question 12

For what ˜X and ˜y will the solution of this linear regression equal to

wreg=argminwλN‖w‖2+1N‖Xw−y‖2?

Sol: Take ˜X=√λI and ˜y=0, then

‖˜y−˜Xw‖2=λ‖w‖2.

Thus, in this case these two solutions are equal.

Question 13

Experiment with Regularized Linear Regression and Validation

Consider regularized linear regression (also called ridge regression) for classification.

wreg=argminwλN‖w‖2+1N‖Xw−y‖2,

Run the algorithm on the following data set hw4_train.dat as D

and the following set hw4_test.dat for evaluating Eout.

Because the data sets are for classification, please consider only the 0/1 error for all the problems below.

Let λ=10, which of the followings is the corresponding Ein and Eout?

Ein=0.015,Eout=0.020
Ein=0.030,Eout=0.015
Ein=0.020,Eout=0.010
Ein=0.035,Eout=0.020
Ein=0.050,Eout=0.045

Sol:

import pandas as pd
import numpy as np
from numpy.linalg import pinv

def load_data(filename):
    df = pd.read_csv(filename, header=None, sep='\s+')
    X_df, y_df = df.iloc[:, :-1], df.iloc[:, -1]
    X, y = X_df.to_numpy(), y_df.to_numpy()
    n, _ = X.shape
    X = np.c_[np.ones((n, 1)), X]
    return X, y

# Read in training and test data
X, y = load_data('hw4_train.dat')
X_test, y_test = load_data('hw4_test.dat')

def ridge_reg(X, y, lambd):
    """
    Args:
        X: ndarray of shape = [N, d + 1]
        y: ndarray of shape = [N, ]
        lambd: float
    Returns:
        w: ndarray of shape = [d, ]
    """
    _, d = X.shape
    w = pinv(X.T @ X + lambd * np.eye(d)) @ X.T @ y
    return w

w_ridge = ridge_reg(X, y, lambd=10)
w_ridge

Output:

array([-0.93238149,  1.04618645,  1.046171  ])

def calc_err(X, y, w):
    """
    Args:
        X: ndarray of shape = [N, d + 1]
        y: ndarray of shape = [N, ] 
        w: ndarray of shape = [d + 1, ]
    Returns:
        err: float
    """
    
    y_hat = np.sign(X @ w)
    err = np.mean(y_hat != y)
    return err

E_in = calc_err(X, y, w_ridge)
E_out = calc_err(X_test, y_test, w_ridge)

print(f"E_in: {E_in}\nE_out: {E_out}")

Output:

E_in: 0.05
E_out: 0.045

Question 14

Among log10λ={2,1,0,−1,…,−8,−9,−10}. What is the λ with the minimum Ein? Compute and its corresponding and then select the closest answer. Break the tie by selecting the largest .

Sol:

lambd_vals = np.logspace(start=2, stop=-10, num=13)

E_in_list = []
E_out_list = []

for lambd in lambd_vals:
    w = ridge_reg(X, y, lambd)
    E_in = calc_err(X, y, w)
    E_out = calc_err(X_test, y_test, w)
    E_in_list.append(E_in)
    E_out_list.append(E_out)

for lambd, E_in, E_out in zip(lambd_vals, E_in_list, E_out_list):
    print(f"lambda={lambd:>6}, E_in={E_in:>6}, E_out={E_out:>6}")

Output:

lambda= 100.0, E_in=  0.24, E_out= 0.261
lambda=  10.0, E_in=  0.05, E_out= 0.045
lambda=   1.0, E_in= 0.035, E_out=  0.02
lambda=   0.1, E_in= 0.035, E_out= 0.016
lambda=  0.01, E_in=  0.03, E_out= 0.016
lambda= 0.001, E_in=  0.03, E_out= 0.016
lambda=0.0001, E_in=  0.03, E_out= 0.016
lambda= 1e-05, E_in=  0.03, E_out= 0.016
lambda= 1e-06, E_in= 0.035, E_out= 0.016
lambda= 1e-07, E_in=  0.03, E_out= 0.015
lambda= 1e-08, E_in= 0.015, E_out=  0.02
lambda= 1e-09, E_in= 0.015, E_out=  0.02
lambda= 1e-10, E_in= 0.015, E_out=  0.02

Thus, the answer is .

Question 15

Among . What is the with the minimum ? Compute and its corresponding and then select the closest answer. Break the tie by selecting the largest .

Sol: The answer is .

Question 16

Now split the given training examples in to the first 120 examples for and 80 for .

Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem.

Run the algorithm on to get , and validate with .

Among . What is the with the minimum ? Compute and the corresponding , and then select the closest answer. Break the tie by selecting the largest .

Sol:

# Split train/val= 120/80
X_train, X_val = X[:120], X[120:]
y_train, y_val = y[:120], y[120:]


E_train_list = []
E_val_list = []
E_out_list = []

for lambd in lambd_vals:
    w = ridge_reg(X_train, y_train, lambd)
    E_train = calc_err(X_train, y_train, w)
    E_val = calc_err(X_val, y_val, w)
    E_out = calc_err(X_test, y_test, w)
    E_train_list.append(E_train)
    E_val_list.append(E_val)
    E_out_list.append(E_out)

for lambd, E_train, E_val, E_out in zip(lambd_vals, E_train_list, E_val_list, E_out_list):
    print(f"lambda={lambd:>6}, E_train={E_train:>6f}, E_val={E_val:>6f}, E_out={E_out:>6f}")

Output:

lambda= 100.0, E_train=0.341667, E_val=0.412500, E_out=0.414000
lambda=  10.0, E_train=0.075000, E_val=0.125000, E_out=0.080000
lambda=   1.0, E_train=0.033333, E_val=0.037500, E_out=0.028000
lambda=   0.1, E_train=0.033333, E_val=0.037500, E_out=0.022000
lambda=  0.01, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 0.001, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda=0.0001, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-05, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-06, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-07, E_train=0.033333, E_val=0.037500, E_out=0.021000
lambda= 1e-08, E_train=0.000000, E_val=0.050000, E_out=0.025000
lambda= 1e-09, E_train=0.000000, E_val=0.100000, E_out=0.038000
lambda= 1e-10, E_train=0.008333, E_val=0.125000, E_out=0.040000

So the answer is .

Question 17

Among . What is the with the minimum ? Compute and the corresponding , and then select the closest answer. Break the tie by selecting the largest .

Sol: The closest answer is .

Question 18

Run the algorithm with the optimal of the previous on the whole to get . Compute and then select the closest answer.

E_{i n}\left(g_{\lambda}\right)=0.035, E_{o u t}\left(g_{\lambda}\right)=0.020
E_{i n}\left(g_{\lambda}\right)=0.055, E_{o u t}\left(g_{\lambda}\right)=0.020
E_{i n}\left(g_{\lambda}\right)=0.015, E_{o u t}\left(g_{\lambda}\right)=0.020
E_{i n}\left(g_{\lambda}\right)=0.045, E_{o u t}\left(g_{\lambda}\right)=0.030
E_{i n}\left(g_{\lambda}\right)=0.025, E_{o u t}\left(g_{\lambda}\right)=0.030

Sol:

lambd = 1
w = ridge_reg(X, y, lambd)
E_in = calc_err(X, y, w)
E_out = calc_err(X_test, y_test, w)

print(f"E_in: {E_in}\nE_out: {E_out}")

Output:

E_in: 0.035
E_out: 0.02

Question 19

Now split the given training examples in to five folds, the first 40 being fold 1, the next 40 being fold 2, and so on. Again, we take a fixed split because the given examples are already randomly permuted.

Among , what is the with the minimum , where comes from the five folds defined above? Compute and the corresponding then select the closest answer. Break the tie by selecting the largest .

Sol:

num_of_folds = 5
E_cv_list = []

for lambd in lambd_vals:
    sum_of_cv_error = 0
    for k in range(num_of_folds):
        k_th_val_fold = slice(k * 40, (k + 1) * 40)
        k_th_train_fold = np.r_[slice(0, k * 40), slice((k + 1) * 40, 200)]
        X_val, y_val = X[k_th_val_fold], y[k_th_val_fold]
        X_train, y_train = X[k_th_train_fold], y[k_th_train_fold]
        w = ridge_reg(X_train, y_train, lambd)
        sum_of_cv_error += calc_err(X_val, y_val, w)
    E_cv = sum_of_cv_error / num_of_folds
    E_cv_list.append(E_cv)

for lambd, E_cv in zip(lambd_vals, E_cv_list):
    print(f"lambda={lambd:>6}, E_cv={E_cv:>6f}")

Output:

lambda= 100.0, E_cv=0.290000
lambda=  10.0, E_cv=0.060000
lambda=   1.0, E_cv=0.035000
lambda=   0.1, E_cv=0.035000
lambda=  0.01, E_cv=0.035000
lambda= 0.001, E_cv=0.035000
lambda=0.0001, E_cv=0.035000
lambda= 1e-05, E_cv=0.035000
lambda= 1e-06, E_cv=0.035000
lambda= 1e-07, E_cv=0.035000
lambda= 1e-08, E_cv=0.030000
lambda= 1e-09, E_cv=0.050000
lambda= 1e-10, E_cv=0.050000

So the answer is .

Question 20

Run the algorithm with the optimal of the previous problem on the whole to get . Compute and then select the closest answer.

E_{i n}\left(g_{\lambda}\right)=0.025, E_{o u t}\left(g_{\lambda}\right)=0.020
E_{i n}\left(g_{\lambda}\right)=0.035, E_{o u t}\left(g_{\lambda}\right)=0.030
E_{i n}\left(g_{\lambda}\right)=0.005, E_{o u t}\left(g_{\lambda}\right)=0.010
E_{i n}\left(g_{\lambda}\right)=0.015, E_{o u t}\left(g_{\lambda}\right)=0.020
E_{i n}\left(g_{\lambda}\right)=0.045, E_{o u t}\left(g_{\lambda}\right)=0.020

Sol:

lambd = 1e-8
w = ridge_reg(X, y, lambd)
E_in = calc_err(X, y, w)
E_out = calc_err(X_test, y_test, w)

print(f"E_in: {E_in}\nE_out: {E_out}")

Output:

E_in: 0.015
E_out: 0.02

Reference

Machine Learning Foundations Homework 4

Machine Learning Foundations Homework 4

Recommend

消息队列学习笔记

Python 相关资源汇总（持续更新中）

河北农村的春节

在 M1 Mac 上运行 macOS 虚拟机

那些东西入心灵

Yefan Zhi Stockpile

Sublime常用快捷键

TDD - Разработка через тестирование

命令执行(Command Injection)与代码执行(Code execution)

特斯拉启示录：什么样的公司最赚钱？

About Joyk