Back-propagation in a convolution layer with numpy implementation
source link: https://numbersmithy.com/back-propagation-in-a-convolution-layer-with-numpy-implementation/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
The goal
In a previous post we implemented 2D and 3D convolutions using
numpy
. That is one major building block of a convolution neural
network (CNN). However, to train a new CNN one also needs to
implement error back-propagation, which will be the topic of this post.
Recap on convolution layer and notations
Before going into the back-propagation, it is necessary to understand what the forward computation is doing. Below is a brief recapture on the forward pass and some notations.
For a convolution layer which is the l
th layer in the CNN, we denote:
- f(l): height and width of the convolution kernel/filter.
- n(l)c: number of filters in the convolution layer.
- n(l−1)H×n(l−1)W×n(l−1)c: size of the input to the convolution layer, i.e. activation of layer l−1, as the product of height (n(l−1)H), width (n(l−1)W) and number of channels (n(l−1)c).
- n(l)H×n(l)W×n(l)c: size of the output of the convolution layer.
- w(l,k): the kth convolution kernel/filter of layer l: w(l,k)∈Rf(l)×f(l).
- z(l,k): convolution result from the kth filter: z(l,k)∈Rn(l)H×n(l)W.
- z(l): stacked convolution results from all filters: z(l)∈Rn(l)H×n(l)W×n(l)c.
- b(l,k): bias term for the kth filter in the layer: b(l,k)∈R.
- b(l): bias term for all the filters in the layer: b(l)∈Rn(l)c.
- a(l,k): result of the activation function for the kth filter in the layer: a(l,k)∈Rn(l)H×n(l)W.
- a(l): result of the activation function for all the filters in the layer: a(l)∈Rn(l)H×n(l)W×n(l)c.
Take for instance the convolution layer shown in Figure 1 below. The input to the layer is denoted a(l−1), which is also the activation of layer l−1. In the case shown in Figure xxx a(l−1) is a 2D array, but recall that in general a(l−1) is a 3D data volume.
Figure 1 Schematic of the convolution computation in a convolution layer.
w(l,k) is the kth convolution kernel/filter in the convolution layer.
The convolution process is given as:
z(l,k)=a(l−1)⊗w(l,k)
where ⊗ is the convolution operator.
Then a bias term b(l,k)∈R is added to the convolution result:
z(l,k)=a(l−1)⊗w(l,k)+b(l,k)
Then, the convolution result is passed to an activation function g():
a(l,k)=g(z(l,k))=g(a(l−1)⊗w(l,k)+b(l,k))
After repeating the same process for all of the n(l)c filters in the layer and stacking up the results, the output of the convolution layer is a(l)∈Rn(l)H×n(l)W×n(l)c.
Finally, we denote the loss function of the CNN as J, and the error term of layer l as:
∂J∂z(l)≡δ(l)
Back-propagation in a 2D convolution layer
Computation of weight gradients
Take again the above example in Figure 1. To simplify the derivations
I’m omitting the filter index k
and assuming that there is only a single
filter in the convolution layer.
Let’s write out the convolution process of z(l)=a(l−1)⊗w(l):
{z1,1=a1,1w1,1+a1,2w1,2+a2,1w2,1+a2,2w2,2z1,2=a1,2w1,1+a1,3w1,2+a2,2w2,1+a2,3w2,2z2,1=a2,1w1,1+a2,2w1,2+a3,1w2,1+a3,2w2,2z2,2=a2,2w1,1+a2,3w1,2+a3,2w2,1+a3,3w2,2 [Eq. 1]
Note that the z
terms are for layer l
and a
terms are from
layer l-1
, I’m omitting the superscripts for brevity. The subscript
indices denote the row,column
positions of the pixels, as shown in
Figure 1.
To update the weights we need to compute their gradients ∂J∂w(l)i,j, where l denotes layer l, i and j denote the row/column indices of a number in the filter.
By applying the chain rule on the 1st element of the weight matrix w1,1 and using Equation 1:
∂J∂w1,1=∂J∂z1,1∂z1,1∂w1,1+∂J∂z1,2∂z1,2∂w1,1+∂J∂z2,1∂z2,1∂w1,1+∂J∂z2,2∂z2,2∂w1,1 [Eq 2]
Note I’m omitting the superscript (l) for all the w and z terms in the above equation.
Notice that ∂J∂z1,1 is δ1,1 by definition, and ∂z1,1∂w1,1=a1,1 from Equation 1. We can similarly replace all other terms in Equation 2 to get:
∂J∂w1,1=δ1,1a1,1+δ1,2a1,2+δ2,1a2,1+δ2,2a2,2
That gives the gradient for the first element in the weight matrix. Repeating the same process we get the gradients for all the elements in weight w(l):
{∂J∂w(l)1,1=δ(l)1,1a(l−1)1,1+δ(l)1,2a(l−1)1,2+δ(l)2,1a(l−1)2,1+δ(l)2,2a(l−1)2,2∂J∂w(l)1,2=δ(l)1,1a(l−1)1,2+δ(l)1,2a(l−1)1,3+δ(l)2,1a(l−1)2,2+δ(l)2,2a(l−1)2,3∂J∂w(l)2,1=δ(l)1,1a(l−1)2,1+δ(l)1,2a(l−1)2,2+δ(l)2,1a(l−1)3,1+δ(l)2,2a(l−1)3,2∂J∂w(l)2,2=δ(l)1,1a(l−1)2,2+δ(l)1,2a(l−1)2,3+δ(l)2,1a(l−1)3,2+δ(l)2,2a(l−1)3,3
I’ve added all superscripts to make it clearer which term is from which layer.
It is a pretty complicated equation set at first look. However, it turns out that it can be expressed more neatly as:
∂J∂w(l)=a(l−1)⊗δ(l)
namely, a convolution between the error term of the layer δ(l) and the input to the layer a(l−1).
Compare this with the gradient computation in an ordinary neural network layer, where
∂J∂θ(l)=δ(l)⋅a(l−1)T
we have a similar structure: gradients of a weight (∂J∂θ(l)) in a layer is proportional to its input (a(l−1)) and the error of the output (δ(l)).
From Equation 3, we see that to compute the gradients for the weights in a layer, we need to know the inputs into the layer (a(l−1)), which we can get by saving a copy during the forward propagation process; and the error term δ(l) of the layer. For the latter we need to propagate the error all the way from the last layer in the network backwards. This is derived in the next part.
Computation of error back propagation
We now derive the process of propagating the error backwards across a convolution layer. Using again the schematic in Figure 1, we are looking for an expression of δ(l−1), given an in-coming error of δ(l).
Recall that the error term is defined as δ(l−1)≡∂J∂z(l−1).
Using the chain rule, we get:
δ(l−1)≡∂J∂z(l−1)=∂J∂a(l−1)∂a(l−1)∂z(l−1)
Note that ∂a(l−1)∂z(l−1) is the derivative of the activation function on z(l−1):
∂a(l−1)∂z(l−1)=g′(z(l−1))
therefore, it is useful to also save a copy of z(l−1) during the forward propagation process, and it is also required to know the derivative of the activation function.
Then we need to work out the ∂J∂a(l−1) term in Equation 4.
Notice that a(l−1) is the input to the convolution layer. Using the chain rule again on the 1st element of a(l−1):
∂J∂a(l−1)1,1=∂J∂z(l)1,1∂z(l)1,1∂a(l−1)1,1=δ(l)1,1w(l)1,1
where ∂J∂z(l)1,1=δ(l)1,1 is from the definition of δ, and ∂z(l)1,1∂a(l−1)1,1=w(l)1,1 is from Equation 1.
Write out all elements in ∂J∂a(l−1), we have:
{∂J∂a(l−1)1,1=δ(l)1,1w(l)1,1∂J∂a(l−1)1,2=δ(l)1,1w(l)1,2+δ(l)1,2w(l)1,1∂J∂a(l−1)1,3=δ(l)1,2w(l)1,2∂J∂a(l−1)2,1=δ(l)1,1w(l)2,1+δ(l)2,1w(l)1,1⋯⋯∂J∂a(l−1)3,3=δ(l)2,2w(l)2,2
It is another monstrous equation set. Fortunately, it can be expressed more neatly again as a convolution:
∂J∂a(l−1)=δ(l)⊗fRot180(w(l))
where:
- ⊗f denotes a “full mode” convolution. See the Convolution modes section in this post for an illustration. In a nutshell, in a “full mode” convolution we compute a dot product between the convolution kernel and the underlying data subset whenever they have any overlap, and as a result, the result from a “full mode” convolution has a larger size than the input. This is also the case as shown in Figure 1.
- Rot180() is a function that rotates a matrix by 180 degree, or equivalently, flips the matrix horizontally and vertically. Note that in the case of 3D convolution, this does not flip the channel dimension.
Given Equation 5 and Equation 4, we get the formula for error back propagation across a convolution layer:
δ(l−1)=δ(l)⊗fRot180(w(l))⊙g′(z(l−1))
Again, it is interesting to note a similar structure in the back propagation in an ordinary neural network:
δ(l−1)=θ(l)T⋅δ(l)⊙g′(z(l−1))
Back-propagation in a 3D convolution layer
The derivations in the above section have made a few simplifications:
- We assume that the input is a 2D array.
- We assume that there is only 1 filter in the convolution layer.
This section will build on top of the previous section and generalize it to cases when the input is a 3D data volume and the convolution layer has more than 1 filters.
Computation of weight gradients
As a concrete example, Figure 2 below shows a convolution layer where:
- the input is a 3D data volume: a(l−1)∈R3×3×3.
- there are 4 filters in the convolution layer, each is a 3D array: w(l,k)∈R2×2×3,k=1,2,3,4.
- for each filter, the convolution produces a 2D slab: z(l,k)∈R2×2,k=1,2,3,4.
- After stacking up all 4 convolution results, the total convolution result is z(l)∈R2×2×4.
Figure 2 Schematic a convolution layer with 3D input and 4 filters.
Equation 3 in the above section shows that to get the gradients of filter weights in a 2D convolution with a single filter, we do a convolution between the input (a(l−1)) and the error (δ(l)).
In the case of 3D convolution, it is noticed that there exists a branching-off of the “data flow”: each filter w(l,k) produces a separate convolution result that is independent to all other filters. Consequently, the weight gradients are computed independently for each filter, using the shared input a(l−1) and the corresponding error term δ(l,k):
∂J∂w(l,k)=a(l−1)⊗δ(l,k),k=1,2,3,4
Note that the shape of a(l−1) is 3×3×3, that of δ(l,k) is 2×2×1. Therefore the convolution is done 3 times, each for a different channel. This way, we recover the correct shape of the filter: w(l,k)∈R2×2×3.
Computation of error back propagation
Previously, we have shown in Equation 6 that in a 2D convolution layer with a single filter, the back propagated error across a convolution layer consists of 3 parts:
- the error of the layer output δ(l),
- the (rotated) corresponding filter Rot180(w(l)), and
- the derivative of the input activation function g′(z(l−1)).
In the case of 3D convolution with multiple filters, the forward pass is a “diverging” flow pattern (see Figure 2), therefore, the backward propagation is a “converging” flow: errors from all filters contribute to the error term of δ(l−1):
δ(l−1)=4∑k=1[δ(l,k)⊗fRot180(w(l,k))⊙g′(z(l−1))]
Again, notice that the shape of δ(l,k) is 2×2×1, that of w(l,k) is 2×2×3. Therefore, the convolution is performed 3 times, recovering the channel dimension of length 3.
Computation of bias gradients
The bias update is relatively easier, because the bias term is added into the equation after the convolution.
For a convolution layer with n(l)c filters, the bias term is a vector of length n(l)c: b(l)∈Rn(l)c.
Given
z(l,k)=a(l−1)⊗w(l,k)+b(l,k)
the bias term in filter k contributes to all the elements in the error term of δ(l,k), therefore, the derivative wrt to bias is given as:
∂J∂b(l,k)=∑i∑j∂J∂z(l,k)i,j∂z(l,k)i,j∂b(l,k)=∑i∑jδ(l,k)i,j
numpy implementation – serial version
We have shown numpy
implementations of 2D and 3D convolutions. One
new type of computation that has not been explicitly covered is the
"full mode" convolution, whose numpy
implementation will be covered
first. Then I show the function that computes the gradient weights and
one that computes the error back propagation.
In this section, the functions are serial, i.e. they handle a single input sample at a time. The next section shows vectorized versions — functions handle multiple input samples at once.
All the codes can be found in this repo.
Full mode convolution
Code first:
def fullConv3D(var, kernel, stride): '''Full mode 3D convolution using stride view. Args: var (ndarray): 2d or 3d array to convolve along the first 2 dimensions. kernel (ndarray): 2d or 3d kernel to convolve. If <var> is 3d and <kernel> is 2d, create a dummy dimension to be the 3rd dimension in kernel. Keyword Args: stride (int): stride along the 1st 2 dimensions. Default to 1. Returns: conv (ndarray): convolution result. Note that the kernel is not filpped inside this function. ''' stride = int(stride) ny, nx = var.shape[:2] ky, kx = kernel.shape[:2] # interleave 0s var2 = interLeave(var, stride-1, stride-1) # pad boundaries nout, pad_left, pad_right = compSize(ny, ky, stride) var2 = padArray(var2, pad_left, pad_right) # convolve conv = conv3D3(var2, kernel, stride=1, pad=0) return conv
The interLeave()
function inserts empty rows/columns filled up 0s
into an array. This is meaningful only when the stride is greater
than 1. During the forward propagation process, if the convolution
stride is 2
, then 1 row/column is skipped and a smaller convolution
result is created. During the back propagation stage, we use a full
mode convolution to propagate the error backwards (see Equation 6 and
Equation 8). In order to get back the original shape, we need to add
back those skipped rows/columns. This is what interLeave()
does.
In addition to these extra rows/columns in the interior of the array,
we also need to pad the exterior edges. The number of rows/columns to
pad are computed using the compSize()
function. Then padArray()
is
called to do the 0-padding.
Lastly, we use the strided-view trick again to perform a convolution
using the conv3D3()
function.
The interLeave()
, compSize()
functions are given
below. padArray()
and conv3D3()
have already been introduced in
this post. (NOTE that the conv3D3()
function is the serial
version in the above post).
def interLeave(arr, sy, sx): '''Interleave array with rows/columns of 0s. Args: arr (ndarray): input 2d or 3d array to interleave in the first 2 dimensions. sy (int): number of rows to interleave. sx (int): number of columns to interleave. Returns: result (ndarray): input <arr> array interleaved with 0s. E.g. arr = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] interLeave(arr, 1, 2) -> [[1, 0, 0, 2, 0, 0, 3], [0, 0, 0, 0, 0, 0, 0], [4, 0, 0, 5, 0, 0, 6], [0, 0, 0, 0, 0, 0, 0], [7, 0, 0, 8, 0, 0, 9]] ''' ny, nx = arr.shape[:2] shape = (ny+sy*(ny-1), nx+sx*(nx-1))+arr.shape[2:] result = np.zeros(shape) result[0::(sy+1), 0::(sx+1), ...] = arr return result def compSize(n, f, s): '''Compute the shape of a full convolution result Args: n (int): length of input array x. f (int): length of kernel. s (int): stride. Returns: nout (int): lenght of output array y. pad_left (int): number padded to the left in a full convolution. pad_right (int): number padded to the right in a full convolution. E.g. x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] f = 3, s = 2. A full convolution is done on [*, *, 0], [0, 1, 2], [2, 3, 4], ..., [6, 7, 8], [9, 10, *]. Where * is missing outside of the input domain. Therefore, the full convolution y has length 6. pad_left = 2, pad_right = 1. ''' nout = 1 pad_left = f-1 pad_right = 0 idx = 0 # index of the right end of the kernel while True: idx_next = idx+s win_left = idx_next-f+1 if win_left <= n-1: nout += 1 idx = idx+s else: break pad_right = idx-n+1 return nout, pad_left, pad_right
Compute weight gradients
Code first:
def computeGradients(self, delta, act): '''Compute gradients of cost wrt filter weights Args: delta (ndarray): errors in filter ouputs. act (ndarray): activations fed into filter. Returns: grads (ndarray): gradients of filter weights. grads_bias (ndarray): 1d array, gradients of biases. The theoretical equation of gradients of filter weights is: \partial J / \partial W^{(l)} = a^{(l-1)} \bigotimes \delta^{(l)} where: J : cost function of network. W^{(l)} : weights in filter. a^{(l-1)} : activations fed into filter. \bigotimes : convolution in valid mode. \delta^{(l)} : errors in the outputs from the filter. Computation in practice is more complicated than the above equation. ''' nc_out = delta.shape[-1] # number of channels in outputs nc_in = act.shape[-1] # number of channels in inputs grads = np.zeros_like(self.filters) for ii in range(nc_out): deltaii = np.take(delta, ii, axis=-1) gii = grads[ii] for jj in range(nc_in): actjj = act[:, :, jj] gij = conv3D3(actjj, deltaii, stride=1, pad=0) gii[:, :, jj] += gij grads[ii] = gii # gradient clip gii = np.clip(gii, -self.clipvalue, self.clipvalue) grads_bias = np.sum(delta, axis=(0, 1)) # 1d return grads, grads_bias
A few points to note:
- This is defined as a class method, since we going to use it to build a CNN class later.
- The
delta
argument is the (\delta^{(l)}) term,act
is the (a^{(l-1)}) term. Both are assumed to be 2D(h, w)
or 3D(h, w, c)
. - As explained in previous section, the gradient for each filter is
computed independently, thus the outer
for
loop across the number of filtersnc_out
. - For each filter, a convolution is performed for each channel of the
input array, thus the inner
for
loop across the channel dimensionnc_in
. - We also clip the computed gradients within a pre-specified range:
gii = np.clip(gii, -self.clipvalue, self.clipvalue)
. - The bias gradients are computed according to Equation 9.
Compute error back propagation
Code first:
def backPropError(self, delta_in, z): '''Back-propagate errors Args: delta_in (ndarray): delta from the next layer in the network. z (ndarray): weighted sum of the current layer. Returns: result (ndarray): delta of the current layer. The theoretical equation for error back-propagation is: \delta^{(l)} = \delta^{(l+1)} \bigotimes_f Rot(W^{(l+1)}) \bigodot f'(z^{(l)}) where: \delta^{(l)} : error of layer l, defined as \partial J / \partial z^{(l)}. \bigotimes_f : convolution in full mode. Rot() : is rotating the filter by 180 degrees, i.e. a kernel flip. W^{(l+1)} : weights of layer l+1. \bigodot : Hadamard (elementwise) product. f() : activation function of layer l. z^{(l)} : weighted sum in layer l. Computation in practice is more complicated than the above equation. ''' # number of channels of input to layer l weights nc_pre = z.shape[-1] # number of channels of output from layer l weights nc_next = delta_in.shape[-1] result = np.zeros_like(z) # loop through channels in layer l for ii in range(nc_next): # flip the kernel kii = self.filters[ii, ::-1, ::-1, ...] deltaii = delta_in[:, :, ii] # loop through channels of input for jj in range(nc_pre): slabij = fullConv3D(deltaii, kii[:, :, jj], self.stride) result[:, :, jj] += slabij result = result*dReLU(z) return result
A few points to note:
- Again, this is defined as a class method.
- The
delta_in
argument is the (\delta^{(l)}) term, andz
is the (z^{(l-1)}) term. - As explained in previous section, errors from all filters contribute
to the error at the input side of the convolution layer, thus the outer
for
loop across the filters. - For each filter, we flip the kernel horizontally and vertically
kii = self.filters[ii, ::-1, ::-1, ...]
, before performing the full mode convolution using thefullConv3D()
function introduced above. - The activation function used is ReLU function, and its derivative
is defined in the
dReLU()
function, given below:
def dReLU(x): '''Gradient of ReLU''' return 1.*(x > 0)
numpy implementation – vectorized version
In this section, the functions are vectorized, i.e. they handle multiple samples at a time. The dimensions are sample images are assumed to be in this order:
(m, h, w, c)
where:
m
: number of samples in the batch.h
: height.w
: width.c
: channels.
All the codes can be found in this repo.
Full mode convolution
Code first:
def fullConv3D(var, kernel, stride): '''Full mode 3D convolution using stride view. Args: var (ndarray): 4d array to convolve along the last 3 dimensions. Shape of the array is (m, hi, wi, ci). Where m: number of records. hi, wi: height and width of input image. ci: channels of input image. kernel (ndarray): 4d filter to convolve with. Shape is (f1, f2, ci, co). where f1, f2: filter size. co: number of filters. Keyword Args: stride (int): stride along the mid 2 dimensions. Default to 1. Returns: conv (ndarray): convolution result. Note that the kernel is not filpped inside this function. ''' if np.ndim(var) != 4: raise Exception("<var> dimension should be 4.") if np.ndim(kernel) != 4: raise Exception("<kernel> dimension should be 4.") stride = int(stride) m, hi, wi, ci = var.shape f1, f2, ci, co = kernel.shape # interleave 0s var2 = interLeave(var, stride-1, stride-1) # pad boundaries nout, pad_left, pad_right = compSize(hi, f1, stride) var2 = padArray(var2, pad_left, pad_right) # transpose kernel kernel = np.transpose(kernel, [0, 1, 3, 2]) # convolve conv = conv3D3(var2, kernel, stride=1, pad=0) return conv
Note that this looks quite similar to the serial version. But the array shapes are different.
Also note that the conv3D3()
function used here is the vectorized
version in this post.
And a different interLeave()
implementation is needed, as given
below:
def interLeave(arr, sy, sx): '''Interleave array with rows/columns of 0s. Args: arr (ndarray): 4d array to interleave in the mid 2 dimensions. sy (int): number of rows to interleave. sx (int): number of columns to interleave. Returns: result (ndarray): input <arr> array interleaved with 0s. E.g. arr[0,:,:,0] = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] interLeave(arr, 1, 2)[0,:,:,0] -> [[1, 0, 0, 2, 0, 0, 3], [0, 0, 0, 0, 0, 0, 0], [4, 0, 0, 5, 0, 0, 6], [0, 0, 0, 0, 0, 0, 0], [7, 0, 0, 8, 0, 0, 9]] ''' m, hi, wi, ci = arr.shape shape = (m, hi+sy*(hi-1), wi+sx*(wi-1), ci) result = np.zeros(shape) result[:, 0::(sy+1), 0::(sx+1), :] = arr return result
Compute weight gradients
Code first:
def computeGradients(self, delta, act): '''Compute gradients of cost wrt filter weights Args: delta (ndarray): errors in filter ouputs. act (ndarray): activations fed into filter. Returns: grads (ndarray): gradients of filter weights. grads_bias (ndarray): 1d array, gradients of biases. The theoretical equation of gradients of filter weights is: \partial J / \partial W^{(l)} = a^{(l-1)} \bigotimes \delta^{(l)} where: J : cost function of network. W^{(l)} : weights in filter. a^{(l-1)} : activations fed into filter. \bigotimes : convolution in valid mode. \delta^{(l)} : errors in the outputs from the filter. ''' grads = conv3Dgrad(act, delta) # gradient clip grads = np.clip(grads, -self.clipvalue, self.clipvalue) grads_bias = np.sum(delta, axis=(0, 1, 2)) # 1d return grads, grads_bias def conv3Dgrad(act, delta): '''Compute gradients of convolution layer filters Args: act (ndarray): activation array as input to the filters. With shape (m, hi, wi, ci). Where m: number of records. hi, wi: height and width of the input into the filters. ci: channels of input into the filters. delta (ndarray): error term as output from the filters. With shape (m, ho, wo, co): ho, wo: height and width of the output from the filters. co: number of filters in the convolution layer. Returns: conv (ndarray): gradients of filters, defined as: \partial J / \partial W^{(l)} = \sum[ a^{(l-1)} \bigotimes \delta^{(l)}] NOTE that the gradients are summed across the m records in <act> and <delta>. ''' m, hi, wi, ci = act.shape m, ho, wo, co = delta.shape view = asStride(act, (ho, wo), stride=1) #conv = np.einsum('myxfgz,mfgo->yxzo', view, delta) conv = np.tensordot(view, delta, axes=([0, 3, 4], [0, 1, 2])) return conv
A few points to note:
- The
delta
argument is assumed to have a shape of(m, ho, wo, co)
, and theact
argument is assumed to have a shape of(m, hi, wi, ci)
. - The core gradient computation is done inside the
conv3Dgrad()
function, where thenp.tensordot()
function is called to achieve the computations in Equation 7 for all filters, for all samples, in one go. This powerful vectorization is made possible by the strided-view trick (see this post for explanations). Thenp.einsum()
function can also be used to give the same result, however, it turns out thatnp.einsum()
is currently limited to single-thread, andtensordot()
can utilize multiple threads, thus the preference of latter here. - Similar as in the serial version, we clip the computed gradients, and then compute the bias gradients.
Compute error back propagation
Code first:
def backPropError(self, delta_in, z): '''Back-propagate errors Args: delta_in (ndarray): delta from the next layer in the network. z (ndarray): weighted sum of the current layer. Returns: result (ndarray): delta of the current layer. The theoretical equation for error back-propagation is: \delta^{(l)} = \delta^{(l+1)} \bigotimes_f Rot(W^{(l+1)}) \bigodot f'(z^{(l)}) where: \delta^{(l)} : error of layer l, defined as \partial J / \partial z^{(l)}. \bigotimes_f : convolution in full mode. Rot() : is rotating the filter by 180 degrees, i.e. a kernel flip. W^{(l+1)} : weights of layer l+1. \bigodot : Hadamard (elementwise) product. f() : activation function of layer l. z^{(l)} : weighted sum in layer l. ''' # filp kernel kernel_f = self.filters[::-1,::-1,:,:] result = fullConv3D(delta_in, kernel_f, stride=self.stride) result = result*dReLU(z) return result
A few points to note:
- The
delta_in
argument is assumed to have a shape of(m, ho, wo, co)
, and thez
argument is assumed to have a shape of(m, hi, wi, ci)
. - The filters are flipped horizontally and vertically before putting
into the full mode convolution, performed by the
fullConv3D()
function introduced above. - Again, the activation function is assumed to be ReLU, and the same
derivative function
dReLU()
as in the serial version is used.
Summary
In this post we covered the mathematical derivations of back-propagation across a convolution layer. Most importantly, the error at the input side of a convolution layer is given by:
δ(l−1)=n(l)c∑k=1[δ(l,k)⊗fRot180(w(l,k))⊙g′(z(l−1))]
With the above, we can propagate the error from the end of a convolutional neural network to all the hidden layers.
Then the weights of the convolution filters can be computed using
∂J∂w(l,k)=a(l−1)⊗δ(l,k),k=1,2,⋯,n(l)c
Then the convolution layer can be trained using, for instance, a gradient descent method.
Then numpy
implementations are given, first for a serial version and then a vectorized version. These will be used in a later post when a complete CNN implementation is introduced.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK