实现属于自己的TensorFlow(二) - 梯度计算与反向传播

1. 实现不同操作输出对输入的梯度计算
2. 实现根据链式法则计算损失函数对不同节点的梯度计算

$$Loss(x, y, z) = z(x+y)$$

$$Loss(x, y, z) = g(z, f(x, y))$$

$$\frac{\partial Loss}{\partial x} = \frac{\partial Loss}{\partial g}\frac{\partial g}{\partial f} \frac{\partial f}{\partial x}$$

1. $\frac{\partial Loss}{\partial g} = 1$

2. $\frac{\partial g}{\partial f} = z = 6$ (当然也可以计算出$\frac{\partial g}{\partial z} = x + y = 5$). 进而求出$\frac{\partial Loss}{\partial f} = \frac{\partial Loss}{\partial g}\frac{\partial g}{\partial f} = 1 \times z = 6$

3. $\frac{\partial f}{\partial x} = 1$ (同时也可以算出$\frac{\partial f}{\partial y} = 1$). 进而求出$\frac{\partial Loss}{\partial x} = \frac{\partial Loss}{\partial g}\frac{\partial g}{\partial f}\frac{\partial f}{\partial x} = 1 \times z \times 1 = 6$

$$\frac{\partial Loss}{\partial x} = \frac{\partial g}{\partial f}\frac{\partial f}{\partial h}\frac{\partial h}{\partial x} + \frac{\partial g}{\partial f}\frac{\partial f}{\partial l}\frac{\partial l}{\partial x}$$

$f$节点可以看成一个函数$z = f(x, y)$， 我们需要做的就是求$\frac{\partial f(x, y)}{\partial x}$和$\frac{\partial f(x, y)}{\partial y}$.

平方运算的梯度计算

class Square(Operation):    ''' Square operation. '''    # ...    def compute_gradient(self, grad=None):        ''' Compute the gradient for square operation wrt input value.:param grad: The gradient of other operation wrt the square output.        :type grad: ndarray.        '''        input_value = self.input_nodes[0].output_valueif grad is None:            grad = np.ones_like(self.output_value)return grad*np.multiply(2.0, input_value)

神经网络反向传播的矩阵梯度计算

$$\frac{\mathrm{d}y}{\mathrm{d}X} = \left[ \begin{matrix} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{12}} \\ \frac{\partial y}{\partial x_{21}} & \frac{\partial y}{\partial x_{22}} \end{matrix} \right]$$

求和操作的梯度计算

1. 先计算对于$C = A + B$， $\frac{\partial L}{\partial B}$的梯度值，其中$B = \left[ \begin{matrix} b_0 & b_0 \\ b_0 & b_0 \end{matrix} \right]$是通过对$b$进行广播操作得到的
$$\frac{\partial L}{\partial B} = \left[ \begin{matrix} \frac{\partial L}{c_{11}} \frac{\partial c_{11}}{\partial b_0} & \frac{\partial L}{c_{12}} \frac{\partial c_{12}}{\partial b_0} \\ \frac{\partial L}{c_{21}} \frac{\partial c_{21}}{\partial b_0} & \frac{\partial L}{c_{22}} \frac{\partial c_{22}}{\partial b_0} \\ \end{matrix} \right] = \left[ \begin{matrix} \frac{\partial L}{c_{11}} \times 1 & \frac{\partial L}{c_{12}} \times 1 \\ \frac{\partial L}{c_{21}} \times 1 & \frac{\partial L}{c_{22}} \times 1 \\ \end{matrix} \right] = \frac{\partial L}{\partial C} = G$$

2. 计算$L$对$b$的梯度$\frac{\partial L}{\partial b}$。因为$B$是对$b$的一次广播操作，虽然是用的是矩阵的形式，本质上是将$b$复制了4份然后再进行操作的，因此将$\frac{\partial L}{\partial B}$中的每个元素进行累加就是$\frac{\partial L}{\partial b}$的值了。

则梯度的值为:
$$\frac{\partial L}{\partial b} = \sum_{i=1}^{2} \sum_{j=1}^{2} \frac{\partial L}{\partial c_{ij}}$$
针对此例$b$是一个标量，使用矩阵表示的话可以表示成:
$$\frac{\partial L}{\partial b} = \left[ \begin{matrix} 1 & 1 \end{matrix} \right] G \left[ \begin{matrix} 1 \\ 1 \end{matrix} \right]$$

若$b$是一个长度为2的列向量，型如$\left[ \begin{matrix} b_0 \\ b_0\end{matrix} \right]$ 则需要将$G$中的每一列进行相加得到与$b$形状相同的梯度向量:
$$\frac{\partial L}{\partial b} = \left[ \begin{matrix} \frac{\partial L}{\partial c_{11}} + \frac{\partial L}{\partial c_{12}} \\ \frac{\partial L}{\partial c_{21}} + \frac{\partial L}{\partial c_{22}} \end{matrix} \right]$$

class Add(object):    # ...def compute_gradient(self, grad=None):        ''' Compute the gradients for this operation wrt input values.:param grad: The gradient of other operation wrt the addition output.        :type grad: number or a ndarray, default value is 1.0.        '''        x, y = [node.output_value for node in self.input_nodes]if grad is None:            grad = np.ones_like(self.output_value)grad_wrt_x = grad        while np.ndim(grad_wrt_x) > len(np.shape(x)):            grad_wrt_x = np.sum(grad_wrt_x, axis=0)        for axis, size in enumerate(np.shape(x)):            if size == 1:                grad_wrt_x = np.sum(grad_wrt_x, axis=axis, keepdims=True)grad_wrt_y = grad        while np.ndim(grad_wrt_y) > len(np.shape(y)):            grad_wrt_y = np.sum(grad_wrt_y, axis=0)        for axis, size in enumerate(np.shape(y)):            if size == 1:                grad_wrt_y = np.sum(grad_wrt_y, axis=axis, keepdims=True)return [grad_wrt_x, grad_wrt_y]

矩阵乘梯度的计算

$$C = AB$$

$$G = \frac{\partial L}{\partial C}$$

$$\frac{\partial L}{\partial B} = \frac{\partial L}{\partial C} \frac{\partial C}{\partial A}$$

$$\frac{\partial L}{\partial B} = \frac{\partial C}{\partial B} \frac{\partial L}{\partial C}$$

$$\frac{\partial L}{\partial B} = \frac{\partial C}{\partial B} \frac{\partial L}{\partial C} = A^{T}G$$

class MatMul(Operation):    # ...def compute_gradient(self, grad=None):        ''' Compute and return the gradient for matrix multiplication.:param grad: The gradient of other operation wrt the matmul output.        :type grad: number or a ndarray, default value is 1.0.        '''        # Get input values.        x, y = [node.output_value for node in self.input_nodes]# Default gradient wrt the matmul output.        if grad is None:            grad = np.ones_like(self.output_value)# Gradients wrt inputs.        dfdx = np.dot(grad, np.transpose(y))        dfdy = np.dot(np.transpose(x), grad)return [dfdx, dfdy]