实现属于自己的TensorFlow(二) - 梯度计算与反向传播

Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

实现属于自己的TensorFlow(二) - 梯度计算与反向传播

1. 实现不同操作输出对输入的梯度计算
2. 实现根据链式法则计算损失函数对不同节点的梯度计算

$$Loss(x, y, z) = z(x+y)$$

$$Loss(x, y, z) = g(z, f(x, y))$$

$$\frac{\partial Loss}{\partial x} = \frac{\partial Loss}{\partial g}\frac{\partial g}{\partial f} \frac{\partial f}{\partial x}$$

1. $\frac{\partial Loss}{\partial g} = 1$

2. $\frac{\partial g}{\partial f} = z = 6$ (当然也可以计算出$\frac{\partial g}{\partial z} = x + y = 5$). 进而求出$\frac{\partial Loss}{\partial f} = \frac{\partial Loss}{\partial g}\frac{\partial g}{\partial f} = 1 \times z = 6$

3. $\frac{\partial f}{\partial x} = 1$ (同时也可以算出$\frac{\partial f}{\partial y} = 1$). 进而求出$\frac{\partial Loss}{\partial x} = \frac{\partial Loss}{\partial g}\frac{\partial g}{\partial f}\frac{\partial f}{\partial x} = 1 \times z \times 1 = 6$

$$\frac{\partial Loss}{\partial x} = \frac{\partial g}{\partial f}\frac{\partial f}{\partial h}\frac{\partial h}{\partial x} + \frac{\partial g}{\partial f}\frac{\partial f}{\partial l}\frac{\partial l}{\partial x}$$

$f$节点可以看成一个函数$z = f(x, y)$， 我们需要做的就是求$\frac{\partial f(x, y)}{\partial x}$和$\frac{\partial f(x, y)}{\partial y}$.

平方运算的梯度计算

class Square(Operation):    ''' Square operation. '''    # ...    def compute_gradient(self, grad=None):        ''' Compute the gradient for square operation wrt input value.:param grad: The gradient of other operation wrt the square output.        :type grad: ndarray.        '''        input_value = self.input_nodes[0].output_valueif grad is None:            grad = np.ones_like(self.output_value)return grad*np.multiply(2.0, input_value)

神经网络反向传播的矩阵梯度计算

$$\frac{\mathrm{d}y}{\mathrm{d}X} = \left[ \begin{matrix} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{12}} \\ \frac{\partial y}{\partial x_{21}} & \frac{\partial y}{\partial x_{22}} \end{matrix} \right]$$

求和操作的梯度计算

1. 先计算对于$C = A + B$， $\frac{\partial L}{\partial B}$的梯度值，其中$B = \left[ \begin{matrix} b_0 & b_0 \\ b_0 & b_0 \end{matrix} \right]$是通过对$b$进行广播操作得到的
$$\frac{\partial L}{\partial B} = \left[ \begin{matrix} \frac{\partial L}{c_{11}} \frac{\partial c_{11}}{\partial b_0} & \frac{\partial L}{c_{12}} \frac{\partial c_{12}}{\partial b_0} \\ \frac{\partial L}{c_{21}} \frac{\partial c_{21}}{\partial b_0} & \frac{\partial L}{c_{22}} \frac{\partial c_{22}}{\partial b_0} \\ \end{matrix} \right] = \left[ \begin{matrix} \frac{\partial L}{c_{11}} \times 1 & \frac{\partial L}{c_{12}} \times 1 \\ \frac{\partial L}{c_{21}} \times 1 & \frac{\partial L}{c_{22}} \times 1 \\ \end{matrix} \right] = \frac{\partial L}{\partial C} = G$$

2. 计算$L$对$b$的梯度$\frac{\partial L}{\partial b}$。因为$B$是对$b$的一次广播操作，虽然是用的是矩阵的形式，本质上是将$b$复制了4份然后再进行操作的，因此将$\frac{\partial L}{\partial B}$中的每个元素进行累加就是$\frac{\partial L}{\partial b}$的值了。

则梯度的值为:
$$\frac{\partial L}{\partial b} = \sum_{i=1}^{2} \sum_{j=1}^{2} \frac{\partial L}{\partial c_{ij}}$$
针对此例$b$是一个标量，使用矩阵表示的话可以表示成:
$$\frac{\partial L}{\partial b} = \left[ \begin{matrix} 1 & 1 \end{matrix} \right] G \left[ \begin{matrix} 1 \\ 1 \end{matrix} \right]$$

若$b$是一个长度为2的列向量，型如$\left[ \begin{matrix} b_0 \\ b_0\end{matrix} \right]$ 则需要将$G$中的每一列进行相加得到与$b$形状相同的梯度向量:
$$\frac{\partial L}{\partial b} = \left[ \begin{matrix} \frac{\partial L}{\partial c_{11}} + \frac{\partial L}{\partial c_{12}} \\ \frac{\partial L}{\partial c_{21}} + \frac{\partial L}{\partial c_{22}} \end{matrix} \right]$$

class Add(object):    # ...def compute_gradient(self, grad=None):        ''' Compute the gradients for this operation wrt input values.:param grad: The gradient of other operation wrt the addition output.        :type grad: number or a ndarray, default value is 1.0.        '''        x, y = [node.output_value for node in self.input_nodes]if grad is None:            grad = np.ones_like(self.output_value)grad_wrt_x = grad        while np.ndim(grad_wrt_x) > len(np.shape(x)):            grad_wrt_x = np.sum(grad_wrt_x, axis=0)        for axis, size in enumerate(np.shape(x)):            if size == 1:                grad_wrt_x = np.sum(grad_wrt_x, axis=axis, keepdims=True)grad_wrt_y = grad        while np.ndim(grad_wrt_y) > len(np.shape(y)):            grad_wrt_y = np.sum(grad_wrt_y, axis=0)        for axis, size in enumerate(np.shape(y)):            if size == 1:                grad_wrt_y = np.sum(grad_wrt_y, axis=axis, keepdims=True)return [grad_wrt_x, grad_wrt_y]

矩阵乘梯度的计算

$$C = AB$$

$$G = \frac{\partial L}{\partial C}$$

$$\frac{\partial L}{\partial B} = \frac{\partial L}{\partial C} \frac{\partial C}{\partial A}$$

$$\frac{\partial L}{\partial B} = \frac{\partial C}{\partial B} \frac{\partial L}{\partial C}$$

$$\frac{\partial L}{\partial B} = \frac{\partial C}{\partial B} \frac{\partial L}{\partial C} = A^{T}G$$

class MatMul(Operation):    # ...def compute_gradient(self, grad=None):        ''' Compute and return the gradient for matrix multiplication.:param grad: The gradient of other operation wrt the matmul output.        :type grad: number or a ndarray, default value is 1.0.        '''        # Get input values.        x, y = [node.output_value for node in self.input_nodes]# Default gradient wrt the matmul output.        if grad is None:            grad = np.ones_like(self.output_value)# Gradients wrt inputs.        dfdx = np.dot(grad, np.transpose(y))        dfdy = np.dot(np.transpose(x), grad)return [dfdx, dfdy]