Tensorflow2 深度学习十必知

博主根据自身多年的深度学习算法研发经验，整理分享以下十条必知。

含参考资料链接，部分附上相关代码实现。

独乐乐不如众乐乐，希望对各位看客有所帮助。

待回头有时间再展开细节说一说深度学习里的那些道道。

有什么技术需求需要有偿解决的也可以邮件或者QQ联系博主。

邮箱QQ同ID：[email protected]

当然除了这十条，肯定还有其他“必知”，

欢迎评论分享更多，这里只是暂时拟定的十条，别较真哈。

主要学习其中的思路，切记，以下思路在个别场景并不适用。

1.数据回流

[1907.05550] Faster Neural Network Training with Data Echoing

def data_echoing(factor):

return lambda image, label: tf.data.Dataset.from_tensors((image, label)).repeat(factor)

数据集加载后，在数据增广前后重复当前批次进模型的次数，减少数据的加载耗时。

等价于让模型看n次当前的数据，或者看n个增广后的数据样本。

2.AMP 自动精度混合

在bert4keras中使用混合精度和XLA加速训练 - 科学空间|Scientific Spaces

tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})

降低显存占用，加速训练，将部分网络计算转为等价的低精度计算，以此降低计算量。

3.优化器节省显存

3.1 [1804.04235]Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

mesh/optimize.py at master · tensorflow/mesh · GitHub

3.2 [1901.11150] Memory-Efficient Adaptive Optimization

google-research/sm3 at master · google-research/google-research (github.com)

节省显存，加速训练，

主要是对二阶动量进行特例化解构，减少显存存储。

4.权重标准化(归一化)

[2102.06171] High-Performance Large-Scale Image Recognition Without Normalization

deepmind-research/nfnets at master · deepmind/deepmind-research · GitHub

class WSConv2D(tf.keras.layers.Conv2D):

def __init__(self, *args, **kwargs):

super(WSConv2D, self).__init__(

kernel_initializer=tf.keras.initializers.VarianceScaling(

scale=1.0, mode='fan_in', distribution='untruncated_normal',

),

use_bias=False,

kernel_regularizer=tf.keras.regularizers.l2(1e-4), *args, **kwargs

)

self.gain = self.add_weight(

name='gain',

shape=(self.filters,),

initializer="ones",

trainable=True,

dtype=self.dtype

)

def standardize_weight(self, eps):

mean, var = tf.nn.moments(self.kernel, axes=[0, 1, 2], keepdims=True)

fan_in = np.prod(self.kernel.shape[:-1])

# Manually fused normalization, eq. to (w - mean) * gain / sqrt(N * var)

scale = tf.math.rsqrt(

tf.math.maximum(

var * fan_in,

tf.convert_to_tensor(eps, dtype=self.dtype)

)

) * self.gain

shift = mean * scale

return self.kernel * scale - shift

def call(self, inputs):

eps = 1e-4

weight = self.standardize_weight(eps)

return tf.nn.conv2d(

inputs, weight, strides=self.strides,

padding=self.padding.upper(), dilations=self.dilation_rate

) if self.bias is None else tf.nn.bias_add(

tf.nn.conv2d(

inputs, weight, strides=self.strides,

padding=self.padding.upper(), dilations=self.dilation_rate

), self.bias)

通过对kernel进行标准化或归一化，相当于对kernel做一个先验约束，以此加速模型训练收敛。

5.自适应梯度裁剪

deepmind-research/agc_optax.py at master · deepmind/deepmind-research · GitHub

def unitwise_norm(x):

if len(tf.squeeze(x).shape) <= 1: # Scalars and vectors

axis = None

keepdims = False

elif len(x.shape) in [2, 3]: # Linear layers of shape IO

axis = 0

keepdims = True

elif len(x.shape) == 4: # Conv kernels of shape HWIO

axis = [0, 1, 2, ]

keepdims = True

else:

raise ValueError(f'Got a parameter with shape not in [1, 2, 3, 4]! {x}')

square_sum = tf.reduce_sum(tf.square(x), axis, keepdims=keepdims)

return tf.sqrt(square_sum)

def gradient_clipping(grad, var):

clipping = 0.01

max_norm = tf.maximum(unitwise_norm(var), 1e-3) * clipping

grad_norm = unitwise_norm(grad)

trigger = (grad_norm > max_norm)

clipped_grad = (max_norm / tf.maximum(grad_norm, 1e-6))

return grad * tf.where(trigger, clipped_grad, tf.ones_like(clipped_grad))

防止梯度爆炸，稳定训练。通过梯度和参数的关系，对梯度进行裁剪，约束学习率。

6.recompute_grad

[1604.06174] Training Deep Nets with Sublinear Memory Cost

google-research/recompute_grad.py at master · google-research/google-research (github.com)

bojone/keras_recompute: saving memory by recomputing for keras (github.com)

通过梯度重计算，节省显存。

7.归一化

[2003.05569] Extended Batch Normalization (arxiv.org)

from keras.layers.normalization.batch_normalization import BatchNormalizationBase

class ExtendedBatchNormalization(BatchNormalizationBase):

def __init__(self,

axis=-1,

momentum=0.99,

epsilon=1e-3,

center=True,

scale=True,

beta_initializer='zeros',

gamma_initializer='ones',

moving_mean_initializer='zeros',

moving_variance_initializer='ones',

beta_regularizer=None,

gamma_regularizer=None,

beta_constraint=None,

gamma_constraint=None,

renorm=False,

renorm_clipping=None,

renorm_momentum=0.99,

trainable=True,

name=None,

**kwargs):

# Currently we only support aggregating over the global batch size.

super(ExtendedBatchNormalization, self).__init__(

axis=axis,

momentum=momentum,

epsilon=epsilon,

center=center,

scale=scale,

beta_initializer=beta_initializer,

gamma_initializer=gamma_initializer,

moving_mean_initializer=moving_mean_initializer,

moving_variance_initializer=moving_variance_initializer,

beta_regularizer=beta_regularizer,

gamma_regularizer=gamma_regularizer,

beta_constraint=beta_constraint,

gamma_constraint=gamma_constraint,

renorm=renorm,

renorm_clipping=renorm_clipping,

renorm_momentum=renorm_momentum,

fused=False,

trainable=trainable,

virtual_batch_size=None,

name=name,

**kwargs)

def _calculate_mean_and_var(self, x, axes, keep_dims):

with tf.keras.backend.name_scope('moments'):

y = tf.cast(x, tf.float32) if x.dtype == tf.float16 else x

replica_ctx = tf.distribute.get_replica_context()

if replica_ctx:

local_sum = tf.math.reduce_sum(y, axis=axes, keepdims=True)

local_squared_sum = tf.math.reduce_sum(tf.math.square(y), axis=axes,

keepdims=True)

batch_size = tf.cast(tf.shape(y)[0], tf.float32)

y_sum = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM, local_sum)

y_squared_sum = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM,

local_squared_sum)

global_batch_size = replica_ctx.all_reduce(tf.distribute.ReduceOp.SUM,

batch_size)

axes_vals = [(tf.shape(y))[i] for i in range(1, len(axes))]

multiplier = tf.cast(tf.reduce_prod(axes_vals), tf.float32)

multiplier = multiplier * global_batch_size

mean = y_sum / multiplier

y_squared_mean = y_squared_sum / multiplier

# var = E(x^2) - E(x)^2

variance = y_squared_mean - tf.math.square(mean)

else:

# Compute true mean while keeping the dims for proper broadcasting.

mean = tf.math.reduce_mean(y, axes, keepdims=True, name='mean')

variance = tf.math.reduce_mean(

tf.math.squared_difference(y, tf.stop_gradient(mean)),

axes,

keepdims=True,

name='variance')

if not keep_dims:

mean = tf.squeeze(mean, axes)

variance = tf.squeeze(variance, axes)

variance = tf.math.reduce_mean(variance)

if x.dtype == tf.float16:

return (tf.cast(mean, tf.float16),

tf.cast(variance, tf.float16))

else:

return mean, variance

一个简易改进版的Batch Normalization，思路简单有效。

8.学习率策略

[1506.01186] Cyclical Learning Rates for Training Neural Networks (arxiv.org)

一个推荐的学习率策略方案，特定情况下可以取得更好的泛化。

9.重参数化

[1908.03930] ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks

https://zhuanlan.zhihu.com/p/361090497

通过同时训练多份参数，合并权重的思路来提升模型泛化性。

10.长尾学习

[2110.04596] Deep Long-Tailed Learning: A Survey (arxiv.org)

Jorwnpay/A-Long-Tailed-Survey: 本项目是 Deep Long-Tailed Learning: A Survey 文章的中译版 (github.com)

解决长尾问题，可以加速收敛，提升模型泛化，稳定训练。

Tensorflow2 深度学习十必知 - cpuimage

Tensorflow2 深度学习十必知

Recommend

AWS Lambda - Why does upgrading the python version cause the packages dependenci...

京东腾讯续签三年战略合作协议；iPhone开售15周年；BOSS直聘恢复新用户注册

厦门市税务局稽查局依法对网络主播范思峰偷逃税案件进行处理

京东宣布与腾讯开启第三轮战略合作

上市暴涨三倍，金太阳教育能否带动中概教育股回暖？

元宇宙很红没错但营销人别一窝蜂上车

Google sign-up 'fast track to surveillance', consumer groups say

Reset Navicat Premium 15/16 remaining trial days

FCC commissioner calls TikTok Chinese spyware and wants it pulled from mobile ap...

我们吃了最火的 7 款低脂雪糕，贵到心碎，配不配？

About Joyk