引入参数控制softmax的smooth程度

3 years ago

source link: https://allenwind.github.io/blog/15205/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Mr.Feng Blog

NLP、深度学习、机器学习、Python、Go

引入参数控制softmax的smooth程度

如何入参数控制softmax的smooth程度？

上一篇多分类模型的输出为什么使用softmax？中分析了多分类模型的输出为什么使用softmax？其中包括softmax的导出过程，该推导过程只要稍加修改就可以很自然地导出带参数能够控制softmax光滑程度的版本。

这里使用到max的一光滑逼近形式，

max(x1,…,xn)≈1αlog(n∑i=1eαxi)max(x1,…,xn)≈1αlog⁡(∑i=1neαxi)

具体推导如下，

one-hot(Ck)=[0,…,1,…,0]=one-hot(argmaxi=1,⋯,nxi)=one-hot(argmaxi=1,⋯,n[xi−max(x)])=one-hot(argmaxi=1,⋯,nexp[xi−max(x)])≈one-hot(argmaxi=1,⋯,nexp[xi−1αlog(n∑i=1eαxi)])=one-hot(argmaxi=1,⋯,nexp1α[αxi−log(n∑i=1eαxi)])=one-hot(argmaxi=1,⋯,nexp[αxi−log(n∑i=1eαxi)])=one-hot(argmaxi=1,⋯,neαxin∑i=1eαxi)≈[eαx1n∑i=1eαxi,…,eαxnn∑i=1eαxi]=softmax(αx)=one-hots(Ck)one-hot⁡(Ck)=[0,…,1,…,0]=one-hot⁡(argmaxi=1,⋯,nxi)=one-hot⁡(argmaxi=1,⋯,n[xi−max(x)])=one-hot⁡(argmaxi=1,⋯,nexp⁡[xi−max(x)])≈one-hot⁡(argmaxi=1,⋯,nexp⁡[xi−1αlog⁡(∑i=1neαxi)])=one-hot⁡(argmaxi=1,⋯,nexp⁡1α[αxi−log⁡(∑i=1neαxi)])=one-hot⁡(argmaxi=1,⋯,nexp⁡[αxi−log⁡(∑i=1neαxi)])=one-hot⁡(argmaxi=1,⋯,neαxi∑i=1neαxi)≈[eαx1∑i=1neαxi,…,eαxn∑i=1neαxi]=softmax⁡(αx)=one-hots⁡(Ck)

以上推导的关键要点是把logsumexp的参数例如其中。这个结果还是符合直觉的，通过αα来控制αxαx，进而控制softmax(x)softmax⁡(x)的光滑程度。

事实上，在激活函数中，也是类似的操作，

σ(αx)=11+e−αx=eαx1+eαxσ(αx)=11+e−αx=eαx1+eαx

这个应用在GELU激活函数中有特例α=1.702α=1.702，

xΦ(x)≈xσ(1.702x)xΦ(x)≈xσ(1.702x)

这个结果在Attention中有重要应用，

Attention(Q,K,V)=softmax(QK⊤√dk)VAttention⁡(Q,K,V)=softmax⁡(QK⊤dk)V

这里αα为，

α=1√dkα=1dk

本文提供一种引入参数控制softmax的smooth程度的导出思路，事实上，光靠数学直觉也能知道如何引入参数，不过多种思路多种理解。

转载请包括本文地址：https://allenwind.github.io/blog/15205
更多文章请参考：https://allenwind.github.io/blog/archives/

Recommend