Softmax Temperature and Prediction Diversity

Harshit Sharma

ML Engineer @ Juniper Networks

Temperature is a hyperparameter of LSTMs (and neural networks generally) used to control the randomness of predictions by scaling the logits before applying softmax. Temperature scaling has been widely used to improve performance for NLP tasks that utilize the Softmax decision layer.

For explaining its utility, we will consider the case of Natural Language Generation, wherein we need to generate text by sampling out novel sequences from the language model (using the decoder part of the seq-to-seq architecture). At each time step in the decoding phase, we need to predict a token, which is done by sampling from a softmax distribution (over the vocabulary) using one of the sampling techniques. In short, once the logits are obtained, the quality and the diversity of the predictions are controlled by the softmax distribution and the sampling technique applied thereupon.

This article is about tweaking the softmax distribution to control how diverse and novel the predictions are. The latter will be covered in a future article.

Fig 1 is a snapshot of how the prediction is made at one of the intermediate timesteps in the decoding phase.

Fig 1: Logits transformation by Softmax

But what is the issue here?

The generated sequence will have a predictable and generic structure. And the reason is less entropy or randomness in the softmax distribution, in the sense that the likelihood of a particular word (corresponding to index 9 in the above example) getting chosen is way higher than the other words. A sequence being predictable is not problematic as long as the aim is to get realistic sequences. But if the goal is to generate a novel text or an image which has never been seen before, randomness is the holy grail.

The Solution?

Increase the randomness. And that’s precisely what Temperature scaling does. It characterizes the entropy of the probability distribution used for sampling, in other words, it controls how surprising or predictable the next word will be. The scaling is done by dividing the logit vector by a value T, which denotes the temperature, followed by the application of softmax.

Fig 2: Temperature Scaling

The effect of this scaling can be visualized in Fig 3:

Fig 3: Visualizing the Effects of Temperature Scaling. Each word gets equal probability as the Temperature increases

The distribution above approaches uniform distribution giving each word an equal probability of getting sampled out, thereby rendering a more creative look to the generated sequence. Too much creativity isn’t good either. In the extreme case, the generated text might not make sense at all. Hence, like all other hyperparameters, this needs to be tuned as well.

Conclusion:

The scale of temperature controls the smoothness of the output distribution. It, therefore, increases the sensitivity to low-probability candidates. As T → ∞, the distribution becomes more uniform, thus increasing the uncertainty. Contrarily, when T → 0, the distribution collapses to a point mass.

As mentioned earlier, the scope of Temperature Scaling is not limited to NLG. It is also used to calibrate deep learning models while training and in Reinforcement Learning as well. Another broader concept that it is a part of is Knowledge Distillation. Below are the links on these topics for further exploration.

References:

Also Published Here

Recommend

24 个Docker的疑难杂症处理技巧 - 运维 - dbaplus社群：围绕Data、Blockchain、AiOps...

【投资视角】启示2022：中国供应链管理服务行业投融资及兼并重组分析(附投融资汇总、...

消息人士：华为国内多个部门在研究和探索 Web3

收购Dynamo成立任天堂影业，任天堂大方承认员工同性婚姻

How to Take a Screenshot on Any Phone or Laptop

美国国会立法投票前众议院议长佩洛西丈夫大举买入英伟达

加密货币交易所AEX暂停平台相关服务，曾因LUNA崩盘流出30亿人民币资产

Picking up free lithium cells off the street and making them safe for use

首发全新麒麟芯！华为Mate50多款机型CPU型号曝光这外形如何？

The Dark Knight Rises

About Joyk