KL Divergence, cross-entropy and neural network losses

When training simple neural networks, we often take for granted the loss function one should use to solve a specific problem.

E.g. Doing detection ? Binary cross-entropy. Classification ? Categorical cross-entropy. Regression ? MSE.

But those come from a very specific way of interpreting neural network outputs, which I will develop below.

All you need is entropy

It's hard to state where information theory stops and machine learning starts. Those two fields seem so intricated one with another to me, the best example being how nearly every losses are based on cross-entropy minimization.

Entropy

Entropy, as a quick reminder, is the expected amount of information of a message drawn from a given distribution. Information is measured in nats or bits depending on which log base ($e$ or $2$) you choose.

$$H(P) = -\mathop{\mathbb{E}}_{x\sim P}[\log P(x)]$$

Cross-Entropy

Cross-entropy is a bit harder to grasp. It also measures an expected amount of information contained in a given message of true distribution P but encoded with another distribution Q.

$$H(P, Q) = -\mathop{\mathbb{E}}_{x\sim P}[\log Q(x)]$$

One nice thing about cross-entropy is that it can be used to quantify a difference between two different probability distributions. But it is not as clear until we see KL (Kullback-Leiber) divergence.

Kullback-Leiber Divergence

KL Divergence is used a lot to compare distributions and has a really clear meaning in terms of information. KL Divergence is equal to the amount of information that is lost / gained (it is not symmetric) when encoding a message following distribution P encoded with probabilities of distribution Q.

$$D_{KL}(P||Q) = \mathop{\mathbb{E}}_{x\sim P}\left[\log \frac{P(x)}{Q(x)} \right]$$

$$= \mathop{\mathbb{E}}_{x\sim P}\left[\log P(x) - \log Q(x) \right]$$

Now, this is great because we can rewrite it using entropy definitions.

$$D_{KL}(P||Q) = - H(P) + H(P, Q)$$

The KL Divergence is the difference between cross-entropy and entropy. It makes sense because both entropies are expected message lengths, with $P$ or $Q$ encoding distributions, and the difference we get is the bits (or nats) we lost.

Of course this is not symmetric because we change the true distribution we are considering by swapping variables.

$$D_{KL}(P||Q) = D_{KL}(Q||P)$$

As a last observation, note that even if we want to minimize the difference between two distributions, minimizing KL Divergence or the cross-entropy yields to the same optimum, the entropy of the true distribution representing a constant here.

But what about loss functions ?

Let's derive three common loss functions using various distributions and cross-entropy.

For each application, the method is the same :

Consider the neural network output as a parameter of a simple probability distribution.
Compute the information of probability density function or discrete probability.

Consider a neural network with parameters $\theta$ which maps each input $x$ to $f(x; \theta)$.

Binary detection with Bernoulli

Suppose that $f(x; \theta)$ is the $p$ parameter of a Bernoulli distribution. We ensure that $f(x; \theta)$ is bounded between $0$ and $1$ by applying the sigmoid function on last layer.

The Bernoulli discrete distribution can be described like below.

$$B(k;p)=p^k(1-p)^{1-k}$$

With $k$ taking the values $0$ and $1$. Let's compute the cross-entropy between a Bernoulli distribution parametrized by the neural network output and the the true distribution $\hat{p}_{data}$.

$$\hat{p}_{model}(y|x) = B(y;f(x; \theta))$$

$$-\Bbb E_{x\sim\hat p_{data}}[\log\hat p_{model}]=-\Bbb E_{x\sim \hat p_{data}}[y\log(f(x, \theta)) + (1-y)\log(1-f(x, \theta))]$$

Which is what is often called the Binary Cross Entropy.

Classification with Multinoulli

In classification, we constrain the last layer to output parameters of a Multinoulli distribution by applying the softmax function. From doing this, the neural network outputs $n$ values $f_i(x; \theta)$ which sum to $1$.

The Multinoulli distribution, or Categorical Distribution (or even Generalized Bernoulli) is a discrete distribution that attributes probability $p_i$ for a given set of events $x_i$.

$$M(k_i;\bm{p}) = p_i$$

As before, we compute the cross-entropy, considering that each output $f_i(x; \theta)$ parameterize the Multinoulli distribution.

$$-\Bbb E_{x\sim \hat p_{data}}[\log \hat p_{model}] = -\Bbb E_{x\sim \hat p_{data}}\left[\sum_i y_i \log(f_i(x; \theta))\right]$$

In fact, if $n=2$, this is the same computation as the Bernoulli case. In theory, having a single sigmoid output that parametrizes a Bernoulli distribution or two softmax outputs that parameterize a Multinoulli are actually the same things.

Regression with Normal

At last, let's tackle simple regression (single scalar prediction). The activation of the final layer is not constrained and we consider that the last output is the mean parameter of a Normal distribution.

$$\mathcal N(y; f(x; \theta), 1) = \frac{1}{\sqrt{2\pi}} \exp\left({-\frac{\left(y-f(x; \theta)\right)^2}{2}}\right)$$

Again, what's does the computation for cross-entropy gives us ?

$$-\Bbb E_{x\sim \hat p_{data}}[\log \hat p_{model}] = -\Bbb E_{x\sim \hat p_{data}}\left[-\frac{\log(2\pi)}{2}-\left(y-f(x; \theta)\right)^2\right]$$

Which, if you minimize it, is the same as minimizing the MSE (Mean Squared Error) between labels and predictions.

Conclusion

The most commonly used loss functions in machine learning are derived from minimizing the cross-entropy between real and predicted distributions with parameters estimated from neural networks. It is not often taught like that, but I found it the most natural way to think about it a while ago when I was reading the Deep Learning Book by Ian Goodfellow, and took time to share it here :)

Get back to top

18 March 2022