softmax_loss derivative -1 subtraction #10

mmuneeburahman · 2023-12-01T16:21:36Z

Here is a partial derivative jacobian matrix for softmax:

This simplifies to:

softmax[y[i]] -= 1                  # update for gradient

Didn't get this? Can someone explain?
For reference see blog.

The text was updated successfully, but these errors were encountered:

mmuneeburahman · 2023-12-01T17:33:23Z

According to my understanding, this is because of the derivative of loss w.r.t. softmax.
I think, In cs231n Notes, the definition of Cross-Entropy is not what is generally written or somewhat complete.

Categorical Cross-Entropy Loss:

However, it simplifies to equation above, as y_j is 1 for only 1 class and y_j is zero for other classes.
And it's derivative of L is pj-1

Ref:
https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function

mantasu · 2023-12-04T18:02:36Z

Just to clarify, for any sample $n$ when computing the loss, we only consider the predicted score of the class $c$ of interest, i.e., the class $c$ that is actually correct for that sample $n$:

$$L_n=\begin{cases}-\log\left(\frac{e^{\hat y_i}}{\sum_{i'} e^{\hat y_{i'}}}\right), & \text{when } i=c \qquad \\ 0, & \text{otherwise}\end{cases}$$

$$\nabla_{\hat y_i}L_n=\begin{cases}\left(\frac{e^{\hat y_i}}{\sum_{i'} e^{\hat y_{i'}}}\right)-1, & \text{when } i=c \\ \left(\frac{e^{\hat y_i}}{\sum_{i'} e^{\hat y_{i'}}}\right), & \text{otherwise}\end{cases}$$

The actual steps of arriving at the final derivative shown above are a bit more involved but you could easily solve that with pen and paper using the differentiation rules, like chain, logarithm, division etc., e.g., $\nabla_x \log(x)= \frac{1}{x}$ (we assume $\log=\text{ln}$).

Intuitively, that -1 accounts for $\hat{y}_c$ appearing in the enumerator in the original calculation of $L_n$, which otherwise does not happen for any other $\hat{y}_i$ with $i \ne c$.

Extra note: the full cross-entropy loss is calculated as the average negative log probability of the correct class across all your data points. For a single data point $n$, its negative log probability is just the "surprisal" of that event according to information theory.

Edit: for clarity, I should add that I use a different notation from the one in cs231n notes:

In cs231n notes, $f_{y_i}$ means the predicted score for the class $y_i$ that is actually correct for sample $i$
In my answer here, $\hat{y}_i$ means the predicted score for any class $i$, regardless of which sample it is (I index samples with $n$)

mmuneeburahman changed the title ~~softmax_loss -1 subtraction~~ softmax_loss derivative -1 subtraction Dec 1, 2023

mantasu added the question Further information is requested label Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

softmax_loss derivative -1 subtraction #10

softmax_loss derivative -1 subtraction #10

mmuneeburahman commented Dec 1, 2023 •

edited

Loading

mmuneeburahman commented Dec 1, 2023

mantasu commented Dec 4, 2023 •

edited

Loading

softmax_loss derivative -1 subtraction #10

softmax_loss derivative -1 subtraction #10

Comments

mmuneeburahman commented Dec 1, 2023 • edited Loading

mmuneeburahman commented Dec 1, 2023

mantasu commented Dec 4, 2023 • edited Loading

mmuneeburahman commented Dec 1, 2023 •

edited

Loading

mantasu commented Dec 4, 2023 •

edited

Loading