Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

softmax_loss derivative -1 subtraction #10

Open
mmuneeburahman opened this issue Dec 1, 2023 · 2 comments
Open

softmax_loss derivative -1 subtraction #10

mmuneeburahman opened this issue Dec 1, 2023 · 2 comments
Labels
question Further information is requested

Comments

@mmuneeburahman
Copy link

mmuneeburahman commented Dec 1, 2023

Here is a partial derivative jacobian matrix for softmax:
image
This simplifies to:
image

softmax[y[i]] -= 1                  # update for gradient

Didn't get this? Can someone explain?
For reference see blog.

@mmuneeburahman mmuneeburahman changed the title softmax_loss -1 subtraction softmax_loss derivative -1 subtraction Dec 1, 2023
@mmuneeburahman
Copy link
Author

According to my understanding, this is because of the derivative of loss w.r.t. softmax.
I think, In cs231n Notes, the definition of Cross-Entropy is not what is generally written or somewhat complete.
image
Categorical Cross-Entropy Loss:
image
However, it simplifies to equation above, as yj is 1 for only 1 class and yj is zero for other classes.
And it's derivative of L is pj-1

Ref:
https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function

@mantasu
Copy link
Owner

mantasu commented Dec 4, 2023

Just to clarify, for any sample $n$ when computing the loss, we only consider the predicted score of the class $c$ of interest, i.e., the class $c$ that is actually correct for that sample $n$:

$$L_n=\begin{cases}-\log\left(\frac{e^{\hat y_i}}{\sum_{i'} e^{\hat y_{i'}}}\right), & \text{when } i=c \qquad \\ 0, & \text{otherwise}\end{cases}$$

$$\nabla_{\hat y_i}L_n=\begin{cases}\left(\frac{e^{\hat y_i}}{\sum_{i'} e^{\hat y_{i'}}}\right)-1, & \text{when } i=c \\ \left(\frac{e^{\hat y_i}}{\sum_{i'} e^{\hat y_{i'}}}\right), & \text{otherwise}\end{cases}$$

The actual steps of arriving at the final derivative shown above are a bit more involved but you could easily solve that with pen and paper using the differentiation rules, like chain, logarithm, division etc., e.g., $\nabla_x \log(x)= \frac{1}{x}$ (we assume $\log=\text{ln}$).

Intuitively, that -1 accounts for $\hat{y}_c$ appearing in the enumerator in the original calculation of $L_n$, which otherwise does not happen for any other $\hat{y}_i$ with $i \ne c$.

Extra note: the full cross-entropy loss is calculated as the average negative log probability of the correct class across all your data points. For a single data point $n$, its negative log probability is just the "surprisal" of that event according to information theory.

Edit: for clarity, I should add that I use a different notation from the one in cs231n notes:

  • In cs231n notes, $f_{y_i}$ means the predicted score for the class $y_i$ that is actually correct for sample $i$
  • In my answer here, $\hat{y}_i$ means the predicted score for any class $i$, regardless of which sample it is (I index samples with $n$)

@mantasu mantasu added the question Further information is requested label Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants