Derivative of Softmax without cross entropy
$begingroup$
There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.
For the purposes of this question, I will use a fixed input vector containing 4 values.
Input vector
$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$
Softmax Function and Derivative
My softmax function is defined as :
$$left [ frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}right ] $$
Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.
My jacobian is this:
$$
left[begin{matrix}frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{0}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}end{matrix}right]
$$
Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.
Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:
Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:
$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$
and the first term is just $$ softmax_{x_0} $$
Which means the derivative of softmax is :
$$softmax - softmax^2$$
or
$$softmax(1-softmax)$$
This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.
Cross Entropy Loss and its derivative
The cross entropy takes in as input the softmax vector and a 'target' probability distribution.
$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$
Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :
$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$
Cross entropy function
$$
- sum_{i}^{classes} t_i log(s_i)
$$
For our case it is
$$
- t_{0} log{left (frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{1} log{left (frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{2} log{left (frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{3} log{left (frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )}
$$
Derivative of cross entropy
Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:
$$
-frac{t_i}{s_i}
$$
Using chain rule to get derivative of softmax with cross entropy
We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:
$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$
Simplifying , it gives
$$
-t_i *(1-s_i)
$$
Analytically computing derivative of softmax with cross entropy
This document derives the derivative of softmax with cross entropy and it gets:
$$
s_i - t_i
$$
Which is different from the one derived using chain rule.
Implementation using numpy
I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)
This is the code to evaluate:
x = np.array([-1.0, -1.0, 1.0]) # unscaled logits, my x vector
t = np.array([0.0,1.0,0.0]) # target probability distribution
## Function definitions
def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum
def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))
def cross_entropy_derivatives(inps,targets):
return -targets/inps
def softmax_derivatives(softmax):
return softmax * (1-softmax)
soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846
cross_der = cross_entropy_derivatives(soft,t) # [-0. , -9.3890561, -0. ]
soft_der = softmax_derivatives(soft) # [0.09516324, 0.09516324, 0.16763901]
## Derivative using chain rule
cross_der * soft_der # [-0. , -0.89349302, -0. ]
## Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]
Notice the difference in values.
My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.
calculus multivariable-calculus neural-networks
$endgroup$
add a comment |
$begingroup$
There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.
For the purposes of this question, I will use a fixed input vector containing 4 values.
Input vector
$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$
Softmax Function and Derivative
My softmax function is defined as :
$$left [ frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}right ] $$
Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.
My jacobian is this:
$$
left[begin{matrix}frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{0}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}end{matrix}right]
$$
Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.
Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:
Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:
$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$
and the first term is just $$ softmax_{x_0} $$
Which means the derivative of softmax is :
$$softmax - softmax^2$$
or
$$softmax(1-softmax)$$
This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.
Cross Entropy Loss and its derivative
The cross entropy takes in as input the softmax vector and a 'target' probability distribution.
$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$
Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :
$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$
Cross entropy function
$$
- sum_{i}^{classes} t_i log(s_i)
$$
For our case it is
$$
- t_{0} log{left (frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{1} log{left (frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{2} log{left (frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{3} log{left (frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )}
$$
Derivative of cross entropy
Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:
$$
-frac{t_i}{s_i}
$$
Using chain rule to get derivative of softmax with cross entropy
We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:
$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$
Simplifying , it gives
$$
-t_i *(1-s_i)
$$
Analytically computing derivative of softmax with cross entropy
This document derives the derivative of softmax with cross entropy and it gets:
$$
s_i - t_i
$$
Which is different from the one derived using chain rule.
Implementation using numpy
I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)
This is the code to evaluate:
x = np.array([-1.0, -1.0, 1.0]) # unscaled logits, my x vector
t = np.array([0.0,1.0,0.0]) # target probability distribution
## Function definitions
def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum
def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))
def cross_entropy_derivatives(inps,targets):
return -targets/inps
def softmax_derivatives(softmax):
return softmax * (1-softmax)
soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846
cross_der = cross_entropy_derivatives(soft,t) # [-0. , -9.3890561, -0. ]
soft_der = softmax_derivatives(soft) # [0.09516324, 0.09516324, 0.16763901]
## Derivative using chain rule
cross_der * soft_der # [-0. , -0.89349302, -0. ]
## Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]
Notice the difference in values.
My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.
calculus multivariable-calculus neural-networks
$endgroup$
$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45
add a comment |
$begingroup$
There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.
For the purposes of this question, I will use a fixed input vector containing 4 values.
Input vector
$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$
Softmax Function and Derivative
My softmax function is defined as :
$$left [ frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}right ] $$
Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.
My jacobian is this:
$$
left[begin{matrix}frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{0}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}end{matrix}right]
$$
Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.
Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:
Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:
$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$
and the first term is just $$ softmax_{x_0} $$
Which means the derivative of softmax is :
$$softmax - softmax^2$$
or
$$softmax(1-softmax)$$
This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.
Cross Entropy Loss and its derivative
The cross entropy takes in as input the softmax vector and a 'target' probability distribution.
$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$
Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :
$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$
Cross entropy function
$$
- sum_{i}^{classes} t_i log(s_i)
$$
For our case it is
$$
- t_{0} log{left (frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{1} log{left (frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{2} log{left (frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{3} log{left (frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )}
$$
Derivative of cross entropy
Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:
$$
-frac{t_i}{s_i}
$$
Using chain rule to get derivative of softmax with cross entropy
We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:
$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$
Simplifying , it gives
$$
-t_i *(1-s_i)
$$
Analytically computing derivative of softmax with cross entropy
This document derives the derivative of softmax with cross entropy and it gets:
$$
s_i - t_i
$$
Which is different from the one derived using chain rule.
Implementation using numpy
I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)
This is the code to evaluate:
x = np.array([-1.0, -1.0, 1.0]) # unscaled logits, my x vector
t = np.array([0.0,1.0,0.0]) # target probability distribution
## Function definitions
def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum
def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))
def cross_entropy_derivatives(inps,targets):
return -targets/inps
def softmax_derivatives(softmax):
return softmax * (1-softmax)
soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846
cross_der = cross_entropy_derivatives(soft,t) # [-0. , -9.3890561, -0. ]
soft_der = softmax_derivatives(soft) # [0.09516324, 0.09516324, 0.16763901]
## Derivative using chain rule
cross_der * soft_der # [-0. , -0.89349302, -0. ]
## Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]
Notice the difference in values.
My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.
calculus multivariable-calculus neural-networks
$endgroup$
There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.
For the purposes of this question, I will use a fixed input vector containing 4 values.
Input vector
$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$
Softmax Function and Derivative
My softmax function is defined as :
$$left [ frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}right ] $$
Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.
My jacobian is this:
$$
left[begin{matrix}frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{0}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}end{matrix}right]
$$
Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.
Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:
Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:
$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$
and the first term is just $$ softmax_{x_0} $$
Which means the derivative of softmax is :
$$softmax - softmax^2$$
or
$$softmax(1-softmax)$$
This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.
Cross Entropy Loss and its derivative
The cross entropy takes in as input the softmax vector and a 'target' probability distribution.
$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$
Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :
$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$
Cross entropy function
$$
- sum_{i}^{classes} t_i log(s_i)
$$
For our case it is
$$
- t_{0} log{left (frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{1} log{left (frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{2} log{left (frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{3} log{left (frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )}
$$
Derivative of cross entropy
Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:
$$
-frac{t_i}{s_i}
$$
Using chain rule to get derivative of softmax with cross entropy
We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:
$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$
Simplifying , it gives
$$
-t_i *(1-s_i)
$$
Analytically computing derivative of softmax with cross entropy
This document derives the derivative of softmax with cross entropy and it gets:
$$
s_i - t_i
$$
Which is different from the one derived using chain rule.
Implementation using numpy
I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)
This is the code to evaluate:
x = np.array([-1.0, -1.0, 1.0]) # unscaled logits, my x vector
t = np.array([0.0,1.0,0.0]) # target probability distribution
## Function definitions
def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum
def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))
def cross_entropy_derivatives(inps,targets):
return -targets/inps
def softmax_derivatives(softmax):
return softmax * (1-softmax)
soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846
cross_der = cross_entropy_derivatives(soft,t) # [-0. , -9.3890561, -0. ]
soft_der = softmax_derivatives(soft) # [0.09516324, 0.09516324, 0.16763901]
## Derivative using chain rule
cross_der * soft_der # [-0. , -0.89349302, -0. ]
## Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]
Notice the difference in values.
My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.
calculus multivariable-calculus neural-networks
calculus multivariable-calculus neural-networks
edited Jul 7 '18 at 16:09
harveyslash
asked Jul 7 '18 at 7:00
harveyslashharveyslash
1286
1286
$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45
add a comment |
$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45
$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45
$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.
Mistakes
1. Softmax Function and its derivative
I incorrectly stated that summing up the columns of the jacobian
is equivalent to doing
$$
color{red}{softmax(1-softmax)}
$$
The sum of the columns of the jacobian for $s_0$ actually goes like this:
$$
s_0 - sum_{i}{} s_0 *s_i
$$
Taking $s_0$ common:
$$
s_0 - s_0 sum_{i}{} s_i
$$
Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).
Therefore we get:
$$
s_0 - s_0*1
$$
which is $0$
So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.
2. Jacobians shouldn't be summed
The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.
This means that the equation
$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$
which calculates the derivative using chain rule.
is INCORRECT
It should actually be :
$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$
where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.
and the $times$ symbol denotes matrix multiplication.
Why Summing up the partials result in 0
To understand that, we need to first understand what the jacobian matrix signifies.
For element 0,0 , it reads :
How does $s_0$ change when I change $x_0$
For element 1,0 , it reads like this:
How does $s_1$ change when I change $x_0$
For element 2,0 , it reads like this:
How does $s_2$ change when I change $x_0$
To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).
The same can be said about $x_2$ and $x_3$.
Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .
This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!
In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.
Since the change is 0, the gradient is 0
In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.
Implementation in numpy
v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
t = np.array([0.0,1.0,0.0]) # target probability distribution
def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum
def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))
def cross_entropy_derivatives(inps,targets):
return -targets/inps
# Fixed softmax derivative which returns the jacobian instead
# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function
def softmax_derivatives(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) - np.dot(s, s.T)
soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846
cross_der = cross_entropy_derivatives(soft,t)
# [-0. , -9.3890561, -0. ]
soft_der = softmax_derivatives(soft)
#[[ 0.09516324, -0.01134374, -0.08381951],
#[-0.01134374, 0.09516324, -0.08381951],
#[-0.08381951, -0.08381951, 0.16763901]]
# derivative using chain rule
cross_der @ soft_der # [[ 0.10650698, -0.89349302, 0.78698604]]
# Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]
Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2843505%2fderivative-of-softmax-without-cross-entropy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.
Mistakes
1. Softmax Function and its derivative
I incorrectly stated that summing up the columns of the jacobian
is equivalent to doing
$$
color{red}{softmax(1-softmax)}
$$
The sum of the columns of the jacobian for $s_0$ actually goes like this:
$$
s_0 - sum_{i}{} s_0 *s_i
$$
Taking $s_0$ common:
$$
s_0 - s_0 sum_{i}{} s_i
$$
Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).
Therefore we get:
$$
s_0 - s_0*1
$$
which is $0$
So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.
2. Jacobians shouldn't be summed
The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.
This means that the equation
$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$
which calculates the derivative using chain rule.
is INCORRECT
It should actually be :
$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$
where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.
and the $times$ symbol denotes matrix multiplication.
Why Summing up the partials result in 0
To understand that, we need to first understand what the jacobian matrix signifies.
For element 0,0 , it reads :
How does $s_0$ change when I change $x_0$
For element 1,0 , it reads like this:
How does $s_1$ change when I change $x_0$
For element 2,0 , it reads like this:
How does $s_2$ change when I change $x_0$
To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).
The same can be said about $x_2$ and $x_3$.
Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .
This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!
In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.
Since the change is 0, the gradient is 0
In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.
Implementation in numpy
v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
t = np.array([0.0,1.0,0.0]) # target probability distribution
def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum
def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))
def cross_entropy_derivatives(inps,targets):
return -targets/inps
# Fixed softmax derivative which returns the jacobian instead
# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function
def softmax_derivatives(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) - np.dot(s, s.T)
soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846
cross_der = cross_entropy_derivatives(soft,t)
# [-0. , -9.3890561, -0. ]
soft_der = softmax_derivatives(soft)
#[[ 0.09516324, -0.01134374, -0.08381951],
#[-0.01134374, 0.09516324, -0.08381951],
#[-0.08381951, -0.08381951, 0.16763901]]
# derivative using chain rule
cross_der @ soft_der # [[ 0.10650698, -0.89349302, 0.78698604]]
# Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]
Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)
$endgroup$
add a comment |
$begingroup$
There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.
Mistakes
1. Softmax Function and its derivative
I incorrectly stated that summing up the columns of the jacobian
is equivalent to doing
$$
color{red}{softmax(1-softmax)}
$$
The sum of the columns of the jacobian for $s_0$ actually goes like this:
$$
s_0 - sum_{i}{} s_0 *s_i
$$
Taking $s_0$ common:
$$
s_0 - s_0 sum_{i}{} s_i
$$
Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).
Therefore we get:
$$
s_0 - s_0*1
$$
which is $0$
So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.
2. Jacobians shouldn't be summed
The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.
This means that the equation
$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$
which calculates the derivative using chain rule.
is INCORRECT
It should actually be :
$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$
where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.
and the $times$ symbol denotes matrix multiplication.
Why Summing up the partials result in 0
To understand that, we need to first understand what the jacobian matrix signifies.
For element 0,0 , it reads :
How does $s_0$ change when I change $x_0$
For element 1,0 , it reads like this:
How does $s_1$ change when I change $x_0$
For element 2,0 , it reads like this:
How does $s_2$ change when I change $x_0$
To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).
The same can be said about $x_2$ and $x_3$.
Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .
This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!
In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.
Since the change is 0, the gradient is 0
In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.
Implementation in numpy
v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
t = np.array([0.0,1.0,0.0]) # target probability distribution
def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum
def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))
def cross_entropy_derivatives(inps,targets):
return -targets/inps
# Fixed softmax derivative which returns the jacobian instead
# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function
def softmax_derivatives(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) - np.dot(s, s.T)
soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846
cross_der = cross_entropy_derivatives(soft,t)
# [-0. , -9.3890561, -0. ]
soft_der = softmax_derivatives(soft)
#[[ 0.09516324, -0.01134374, -0.08381951],
#[-0.01134374, 0.09516324, -0.08381951],
#[-0.08381951, -0.08381951, 0.16763901]]
# derivative using chain rule
cross_der @ soft_der # [[ 0.10650698, -0.89349302, 0.78698604]]
# Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]
Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)
$endgroup$
add a comment |
$begingroup$
There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.
Mistakes
1. Softmax Function and its derivative
I incorrectly stated that summing up the columns of the jacobian
is equivalent to doing
$$
color{red}{softmax(1-softmax)}
$$
The sum of the columns of the jacobian for $s_0$ actually goes like this:
$$
s_0 - sum_{i}{} s_0 *s_i
$$
Taking $s_0$ common:
$$
s_0 - s_0 sum_{i}{} s_i
$$
Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).
Therefore we get:
$$
s_0 - s_0*1
$$
which is $0$
So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.
2. Jacobians shouldn't be summed
The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.
This means that the equation
$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$
which calculates the derivative using chain rule.
is INCORRECT
It should actually be :
$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$
where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.
and the $times$ symbol denotes matrix multiplication.
Why Summing up the partials result in 0
To understand that, we need to first understand what the jacobian matrix signifies.
For element 0,0 , it reads :
How does $s_0$ change when I change $x_0$
For element 1,0 , it reads like this:
How does $s_1$ change when I change $x_0$
For element 2,0 , it reads like this:
How does $s_2$ change when I change $x_0$
To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).
The same can be said about $x_2$ and $x_3$.
Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .
This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!
In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.
Since the change is 0, the gradient is 0
In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.
Implementation in numpy
v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
t = np.array([0.0,1.0,0.0]) # target probability distribution
def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum
def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))
def cross_entropy_derivatives(inps,targets):
return -targets/inps
# Fixed softmax derivative which returns the jacobian instead
# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function
def softmax_derivatives(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) - np.dot(s, s.T)
soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846
cross_der = cross_entropy_derivatives(soft,t)
# [-0. , -9.3890561, -0. ]
soft_der = softmax_derivatives(soft)
#[[ 0.09516324, -0.01134374, -0.08381951],
#[-0.01134374, 0.09516324, -0.08381951],
#[-0.08381951, -0.08381951, 0.16763901]]
# derivative using chain rule
cross_der @ soft_der # [[ 0.10650698, -0.89349302, 0.78698604]]
# Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]
Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)
$endgroup$
There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.
Mistakes
1. Softmax Function and its derivative
I incorrectly stated that summing up the columns of the jacobian
is equivalent to doing
$$
color{red}{softmax(1-softmax)}
$$
The sum of the columns of the jacobian for $s_0$ actually goes like this:
$$
s_0 - sum_{i}{} s_0 *s_i
$$
Taking $s_0$ common:
$$
s_0 - s_0 sum_{i}{} s_i
$$
Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).
Therefore we get:
$$
s_0 - s_0*1
$$
which is $0$
So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.
2. Jacobians shouldn't be summed
The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.
This means that the equation
$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$
which calculates the derivative using chain rule.
is INCORRECT
It should actually be :
$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$
where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.
and the $times$ symbol denotes matrix multiplication.
Why Summing up the partials result in 0
To understand that, we need to first understand what the jacobian matrix signifies.
For element 0,0 , it reads :
How does $s_0$ change when I change $x_0$
For element 1,0 , it reads like this:
How does $s_1$ change when I change $x_0$
For element 2,0 , it reads like this:
How does $s_2$ change when I change $x_0$
To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).
The same can be said about $x_2$ and $x_3$.
Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .
This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!
In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.
Since the change is 0, the gradient is 0
In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.
Implementation in numpy
v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
t = np.array([0.0,1.0,0.0]) # target probability distribution
def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum
def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))
def cross_entropy_derivatives(inps,targets):
return -targets/inps
# Fixed softmax derivative which returns the jacobian instead
# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function
def softmax_derivatives(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) - np.dot(s, s.T)
soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846
cross_der = cross_entropy_derivatives(soft,t)
# [-0. , -9.3890561, -0. ]
soft_der = softmax_derivatives(soft)
#[[ 0.09516324, -0.01134374, -0.08381951],
#[-0.01134374, 0.09516324, -0.08381951],
#[-0.08381951, -0.08381951, 0.16763901]]
# derivative using chain rule
cross_der @ soft_der # [[ 0.10650698, -0.89349302, 0.78698604]]
# Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]
Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)
edited Jul 9 '18 at 18:37
answered Jul 7 '18 at 15:21
harveyslashharveyslash
1286
1286
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2843505%2fderivative-of-softmax-without-cross-entropy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45