Derivative of Softmax without cross entropy












1












$begingroup$




There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.



For the purposes of this question, I will use a fixed input vector containing 4 values.



Input vector



$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$



Softmax Function and Derivative



My softmax function is defined as :



$$left [ frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}right ] $$



Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.



My jacobian is this:



$$
left[begin{matrix}frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{0}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}end{matrix}right]
$$



Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.



Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:



Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:



$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$



and the first term is just $$ softmax_{x_0} $$



Which means the derivative of softmax is :



$$softmax - softmax^2$$



or



$$softmax(1-softmax)$$



This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.



Cross Entropy Loss and its derivative



The cross entropy takes in as input the softmax vector and a 'target' probability distribution.



$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$



Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :



$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$



Cross entropy function



$$
- sum_{i}^{classes} t_i log(s_i)
$$



For our case it is



$$
- t_{0} log{left (frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{1} log{left (frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{2} log{left (frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{3} log{left (frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )}
$$



Derivative of cross entropy



Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:



$$
-frac{t_i}{s_i}
$$



Using chain rule to get derivative of softmax with cross entropy



We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:



$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$



Simplifying , it gives



$$
-t_i *(1-s_i)
$$



Analytically computing derivative of softmax with cross entropy



This document derives the derivative of softmax with cross entropy and it gets:



$$
s_i - t_i
$$



Which is different from the one derived using chain rule.



Implementation using numpy



I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)



This is the code to evaluate:



x = np.array([-1.0, -1.0, 1.0])                 # unscaled logits, my x vector
t = np.array([0.0,1.0,0.0]) # target probability distribution


## Function definitions

def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum

def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))

def cross_entropy_derivatives(inps,targets):
return -targets/inps

def softmax_derivatives(softmax):
return softmax * (1-softmax)


soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]

cross_entropy(soft,t) # 2.2395447662218846

cross_der = cross_entropy_derivatives(soft,t) # [-0. , -9.3890561, -0. ]

soft_der = softmax_derivatives(soft) # [0.09516324, 0.09516324, 0.16763901]

## Derivative using chain rule
cross_der * soft_der # [-0. , -0.89349302, -0. ]


## Derivative using analytical derivation

soft - t # [ 0.10650698, -0.89349302, 0.78698604]


Notice the difference in values.



My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.










share|cite|improve this question











$endgroup$












  • $begingroup$
    I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
    $endgroup$
    – harveyslash
    Jul 10 '18 at 6:45
















1












$begingroup$




There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.



For the purposes of this question, I will use a fixed input vector containing 4 values.



Input vector



$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$



Softmax Function and Derivative



My softmax function is defined as :



$$left [ frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}right ] $$



Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.



My jacobian is this:



$$
left[begin{matrix}frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{0}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}end{matrix}right]
$$



Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.



Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:



Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:



$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$



and the first term is just $$ softmax_{x_0} $$



Which means the derivative of softmax is :



$$softmax - softmax^2$$



or



$$softmax(1-softmax)$$



This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.



Cross Entropy Loss and its derivative



The cross entropy takes in as input the softmax vector and a 'target' probability distribution.



$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$



Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :



$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$



Cross entropy function



$$
- sum_{i}^{classes} t_i log(s_i)
$$



For our case it is



$$
- t_{0} log{left (frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{1} log{left (frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{2} log{left (frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{3} log{left (frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )}
$$



Derivative of cross entropy



Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:



$$
-frac{t_i}{s_i}
$$



Using chain rule to get derivative of softmax with cross entropy



We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:



$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$



Simplifying , it gives



$$
-t_i *(1-s_i)
$$



Analytically computing derivative of softmax with cross entropy



This document derives the derivative of softmax with cross entropy and it gets:



$$
s_i - t_i
$$



Which is different from the one derived using chain rule.



Implementation using numpy



I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)



This is the code to evaluate:



x = np.array([-1.0, -1.0, 1.0])                 # unscaled logits, my x vector
t = np.array([0.0,1.0,0.0]) # target probability distribution


## Function definitions

def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum

def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))

def cross_entropy_derivatives(inps,targets):
return -targets/inps

def softmax_derivatives(softmax):
return softmax * (1-softmax)


soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]

cross_entropy(soft,t) # 2.2395447662218846

cross_der = cross_entropy_derivatives(soft,t) # [-0. , -9.3890561, -0. ]

soft_der = softmax_derivatives(soft) # [0.09516324, 0.09516324, 0.16763901]

## Derivative using chain rule
cross_der * soft_der # [-0. , -0.89349302, -0. ]


## Derivative using analytical derivation

soft - t # [ 0.10650698, -0.89349302, 0.78698604]


Notice the difference in values.



My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.










share|cite|improve this question











$endgroup$












  • $begingroup$
    I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
    $endgroup$
    – harveyslash
    Jul 10 '18 at 6:45














1












1








1


1



$begingroup$




There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.



For the purposes of this question, I will use a fixed input vector containing 4 values.



Input vector



$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$



Softmax Function and Derivative



My softmax function is defined as :



$$left [ frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}right ] $$



Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.



My jacobian is this:



$$
left[begin{matrix}frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{0}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}end{matrix}right]
$$



Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.



Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:



Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:



$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$



and the first term is just $$ softmax_{x_0} $$



Which means the derivative of softmax is :



$$softmax - softmax^2$$



or



$$softmax(1-softmax)$$



This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.



Cross Entropy Loss and its derivative



The cross entropy takes in as input the softmax vector and a 'target' probability distribution.



$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$



Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :



$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$



Cross entropy function



$$
- sum_{i}^{classes} t_i log(s_i)
$$



For our case it is



$$
- t_{0} log{left (frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{1} log{left (frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{2} log{left (frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{3} log{left (frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )}
$$



Derivative of cross entropy



Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:



$$
-frac{t_i}{s_i}
$$



Using chain rule to get derivative of softmax with cross entropy



We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:



$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$



Simplifying , it gives



$$
-t_i *(1-s_i)
$$



Analytically computing derivative of softmax with cross entropy



This document derives the derivative of softmax with cross entropy and it gets:



$$
s_i - t_i
$$



Which is different from the one derived using chain rule.



Implementation using numpy



I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)



This is the code to evaluate:



x = np.array([-1.0, -1.0, 1.0])                 # unscaled logits, my x vector
t = np.array([0.0,1.0,0.0]) # target probability distribution


## Function definitions

def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum

def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))

def cross_entropy_derivatives(inps,targets):
return -targets/inps

def softmax_derivatives(softmax):
return softmax * (1-softmax)


soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]

cross_entropy(soft,t) # 2.2395447662218846

cross_der = cross_entropy_derivatives(soft,t) # [-0. , -9.3890561, -0. ]

soft_der = softmax_derivatives(soft) # [0.09516324, 0.09516324, 0.16763901]

## Derivative using chain rule
cross_der * soft_der # [-0. , -0.89349302, -0. ]


## Derivative using analytical derivation

soft - t # [ 0.10650698, -0.89349302, 0.78698604]


Notice the difference in values.



My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.










share|cite|improve this question











$endgroup$






There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.



For the purposes of this question, I will use a fixed input vector containing 4 values.



Input vector



$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$



Softmax Function and Derivative



My softmax function is defined as :



$$left [ frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}right ] $$



Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.



My jacobian is this:



$$
left[begin{matrix}frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{0}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}end{matrix}right]
$$



Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.



Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:



Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:



$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$



and the first term is just $$ softmax_{x_0} $$



Which means the derivative of softmax is :



$$softmax - softmax^2$$



or



$$softmax(1-softmax)$$



This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.



Cross Entropy Loss and its derivative



The cross entropy takes in as input the softmax vector and a 'target' probability distribution.



$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$



Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :



$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$



Cross entropy function



$$
- sum_{i}^{classes} t_i log(s_i)
$$



For our case it is



$$
- t_{0} log{left (frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{1} log{left (frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{2} log{left (frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{3} log{left (frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )}
$$



Derivative of cross entropy



Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:



$$
-frac{t_i}{s_i}
$$



Using chain rule to get derivative of softmax with cross entropy



We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:



$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$



Simplifying , it gives



$$
-t_i *(1-s_i)
$$



Analytically computing derivative of softmax with cross entropy



This document derives the derivative of softmax with cross entropy and it gets:



$$
s_i - t_i
$$



Which is different from the one derived using chain rule.



Implementation using numpy



I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)



This is the code to evaluate:



x = np.array([-1.0, -1.0, 1.0])                 # unscaled logits, my x vector
t = np.array([0.0,1.0,0.0]) # target probability distribution


## Function definitions

def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum

def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))

def cross_entropy_derivatives(inps,targets):
return -targets/inps

def softmax_derivatives(softmax):
return softmax * (1-softmax)


soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]

cross_entropy(soft,t) # 2.2395447662218846

cross_der = cross_entropy_derivatives(soft,t) # [-0. , -9.3890561, -0. ]

soft_der = softmax_derivatives(soft) # [0.09516324, 0.09516324, 0.16763901]

## Derivative using chain rule
cross_der * soft_der # [-0. , -0.89349302, -0. ]


## Derivative using analytical derivation

soft - t # [ 0.10650698, -0.89349302, 0.78698604]


Notice the difference in values.



My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.







calculus multivariable-calculus neural-networks






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Jul 7 '18 at 16:09







harveyslash

















asked Jul 7 '18 at 7:00









harveyslashharveyslash

1286




1286












  • $begingroup$
    I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
    $endgroup$
    – harveyslash
    Jul 10 '18 at 6:45


















  • $begingroup$
    I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
    $endgroup$
    – harveyslash
    Jul 10 '18 at 6:45
















$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45




$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45










1 Answer
1






active

oldest

votes


















0












$begingroup$

There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.



Mistakes



1. Softmax Function and its derivative



I incorrectly stated that summing up the columns of the jacobian



is equivalent to doing



$$
color{red}{softmax(1-softmax)}
$$



The sum of the columns of the jacobian for $s_0$ actually goes like this:



$$
s_0 - sum_{i}{} s_0 *s_i
$$



Taking $s_0$ common:



$$
s_0 - s_0 sum_{i}{} s_i
$$



Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).



Therefore we get:



$$
s_0 - s_0*1
$$



which is $0$



So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.



2. Jacobians shouldn't be summed



The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.



This means that the equation



$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$



which calculates the derivative using chain rule.



is INCORRECT



It should actually be :



$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$



where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.



and the $times$ symbol denotes matrix multiplication.



Why Summing up the partials result in 0



To understand that, we need to first understand what the jacobian matrix signifies.



For element 0,0 , it reads :




How does $s_0$ change when I change $x_0$




For element 1,0 , it reads like this:




How does $s_1$ change when I change $x_0$




For element 2,0 , it reads like this:




How does $s_2$ change when I change $x_0$




To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).



The same can be said about $x_2$ and $x_3$.



Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .



This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!



In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.



Since the change is 0, the gradient is 0



In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.



Implementation in numpy



v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
t = np.array([0.0,1.0,0.0]) # target probability distribution


def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum

def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))

def cross_entropy_derivatives(inps,targets):
return -targets/inps

# Fixed softmax derivative which returns the jacobian instead
# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function

def softmax_derivatives(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) - np.dot(s, s.T)



soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846

cross_der = cross_entropy_derivatives(soft,t)
# [-0. , -9.3890561, -0. ]

soft_der = softmax_derivatives(soft)

#[[ 0.09516324, -0.01134374, -0.08381951],
#[-0.01134374, 0.09516324, -0.08381951],
#[-0.08381951, -0.08381951, 0.16763901]]


# derivative using chain rule
cross_der @ soft_der # [[ 0.10650698, -0.89349302, 0.78698604]]



# Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]


Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)






share|cite|improve this answer











$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "69"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2843505%2fderivative-of-softmax-without-cross-entropy%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.



    Mistakes



    1. Softmax Function and its derivative



    I incorrectly stated that summing up the columns of the jacobian



    is equivalent to doing



    $$
    color{red}{softmax(1-softmax)}
    $$



    The sum of the columns of the jacobian for $s_0$ actually goes like this:



    $$
    s_0 - sum_{i}{} s_0 *s_i
    $$



    Taking $s_0$ common:



    $$
    s_0 - s_0 sum_{i}{} s_i
    $$



    Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).



    Therefore we get:



    $$
    s_0 - s_0*1
    $$



    which is $0$



    So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.



    2. Jacobians shouldn't be summed



    The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.



    This means that the equation



    $$
    color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
    $$



    which calculates the derivative using chain rule.



    is INCORRECT



    It should actually be :



    $$
    -frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
    $$



    where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.



    and the $times$ symbol denotes matrix multiplication.



    Why Summing up the partials result in 0



    To understand that, we need to first understand what the jacobian matrix signifies.



    For element 0,0 , it reads :




    How does $s_0$ change when I change $x_0$




    For element 1,0 , it reads like this:




    How does $s_1$ change when I change $x_0$




    For element 2,0 , it reads like this:




    How does $s_2$ change when I change $x_0$




    To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).



    The same can be said about $x_2$ and $x_3$.



    Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .



    This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
    Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!



    In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.



    Since the change is 0, the gradient is 0



    In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.



    Implementation in numpy



    v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
    t = np.array([0.0,1.0,0.0]) # target probability distribution


    def softmax(v):
    exps = np.exp(v)
    sum = np.sum(exps)
    return exps/sum

    def cross_entropy(inps,targets):
    return np.sum(-targets*np.log(inps))

    def cross_entropy_derivatives(inps,targets):
    return -targets/inps

    # Fixed softmax derivative which returns the jacobian instead
    # see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function

    def softmax_derivatives(softmax):
    s = softmax.reshape(-1,1)
    return np.diagflat(s) - np.dot(s, s.T)



    soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
    cross_entropy(soft,t) # 2.2395447662218846

    cross_der = cross_entropy_derivatives(soft,t)
    # [-0. , -9.3890561, -0. ]

    soft_der = softmax_derivatives(soft)

    #[[ 0.09516324, -0.01134374, -0.08381951],
    #[-0.01134374, 0.09516324, -0.08381951],
    #[-0.08381951, -0.08381951, 0.16763901]]


    # derivative using chain rule
    cross_der @ soft_der # [[ 0.10650698, -0.89349302, 0.78698604]]



    # Derivative using analytical derivation
    soft - t # [ 0.10650698, -0.89349302, 0.78698604]


    Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)






    share|cite|improve this answer











    $endgroup$


















      0












      $begingroup$

      There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.



      Mistakes



      1. Softmax Function and its derivative



      I incorrectly stated that summing up the columns of the jacobian



      is equivalent to doing



      $$
      color{red}{softmax(1-softmax)}
      $$



      The sum of the columns of the jacobian for $s_0$ actually goes like this:



      $$
      s_0 - sum_{i}{} s_0 *s_i
      $$



      Taking $s_0$ common:



      $$
      s_0 - s_0 sum_{i}{} s_i
      $$



      Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).



      Therefore we get:



      $$
      s_0 - s_0*1
      $$



      which is $0$



      So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.



      2. Jacobians shouldn't be summed



      The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.



      This means that the equation



      $$
      color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
      $$



      which calculates the derivative using chain rule.



      is INCORRECT



      It should actually be :



      $$
      -frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
      $$



      where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.



      and the $times$ symbol denotes matrix multiplication.



      Why Summing up the partials result in 0



      To understand that, we need to first understand what the jacobian matrix signifies.



      For element 0,0 , it reads :




      How does $s_0$ change when I change $x_0$




      For element 1,0 , it reads like this:




      How does $s_1$ change when I change $x_0$




      For element 2,0 , it reads like this:




      How does $s_2$ change when I change $x_0$




      To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).



      The same can be said about $x_2$ and $x_3$.



      Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .



      This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
      Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!



      In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.



      Since the change is 0, the gradient is 0



      In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.



      Implementation in numpy



      v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
      t = np.array([0.0,1.0,0.0]) # target probability distribution


      def softmax(v):
      exps = np.exp(v)
      sum = np.sum(exps)
      return exps/sum

      def cross_entropy(inps,targets):
      return np.sum(-targets*np.log(inps))

      def cross_entropy_derivatives(inps,targets):
      return -targets/inps

      # Fixed softmax derivative which returns the jacobian instead
      # see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function

      def softmax_derivatives(softmax):
      s = softmax.reshape(-1,1)
      return np.diagflat(s) - np.dot(s, s.T)



      soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
      cross_entropy(soft,t) # 2.2395447662218846

      cross_der = cross_entropy_derivatives(soft,t)
      # [-0. , -9.3890561, -0. ]

      soft_der = softmax_derivatives(soft)

      #[[ 0.09516324, -0.01134374, -0.08381951],
      #[-0.01134374, 0.09516324, -0.08381951],
      #[-0.08381951, -0.08381951, 0.16763901]]


      # derivative using chain rule
      cross_der @ soft_der # [[ 0.10650698, -0.89349302, 0.78698604]]



      # Derivative using analytical derivation
      soft - t # [ 0.10650698, -0.89349302, 0.78698604]


      Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)






      share|cite|improve this answer











      $endgroup$
















        0












        0








        0





        $begingroup$

        There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.



        Mistakes



        1. Softmax Function and its derivative



        I incorrectly stated that summing up the columns of the jacobian



        is equivalent to doing



        $$
        color{red}{softmax(1-softmax)}
        $$



        The sum of the columns of the jacobian for $s_0$ actually goes like this:



        $$
        s_0 - sum_{i}{} s_0 *s_i
        $$



        Taking $s_0$ common:



        $$
        s_0 - s_0 sum_{i}{} s_i
        $$



        Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).



        Therefore we get:



        $$
        s_0 - s_0*1
        $$



        which is $0$



        So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.



        2. Jacobians shouldn't be summed



        The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.



        This means that the equation



        $$
        color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
        $$



        which calculates the derivative using chain rule.



        is INCORRECT



        It should actually be :



        $$
        -frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
        $$



        where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.



        and the $times$ symbol denotes matrix multiplication.



        Why Summing up the partials result in 0



        To understand that, we need to first understand what the jacobian matrix signifies.



        For element 0,0 , it reads :




        How does $s_0$ change when I change $x_0$




        For element 1,0 , it reads like this:




        How does $s_1$ change when I change $x_0$




        For element 2,0 , it reads like this:




        How does $s_2$ change when I change $x_0$




        To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).



        The same can be said about $x_2$ and $x_3$.



        Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .



        This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
        Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!



        In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.



        Since the change is 0, the gradient is 0



        In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.



        Implementation in numpy



        v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
        t = np.array([0.0,1.0,0.0]) # target probability distribution


        def softmax(v):
        exps = np.exp(v)
        sum = np.sum(exps)
        return exps/sum

        def cross_entropy(inps,targets):
        return np.sum(-targets*np.log(inps))

        def cross_entropy_derivatives(inps,targets):
        return -targets/inps

        # Fixed softmax derivative which returns the jacobian instead
        # see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function

        def softmax_derivatives(softmax):
        s = softmax.reshape(-1,1)
        return np.diagflat(s) - np.dot(s, s.T)



        soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
        cross_entropy(soft,t) # 2.2395447662218846

        cross_der = cross_entropy_derivatives(soft,t)
        # [-0. , -9.3890561, -0. ]

        soft_der = softmax_derivatives(soft)

        #[[ 0.09516324, -0.01134374, -0.08381951],
        #[-0.01134374, 0.09516324, -0.08381951],
        #[-0.08381951, -0.08381951, 0.16763901]]


        # derivative using chain rule
        cross_der @ soft_der # [[ 0.10650698, -0.89349302, 0.78698604]]



        # Derivative using analytical derivation
        soft - t # [ 0.10650698, -0.89349302, 0.78698604]


        Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)






        share|cite|improve this answer











        $endgroup$



        There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.



        Mistakes



        1. Softmax Function and its derivative



        I incorrectly stated that summing up the columns of the jacobian



        is equivalent to doing



        $$
        color{red}{softmax(1-softmax)}
        $$



        The sum of the columns of the jacobian for $s_0$ actually goes like this:



        $$
        s_0 - sum_{i}{} s_0 *s_i
        $$



        Taking $s_0$ common:



        $$
        s_0 - s_0 sum_{i}{} s_i
        $$



        Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).



        Therefore we get:



        $$
        s_0 - s_0*1
        $$



        which is $0$



        So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.



        2. Jacobians shouldn't be summed



        The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.



        This means that the equation



        $$
        color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
        $$



        which calculates the derivative using chain rule.



        is INCORRECT



        It should actually be :



        $$
        -frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
        $$



        where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.



        and the $times$ symbol denotes matrix multiplication.



        Why Summing up the partials result in 0



        To understand that, we need to first understand what the jacobian matrix signifies.



        For element 0,0 , it reads :




        How does $s_0$ change when I change $x_0$




        For element 1,0 , it reads like this:




        How does $s_1$ change when I change $x_0$




        For element 2,0 , it reads like this:




        How does $s_2$ change when I change $x_0$




        To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).



        The same can be said about $x_2$ and $x_3$.



        Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .



        This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
        Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!



        In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.



        Since the change is 0, the gradient is 0



        In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.



        Implementation in numpy



        v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
        t = np.array([0.0,1.0,0.0]) # target probability distribution


        def softmax(v):
        exps = np.exp(v)
        sum = np.sum(exps)
        return exps/sum

        def cross_entropy(inps,targets):
        return np.sum(-targets*np.log(inps))

        def cross_entropy_derivatives(inps,targets):
        return -targets/inps

        # Fixed softmax derivative which returns the jacobian instead
        # see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function

        def softmax_derivatives(softmax):
        s = softmax.reshape(-1,1)
        return np.diagflat(s) - np.dot(s, s.T)



        soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
        cross_entropy(soft,t) # 2.2395447662218846

        cross_der = cross_entropy_derivatives(soft,t)
        # [-0. , -9.3890561, -0. ]

        soft_der = softmax_derivatives(soft)

        #[[ 0.09516324, -0.01134374, -0.08381951],
        #[-0.01134374, 0.09516324, -0.08381951],
        #[-0.08381951, -0.08381951, 0.16763901]]


        # derivative using chain rule
        cross_der @ soft_der # [[ 0.10650698, -0.89349302, 0.78698604]]



        # Derivative using analytical derivation
        soft - t # [ 0.10650698, -0.89349302, 0.78698604]


        Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Jul 9 '18 at 18:37

























        answered Jul 7 '18 at 15:21









        harveyslashharveyslash

        1286




        1286






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2843505%2fderivative-of-softmax-without-cross-entropy%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Wiesbaden

            Marschland

            Dieringhausen