Derivative of Softmax without cross entropy

There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.

For the purposes of this question, I will use a fixed input vector containing 4 values.

Input vector

$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$

Softmax Function and Derivative

My softmax function is defined as :

$$left [ frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, quad frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}right ] $$

Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.

My jacobian is this:

$$
left[begin{matrix}frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{0}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{1}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{2}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}\- frac{e^{x_{0}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{1}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & - frac{e^{x_{2}} e^{x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}} & frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - frac{e^{2 x_{3}}}{left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}right)^{2}}end{matrix}right]
$$

Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.

Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:

Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:

$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$

and the first term is just $$ softmax_{x_0} $$

Which means the derivative of softmax is :

$$softmax - softmax^2$$

$$softmax(1-softmax)$$

This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.

Cross Entropy Loss and its derivative

The cross entropy takes in as input the softmax vector and a 'target' probability distribution.

$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$

Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :

$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$

Cross entropy function

$$
- sum_{i}^{classes} t_i log(s_i)
$$

For our case it is

$$
- t_{0} log{left (frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{1} log{left (frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{2} log{left (frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )} - t_{3} log{left (frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} right )}
$$

Derivative of cross entropy

Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:

$$
-frac{t_i}{s_i}
$$

Using chain rule to get derivative of softmax with cross entropy

We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:

$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$

Simplifying , it gives

$$
-t_i *(1-s_i)
$$

Analytically computing derivative of softmax with cross entropy

This document derives the derivative of softmax with cross entropy and it gets:

$$
s_i - t_i
$$

Which is different from the one derived using chain rule.

Implementation using numpy

I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)

This is the code to evaluate:

x = np.array([-1.0, -1.0, 1.0])                 # unscaled logits, my x vector

t = np.array([0.0,1.0,0.0])                     # target probability distribution





## Function definitions



def softmax(v):

    exps = np.exp(v)

    sum  = np.sum(exps)

    return exps/sum



def cross_entropy(inps,targets):

    return np.sum(-targets*np.log(inps))



def cross_entropy_derivatives(inps,targets):

    return -targets/inps



def softmax_derivatives(softmax):

    return softmax  * (1-softmax)





soft = softmax(v)                               # [0.10650698, 0.10650698, 0.78698604]



cross_entropy(soft,t)                           # 2.2395447662218846



cross_der = cross_entropy_derivatives(soft,t)   # [-0.       , -9.3890561, -0.       ]



soft_der = softmax_derivatives(soft)            # [0.09516324, 0.09516324, 0.16763901]



## Derivative using chain rule  

cross_der * soft_der                            # [-0.        , -0.89349302, -0.        ]





## Derivative using analytical derivation 



soft - t                                        # [ 0.10650698, -0.89349302,  0.78698604]

Notice the difference in values.

My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.

edited Jul 7 '18 at 16:09

asked Jul 7 '18 at 7:00

harveyslash

1286

$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45

add a comment |

There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.

For the purposes of this question, I will use a fixed input vector containing 4 values.

Input vector

$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$

Softmax Function and Derivative

My softmax function is defined as :

My jacobian is this:

Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:

Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:

$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$

and the first term is just $$ softmax_{x_0} $$

Which means the derivative of softmax is :

$$softmax - softmax^2$$

$$softmax(1-softmax)$$

This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.

Cross Entropy Loss and its derivative

The cross entropy takes in as input the softmax vector and a 'target' probability distribution.

$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$

Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :

$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$

Cross entropy function

$$
- sum_{i}^{classes} t_i log(s_i)
$$

For our case it is

Derivative of cross entropy

Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:

$$
-frac{t_i}{s_i}
$$

Using chain rule to get derivative of softmax with cross entropy

We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:

$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$

Simplifying , it gives

$$
-t_i *(1-s_i)
$$

Analytically computing derivative of softmax with cross entropy

This document derives the derivative of softmax with cross entropy and it gets:

$$
s_i - t_i
$$

Which is different from the one derived using chain rule.

Implementation using numpy

I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)

This is the code to evaluate:

x = np.array([-1.0, -1.0, 1.0])                 # unscaled logits, my x vector

t = np.array([0.0,1.0,0.0])                     # target probability distribution





## Function definitions



def softmax(v):

    exps = np.exp(v)

    sum  = np.sum(exps)

    return exps/sum



def cross_entropy(inps,targets):

    return np.sum(-targets*np.log(inps))



def cross_entropy_derivatives(inps,targets):

    return -targets/inps



def softmax_derivatives(softmax):

    return softmax  * (1-softmax)





soft = softmax(v)                               # [0.10650698, 0.10650698, 0.78698604]



cross_entropy(soft,t)                           # 2.2395447662218846



cross_der = cross_entropy_derivatives(soft,t)   # [-0.       , -9.3890561, -0.       ]



soft_der = softmax_derivatives(soft)            # [0.09516324, 0.09516324, 0.16763901]



## Derivative using chain rule  

cross_der * soft_der                            # [-0.        , -0.89349302, -0.        ]





## Derivative using analytical derivation 



soft - t                                        # [ 0.10650698, -0.89349302,  0.78698604]

Notice the difference in values.

My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.

edited Jul 7 '18 at 16:09

asked Jul 7 '18 at 7:00

harveyslash

1286

$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45

add a comment |

There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.

For the purposes of this question, I will use a fixed input vector containing 4 values.

Input vector

$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$

Softmax Function and Derivative

My softmax function is defined as :

My jacobian is this:

Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:

Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:

$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$

and the first term is just $$ softmax_{x_0} $$

Which means the derivative of softmax is :

$$softmax - softmax^2$$

$$softmax(1-softmax)$$

This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.

Cross Entropy Loss and its derivative

The cross entropy takes in as input the softmax vector and a 'target' probability distribution.

$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$

Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :

$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$

Cross entropy function

$$
- sum_{i}^{classes} t_i log(s_i)
$$

For our case it is

Derivative of cross entropy

Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:

$$
-frac{t_i}{s_i}
$$

Using chain rule to get derivative of softmax with cross entropy

We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:

$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$

Simplifying , it gives

$$
-t_i *(1-s_i)
$$

Analytically computing derivative of softmax with cross entropy

This document derives the derivative of softmax with cross entropy and it gets:

$$
s_i - t_i
$$

Which is different from the one derived using chain rule.

Implementation using numpy

I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)

This is the code to evaluate:

x = np.array([-1.0, -1.0, 1.0])                 # unscaled logits, my x vector

t = np.array([0.0,1.0,0.0])                     # target probability distribution





## Function definitions



def softmax(v):

    exps = np.exp(v)

    sum  = np.sum(exps)

    return exps/sum



def cross_entropy(inps,targets):

    return np.sum(-targets*np.log(inps))



def cross_entropy_derivatives(inps,targets):

    return -targets/inps



def softmax_derivatives(softmax):

    return softmax  * (1-softmax)





soft = softmax(v)                               # [0.10650698, 0.10650698, 0.78698604]



cross_entropy(soft,t)                           # 2.2395447662218846



cross_der = cross_entropy_derivatives(soft,t)   # [-0.       , -9.3890561, -0.       ]



soft_der = softmax_derivatives(soft)            # [0.09516324, 0.09516324, 0.16763901]



## Derivative using chain rule  

cross_der * soft_der                            # [-0.        , -0.89349302, -0.        ]





## Derivative using analytical derivation 



soft - t                                        # [ 0.10650698, -0.89349302,  0.78698604]

Notice the difference in values.

My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.

edited Jul 7 '18 at 16:09

asked Jul 7 '18 at 7:00

harveyslash

1286

There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.

For the purposes of this question, I will use a fixed input vector containing 4 values.

Input vector

$$left [ x_{0}, quad x_{1}, quad x_{2}, quad x_{3}right ]$$

Softmax Function and Derivative

My softmax function is defined as :

My jacobian is this:

Due to numerical stability issues, summing up the values gives unstable results.
However, it is quite easy to reduce the sum of each row to this expression:

Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:

$$sum_{i}{} softmax_{x_0} * softmax_{x_i} $$

and the first term is just $$ softmax_{x_0} $$

Which means the derivative of softmax is :

$$softmax - softmax^2$$

$$softmax(1-softmax)$$

This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.

Cross Entropy Loss and its derivative

The cross entropy takes in as input the softmax vector and a 'target' probability distribution.

$$left [ t_{0}, quad t_{1}, quad t_{2}, quad t_{3}right ]$$

Let the softmax index at i be denoted as $s_i$
So the full softmax vector is :

$$left [ s_{0}, quad s_{1}, quad s_{2}, quad s_{3}right ]$$

Cross entropy function

$$
- sum_{i}^{classes} t_i log(s_i)
$$

For our case it is

Derivative of cross entropy

Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:

$$
-frac{t_i}{s_i}
$$

Using chain rule to get derivative of softmax with cross entropy

We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:

$$
-frac{t_i}{s_i} * s_i(1-s_i)
$$

Simplifying , it gives

$$
-t_i *(1-s_i)
$$

Analytically computing derivative of softmax with cross entropy

This document derives the derivative of softmax with cross entropy and it gets:

$$
s_i - t_i
$$

Which is different from the one derived using chain rule.

Implementation using numpy

I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)

This is the code to evaluate:

x = np.array([-1.0, -1.0, 1.0])                 # unscaled logits, my x vector

t = np.array([0.0,1.0,0.0])                     # target probability distribution





## Function definitions



def softmax(v):

    exps = np.exp(v)

    sum  = np.sum(exps)

    return exps/sum



def cross_entropy(inps,targets):

    return np.sum(-targets*np.log(inps))



def cross_entropy_derivatives(inps,targets):

    return -targets/inps



def softmax_derivatives(softmax):

    return softmax  * (1-softmax)





soft = softmax(v)                               # [0.10650698, 0.10650698, 0.78698604]



cross_entropy(soft,t)                           # 2.2395447662218846



cross_der = cross_entropy_derivatives(soft,t)   # [-0.       , -9.3890561, -0.       ]



soft_der = softmax_derivatives(soft)            # [0.09516324, 0.09516324, 0.16763901]



## Derivative using chain rule  

cross_der * soft_der                            # [-0.        , -0.89349302, -0.        ]





## Derivative using analytical derivation 



soft - t                                        # [ 0.10650698, -0.89349302,  0.78698604]

Notice the difference in values.

My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.

calculus multivariable-calculus neural-networks

edited Jul 7 '18 at 16:09

asked Jul 7 '18 at 7:00

harveyslash

1286

edited Jul 7 '18 at 16:09

asked Jul 7 '18 at 7:00

harveyslash

1286

edited Jul 7 '18 at 16:09

asked Jul 7 '18 at 7:00

harveyslash

1286

asked Jul 7 '18 at 7:00

harveyslash

1286

asked Jul 7 '18 at 7:00

harveyslash

1286

$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45

add a comment |

$begingroup$
I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer
$endgroup$
– harveyslash
Jul 10 '18 at 6:45

I have posted my own answer. If there are no better answers in a few days, I will accept it as the correct answer

– harveyslash
Jul 10 '18 at 6:45

add a comment |

1 Answer
1

active

oldest

votes

There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.

Mistakes

1. Softmax Function and its derivative

I incorrectly stated that summing up the columns of the jacobian

is equivalent to doing

$$
color{red}{softmax(1-softmax)}
$$

The sum of the columns of the jacobian for $s_0$ actually goes like this:

$$
s_0 - sum_{i}{} s_0 *s_i
$$

Taking $s_0$ common:

$$
s_0 - s_0 sum_{i}{} s_i
$$

Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).

Therefore we get:

$$
s_0 - s_0*1
$$

which is $0$

So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.

2. Jacobians shouldn't be summed

The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.

This means that the equation

$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$

which calculates the derivative using chain rule.

is INCORRECT

It should actually be :

$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$

where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.

and the $times$ symbol denotes matrix multiplication.

Why Summing up the partials result in 0

To understand that, we need to first understand what the jacobian matrix signifies.

For element 0,0 , it reads :

How does $s_0$ change when I change $x_0$

For element 1,0 , it reads like this:

How does $s_1$ change when I change $x_0$

For element 2,0 , it reads like this:

How does $s_2$ change when I change $x_0$

To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).

The same can be said about $x_2$ and $x_3$.

Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .

This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value.
Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!

In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.

Since the change is 0, the gradient is 0

In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.

Implementation in numpy

v = np.array([-1.0, -1.0, 1.0]) # unscaled logits

t = np.array([0.0,1.0,0.0])     # target probability distribution





def softmax(v):

    exps = np.exp(v)

    sum  = np.sum(exps)

    return exps/sum



def cross_entropy(inps,targets):

    return np.sum(-targets*np.log(inps))



def cross_entropy_derivatives(inps,targets):

    return -targets/inps



# Fixed softmax derivative which returns the jacobian instead

# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function 



def softmax_derivatives(softmax):

    s = softmax.reshape(-1,1)

    return np.diagflat(s) - np.dot(s, s.T)







soft = softmax(v)           # [0.10650698, 0.10650698, 0.78698604]

cross_entropy(soft,t)       # 2.2395447662218846



cross_der = cross_entropy_derivatives(soft,t) 

                            # [-0.       , -9.3890561, -0.       ]



soft_der = softmax_derivatives(soft)



#[[ 0.09516324, -0.01134374, -0.08381951],

#[-0.01134374,  0.09516324, -0.08381951],

#[-0.08381951, -0.08381951,  0.16763901]]





# derivative using chain rule 

 cross_der  @ soft_der      # [[ 0.10650698, -0.89349302,  0.78698604]]







# Derivative using analytical derivation 

soft - t                    # [ 0.10650698, -0.89349302,  0.78698604]

Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)

edited Jul 9 '18 at 18:37

answered Jul 7 '18 at 15:21

harveyslash

1286

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2843505%2fderivative-of-softmax-without-cross-entropy%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Mistakes

1. Softmax Function and its derivative

I incorrectly stated that summing up the columns of the jacobian

is equivalent to doing

$$
color{red}{softmax(1-softmax)}
$$

The sum of the columns of the jacobian for $s_0$ actually goes like this:

$$
s_0 - sum_{i}{} s_0 *s_i
$$

Taking $s_0$ common:

$$
s_0 - s_0 sum_{i}{} s_i
$$

Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).

Therefore we get:

$$
s_0 - s_0*1
$$

which is $0$

So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.

2. Jacobians shouldn't be summed

The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.

This means that the equation

$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$

which calculates the derivative using chain rule.

is INCORRECT

It should actually be :

$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$

where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.

and the $times$ symbol denotes matrix multiplication.

Why Summing up the partials result in 0

To understand that, we need to first understand what the jacobian matrix signifies.

For element 0,0 , it reads :

How does $s_0$ change when I change $x_0$

For element 1,0 , it reads like this:

How does $s_1$ change when I change $x_0$

For element 2,0 , it reads like this:

How does $s_2$ change when I change $x_0$

To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).

The same can be said about $x_2$ and $x_3$.

Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .

In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.

Since the change is 0, the gradient is 0

In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.

Implementation in numpy

v = np.array([-1.0, -1.0, 1.0]) # unscaled logits

t = np.array([0.0,1.0,0.0])     # target probability distribution





def softmax(v):

    exps = np.exp(v)

    sum  = np.sum(exps)

    return exps/sum



def cross_entropy(inps,targets):

    return np.sum(-targets*np.log(inps))



def cross_entropy_derivatives(inps,targets):

    return -targets/inps



# Fixed softmax derivative which returns the jacobian instead

# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function 



def softmax_derivatives(softmax):

    s = softmax.reshape(-1,1)

    return np.diagflat(s) - np.dot(s, s.T)







soft = softmax(v)           # [0.10650698, 0.10650698, 0.78698604]

cross_entropy(soft,t)       # 2.2395447662218846



cross_der = cross_entropy_derivatives(soft,t) 

                            # [-0.       , -9.3890561, -0.       ]



soft_der = softmax_derivatives(soft)



#[[ 0.09516324, -0.01134374, -0.08381951],

#[-0.01134374,  0.09516324, -0.08381951],

#[-0.08381951, -0.08381951,  0.16763901]]





# derivative using chain rule 

 cross_der  @ soft_der      # [[ 0.10650698, -0.89349302,  0.78698604]]







# Derivative using analytical derivation 

soft - t                    # [ 0.10650698, -0.89349302,  0.78698604]

Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)

edited Jul 9 '18 at 18:37

answered Jul 7 '18 at 15:21

harveyslash

1286

add a comment |

Mistakes

1. Softmax Function and its derivative

I incorrectly stated that summing up the columns of the jacobian

is equivalent to doing

$$
color{red}{softmax(1-softmax)}
$$

The sum of the columns of the jacobian for $s_0$ actually goes like this:

$$
s_0 - sum_{i}{} s_0 *s_i
$$

Taking $s_0$ common:

$$
s_0 - s_0 sum_{i}{} s_i
$$

Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).

Therefore we get:

$$
s_0 - s_0*1
$$

which is $0$

So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.

2. Jacobians shouldn't be summed

The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.

This means that the equation

$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$

which calculates the derivative using chain rule.

is INCORRECT

It should actually be :

$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$

where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.

and the $times$ symbol denotes matrix multiplication.

Why Summing up the partials result in 0

To understand that, we need to first understand what the jacobian matrix signifies.

For element 0,0 , it reads :

How does $s_0$ change when I change $x_0$

For element 1,0 , it reads like this:

How does $s_1$ change when I change $x_0$

For element 2,0 , it reads like this:

How does $s_2$ change when I change $x_0$

To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).

The same can be said about $x_2$ and $x_3$.

Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .

In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.

Since the change is 0, the gradient is 0

In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.

Implementation in numpy

v = np.array([-1.0, -1.0, 1.0]) # unscaled logits

t = np.array([0.0,1.0,0.0])     # target probability distribution





def softmax(v):

    exps = np.exp(v)

    sum  = np.sum(exps)

    return exps/sum



def cross_entropy(inps,targets):

    return np.sum(-targets*np.log(inps))



def cross_entropy_derivatives(inps,targets):

    return -targets/inps



# Fixed softmax derivative which returns the jacobian instead

# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function 



def softmax_derivatives(softmax):

    s = softmax.reshape(-1,1)

    return np.diagflat(s) - np.dot(s, s.T)







soft = softmax(v)           # [0.10650698, 0.10650698, 0.78698604]

cross_entropy(soft,t)       # 2.2395447662218846



cross_der = cross_entropy_derivatives(soft,t) 

                            # [-0.       , -9.3890561, -0.       ]



soft_der = softmax_derivatives(soft)



#[[ 0.09516324, -0.01134374, -0.08381951],

#[-0.01134374,  0.09516324, -0.08381951],

#[-0.08381951, -0.08381951,  0.16763901]]





# derivative using chain rule 

 cross_der  @ soft_der      # [[ 0.10650698, -0.89349302,  0.78698604]]







# Derivative using analytical derivation 

soft - t                    # [ 0.10650698, -0.89349302,  0.78698604]

Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)

edited Jul 9 '18 at 18:37

answered Jul 7 '18 at 15:21

harveyslash

1286

add a comment |

Mistakes

1. Softmax Function and its derivative

I incorrectly stated that summing up the columns of the jacobian

is equivalent to doing

$$
color{red}{softmax(1-softmax)}
$$

The sum of the columns of the jacobian for $s_0$ actually goes like this:

$$
s_0 - sum_{i}{} s_0 *s_i
$$

Taking $s_0$ common:

$$
s_0 - s_0 sum_{i}{} s_i
$$

Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).

Therefore we get:

$$
s_0 - s_0*1
$$

which is $0$

So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.

2. Jacobians shouldn't be summed

The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.

This means that the equation

$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$

which calculates the derivative using chain rule.

is INCORRECT

It should actually be :

$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$

where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.

and the $times$ symbol denotes matrix multiplication.

Why Summing up the partials result in 0

To understand that, we need to first understand what the jacobian matrix signifies.

For element 0,0 , it reads :

How does $s_0$ change when I change $x_0$

For element 1,0 , it reads like this:

How does $s_1$ change when I change $x_0$

For element 2,0 , it reads like this:

How does $s_2$ change when I change $x_0$

To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).

The same can be said about $x_2$ and $x_3$.

Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .

In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.

Since the change is 0, the gradient is 0

In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.

Implementation in numpy

v = np.array([-1.0, -1.0, 1.0]) # unscaled logits

t = np.array([0.0,1.0,0.0])     # target probability distribution





def softmax(v):

    exps = np.exp(v)

    sum  = np.sum(exps)

    return exps/sum



def cross_entropy(inps,targets):

    return np.sum(-targets*np.log(inps))



def cross_entropy_derivatives(inps,targets):

    return -targets/inps



# Fixed softmax derivative which returns the jacobian instead

# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function 



def softmax_derivatives(softmax):

    s = softmax.reshape(-1,1)

    return np.diagflat(s) - np.dot(s, s.T)







soft = softmax(v)           # [0.10650698, 0.10650698, 0.78698604]

cross_entropy(soft,t)       # 2.2395447662218846



cross_der = cross_entropy_derivatives(soft,t) 

                            # [-0.       , -9.3890561, -0.       ]



soft_der = softmax_derivatives(soft)



#[[ 0.09516324, -0.01134374, -0.08381951],

#[-0.01134374,  0.09516324, -0.08381951],

#[-0.08381951, -0.08381951,  0.16763901]]





# derivative using chain rule 

 cross_der  @ soft_der      # [[ 0.10650698, -0.89349302,  0.78698604]]







# Derivative using analytical derivation 

soft - t                    # [ 0.10650698, -0.89349302,  0.78698604]

Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)

edited Jul 9 '18 at 18:37

answered Jul 7 '18 at 15:21

harveyslash

1286

Mistakes

1. Softmax Function and its derivative

I incorrectly stated that summing up the columns of the jacobian

is equivalent to doing

$$
color{red}{softmax(1-softmax)}
$$

The sum of the columns of the jacobian for $s_0$ actually goes like this:

$$
s_0 - sum_{i}{} s_0 *s_i
$$

Taking $s_0$ common:

$$
s_0 - s_0 sum_{i}{} s_i
$$

Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).

Therefore we get:

$$
s_0 - s_0*1
$$

which is $0$

So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.

2. Jacobians shouldn't be summed

The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.

This means that the equation

$$
color{red}{-frac{t_i}{s_i} * s_i(1-s_i)}
$$

which calculates the derivative using chain rule.

is INCORRECT

It should actually be :

$$
-frac{mathbf{t}}{mathbf{s}} times Softmax_Jacobian
$$

where $mathbf{t}$ and $mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.

and the $times$ symbol denotes matrix multiplication.

Why Summing up the partials result in 0

To understand that, we need to first understand what the jacobian matrix signifies.

For element 0,0 , it reads :

How does $s_0$ change when I change $x_0$

For element 1,0 , it reads like this:

How does $s_1$ change when I change $x_0$

For element 2,0 , it reads like this:

How does $s_2$ change when I change $x_0$

To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).

The same can be said about $x_2$ and $x_3$.

Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .

In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.

Since the change is 0, the gradient is 0

In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.

Implementation in numpy

v = np.array([-1.0, -1.0, 1.0]) # unscaled logits

t = np.array([0.0,1.0,0.0])     # target probability distribution





def softmax(v):

    exps = np.exp(v)

    sum  = np.sum(exps)

    return exps/sum



def cross_entropy(inps,targets):

    return np.sum(-targets*np.log(inps))



def cross_entropy_derivatives(inps,targets):

    return -targets/inps



# Fixed softmax derivative which returns the jacobian instead

# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function 



def softmax_derivatives(softmax):

    s = softmax.reshape(-1,1)

    return np.diagflat(s) - np.dot(s, s.T)







soft = softmax(v)           # [0.10650698, 0.10650698, 0.78698604]

cross_entropy(soft,t)       # 2.2395447662218846



cross_der = cross_entropy_derivatives(soft,t) 

                            # [-0.       , -9.3890561, -0.       ]



soft_der = softmax_derivatives(soft)



#[[ 0.09516324, -0.01134374, -0.08381951],

#[-0.01134374,  0.09516324, -0.08381951],

#[-0.08381951, -0.08381951,  0.16763901]]





# derivative using chain rule 

 cross_der  @ soft_der      # [[ 0.10650698, -0.89349302,  0.78698604]]







# Derivative using analytical derivation 

soft - t                    # [ 0.10650698, -0.89349302,  0.78698604]

Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)

edited Jul 9 '18 at 18:37

answered Jul 7 '18 at 15:21

harveyslash

1286

edited Jul 9 '18 at 18:37

answered Jul 7 '18 at 15:21

harveyslash

1286

answered Jul 7 '18 at 15:21

harveyslash

1286

answered Jul 7 '18 at 15:21

harveyslash

1286

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

U0nHFAg 0Nn,T9ssB OBEYLDDBn5,gcTBx,V5YsMoLYKdF4Xi,lh0z9E,p0EP3hk2ifb

Derivative of Softmax without cross entropy

Input vector

Softmax Function and Derivative

Cross Entropy Loss and its derivative

Cross entropy function

Derivative of cross entropy

Using chain rule to get derivative of softmax with cross entropy

Analytically computing derivative of softmax with cross entropy

Implementation using numpy

Input vector

Softmax Function and Derivative

Cross Entropy Loss and its derivative

Cross entropy function

Derivative of cross entropy

Using chain rule to get derivative of softmax with cross entropy

Analytically computing derivative of softmax with cross entropy

Implementation using numpy

Input vector

Softmax Function and Derivative

Cross Entropy Loss and its derivative

Cross entropy function

Derivative of cross entropy

Using chain rule to get derivative of softmax with cross entropy

Analytically computing derivative of softmax with cross entropy

Implementation using numpy

Input vector

Softmax Function and Derivative

Cross Entropy Loss and its derivative

Cross entropy function

Derivative of cross entropy

Using chain rule to get derivative of softmax with cross entropy

Analytically computing derivative of softmax with cross entropy

Implementation using numpy

1 Answer 1

Mistakes

1. Softmax Function and its derivative

2. Jacobians shouldn't be summed

Why Summing up the partials result in 0

Implementation in numpy

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Mistakes

1. Softmax Function and its derivative

2. Jacobians shouldn't be summed

Why Summing up the partials result in 0

Implementation in numpy

Mistakes

1. Softmax Function and its derivative

2. Jacobians shouldn't be summed

Why Summing up the partials result in 0

Implementation in numpy

Mistakes

1. Softmax Function and its derivative

2. Jacobians shouldn't be summed

Why Summing up the partials result in 0

Implementation in numpy

Mistakes

1. Softmax Function and its derivative

2. Jacobians shouldn't be summed

Why Summing up the partials result in 0

Implementation in numpy

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Wiesbaden

27. Oktober

Sommerrodelbahn

1 Answer
1

1 Answer
1

1 Answer
1