Why divide by $2m$












22














I'm taking a machine learning course. The professor has a model for linear regression. Where $h_theta$ is the hypothesis (proposed model. linear regression, in this case), $J(theta_1)$ is the cost function, $m$ is the number of elements in the training set, $x^{(i)}$ and $y^{(i)}$ are the variables of the training set element at $i$



$$h_theta = theta_1x$$



$$J(theta_1) = frac{1}{2m} sum_{i=1}^{m}(h_theta(x^{(i)})-y^{(i)})^2$$



What I don't understand is why he is dividing the sum by $2m$.










share|cite|improve this question



























    22














    I'm taking a machine learning course. The professor has a model for linear regression. Where $h_theta$ is the hypothesis (proposed model. linear regression, in this case), $J(theta_1)$ is the cost function, $m$ is the number of elements in the training set, $x^{(i)}$ and $y^{(i)}$ are the variables of the training set element at $i$



    $$h_theta = theta_1x$$



    $$J(theta_1) = frac{1}{2m} sum_{i=1}^{m}(h_theta(x^{(i)})-y^{(i)})^2$$



    What I don't understand is why he is dividing the sum by $2m$.










    share|cite|improve this question

























      22












      22








      22


      7





      I'm taking a machine learning course. The professor has a model for linear regression. Where $h_theta$ is the hypothesis (proposed model. linear regression, in this case), $J(theta_1)$ is the cost function, $m$ is the number of elements in the training set, $x^{(i)}$ and $y^{(i)}$ are the variables of the training set element at $i$



      $$h_theta = theta_1x$$



      $$J(theta_1) = frac{1}{2m} sum_{i=1}^{m}(h_theta(x^{(i)})-y^{(i)})^2$$



      What I don't understand is why he is dividing the sum by $2m$.










      share|cite|improve this question













      I'm taking a machine learning course. The professor has a model for linear regression. Where $h_theta$ is the hypothesis (proposed model. linear regression, in this case), $J(theta_1)$ is the cost function, $m$ is the number of elements in the training set, $x^{(i)}$ and $y^{(i)}$ are the variables of the training set element at $i$



      $$h_theta = theta_1x$$



      $$J(theta_1) = frac{1}{2m} sum_{i=1}^{m}(h_theta(x^{(i)})-y^{(i)})^2$$



      What I don't understand is why he is dividing the sum by $2m$.







      regression machine-learning






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked Aug 1 '14 at 17:42









      Daniel Pendergast

      51321024




      51321024






















          4 Answers
          4






          active

          oldest

          votes


















          24














          The $frac{1}{m}$ is to "average" the squared error over the number of components so that the number of components doesn't affect the function (see John's answer).



          So now the question is why there is an extra $frac{1}{2}$. In short, it doesn't matter. The solution that minimizes $J$ as you have written it will also minimize $2J=frac{1}{m} sum_i (h(x_i)-y_i)^2$. The latter function, $2J$, may seem more "natural," but the factor of $2$ does not matter when optimizing.



          The only reason some authors like to include it is because when you take the derivative with respect to $x$, the $2$ goes away.






          share|cite|improve this answer





















          • Then why include it at all?
            – Daniel Pendergast
            Aug 1 '14 at 17:59










          • @DantheMan See the last sentence of my answer. After taking the derivative, the $2$ won't appear anymore, and since most of the computation is with the derivative, it saves some clutter.
            – angryavian
            Aug 1 '14 at 18:01












          • Ah, I understand now. I understood something else. Thank you.
            – Daniel Pendergast
            Aug 1 '14 at 18:04










          • I can't prove this, but I believe you, it makes derivative calculation easier, which we do in gradient descent for example.
            – mskw
            Sep 29 '17 at 14:48



















          7














          Dividing by $2m$ ensures that the cost function doesn't depend on the number of elements in the training set. This allows a better comparison across models.






          share|cite|improve this answer





















          • How does that make it non-dependent? It looks like the 2m is just dividing the average up even more. like a pie dividing by 6, now we divide the pie by 2x6=12. Pratically, the average halved.
            – mskw
            Sep 29 '17 at 14:47








          • 1




            @mskw The accepted answer explains better where the $2$ comes from. Actually, the entire answer is better than mine was!
            – John
            Sep 29 '17 at 15:32










          • "This allows a better comparison across models" - Thanks, this is the answer I was looking for.
            – Amnon
            Mar 21 at 17:46



















          3














          I assume that the $frac{1}{m}$ component is obvious and therefore I will focus on the $frac{1}{2}$ part. I personally doubt that so many authors would decide to include this confusing term just achieve a little bit simpler gradient formulas. Note that there are ways of finding the solution to the linear regression equations that doesn't involve gradients. I will provide another explanation.



          When we try to evaluate the machine learning models we assume that our observations are not fully accurate but rather contain some kind of error. For example, imagine measuring a length using some low quality ruler. One of the simplest assumptions would be that we introduce some Gaussian error:



          $$
          epsilon thicksim mathcal{N}(0, 1)
          $$



          Those parameters are usually safe because we perform some kind of data normalization anyway. We can now compute a probability that our prediction $hat{y}$ equals our target value $y$ up to this measurement error:



          $$
          hat{y} + epsilon = y
          $$



          We can treat $hat{y} + epsilon$ as a new random variable $widetilde{y} sim mathcal{N}(hat{y}, 1)$. We have just added a constant $hat{y}$ to our zero-centered random variable $epsilon$. This random variable $widetilde{y}$ is our probabilistic estimation of the observation. Instead of stating that for given input $x$ we will observe the output $hat{y}$ (which would not be true due to the errors) we state that we will most probably observe something around $hat{y}$. We can compute the likelihood of actually observing the $hat{y}$ or $y$ as well as any other number using the Gaussian PDF:



          $$
          p(x) = frac{1}{{sigma sqrt {2pi } }}expleft({{frac{ - left( {x - mu } right)^2 }{2sigma^2}}}right) \
          $$



          In our case $mu = hat{y}$ and $sigma = 1$:



          $$
          p(y) = frac{1}{{sqrt {2pi } }}expleft({{frac{ - left( {y - hat{y} } right)^2 }{2}}}right) \
          $$



          Note that this is the function that we would actually like to maximize - the probability of observing the true value $y$ given our model. Since our main goal is maximization we can apply a monotone function like logarithm and ignore the constants.



          $$
          log~p(y) = frac{ - left( {y - hat{y} } right)^2 }{2} + const
          $$



          Once we get rid of the constant and the minus sign we obtain the squared error term for a single example in our dataset. We can average over all of the examples to get the MSE formula.



          $$
          MSE(y, hat{y}) = frac{1}{2m}sum_i^m (y - hat{y})^2
          $$



          Note that we can similarly derive the formula for the logistic regression loss, i.e. cross-entropy or log-loss.






          share|cite|improve this answer





























            1














            I wondered about the exact same thing when taking this course, and ended up researching this a bit. I'll give a short answer here, but you can read a more detailed overview in a blog post I wrote about it.



            I believe that at least part of the reason for those scaling coefficients is that L² regularization probably entered the field of deep learning through the introduction of the related, but not identical, concept of weight decay.



            The 0.5 factor is then there to get a nice λ-only coefficient for the weight decay in the gradient, and the scaling by m... well, there are at least 5 different motivations that I have found or came up with:





            1. A side-effect of batch gradient descent: When a single iteration of gradient descent is instead formalized over the entire training set, resulting in the algorithm sometimes called batch gradient descent, the scaling factor of 1/m, introduced to make the cost function comparable across different size datasets, gets automatically applied to the weight decay term.


            2. Rescale to the weight of a single example: See grez's interesting intuition.


            3. Training set representativeness: It makes sense to scale down regularization as the size of the training set grows, as statistically, its representativeness of the overall distribution also grows. Basically, the more data we have, the less regularization is needed.


            4. Making λ comparable: By hopefully mitigating the need to change λ when m changes, this scaling makes λ itself comparable across different size datasets. This make λ a more representative estimator of the actual degree of regularization required by a specific model on a specific learning problem.


            5. Empirical value: The great notebook by grez demonstrates that this improves performance in practice.






            share|cite|improve this answer





















              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "69"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              noCode: true, onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f884887%2fwhy-divide-by-2m%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              4 Answers
              4






              active

              oldest

              votes








              4 Answers
              4






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              24














              The $frac{1}{m}$ is to "average" the squared error over the number of components so that the number of components doesn't affect the function (see John's answer).



              So now the question is why there is an extra $frac{1}{2}$. In short, it doesn't matter. The solution that minimizes $J$ as you have written it will also minimize $2J=frac{1}{m} sum_i (h(x_i)-y_i)^2$. The latter function, $2J$, may seem more "natural," but the factor of $2$ does not matter when optimizing.



              The only reason some authors like to include it is because when you take the derivative with respect to $x$, the $2$ goes away.






              share|cite|improve this answer





















              • Then why include it at all?
                – Daniel Pendergast
                Aug 1 '14 at 17:59










              • @DantheMan See the last sentence of my answer. After taking the derivative, the $2$ won't appear anymore, and since most of the computation is with the derivative, it saves some clutter.
                – angryavian
                Aug 1 '14 at 18:01












              • Ah, I understand now. I understood something else. Thank you.
                – Daniel Pendergast
                Aug 1 '14 at 18:04










              • I can't prove this, but I believe you, it makes derivative calculation easier, which we do in gradient descent for example.
                – mskw
                Sep 29 '17 at 14:48
















              24














              The $frac{1}{m}$ is to "average" the squared error over the number of components so that the number of components doesn't affect the function (see John's answer).



              So now the question is why there is an extra $frac{1}{2}$. In short, it doesn't matter. The solution that minimizes $J$ as you have written it will also minimize $2J=frac{1}{m} sum_i (h(x_i)-y_i)^2$. The latter function, $2J$, may seem more "natural," but the factor of $2$ does not matter when optimizing.



              The only reason some authors like to include it is because when you take the derivative with respect to $x$, the $2$ goes away.






              share|cite|improve this answer





















              • Then why include it at all?
                – Daniel Pendergast
                Aug 1 '14 at 17:59










              • @DantheMan See the last sentence of my answer. After taking the derivative, the $2$ won't appear anymore, and since most of the computation is with the derivative, it saves some clutter.
                – angryavian
                Aug 1 '14 at 18:01












              • Ah, I understand now. I understood something else. Thank you.
                – Daniel Pendergast
                Aug 1 '14 at 18:04










              • I can't prove this, but I believe you, it makes derivative calculation easier, which we do in gradient descent for example.
                – mskw
                Sep 29 '17 at 14:48














              24












              24








              24






              The $frac{1}{m}$ is to "average" the squared error over the number of components so that the number of components doesn't affect the function (see John's answer).



              So now the question is why there is an extra $frac{1}{2}$. In short, it doesn't matter. The solution that minimizes $J$ as you have written it will also minimize $2J=frac{1}{m} sum_i (h(x_i)-y_i)^2$. The latter function, $2J$, may seem more "natural," but the factor of $2$ does not matter when optimizing.



              The only reason some authors like to include it is because when you take the derivative with respect to $x$, the $2$ goes away.






              share|cite|improve this answer












              The $frac{1}{m}$ is to "average" the squared error over the number of components so that the number of components doesn't affect the function (see John's answer).



              So now the question is why there is an extra $frac{1}{2}$. In short, it doesn't matter. The solution that minimizes $J$ as you have written it will also minimize $2J=frac{1}{m} sum_i (h(x_i)-y_i)^2$. The latter function, $2J$, may seem more "natural," but the factor of $2$ does not matter when optimizing.



              The only reason some authors like to include it is because when you take the derivative with respect to $x$, the $2$ goes away.







              share|cite|improve this answer












              share|cite|improve this answer



              share|cite|improve this answer










              answered Aug 1 '14 at 17:50









              angryavian

              38.4k23180




              38.4k23180












              • Then why include it at all?
                – Daniel Pendergast
                Aug 1 '14 at 17:59










              • @DantheMan See the last sentence of my answer. After taking the derivative, the $2$ won't appear anymore, and since most of the computation is with the derivative, it saves some clutter.
                – angryavian
                Aug 1 '14 at 18:01












              • Ah, I understand now. I understood something else. Thank you.
                – Daniel Pendergast
                Aug 1 '14 at 18:04










              • I can't prove this, but I believe you, it makes derivative calculation easier, which we do in gradient descent for example.
                – mskw
                Sep 29 '17 at 14:48


















              • Then why include it at all?
                – Daniel Pendergast
                Aug 1 '14 at 17:59










              • @DantheMan See the last sentence of my answer. After taking the derivative, the $2$ won't appear anymore, and since most of the computation is with the derivative, it saves some clutter.
                – angryavian
                Aug 1 '14 at 18:01












              • Ah, I understand now. I understood something else. Thank you.
                – Daniel Pendergast
                Aug 1 '14 at 18:04










              • I can't prove this, but I believe you, it makes derivative calculation easier, which we do in gradient descent for example.
                – mskw
                Sep 29 '17 at 14:48
















              Then why include it at all?
              – Daniel Pendergast
              Aug 1 '14 at 17:59




              Then why include it at all?
              – Daniel Pendergast
              Aug 1 '14 at 17:59












              @DantheMan See the last sentence of my answer. After taking the derivative, the $2$ won't appear anymore, and since most of the computation is with the derivative, it saves some clutter.
              – angryavian
              Aug 1 '14 at 18:01






              @DantheMan See the last sentence of my answer. After taking the derivative, the $2$ won't appear anymore, and since most of the computation is with the derivative, it saves some clutter.
              – angryavian
              Aug 1 '14 at 18:01














              Ah, I understand now. I understood something else. Thank you.
              – Daniel Pendergast
              Aug 1 '14 at 18:04




              Ah, I understand now. I understood something else. Thank you.
              – Daniel Pendergast
              Aug 1 '14 at 18:04












              I can't prove this, but I believe you, it makes derivative calculation easier, which we do in gradient descent for example.
              – mskw
              Sep 29 '17 at 14:48




              I can't prove this, but I believe you, it makes derivative calculation easier, which we do in gradient descent for example.
              – mskw
              Sep 29 '17 at 14:48











              7














              Dividing by $2m$ ensures that the cost function doesn't depend on the number of elements in the training set. This allows a better comparison across models.






              share|cite|improve this answer





















              • How does that make it non-dependent? It looks like the 2m is just dividing the average up even more. like a pie dividing by 6, now we divide the pie by 2x6=12. Pratically, the average halved.
                – mskw
                Sep 29 '17 at 14:47








              • 1




                @mskw The accepted answer explains better where the $2$ comes from. Actually, the entire answer is better than mine was!
                – John
                Sep 29 '17 at 15:32










              • "This allows a better comparison across models" - Thanks, this is the answer I was looking for.
                – Amnon
                Mar 21 at 17:46
















              7














              Dividing by $2m$ ensures that the cost function doesn't depend on the number of elements in the training set. This allows a better comparison across models.






              share|cite|improve this answer





















              • How does that make it non-dependent? It looks like the 2m is just dividing the average up even more. like a pie dividing by 6, now we divide the pie by 2x6=12. Pratically, the average halved.
                – mskw
                Sep 29 '17 at 14:47








              • 1




                @mskw The accepted answer explains better where the $2$ comes from. Actually, the entire answer is better than mine was!
                – John
                Sep 29 '17 at 15:32










              • "This allows a better comparison across models" - Thanks, this is the answer I was looking for.
                – Amnon
                Mar 21 at 17:46














              7












              7








              7






              Dividing by $2m$ ensures that the cost function doesn't depend on the number of elements in the training set. This allows a better comparison across models.






              share|cite|improve this answer












              Dividing by $2m$ ensures that the cost function doesn't depend on the number of elements in the training set. This allows a better comparison across models.







              share|cite|improve this answer












              share|cite|improve this answer



              share|cite|improve this answer










              answered Aug 1 '14 at 17:48









              John

              22.6k32349




              22.6k32349












              • How does that make it non-dependent? It looks like the 2m is just dividing the average up even more. like a pie dividing by 6, now we divide the pie by 2x6=12. Pratically, the average halved.
                – mskw
                Sep 29 '17 at 14:47








              • 1




                @mskw The accepted answer explains better where the $2$ comes from. Actually, the entire answer is better than mine was!
                – John
                Sep 29 '17 at 15:32










              • "This allows a better comparison across models" - Thanks, this is the answer I was looking for.
                – Amnon
                Mar 21 at 17:46


















              • How does that make it non-dependent? It looks like the 2m is just dividing the average up even more. like a pie dividing by 6, now we divide the pie by 2x6=12. Pratically, the average halved.
                – mskw
                Sep 29 '17 at 14:47








              • 1




                @mskw The accepted answer explains better where the $2$ comes from. Actually, the entire answer is better than mine was!
                – John
                Sep 29 '17 at 15:32










              • "This allows a better comparison across models" - Thanks, this is the answer I was looking for.
                – Amnon
                Mar 21 at 17:46
















              How does that make it non-dependent? It looks like the 2m is just dividing the average up even more. like a pie dividing by 6, now we divide the pie by 2x6=12. Pratically, the average halved.
              – mskw
              Sep 29 '17 at 14:47






              How does that make it non-dependent? It looks like the 2m is just dividing the average up even more. like a pie dividing by 6, now we divide the pie by 2x6=12. Pratically, the average halved.
              – mskw
              Sep 29 '17 at 14:47






              1




              1




              @mskw The accepted answer explains better where the $2$ comes from. Actually, the entire answer is better than mine was!
              – John
              Sep 29 '17 at 15:32




              @mskw The accepted answer explains better where the $2$ comes from. Actually, the entire answer is better than mine was!
              – John
              Sep 29 '17 at 15:32












              "This allows a better comparison across models" - Thanks, this is the answer I was looking for.
              – Amnon
              Mar 21 at 17:46




              "This allows a better comparison across models" - Thanks, this is the answer I was looking for.
              – Amnon
              Mar 21 at 17:46











              3














              I assume that the $frac{1}{m}$ component is obvious and therefore I will focus on the $frac{1}{2}$ part. I personally doubt that so many authors would decide to include this confusing term just achieve a little bit simpler gradient formulas. Note that there are ways of finding the solution to the linear regression equations that doesn't involve gradients. I will provide another explanation.



              When we try to evaluate the machine learning models we assume that our observations are not fully accurate but rather contain some kind of error. For example, imagine measuring a length using some low quality ruler. One of the simplest assumptions would be that we introduce some Gaussian error:



              $$
              epsilon thicksim mathcal{N}(0, 1)
              $$



              Those parameters are usually safe because we perform some kind of data normalization anyway. We can now compute a probability that our prediction $hat{y}$ equals our target value $y$ up to this measurement error:



              $$
              hat{y} + epsilon = y
              $$



              We can treat $hat{y} + epsilon$ as a new random variable $widetilde{y} sim mathcal{N}(hat{y}, 1)$. We have just added a constant $hat{y}$ to our zero-centered random variable $epsilon$. This random variable $widetilde{y}$ is our probabilistic estimation of the observation. Instead of stating that for given input $x$ we will observe the output $hat{y}$ (which would not be true due to the errors) we state that we will most probably observe something around $hat{y}$. We can compute the likelihood of actually observing the $hat{y}$ or $y$ as well as any other number using the Gaussian PDF:



              $$
              p(x) = frac{1}{{sigma sqrt {2pi } }}expleft({{frac{ - left( {x - mu } right)^2 }{2sigma^2}}}right) \
              $$



              In our case $mu = hat{y}$ and $sigma = 1$:



              $$
              p(y) = frac{1}{{sqrt {2pi } }}expleft({{frac{ - left( {y - hat{y} } right)^2 }{2}}}right) \
              $$



              Note that this is the function that we would actually like to maximize - the probability of observing the true value $y$ given our model. Since our main goal is maximization we can apply a monotone function like logarithm and ignore the constants.



              $$
              log~p(y) = frac{ - left( {y - hat{y} } right)^2 }{2} + const
              $$



              Once we get rid of the constant and the minus sign we obtain the squared error term for a single example in our dataset. We can average over all of the examples to get the MSE formula.



              $$
              MSE(y, hat{y}) = frac{1}{2m}sum_i^m (y - hat{y})^2
              $$



              Note that we can similarly derive the formula for the logistic regression loss, i.e. cross-entropy or log-loss.






              share|cite|improve this answer


























                3














                I assume that the $frac{1}{m}$ component is obvious and therefore I will focus on the $frac{1}{2}$ part. I personally doubt that so many authors would decide to include this confusing term just achieve a little bit simpler gradient formulas. Note that there are ways of finding the solution to the linear regression equations that doesn't involve gradients. I will provide another explanation.



                When we try to evaluate the machine learning models we assume that our observations are not fully accurate but rather contain some kind of error. For example, imagine measuring a length using some low quality ruler. One of the simplest assumptions would be that we introduce some Gaussian error:



                $$
                epsilon thicksim mathcal{N}(0, 1)
                $$



                Those parameters are usually safe because we perform some kind of data normalization anyway. We can now compute a probability that our prediction $hat{y}$ equals our target value $y$ up to this measurement error:



                $$
                hat{y} + epsilon = y
                $$



                We can treat $hat{y} + epsilon$ as a new random variable $widetilde{y} sim mathcal{N}(hat{y}, 1)$. We have just added a constant $hat{y}$ to our zero-centered random variable $epsilon$. This random variable $widetilde{y}$ is our probabilistic estimation of the observation. Instead of stating that for given input $x$ we will observe the output $hat{y}$ (which would not be true due to the errors) we state that we will most probably observe something around $hat{y}$. We can compute the likelihood of actually observing the $hat{y}$ or $y$ as well as any other number using the Gaussian PDF:



                $$
                p(x) = frac{1}{{sigma sqrt {2pi } }}expleft({{frac{ - left( {x - mu } right)^2 }{2sigma^2}}}right) \
                $$



                In our case $mu = hat{y}$ and $sigma = 1$:



                $$
                p(y) = frac{1}{{sqrt {2pi } }}expleft({{frac{ - left( {y - hat{y} } right)^2 }{2}}}right) \
                $$



                Note that this is the function that we would actually like to maximize - the probability of observing the true value $y$ given our model. Since our main goal is maximization we can apply a monotone function like logarithm and ignore the constants.



                $$
                log~p(y) = frac{ - left( {y - hat{y} } right)^2 }{2} + const
                $$



                Once we get rid of the constant and the minus sign we obtain the squared error term for a single example in our dataset. We can average over all of the examples to get the MSE formula.



                $$
                MSE(y, hat{y}) = frac{1}{2m}sum_i^m (y - hat{y})^2
                $$



                Note that we can similarly derive the formula for the logistic regression loss, i.e. cross-entropy or log-loss.






                share|cite|improve this answer
























                  3












                  3








                  3






                  I assume that the $frac{1}{m}$ component is obvious and therefore I will focus on the $frac{1}{2}$ part. I personally doubt that so many authors would decide to include this confusing term just achieve a little bit simpler gradient formulas. Note that there are ways of finding the solution to the linear regression equations that doesn't involve gradients. I will provide another explanation.



                  When we try to evaluate the machine learning models we assume that our observations are not fully accurate but rather contain some kind of error. For example, imagine measuring a length using some low quality ruler. One of the simplest assumptions would be that we introduce some Gaussian error:



                  $$
                  epsilon thicksim mathcal{N}(0, 1)
                  $$



                  Those parameters are usually safe because we perform some kind of data normalization anyway. We can now compute a probability that our prediction $hat{y}$ equals our target value $y$ up to this measurement error:



                  $$
                  hat{y} + epsilon = y
                  $$



                  We can treat $hat{y} + epsilon$ as a new random variable $widetilde{y} sim mathcal{N}(hat{y}, 1)$. We have just added a constant $hat{y}$ to our zero-centered random variable $epsilon$. This random variable $widetilde{y}$ is our probabilistic estimation of the observation. Instead of stating that for given input $x$ we will observe the output $hat{y}$ (which would not be true due to the errors) we state that we will most probably observe something around $hat{y}$. We can compute the likelihood of actually observing the $hat{y}$ or $y$ as well as any other number using the Gaussian PDF:



                  $$
                  p(x) = frac{1}{{sigma sqrt {2pi } }}expleft({{frac{ - left( {x - mu } right)^2 }{2sigma^2}}}right) \
                  $$



                  In our case $mu = hat{y}$ and $sigma = 1$:



                  $$
                  p(y) = frac{1}{{sqrt {2pi } }}expleft({{frac{ - left( {y - hat{y} } right)^2 }{2}}}right) \
                  $$



                  Note that this is the function that we would actually like to maximize - the probability of observing the true value $y$ given our model. Since our main goal is maximization we can apply a monotone function like logarithm and ignore the constants.



                  $$
                  log~p(y) = frac{ - left( {y - hat{y} } right)^2 }{2} + const
                  $$



                  Once we get rid of the constant and the minus sign we obtain the squared error term for a single example in our dataset. We can average over all of the examples to get the MSE formula.



                  $$
                  MSE(y, hat{y}) = frac{1}{2m}sum_i^m (y - hat{y})^2
                  $$



                  Note that we can similarly derive the formula for the logistic regression loss, i.e. cross-entropy or log-loss.






                  share|cite|improve this answer












                  I assume that the $frac{1}{m}$ component is obvious and therefore I will focus on the $frac{1}{2}$ part. I personally doubt that so many authors would decide to include this confusing term just achieve a little bit simpler gradient formulas. Note that there are ways of finding the solution to the linear regression equations that doesn't involve gradients. I will provide another explanation.



                  When we try to evaluate the machine learning models we assume that our observations are not fully accurate but rather contain some kind of error. For example, imagine measuring a length using some low quality ruler. One of the simplest assumptions would be that we introduce some Gaussian error:



                  $$
                  epsilon thicksim mathcal{N}(0, 1)
                  $$



                  Those parameters are usually safe because we perform some kind of data normalization anyway. We can now compute a probability that our prediction $hat{y}$ equals our target value $y$ up to this measurement error:



                  $$
                  hat{y} + epsilon = y
                  $$



                  We can treat $hat{y} + epsilon$ as a new random variable $widetilde{y} sim mathcal{N}(hat{y}, 1)$. We have just added a constant $hat{y}$ to our zero-centered random variable $epsilon$. This random variable $widetilde{y}$ is our probabilistic estimation of the observation. Instead of stating that for given input $x$ we will observe the output $hat{y}$ (which would not be true due to the errors) we state that we will most probably observe something around $hat{y}$. We can compute the likelihood of actually observing the $hat{y}$ or $y$ as well as any other number using the Gaussian PDF:



                  $$
                  p(x) = frac{1}{{sigma sqrt {2pi } }}expleft({{frac{ - left( {x - mu } right)^2 }{2sigma^2}}}right) \
                  $$



                  In our case $mu = hat{y}$ and $sigma = 1$:



                  $$
                  p(y) = frac{1}{{sqrt {2pi } }}expleft({{frac{ - left( {y - hat{y} } right)^2 }{2}}}right) \
                  $$



                  Note that this is the function that we would actually like to maximize - the probability of observing the true value $y$ given our model. Since our main goal is maximization we can apply a monotone function like logarithm and ignore the constants.



                  $$
                  log~p(y) = frac{ - left( {y - hat{y} } right)^2 }{2} + const
                  $$



                  Once we get rid of the constant and the minus sign we obtain the squared error term for a single example in our dataset. We can average over all of the examples to get the MSE formula.



                  $$
                  MSE(y, hat{y}) = frac{1}{2m}sum_i^m (y - hat{y})^2
                  $$



                  Note that we can similarly derive the formula for the logistic regression loss, i.e. cross-entropy or log-loss.







                  share|cite|improve this answer












                  share|cite|improve this answer



                  share|cite|improve this answer










                  answered Jul 1 at 17:28









                  pkubik

                  311




                  311























                      1














                      I wondered about the exact same thing when taking this course, and ended up researching this a bit. I'll give a short answer here, but you can read a more detailed overview in a blog post I wrote about it.



                      I believe that at least part of the reason for those scaling coefficients is that L² regularization probably entered the field of deep learning through the introduction of the related, but not identical, concept of weight decay.



                      The 0.5 factor is then there to get a nice λ-only coefficient for the weight decay in the gradient, and the scaling by m... well, there are at least 5 different motivations that I have found or came up with:





                      1. A side-effect of batch gradient descent: When a single iteration of gradient descent is instead formalized over the entire training set, resulting in the algorithm sometimes called batch gradient descent, the scaling factor of 1/m, introduced to make the cost function comparable across different size datasets, gets automatically applied to the weight decay term.


                      2. Rescale to the weight of a single example: See grez's interesting intuition.


                      3. Training set representativeness: It makes sense to scale down regularization as the size of the training set grows, as statistically, its representativeness of the overall distribution also grows. Basically, the more data we have, the less regularization is needed.


                      4. Making λ comparable: By hopefully mitigating the need to change λ when m changes, this scaling makes λ itself comparable across different size datasets. This make λ a more representative estimator of the actual degree of regularization required by a specific model on a specific learning problem.


                      5. Empirical value: The great notebook by grez demonstrates that this improves performance in practice.






                      share|cite|improve this answer


























                        1














                        I wondered about the exact same thing when taking this course, and ended up researching this a bit. I'll give a short answer here, but you can read a more detailed overview in a blog post I wrote about it.



                        I believe that at least part of the reason for those scaling coefficients is that L² regularization probably entered the field of deep learning through the introduction of the related, but not identical, concept of weight decay.



                        The 0.5 factor is then there to get a nice λ-only coefficient for the weight decay in the gradient, and the scaling by m... well, there are at least 5 different motivations that I have found or came up with:





                        1. A side-effect of batch gradient descent: When a single iteration of gradient descent is instead formalized over the entire training set, resulting in the algorithm sometimes called batch gradient descent, the scaling factor of 1/m, introduced to make the cost function comparable across different size datasets, gets automatically applied to the weight decay term.


                        2. Rescale to the weight of a single example: See grez's interesting intuition.


                        3. Training set representativeness: It makes sense to scale down regularization as the size of the training set grows, as statistically, its representativeness of the overall distribution also grows. Basically, the more data we have, the less regularization is needed.


                        4. Making λ comparable: By hopefully mitigating the need to change λ when m changes, this scaling makes λ itself comparable across different size datasets. This make λ a more representative estimator of the actual degree of regularization required by a specific model on a specific learning problem.


                        5. Empirical value: The great notebook by grez demonstrates that this improves performance in practice.






                        share|cite|improve this answer
























                          1












                          1








                          1






                          I wondered about the exact same thing when taking this course, and ended up researching this a bit. I'll give a short answer here, but you can read a more detailed overview in a blog post I wrote about it.



                          I believe that at least part of the reason for those scaling coefficients is that L² regularization probably entered the field of deep learning through the introduction of the related, but not identical, concept of weight decay.



                          The 0.5 factor is then there to get a nice λ-only coefficient for the weight decay in the gradient, and the scaling by m... well, there are at least 5 different motivations that I have found or came up with:





                          1. A side-effect of batch gradient descent: When a single iteration of gradient descent is instead formalized over the entire training set, resulting in the algorithm sometimes called batch gradient descent, the scaling factor of 1/m, introduced to make the cost function comparable across different size datasets, gets automatically applied to the weight decay term.


                          2. Rescale to the weight of a single example: See grez's interesting intuition.


                          3. Training set representativeness: It makes sense to scale down regularization as the size of the training set grows, as statistically, its representativeness of the overall distribution also grows. Basically, the more data we have, the less regularization is needed.


                          4. Making λ comparable: By hopefully mitigating the need to change λ when m changes, this scaling makes λ itself comparable across different size datasets. This make λ a more representative estimator of the actual degree of regularization required by a specific model on a specific learning problem.


                          5. Empirical value: The great notebook by grez demonstrates that this improves performance in practice.






                          share|cite|improve this answer












                          I wondered about the exact same thing when taking this course, and ended up researching this a bit. I'll give a short answer here, but you can read a more detailed overview in a blog post I wrote about it.



                          I believe that at least part of the reason for those scaling coefficients is that L² regularization probably entered the field of deep learning through the introduction of the related, but not identical, concept of weight decay.



                          The 0.5 factor is then there to get a nice λ-only coefficient for the weight decay in the gradient, and the scaling by m... well, there are at least 5 different motivations that I have found or came up with:





                          1. A side-effect of batch gradient descent: When a single iteration of gradient descent is instead formalized over the entire training set, resulting in the algorithm sometimes called batch gradient descent, the scaling factor of 1/m, introduced to make the cost function comparable across different size datasets, gets automatically applied to the weight decay term.


                          2. Rescale to the weight of a single example: See grez's interesting intuition.


                          3. Training set representativeness: It makes sense to scale down regularization as the size of the training set grows, as statistically, its representativeness of the overall distribution also grows. Basically, the more data we have, the less regularization is needed.


                          4. Making λ comparable: By hopefully mitigating the need to change λ when m changes, this scaling makes λ itself comparable across different size datasets. This make λ a more representative estimator of the actual degree of regularization required by a specific model on a specific learning problem.


                          5. Empirical value: The great notebook by grez demonstrates that this improves performance in practice.







                          share|cite|improve this answer












                          share|cite|improve this answer



                          share|cite|improve this answer










                          answered Nov 28 at 13:16









                          ShayPal5

                          111




                          111






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Mathematics Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f884887%2fwhy-divide-by-2m%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Tonle Sap (See)

                              I get strange results when I access the Sqlitedatabase with Unity C# via XAMPP

                              Guatemaltekische Davis-Cup-Mannschaft