Why log-transform to normal distribution for decision trees?












4












$begingroup$


On page 304 of chapter 8 of An Introduction to Statistical Learning with Applications in R (James et al.), the authors say:




We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)




No additional motivation for the log-transform is given. Being that the data are being fed into decision tree algorithms, why was it important to force the data into a normal distribution? I thought most/all decision tree algorithms were invariant to scale changes.










share|cite|improve this question











$endgroup$

















    4












    $begingroup$


    On page 304 of chapter 8 of An Introduction to Statistical Learning with Applications in R (James et al.), the authors say:




    We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)




    No additional motivation for the log-transform is given. Being that the data are being fed into decision tree algorithms, why was it important to force the data into a normal distribution? I thought most/all decision tree algorithms were invariant to scale changes.










    share|cite|improve this question











    $endgroup$















      4












      4








      4


      2



      $begingroup$


      On page 304 of chapter 8 of An Introduction to Statistical Learning with Applications in R (James et al.), the authors say:




      We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)




      No additional motivation for the log-transform is given. Being that the data are being fed into decision tree algorithms, why was it important to force the data into a normal distribution? I thought most/all decision tree algorithms were invariant to scale changes.










      share|cite|improve this question











      $endgroup$




      On page 304 of chapter 8 of An Introduction to Statistical Learning with Applications in R (James et al.), the authors say:




      We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)




      No additional motivation for the log-transform is given. Being that the data are being fed into decision tree algorithms, why was it important to force the data into a normal distribution? I thought most/all decision tree algorithms were invariant to scale changes.







      machine-learning cart






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Jan 2 at 3:38









      Sycorax

      42.1k12111207




      42.1k12111207










      asked Jan 2 at 2:50









      jss367jss367

      1255




      1255






















          2 Answers
          2






          active

          oldest

          votes


















          14












          $begingroup$

          In this case, the salary is the target (dependent variable/outcome) of the decision tree, not one of the features (independent variables/predictors). You are correct that decision trees are insensitive to the scale of the predictors, but since I suspect there are a small number of extremely large salaries, transforming the salaries might improve predictions because loss functions which minimize square error will not be so strongly influenced by these large values.






          share|cite|improve this answer











          $endgroup$





















            4












            $begingroup$

            I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.
            salaries



            The pdf of the Pareto distribution is $$frac{alpha{x_m}^alpha}{x^{alpha+1}}.$$



            The scale parameter $x_m$ is $545,000, the lowest salary last year. I estimated the shape parameter, $alpha$, as 0.7848238 using MLE. This matters because when $alpha<2,$ then the distribution has no variance. More properly, its variance is undefined. If any of your variables lacks a mean or a variance, then you cannot use anything that minimizes squared loss.



            The distribution of the log of the variables does have a variance and so you can use least squares style methodologies on them. This is actually a serious omission from your textbook. Some things, like the stock market returns which have neither a mean nor a variance, or baseball salaries, which lack a variance will make OLS models meaningless. The log is not, inherently, the best treatment, but it does work.



            Taking the log does not give you a bell shape.log saleries



            This is entirely about being certain that all of your data has a variance. If all of the assumptions for OLS are met, then the underlying distributions do not matter. They can be insane looking, but variance has to be defined everywhere.



            EDIT As Therkel pointed out in the comments when $alpha<1$ then no mean exists either. There is a comment by Cliff AB that I should take up as well. He argues that the distribution is doubly bounded and so a finite variance and mean exist. I would disagree with that as an economist. It is true that there is only so much wealth in the world, but it is also true that we have no idea what it is. Furthermore, that wealth is changing every second of every day as people make individual choices.



            The worker who does not pick that one apple reduces wealth if that apple is never picked and reduces available wealth regardless. An apple on a tree has no income value until it picked and processed. This makes the right-hand side constraint stochastic. For the purposes of baseball, the stochastic effect should be considered to be zero.



            Baseball, as a percentage of world output, is so miniscule that you could ignore it. The same is true for American football, North American hockey, or for that matter, live stage theater for the whole United States.



            The fact that you can model this data with a Pareto distribution means you have no mean or variance if the estimates are valid. If you take the log, you end up with finite variance. If you divide the data by its minimum value and take the logs, you end up with the exponential distribution, which is well enough behaved, but then you get interpretation problems.






            share|cite|improve this answer











            $endgroup$













            • $begingroup$
              Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
              $endgroup$
              – Therkel
              Jan 2 at 7:22












            • $begingroup$
              "This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
              $endgroup$
              – Cliff AB
              Jan 2 at 17:34










            • $begingroup$
              @Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
              $endgroup$
              – kjetil b halvorsen
              Jan 2 at 17:51










            • $begingroup$
              @CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
              $endgroup$
              – Dave Harris
              Jan 2 at 19:54










            • $begingroup$
              @Therkel thanks for pointing that out. I was working from memory.
              $endgroup$
              – Dave Harris
              Jan 2 at 19:54












            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "65"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f385231%2fwhy-log-transform-to-normal-distribution-for-decision-trees%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            14












            $begingroup$

            In this case, the salary is the target (dependent variable/outcome) of the decision tree, not one of the features (independent variables/predictors). You are correct that decision trees are insensitive to the scale of the predictors, but since I suspect there are a small number of extremely large salaries, transforming the salaries might improve predictions because loss functions which minimize square error will not be so strongly influenced by these large values.






            share|cite|improve this answer











            $endgroup$


















              14












              $begingroup$

              In this case, the salary is the target (dependent variable/outcome) of the decision tree, not one of the features (independent variables/predictors). You are correct that decision trees are insensitive to the scale of the predictors, but since I suspect there are a small number of extremely large salaries, transforming the salaries might improve predictions because loss functions which minimize square error will not be so strongly influenced by these large values.






              share|cite|improve this answer











              $endgroup$
















                14












                14








                14





                $begingroup$

                In this case, the salary is the target (dependent variable/outcome) of the decision tree, not one of the features (independent variables/predictors). You are correct that decision trees are insensitive to the scale of the predictors, but since I suspect there are a small number of extremely large salaries, transforming the salaries might improve predictions because loss functions which minimize square error will not be so strongly influenced by these large values.






                share|cite|improve this answer











                $endgroup$



                In this case, the salary is the target (dependent variable/outcome) of the decision tree, not one of the features (independent variables/predictors). You are correct that decision trees are insensitive to the scale of the predictors, but since I suspect there are a small number of extremely large salaries, transforming the salaries might improve predictions because loss functions which minimize square error will not be so strongly influenced by these large values.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited Jan 2 at 16:59

























                answered Jan 2 at 3:36









                SycoraxSycorax

                42.1k12111207




                42.1k12111207

























                    4












                    $begingroup$

                    I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.
                    salaries



                    The pdf of the Pareto distribution is $$frac{alpha{x_m}^alpha}{x^{alpha+1}}.$$



                    The scale parameter $x_m$ is $545,000, the lowest salary last year. I estimated the shape parameter, $alpha$, as 0.7848238 using MLE. This matters because when $alpha<2,$ then the distribution has no variance. More properly, its variance is undefined. If any of your variables lacks a mean or a variance, then you cannot use anything that minimizes squared loss.



                    The distribution of the log of the variables does have a variance and so you can use least squares style methodologies on them. This is actually a serious omission from your textbook. Some things, like the stock market returns which have neither a mean nor a variance, or baseball salaries, which lack a variance will make OLS models meaningless. The log is not, inherently, the best treatment, but it does work.



                    Taking the log does not give you a bell shape.log saleries



                    This is entirely about being certain that all of your data has a variance. If all of the assumptions for OLS are met, then the underlying distributions do not matter. They can be insane looking, but variance has to be defined everywhere.



                    EDIT As Therkel pointed out in the comments when $alpha<1$ then no mean exists either. There is a comment by Cliff AB that I should take up as well. He argues that the distribution is doubly bounded and so a finite variance and mean exist. I would disagree with that as an economist. It is true that there is only so much wealth in the world, but it is also true that we have no idea what it is. Furthermore, that wealth is changing every second of every day as people make individual choices.



                    The worker who does not pick that one apple reduces wealth if that apple is never picked and reduces available wealth regardless. An apple on a tree has no income value until it picked and processed. This makes the right-hand side constraint stochastic. For the purposes of baseball, the stochastic effect should be considered to be zero.



                    Baseball, as a percentage of world output, is so miniscule that you could ignore it. The same is true for American football, North American hockey, or for that matter, live stage theater for the whole United States.



                    The fact that you can model this data with a Pareto distribution means you have no mean or variance if the estimates are valid. If you take the log, you end up with finite variance. If you divide the data by its minimum value and take the logs, you end up with the exponential distribution, which is well enough behaved, but then you get interpretation problems.






                    share|cite|improve this answer











                    $endgroup$













                    • $begingroup$
                      Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
                      $endgroup$
                      – Therkel
                      Jan 2 at 7:22












                    • $begingroup$
                      "This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
                      $endgroup$
                      – Cliff AB
                      Jan 2 at 17:34










                    • $begingroup$
                      @Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
                      $endgroup$
                      – kjetil b halvorsen
                      Jan 2 at 17:51










                    • $begingroup$
                      @CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
                      $endgroup$
                      – Dave Harris
                      Jan 2 at 19:54










                    • $begingroup$
                      @Therkel thanks for pointing that out. I was working from memory.
                      $endgroup$
                      – Dave Harris
                      Jan 2 at 19:54
















                    4












                    $begingroup$

                    I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.
                    salaries



                    The pdf of the Pareto distribution is $$frac{alpha{x_m}^alpha}{x^{alpha+1}}.$$



                    The scale parameter $x_m$ is $545,000, the lowest salary last year. I estimated the shape parameter, $alpha$, as 0.7848238 using MLE. This matters because when $alpha<2,$ then the distribution has no variance. More properly, its variance is undefined. If any of your variables lacks a mean or a variance, then you cannot use anything that minimizes squared loss.



                    The distribution of the log of the variables does have a variance and so you can use least squares style methodologies on them. This is actually a serious omission from your textbook. Some things, like the stock market returns which have neither a mean nor a variance, or baseball salaries, which lack a variance will make OLS models meaningless. The log is not, inherently, the best treatment, but it does work.



                    Taking the log does not give you a bell shape.log saleries



                    This is entirely about being certain that all of your data has a variance. If all of the assumptions for OLS are met, then the underlying distributions do not matter. They can be insane looking, but variance has to be defined everywhere.



                    EDIT As Therkel pointed out in the comments when $alpha<1$ then no mean exists either. There is a comment by Cliff AB that I should take up as well. He argues that the distribution is doubly bounded and so a finite variance and mean exist. I would disagree with that as an economist. It is true that there is only so much wealth in the world, but it is also true that we have no idea what it is. Furthermore, that wealth is changing every second of every day as people make individual choices.



                    The worker who does not pick that one apple reduces wealth if that apple is never picked and reduces available wealth regardless. An apple on a tree has no income value until it picked and processed. This makes the right-hand side constraint stochastic. For the purposes of baseball, the stochastic effect should be considered to be zero.



                    Baseball, as a percentage of world output, is so miniscule that you could ignore it. The same is true for American football, North American hockey, or for that matter, live stage theater for the whole United States.



                    The fact that you can model this data with a Pareto distribution means you have no mean or variance if the estimates are valid. If you take the log, you end up with finite variance. If you divide the data by its minimum value and take the logs, you end up with the exponential distribution, which is well enough behaved, but then you get interpretation problems.






                    share|cite|improve this answer











                    $endgroup$













                    • $begingroup$
                      Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
                      $endgroup$
                      – Therkel
                      Jan 2 at 7:22












                    • $begingroup$
                      "This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
                      $endgroup$
                      – Cliff AB
                      Jan 2 at 17:34










                    • $begingroup$
                      @Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
                      $endgroup$
                      – kjetil b halvorsen
                      Jan 2 at 17:51










                    • $begingroup$
                      @CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
                      $endgroup$
                      – Dave Harris
                      Jan 2 at 19:54










                    • $begingroup$
                      @Therkel thanks for pointing that out. I was working from memory.
                      $endgroup$
                      – Dave Harris
                      Jan 2 at 19:54














                    4












                    4








                    4





                    $begingroup$

                    I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.
                    salaries



                    The pdf of the Pareto distribution is $$frac{alpha{x_m}^alpha}{x^{alpha+1}}.$$



                    The scale parameter $x_m$ is $545,000, the lowest salary last year. I estimated the shape parameter, $alpha$, as 0.7848238 using MLE. This matters because when $alpha<2,$ then the distribution has no variance. More properly, its variance is undefined. If any of your variables lacks a mean or a variance, then you cannot use anything that minimizes squared loss.



                    The distribution of the log of the variables does have a variance and so you can use least squares style methodologies on them. This is actually a serious omission from your textbook. Some things, like the stock market returns which have neither a mean nor a variance, or baseball salaries, which lack a variance will make OLS models meaningless. The log is not, inherently, the best treatment, but it does work.



                    Taking the log does not give you a bell shape.log saleries



                    This is entirely about being certain that all of your data has a variance. If all of the assumptions for OLS are met, then the underlying distributions do not matter. They can be insane looking, but variance has to be defined everywhere.



                    EDIT As Therkel pointed out in the comments when $alpha<1$ then no mean exists either. There is a comment by Cliff AB that I should take up as well. He argues that the distribution is doubly bounded and so a finite variance and mean exist. I would disagree with that as an economist. It is true that there is only so much wealth in the world, but it is also true that we have no idea what it is. Furthermore, that wealth is changing every second of every day as people make individual choices.



                    The worker who does not pick that one apple reduces wealth if that apple is never picked and reduces available wealth regardless. An apple on a tree has no income value until it picked and processed. This makes the right-hand side constraint stochastic. For the purposes of baseball, the stochastic effect should be considered to be zero.



                    Baseball, as a percentage of world output, is so miniscule that you could ignore it. The same is true for American football, North American hockey, or for that matter, live stage theater for the whole United States.



                    The fact that you can model this data with a Pareto distribution means you have no mean or variance if the estimates are valid. If you take the log, you end up with finite variance. If you divide the data by its minimum value and take the logs, you end up with the exponential distribution, which is well enough behaved, but then you get interpretation problems.






                    share|cite|improve this answer











                    $endgroup$



                    I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.
                    salaries



                    The pdf of the Pareto distribution is $$frac{alpha{x_m}^alpha}{x^{alpha+1}}.$$



                    The scale parameter $x_m$ is $545,000, the lowest salary last year. I estimated the shape parameter, $alpha$, as 0.7848238 using MLE. This matters because when $alpha<2,$ then the distribution has no variance. More properly, its variance is undefined. If any of your variables lacks a mean or a variance, then you cannot use anything that minimizes squared loss.



                    The distribution of the log of the variables does have a variance and so you can use least squares style methodologies on them. This is actually a serious omission from your textbook. Some things, like the stock market returns which have neither a mean nor a variance, or baseball salaries, which lack a variance will make OLS models meaningless. The log is not, inherently, the best treatment, but it does work.



                    Taking the log does not give you a bell shape.log saleries



                    This is entirely about being certain that all of your data has a variance. If all of the assumptions for OLS are met, then the underlying distributions do not matter. They can be insane looking, but variance has to be defined everywhere.



                    EDIT As Therkel pointed out in the comments when $alpha<1$ then no mean exists either. There is a comment by Cliff AB that I should take up as well. He argues that the distribution is doubly bounded and so a finite variance and mean exist. I would disagree with that as an economist. It is true that there is only so much wealth in the world, but it is also true that we have no idea what it is. Furthermore, that wealth is changing every second of every day as people make individual choices.



                    The worker who does not pick that one apple reduces wealth if that apple is never picked and reduces available wealth regardless. An apple on a tree has no income value until it picked and processed. This makes the right-hand side constraint stochastic. For the purposes of baseball, the stochastic effect should be considered to be zero.



                    Baseball, as a percentage of world output, is so miniscule that you could ignore it. The same is true for American football, North American hockey, or for that matter, live stage theater for the whole United States.



                    The fact that you can model this data with a Pareto distribution means you have no mean or variance if the estimates are valid. If you take the log, you end up with finite variance. If you divide the data by its minimum value and take the logs, you end up with the exponential distribution, which is well enough behaved, but then you get interpretation problems.







                    share|cite|improve this answer














                    share|cite|improve this answer



                    share|cite|improve this answer








                    edited Jan 2 at 20:04

























                    answered Jan 2 at 6:25









                    Dave HarrisDave Harris

                    3,767515




                    3,767515












                    • $begingroup$
                      Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
                      $endgroup$
                      – Therkel
                      Jan 2 at 7:22












                    • $begingroup$
                      "This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
                      $endgroup$
                      – Cliff AB
                      Jan 2 at 17:34










                    • $begingroup$
                      @Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
                      $endgroup$
                      – kjetil b halvorsen
                      Jan 2 at 17:51










                    • $begingroup$
                      @CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
                      $endgroup$
                      – Dave Harris
                      Jan 2 at 19:54










                    • $begingroup$
                      @Therkel thanks for pointing that out. I was working from memory.
                      $endgroup$
                      – Dave Harris
                      Jan 2 at 19:54


















                    • $begingroup$
                      Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
                      $endgroup$
                      – Therkel
                      Jan 2 at 7:22












                    • $begingroup$
                      "This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
                      $endgroup$
                      – Cliff AB
                      Jan 2 at 17:34










                    • $begingroup$
                      @Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
                      $endgroup$
                      – kjetil b halvorsen
                      Jan 2 at 17:51










                    • $begingroup$
                      @CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
                      $endgroup$
                      – Dave Harris
                      Jan 2 at 19:54










                    • $begingroup$
                      @Therkel thanks for pointing that out. I was working from memory.
                      $endgroup$
                      – Dave Harris
                      Jan 2 at 19:54
















                    $begingroup$
                    Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
                    $endgroup$
                    – Therkel
                    Jan 2 at 7:22






                    $begingroup$
                    Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
                    $endgroup$
                    – Therkel
                    Jan 2 at 7:22














                    $begingroup$
                    "This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
                    $endgroup$
                    – Cliff AB
                    Jan 2 at 17:34




                    $begingroup$
                    "This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
                    $endgroup$
                    – Cliff AB
                    Jan 2 at 17:34












                    $begingroup$
                    @Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
                    $endgroup$
                    – kjetil b halvorsen
                    Jan 2 at 17:51




                    $begingroup$
                    @Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
                    $endgroup$
                    – kjetil b halvorsen
                    Jan 2 at 17:51












                    $begingroup$
                    @CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
                    $endgroup$
                    – Dave Harris
                    Jan 2 at 19:54




                    $begingroup$
                    @CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
                    $endgroup$
                    – Dave Harris
                    Jan 2 at 19:54












                    $begingroup$
                    @Therkel thanks for pointing that out. I was working from memory.
                    $endgroup$
                    – Dave Harris
                    Jan 2 at 19:54




                    $begingroup$
                    @Therkel thanks for pointing that out. I was working from memory.
                    $endgroup$
                    – Dave Harris
                    Jan 2 at 19:54


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Cross Validated!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f385231%2fwhy-log-transform-to-normal-distribution-for-decision-trees%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Wiesbaden

                    Marschland

                    Dieringhausen