Why log-transform to normal distribution for decision trees?

On page 304 of chapter 8 of An Introduction to Statistical Learning with Applications in R (James et al.), the authors say:

We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)

No additional motivation for the log-transform is given. Being that the data are being fed into decision tree algorithms, why was it important to force the data into a normal distribution? I thought most/all decision tree algorithms were invariant to scale changes.

edited Jan 2 at 3:38

Sycorax

42.1k12111207

asked Jan 2 at 2:50

jss367

1255

add a comment |

On page 304 of chapter 8 of An Introduction to Statistical Learning with Applications in R (James et al.), the authors say:

We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)

edited Jan 2 at 3:38

Sycorax

42.1k12111207

asked Jan 2 at 2:50

jss367

1255

add a comment |

On page 304 of chapter 8 of An Introduction to Statistical Learning with Applications in R (James et al.), the authors say:

We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)

edited Jan 2 at 3:38

Sycorax

42.1k12111207

asked Jan 2 at 2:50

jss367

1255

On page 304 of chapter 8 of An Introduction to Statistical Learning with Applications in R (James et al.), the authors say:

We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)

machine-learning cart

edited Jan 2 at 3:38

Sycorax

42.1k12111207

asked Jan 2 at 2:50

jss367

1255

edited Jan 2 at 3:38

Sycorax

42.1k12111207

asked Jan 2 at 2:50

jss367

1255

edited Jan 2 at 3:38

Sycorax

42.1k12111207

edited Jan 2 at 3:38

Sycorax

42.1k12111207

edited Jan 2 at 3:38

Sycorax

42.1k12111207

asked Jan 2 at 2:50

jss367

1255

asked Jan 2 at 2:50

jss367

1255

asked Jan 2 at 2:50

jss367

1255

add a comment |

2 Answers
2

active

oldest

votes

In this case, the salary is the target (dependent variable/outcome) of the decision tree, not one of the features (independent variables/predictors). You are correct that decision trees are insensitive to the scale of the predictors, but since I suspect there are a small number of extremely large salaries, transforming the salaries might improve predictions because loss functions which minimize square error will not be so strongly influenced by these large values.

edited Jan 2 at 16:59

answered Jan 2 at 3:36

Sycorax

42.1k12111207

add a comment |

I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.

The pdf of the Pareto distribution is $$frac{alpha{x_m}^alpha}{x^{alpha+1}}.$$

The scale parameter $x_m$ is $545,000, the lowest salary last year. I estimated the shape parameter, $alpha$, as 0.7848238 using MLE. This matters because when $alpha<2,$ then the distribution has no variance. More properly, its variance is undefined. If any of your variables lacks a mean or a variance, then you cannot use anything that minimizes squared loss.

The distribution of the log of the variables does have a variance and so you can use least squares style methodologies on them. This is actually a serious omission from your textbook. Some things, like the stock market returns which have neither a mean nor a variance, or baseball salaries, which lack a variance will make OLS models meaningless. The log is not, inherently, the best treatment, but it does work.

Taking the log does not give you a bell shape. log saleries

This is entirely about being certain that all of your data has a variance. If all of the assumptions for OLS are met, then the underlying distributions do not matter. They can be insane looking, but variance has to be defined everywhere.

EDIT As Therkel pointed out in the comments when $alpha<1$ then no mean exists either. There is a comment by Cliff AB that I should take up as well. He argues that the distribution is doubly bounded and so a finite variance and mean exist. I would disagree with that as an economist. It is true that there is only so much wealth in the world, but it is also true that we have no idea what it is. Furthermore, that wealth is changing every second of every day as people make individual choices.

The worker who does not pick that one apple reduces wealth if that apple is never picked and reduces available wealth regardless. An apple on a tree has no income value until it picked and processed. This makes the right-hand side constraint stochastic. For the purposes of baseball, the stochastic effect should be considered to be zero.

Baseball, as a percentage of world output, is so miniscule that you could ignore it. The same is true for American football, North American hockey, or for that matter, live stage theater for the whole United States.

The fact that you can model this data with a Pareto distribution means you have no mean or variance if the estimates are valid. If you take the log, you end up with finite variance. If you divide the data by its minimum value and take the logs, you end up with the exponential distribution, which is well enough behaved, but then you get interpretation problems.

edited Jan 2 at 20:04

answered Jan 2 at 6:25

Dave Harris

3,767515

$begingroup$
Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
$endgroup$
– Therkel
Jan 2 at 7:22

$begingroup$
"This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
$endgroup$
– Cliff AB
Jan 2 at 17:34

$begingroup$
@Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
$endgroup$
– kjetil b halvorsen
Jan 2 at 17:51

$begingroup$
@CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
$endgroup$
– Dave Harris
Jan 2 at 19:54

$begingroup$
@Therkel thanks for pointing that out. I was working from memory.
$endgroup$
– Dave Harris
Jan 2 at 19:54

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f385231%2fwhy-log-transform-to-normal-distribution-for-decision-trees%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

edited Jan 2 at 16:59

answered Jan 2 at 3:36

Sycorax

42.1k12111207

add a comment |

edited Jan 2 at 16:59

answered Jan 2 at 3:36

Sycorax

42.1k12111207

add a comment |

edited Jan 2 at 16:59

answered Jan 2 at 3:36

Sycorax

42.1k12111207

edited Jan 2 at 16:59

answered Jan 2 at 3:36

Sycorax

42.1k12111207

edited Jan 2 at 16:59

answered Jan 2 at 3:36

Sycorax

42.1k12111207

answered Jan 2 at 3:36

Sycorax

42.1k12111207

answered Jan 2 at 3:36

Sycorax

42.1k12111207

add a comment |

I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.

The pdf of the Pareto distribution is $$frac{alpha{x_m}^alpha}{x^{alpha+1}}.$$

Taking the log does not give you a bell shape. log saleries

edited Jan 2 at 20:04

answered Jan 2 at 6:25

Dave Harris

3,767515

$begingroup$
Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
$endgroup$
– Therkel
Jan 2 at 7:22

$begingroup$
"This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
$endgroup$
– Cliff AB
Jan 2 at 17:34

$begingroup$
@Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
$endgroup$
– kjetil b halvorsen
Jan 2 at 17:51

$begingroup$
@CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
$endgroup$
– Dave Harris
Jan 2 at 19:54

$begingroup$
@Therkel thanks for pointing that out. I was working from memory.
$endgroup$
– Dave Harris
Jan 2 at 19:54

add a comment |

I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.

The pdf of the Pareto distribution is $$frac{alpha{x_m}^alpha}{x^{alpha+1}}.$$

Taking the log does not give you a bell shape. log saleries

edited Jan 2 at 20:04

answered Jan 2 at 6:25

Dave Harris

3,767515

$begingroup$
Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
$endgroup$
– Therkel
Jan 2 at 7:22

$begingroup$
"This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
$endgroup$
– Cliff AB
Jan 2 at 17:34

$begingroup$
@Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
$endgroup$
– kjetil b halvorsen
Jan 2 at 17:51

$begingroup$
@CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
$endgroup$
– Dave Harris
Jan 2 at 19:54

$begingroup$
@Therkel thanks for pointing that out. I was working from memory.
$endgroup$
– Dave Harris
Jan 2 at 19:54

add a comment |

I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.

The pdf of the Pareto distribution is $$frac{alpha{x_m}^alpha}{x^{alpha+1}}.$$

Taking the log does not give you a bell shape. log saleries

edited Jan 2 at 20:04

answered Jan 2 at 6:25

Dave Harris

3,767515

I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.

The pdf of the Pareto distribution is $$frac{alpha{x_m}^alpha}{x^{alpha+1}}.$$

Taking the log does not give you a bell shape. log saleries

edited Jan 2 at 20:04

answered Jan 2 at 6:25

Dave Harris

3,767515

edited Jan 2 at 20:04

answered Jan 2 at 6:25

Dave Harris

3,767515

answered Jan 2 at 6:25

Dave Harris

3,767515

answered Jan 2 at 6:25

Dave Harris

3,767515

$begingroup$
Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
$endgroup$
– Therkel
Jan 2 at 7:22

$begingroup$
"This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
$endgroup$
– Cliff AB
Jan 2 at 17:34

$begingroup$
@Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
$endgroup$
– kjetil b halvorsen
Jan 2 at 17:51

$begingroup$
@CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
$endgroup$
– Dave Harris
Jan 2 at 19:54

$begingroup$
@Therkel thanks for pointing that out. I was working from memory.
$endgroup$
– Dave Harris
Jan 2 at 19:54

add a comment |

$begingroup$
Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.
$endgroup$
– Therkel
Jan 2 at 7:22

$begingroup$
"This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.
$endgroup$
– Cliff AB
Jan 2 at 17:34

$begingroup$
@Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…
$endgroup$
– kjetil b halvorsen
Jan 2 at 17:51

$begingroup$
@CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.
$endgroup$
– Dave Harris
Jan 2 at 19:54

$begingroup$
@Therkel thanks for pointing that out. I was working from memory.
$endgroup$
– Dave Harris
Jan 2 at 19:54

Note that for a shape parameter $alpha < 1$ then the distribution does not even have a mean.

– Therkel
Jan 2 at 7:22

"This matters because when α<2, then the distribution has no variance." - No, because we know the variance is bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance.

– Cliff AB
Jan 2 at 17:34

@Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: stats.stackexchange.com/questions/94402/…

– kjetil b halvorsen
Jan 2 at 17:51

@CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion.

– Dave Harris
Jan 2 at 19:54

@Therkel thanks for pointing that out. I was working from memory.

– Dave Harris
Jan 2 at 19:54

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg