Training Error is Lower than Testing error in a Random Forest Model
I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.
label count
0 0.0 1,021,095
1 1.0 4459
The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()
#Sampling Techniques --- Should be done one of these
#Upsampling ----
df_class_0 = train_initial[train_initial['label'] == 0]
df_class_1 = train_initial[train_initial['label'] == 1]
df_class_1_over = df_class_1.sample(True, 100.0, seed=99)
train_up = df_class_0.union(df_class_1_over)
train_up.groupby('label').count().toPandas()
#Down Sampling
stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()
stratified_train.groupby('label').count().toPandas()
Below is how I am training my model
labelIndexer = StringIndexer(inputCol='label',
outputCol='indexedLabel').fit(new_data)
featureIndexer = VectorIndexer(inputCol='features',
outputCol='indexedFeatures',
maxCategories=2).fit(new_data)
from pyspark.ml.classification import RandomForestClassifier
rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])
# Search through random forest maxDepth parameter for best model
paramGrid = ParamGridBuilder()
.addGrid(rf_model.numTrees, [ 200, 400,600,800,1000])
.addGrid(rf_model.impurity,['entropy','gini'])
.addGrid(rf_model.maxDepth,[2,3,4,5])
.build()
# Set up 5-fold cross validation
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=5)
train_model = crossval.fit(train_up/stratified_train)
Below are the results from both the methods
#UpSampling - Training
Train Error = 0.184633
precision: 0.8565508112679312
recall: 0.6597217024736883
auroc: 0.9062348758176568
f1 : 0.7453609484359377
#Upsampling - Test
Test Error = 0.0781619
precision: 0.054455645977569946
recall: 0.6503868471953579
auroc: 0.8982212236597943
f1 : 0.10049688048716704
#UnderSampling - Training
Train Error = 0.179293
precision: 0.8468290542023261
recall: 0.781807131280389
f1 : 0.8130201200884863
auroc: 0.9129391668636556
#UnderSamping - Test
Test Error = 0.147874
precision: 0.034453223699706645
recall: 0.778046421663443
f1 : 0.06598453935901905
auroc: 0.8989720777537427
Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.
I was hoping if someone could please help me out with this model and help me to clear my doubts.
Thanks a lot in advance !!
machine-learning random-forest sampling
add a comment |
I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.
label count
0 0.0 1,021,095
1 1.0 4459
The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()
#Sampling Techniques --- Should be done one of these
#Upsampling ----
df_class_0 = train_initial[train_initial['label'] == 0]
df_class_1 = train_initial[train_initial['label'] == 1]
df_class_1_over = df_class_1.sample(True, 100.0, seed=99)
train_up = df_class_0.union(df_class_1_over)
train_up.groupby('label').count().toPandas()
#Down Sampling
stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()
stratified_train.groupby('label').count().toPandas()
Below is how I am training my model
labelIndexer = StringIndexer(inputCol='label',
outputCol='indexedLabel').fit(new_data)
featureIndexer = VectorIndexer(inputCol='features',
outputCol='indexedFeatures',
maxCategories=2).fit(new_data)
from pyspark.ml.classification import RandomForestClassifier
rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])
# Search through random forest maxDepth parameter for best model
paramGrid = ParamGridBuilder()
.addGrid(rf_model.numTrees, [ 200, 400,600,800,1000])
.addGrid(rf_model.impurity,['entropy','gini'])
.addGrid(rf_model.maxDepth,[2,3,4,5])
.build()
# Set up 5-fold cross validation
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=5)
train_model = crossval.fit(train_up/stratified_train)
Below are the results from both the methods
#UpSampling - Training
Train Error = 0.184633
precision: 0.8565508112679312
recall: 0.6597217024736883
auroc: 0.9062348758176568
f1 : 0.7453609484359377
#Upsampling - Test
Test Error = 0.0781619
precision: 0.054455645977569946
recall: 0.6503868471953579
auroc: 0.8982212236597943
f1 : 0.10049688048716704
#UnderSampling - Training
Train Error = 0.179293
precision: 0.8468290542023261
recall: 0.781807131280389
f1 : 0.8130201200884863
auroc: 0.9129391668636556
#UnderSamping - Test
Test Error = 0.147874
precision: 0.034453223699706645
recall: 0.778046421663443
f1 : 0.06598453935901905
auroc: 0.8989720777537427
Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.
I was hoping if someone could please help me out with this model and help me to clear my doubts.
Thanks a lot in advance !!
machine-learning random-forest sampling
add a comment |
I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.
label count
0 0.0 1,021,095
1 1.0 4459
The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()
#Sampling Techniques --- Should be done one of these
#Upsampling ----
df_class_0 = train_initial[train_initial['label'] == 0]
df_class_1 = train_initial[train_initial['label'] == 1]
df_class_1_over = df_class_1.sample(True, 100.0, seed=99)
train_up = df_class_0.union(df_class_1_over)
train_up.groupby('label').count().toPandas()
#Down Sampling
stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()
stratified_train.groupby('label').count().toPandas()
Below is how I am training my model
labelIndexer = StringIndexer(inputCol='label',
outputCol='indexedLabel').fit(new_data)
featureIndexer = VectorIndexer(inputCol='features',
outputCol='indexedFeatures',
maxCategories=2).fit(new_data)
from pyspark.ml.classification import RandomForestClassifier
rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])
# Search through random forest maxDepth parameter for best model
paramGrid = ParamGridBuilder()
.addGrid(rf_model.numTrees, [ 200, 400,600,800,1000])
.addGrid(rf_model.impurity,['entropy','gini'])
.addGrid(rf_model.maxDepth,[2,3,4,5])
.build()
# Set up 5-fold cross validation
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=5)
train_model = crossval.fit(train_up/stratified_train)
Below are the results from both the methods
#UpSampling - Training
Train Error = 0.184633
precision: 0.8565508112679312
recall: 0.6597217024736883
auroc: 0.9062348758176568
f1 : 0.7453609484359377
#Upsampling - Test
Test Error = 0.0781619
precision: 0.054455645977569946
recall: 0.6503868471953579
auroc: 0.8982212236597943
f1 : 0.10049688048716704
#UnderSampling - Training
Train Error = 0.179293
precision: 0.8468290542023261
recall: 0.781807131280389
f1 : 0.8130201200884863
auroc: 0.9129391668636556
#UnderSamping - Test
Test Error = 0.147874
precision: 0.034453223699706645
recall: 0.778046421663443
f1 : 0.06598453935901905
auroc: 0.8989720777537427
Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.
I was hoping if someone could please help me out with this model and help me to clear my doubts.
Thanks a lot in advance !!
machine-learning random-forest sampling
I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.
label count
0 0.0 1,021,095
1 1.0 4459
The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()
#Sampling Techniques --- Should be done one of these
#Upsampling ----
df_class_0 = train_initial[train_initial['label'] == 0]
df_class_1 = train_initial[train_initial['label'] == 1]
df_class_1_over = df_class_1.sample(True, 100.0, seed=99)
train_up = df_class_0.union(df_class_1_over)
train_up.groupby('label').count().toPandas()
#Down Sampling
stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()
stratified_train.groupby('label').count().toPandas()
Below is how I am training my model
labelIndexer = StringIndexer(inputCol='label',
outputCol='indexedLabel').fit(new_data)
featureIndexer = VectorIndexer(inputCol='features',
outputCol='indexedFeatures',
maxCategories=2).fit(new_data)
from pyspark.ml.classification import RandomForestClassifier
rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])
# Search through random forest maxDepth parameter for best model
paramGrid = ParamGridBuilder()
.addGrid(rf_model.numTrees, [ 200, 400,600,800,1000])
.addGrid(rf_model.impurity,['entropy','gini'])
.addGrid(rf_model.maxDepth,[2,3,4,5])
.build()
# Set up 5-fold cross validation
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=5)
train_model = crossval.fit(train_up/stratified_train)
Below are the results from both the methods
#UpSampling - Training
Train Error = 0.184633
precision: 0.8565508112679312
recall: 0.6597217024736883
auroc: 0.9062348758176568
f1 : 0.7453609484359377
#Upsampling - Test
Test Error = 0.0781619
precision: 0.054455645977569946
recall: 0.6503868471953579
auroc: 0.8982212236597943
f1 : 0.10049688048716704
#UnderSampling - Training
Train Error = 0.179293
precision: 0.8468290542023261
recall: 0.781807131280389
f1 : 0.8130201200884863
auroc: 0.9129391668636556
#UnderSamping - Test
Test Error = 0.147874
precision: 0.034453223699706645
recall: 0.778046421663443
f1 : 0.06598453935901905
auroc: 0.8989720777537427
Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.
I was hoping if someone could please help me out with this model and help me to clear my doubts.
Thanks a lot in advance !!
machine-learning random-forest sampling
machine-learning random-forest sampling
asked Nov 21 '18 at 21:27
Tushar MehtaTushar Mehta
387
387
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.
Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.
Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.
Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around
– Tushar Mehta
Nov 22 '18 at 0:49
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420692%2ftraining-error-is-lower-than-testing-error-in-a-random-forest-model%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.
Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.
Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.
Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around
– Tushar Mehta
Nov 22 '18 at 0:49
add a comment |
Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.
Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.
Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.
Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around
– Tushar Mehta
Nov 22 '18 at 0:49
add a comment |
Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.
Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.
Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.
Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.
Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.
Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.
answered Nov 21 '18 at 22:30
sjishansjishan
5802629
5802629
Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around
– Tushar Mehta
Nov 22 '18 at 0:49
add a comment |
Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around
– Tushar Mehta
Nov 22 '18 at 0:49
Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around
– Tushar Mehta
Nov 22 '18 at 0:49
Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around
– Tushar Mehta
Nov 22 '18 at 0:49
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420692%2ftraining-error-is-lower-than-testing-error-in-a-random-forest-model%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown