Training Error is Lower than Testing error in a Random Forest Model

I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.

   label   count                                                                

0    0.0  1,021,095

1    1.0    4459

The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling

train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)

train_initial.groupby('label').count().toPandas()

test.groupby('label').count().toPandas()



#Sampling Techniques --- Should be done one of these

#Upsampling ----

df_class_0 = train_initial[train_initial['label'] == 0]

df_class_1 = train_initial[train_initial['label'] == 1]

df_class_1_over = df_class_1.sample(True, 100.0, seed=99)

train_up = df_class_0.union(df_class_1_over)

train_up.groupby('label').count().toPandas()



#Down Sampling

stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()

stratified_train.groupby('label').count().toPandas()

Below is how I am training my model

labelIndexer = StringIndexer(inputCol='label',

                             outputCol='indexedLabel').fit(new_data)





featureIndexer = VectorIndexer(inputCol='features',

                               outputCol='indexedFeatures',

                               maxCategories=2).fit(new_data)



from pyspark.ml.classification import RandomForestClassifier

rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")



labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",

                               labels=labelIndexer.labels)



# Chain indexers and tree in a Pipeline

pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])



# Search through random forest maxDepth parameter for best model

paramGrid = ParamGridBuilder() 

    .addGrid(rf_model.numTrees, [ 200, 400,600,800,1000]) 

    .addGrid(rf_model.impurity,['entropy','gini']) 

    .addGrid(rf_model.maxDepth,[2,3,4,5]) 

    .build()





# Set up 5-fold cross validation

crossval = CrossValidator(estimator=pipeline,

                          estimatorParamMaps=paramGrid,

                          evaluator=BinaryClassificationEvaluator(),

                          numFolds=5)    



train_model = crossval.fit(train_up/stratified_train)

Below are the results from both the methods

#UpSampling - Training                                 

Train Error = 0.184633

precision: 0.8565508112679312

recall: 0.6597217024736883                                            

auroc: 0.9062348758176568

f1 : 0.7453609484359377



#Upsampling - Test                                

Test Error = 0.0781619                             

precision: 0.054455645977569946

recall: 0.6503868471953579

auroc: 0.8982212236597943

f1 : 0.10049688048716704



#UnderSampling - Training                               

Train Error = 0.179293           

precision: 0.8468290542023261

recall: 0.781807131280389

f1 : 0.8130201200884863                                          

auroc: 0.9129391668636556



#UnderSamping - Test                               

Test Error = 0.147874

precision: 0.034453223699706645

recall: 0.778046421663443

f1 : 0.06598453935901905

auroc: 0.8989720777537427

Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.

I was hoping if someone could please help me out with this model and help me to clear my doubts.

Thanks a lot in advance !!

asked Nov 21 '18 at 21:27

Tushar Mehta

387

add a comment |

   label   count                                                                

0    0.0  1,021,095

1    1.0    4459

The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling

train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)

train_initial.groupby('label').count().toPandas()

test.groupby('label').count().toPandas()



#Sampling Techniques --- Should be done one of these

#Upsampling ----

df_class_0 = train_initial[train_initial['label'] == 0]

df_class_1 = train_initial[train_initial['label'] == 1]

df_class_1_over = df_class_1.sample(True, 100.0, seed=99)

train_up = df_class_0.union(df_class_1_over)

train_up.groupby('label').count().toPandas()



#Down Sampling

stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()

stratified_train.groupby('label').count().toPandas()

Below is how I am training my model

labelIndexer = StringIndexer(inputCol='label',

                             outputCol='indexedLabel').fit(new_data)





featureIndexer = VectorIndexer(inputCol='features',

                               outputCol='indexedFeatures',

                               maxCategories=2).fit(new_data)



from pyspark.ml.classification import RandomForestClassifier

rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")



labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",

                               labels=labelIndexer.labels)



# Chain indexers and tree in a Pipeline

pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])



# Search through random forest maxDepth parameter for best model

paramGrid = ParamGridBuilder() 

    .addGrid(rf_model.numTrees, [ 200, 400,600,800,1000]) 

    .addGrid(rf_model.impurity,['entropy','gini']) 

    .addGrid(rf_model.maxDepth,[2,3,4,5]) 

    .build()





# Set up 5-fold cross validation

crossval = CrossValidator(estimator=pipeline,

                          estimatorParamMaps=paramGrid,

                          evaluator=BinaryClassificationEvaluator(),

                          numFolds=5)    



train_model = crossval.fit(train_up/stratified_train)

Below are the results from both the methods

#UpSampling - Training                                 

Train Error = 0.184633

precision: 0.8565508112679312

recall: 0.6597217024736883                                            

auroc: 0.9062348758176568

f1 : 0.7453609484359377



#Upsampling - Test                                

Test Error = 0.0781619                             

precision: 0.054455645977569946

recall: 0.6503868471953579

auroc: 0.8982212236597943

f1 : 0.10049688048716704



#UnderSampling - Training                               

Train Error = 0.179293           

precision: 0.8468290542023261

recall: 0.781807131280389

f1 : 0.8130201200884863                                          

auroc: 0.9129391668636556



#UnderSamping - Test                               

Test Error = 0.147874

precision: 0.034453223699706645

recall: 0.778046421663443

f1 : 0.06598453935901905

auroc: 0.8989720777537427

I was hoping if someone could please help me out with this model and help me to clear my doubts.

Thanks a lot in advance !!

asked Nov 21 '18 at 21:27

Tushar Mehta

387

add a comment |

   label   count                                                                

0    0.0  1,021,095

1    1.0    4459

The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling

train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)

train_initial.groupby('label').count().toPandas()

test.groupby('label').count().toPandas()



#Sampling Techniques --- Should be done one of these

#Upsampling ----

df_class_0 = train_initial[train_initial['label'] == 0]

df_class_1 = train_initial[train_initial['label'] == 1]

df_class_1_over = df_class_1.sample(True, 100.0, seed=99)

train_up = df_class_0.union(df_class_1_over)

train_up.groupby('label').count().toPandas()



#Down Sampling

stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()

stratified_train.groupby('label').count().toPandas()

Below is how I am training my model

labelIndexer = StringIndexer(inputCol='label',

                             outputCol='indexedLabel').fit(new_data)





featureIndexer = VectorIndexer(inputCol='features',

                               outputCol='indexedFeatures',

                               maxCategories=2).fit(new_data)



from pyspark.ml.classification import RandomForestClassifier

rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")



labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",

                               labels=labelIndexer.labels)



# Chain indexers and tree in a Pipeline

pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])



# Search through random forest maxDepth parameter for best model

paramGrid = ParamGridBuilder() 

    .addGrid(rf_model.numTrees, [ 200, 400,600,800,1000]) 

    .addGrid(rf_model.impurity,['entropy','gini']) 

    .addGrid(rf_model.maxDepth,[2,3,4,5]) 

    .build()





# Set up 5-fold cross validation

crossval = CrossValidator(estimator=pipeline,

                          estimatorParamMaps=paramGrid,

                          evaluator=BinaryClassificationEvaluator(),

                          numFolds=5)    



train_model = crossval.fit(train_up/stratified_train)

Below are the results from both the methods

#UpSampling - Training                                 

Train Error = 0.184633

precision: 0.8565508112679312

recall: 0.6597217024736883                                            

auroc: 0.9062348758176568

f1 : 0.7453609484359377



#Upsampling - Test                                

Test Error = 0.0781619                             

precision: 0.054455645977569946

recall: 0.6503868471953579

auroc: 0.8982212236597943

f1 : 0.10049688048716704



#UnderSampling - Training                               

Train Error = 0.179293           

precision: 0.8468290542023261

recall: 0.781807131280389

f1 : 0.8130201200884863                                          

auroc: 0.9129391668636556



#UnderSamping - Test                               

Test Error = 0.147874

precision: 0.034453223699706645

recall: 0.778046421663443

f1 : 0.06598453935901905

auroc: 0.8989720777537427

I was hoping if someone could please help me out with this model and help me to clear my doubts.

Thanks a lot in advance !!

asked Nov 21 '18 at 21:27

Tushar Mehta

387

   label   count                                                                

0    0.0  1,021,095

1    1.0    4459

The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling

train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)

train_initial.groupby('label').count().toPandas()

test.groupby('label').count().toPandas()



#Sampling Techniques --- Should be done one of these

#Upsampling ----

df_class_0 = train_initial[train_initial['label'] == 0]

df_class_1 = train_initial[train_initial['label'] == 1]

df_class_1_over = df_class_1.sample(True, 100.0, seed=99)

train_up = df_class_0.union(df_class_1_over)

train_up.groupby('label').count().toPandas()



#Down Sampling

stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()

stratified_train.groupby('label').count().toPandas()

Below is how I am training my model

labelIndexer = StringIndexer(inputCol='label',

                             outputCol='indexedLabel').fit(new_data)





featureIndexer = VectorIndexer(inputCol='features',

                               outputCol='indexedFeatures',

                               maxCategories=2).fit(new_data)



from pyspark.ml.classification import RandomForestClassifier

rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")



labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",

                               labels=labelIndexer.labels)



# Chain indexers and tree in a Pipeline

pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])



# Search through random forest maxDepth parameter for best model

paramGrid = ParamGridBuilder() 

    .addGrid(rf_model.numTrees, [ 200, 400,600,800,1000]) 

    .addGrid(rf_model.impurity,['entropy','gini']) 

    .addGrid(rf_model.maxDepth,[2,3,4,5]) 

    .build()





# Set up 5-fold cross validation

crossval = CrossValidator(estimator=pipeline,

                          estimatorParamMaps=paramGrid,

                          evaluator=BinaryClassificationEvaluator(),

                          numFolds=5)    



train_model = crossval.fit(train_up/stratified_train)

Below are the results from both the methods

#UpSampling - Training                                 

Train Error = 0.184633

precision: 0.8565508112679312

recall: 0.6597217024736883                                            

auroc: 0.9062348758176568

f1 : 0.7453609484359377



#Upsampling - Test                                

Test Error = 0.0781619                             

precision: 0.054455645977569946

recall: 0.6503868471953579

auroc: 0.8982212236597943

f1 : 0.10049688048716704



#UnderSampling - Training                               

Train Error = 0.179293           

precision: 0.8468290542023261

recall: 0.781807131280389

f1 : 0.8130201200884863                                          

auroc: 0.9129391668636556



#UnderSamping - Test                               

Test Error = 0.147874

precision: 0.034453223699706645

recall: 0.778046421663443

f1 : 0.06598453935901905

auroc: 0.8989720777537427

I was hoping if someone could please help me out with this model and help me to clear my doubts.

Thanks a lot in advance !!

machine-learning random-forest sampling

asked Nov 21 '18 at 21:27

Tushar Mehta

387

asked Nov 21 '18 at 21:27

Tushar Mehta

387

asked Nov 21 '18 at 21:27

Tushar Mehta

387

asked Nov 21 '18 at 21:27

Tushar Mehta

387

asked Nov 21 '18 at 21:27

Tushar Mehta

387

add a comment |

1 Answer
1

active

oldest

votes

Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.

Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.

Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.

answered Nov 21 '18 at 22:30

sjishan

5802629

Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

– Tushar Mehta
Nov 22 '18 at 0:49

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420692%2ftraining-error-is-lower-than-testing-error-in-a-random-forest-model%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.

Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.

answered Nov 21 '18 at 22:30

sjishan

5802629

Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

– Tushar Mehta
Nov 22 '18 at 0:49

add a comment |

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.

Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.

answered Nov 21 '18 at 22:30

sjishan

5802629

Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

– Tushar Mehta
Nov 22 '18 at 0:49

add a comment |

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.

Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.

answered Nov 21 '18 at 22:30

sjishan

5802629

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.

Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.

answered Nov 21 '18 at 22:30

sjishan

5802629

answered Nov 21 '18 at 22:30

sjishan

5802629

answered Nov 21 '18 at 22:30

sjishan

5802629

answered Nov 21 '18 at 22:30

sjishan

5802629

Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

– Tushar Mehta
Nov 22 '18 at 0:49

add a comment |

Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

– Tushar Mehta
Nov 22 '18 at 0:49

Thanks a lot for responding. As suggested, I shuffled that data with different seed value. Performed down sampling using the same above code I mentioned in the question. And almost every time I was getting my test error rate greater that train error by a margin of 0.05%. What would be your thoughts about this change in error rates. I used the same trained model, shuffled the data around

– Tushar Mehta
Nov 22 '18 at 0:49

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

99CMF,gs1nWWD5fSuEb4 cQa57Nep874km M89Mch,LuQKxnFwYxeQwnRrUF ZvEIe,aL4A6Yr,tfiXN9dST8Vu3W39LFmi

搜尋此網誌

Ytukyg