How to only use all features for training, but only 2 features for testing with SciKit learn?












-1















I am building a machine learning model for predicting Premier League (football/soccer) results using this dataset, which has features such as Home Goals, Away Goals, Shots on target etc. This is my code currently after I have loaded the dataset:



features = list(data.columns.values)
X, y = data[features], data.FTR #FTR stands for Full Time Result
print(X.shape)
-> (4940, 20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)


This gives a very good accuracy (72%), but this is because when I am asking the model to predict the result, I am giving it the access to the statistics (including goals scored) from the match that I am trying to predict. Is there a way to "hide" all of the features apart from Home team and Away team and predict the results this way?



I have tried doing this:



 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)


However, this gives the following error:



ValueError: operands could not be broadcast together with shapes (988,2) (20,) 









share|improve this question


















  • 2





    There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

    – desertnaut
    Nov 22 '18 at 23:32













  • @desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

    – Ivan Novikov
    Nov 23 '18 at 1:19


















-1















I am building a machine learning model for predicting Premier League (football/soccer) results using this dataset, which has features such as Home Goals, Away Goals, Shots on target etc. This is my code currently after I have loaded the dataset:



features = list(data.columns.values)
X, y = data[features], data.FTR #FTR stands for Full Time Result
print(X.shape)
-> (4940, 20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)


This gives a very good accuracy (72%), but this is because when I am asking the model to predict the result, I am giving it the access to the statistics (including goals scored) from the match that I am trying to predict. Is there a way to "hide" all of the features apart from Home team and Away team and predict the results this way?



I have tried doing this:



 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)


However, this gives the following error:



ValueError: operands could not be broadcast together with shapes (988,2) (20,) 









share|improve this question


















  • 2





    There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

    – desertnaut
    Nov 22 '18 at 23:32













  • @desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

    – Ivan Novikov
    Nov 23 '18 at 1:19
















-1












-1








-1








I am building a machine learning model for predicting Premier League (football/soccer) results using this dataset, which has features such as Home Goals, Away Goals, Shots on target etc. This is my code currently after I have loaded the dataset:



features = list(data.columns.values)
X, y = data[features], data.FTR #FTR stands for Full Time Result
print(X.shape)
-> (4940, 20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)


This gives a very good accuracy (72%), but this is because when I am asking the model to predict the result, I am giving it the access to the statistics (including goals scored) from the match that I am trying to predict. Is there a way to "hide" all of the features apart from Home team and Away team and predict the results this way?



I have tried doing this:



 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)


However, this gives the following error:



ValueError: operands could not be broadcast together with shapes (988,2) (20,) 









share|improve this question














I am building a machine learning model for predicting Premier League (football/soccer) results using this dataset, which has features such as Home Goals, Away Goals, Shots on target etc. This is my code currently after I have loaded the dataset:



features = list(data.columns.values)
X, y = data[features], data.FTR #FTR stands for Full Time Result
print(X.shape)
-> (4940, 20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)


This gives a very good accuracy (72%), but this is because when I am asking the model to predict the result, I am giving it the access to the statistics (including goals scored) from the match that I am trying to predict. Is there a way to "hide" all of the features apart from Home team and Away team and predict the results this way?



I have tried doing this:



 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)


However, this gives the following error:



ValueError: operands could not be broadcast together with shapes (988,2) (20,) 






python machine-learning scikit-learn training-data






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 22 '18 at 20:39









Ivan NovikovIvan Novikov

315




315








  • 2





    There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

    – desertnaut
    Nov 22 '18 at 23:32













  • @desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

    – Ivan Novikov
    Nov 23 '18 at 1:19
















  • 2





    There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

    – desertnaut
    Nov 22 '18 at 23:32













  • @desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

    – Ivan Novikov
    Nov 23 '18 at 1:19










2




2





There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

– desertnaut
Nov 22 '18 at 23:32







There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

– desertnaut
Nov 22 '18 at 23:32















@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

– Ivan Novikov
Nov 23 '18 at 1:19







@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

– Ivan Novikov
Nov 23 '18 at 1:19














1 Answer
1






active

oldest

votes


















0














If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.



When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.






share|improve this answer
























  • @rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

    – Ivan Novikov
    Nov 24 '18 at 13:45











  • @IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

    – drbabaghanoush
    Nov 24 '18 at 20:22











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437749%2fhow-to-only-use-all-features-for-training-but-only-2-features-for-testing-with%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.



When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.






share|improve this answer
























  • @rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

    – Ivan Novikov
    Nov 24 '18 at 13:45











  • @IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

    – drbabaghanoush
    Nov 24 '18 at 20:22
















0














If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.



When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.






share|improve this answer
























  • @rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

    – Ivan Novikov
    Nov 24 '18 at 13:45











  • @IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

    – drbabaghanoush
    Nov 24 '18 at 20:22














0












0








0







If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.



When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.






share|improve this answer













If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.



When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 24 '18 at 12:48









drbabaghanoushdrbabaghanoush

563




563













  • @rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

    – Ivan Novikov
    Nov 24 '18 at 13:45











  • @IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

    – drbabaghanoush
    Nov 24 '18 at 20:22



















  • @rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

    – Ivan Novikov
    Nov 24 '18 at 13:45











  • @IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

    – drbabaghanoush
    Nov 24 '18 at 20:22

















@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

– Ivan Novikov
Nov 24 '18 at 13:45





@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

– Ivan Novikov
Nov 24 '18 at 13:45













@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

– drbabaghanoush
Nov 24 '18 at 20:22





@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

– drbabaghanoush
Nov 24 '18 at 20:22


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437749%2fhow-to-only-use-all-features-for-training-but-only-2-features-for-testing-with%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Wiesbaden

Marschland

Dieringhausen