How to only use all features for training, but only 2 features for testing with SciKit learn?
I am building a machine learning model for predicting Premier League (football/soccer) results using this dataset, which has features such as Home Goals, Away Goals, Shots on target etc. This is my code currently after I have loaded the dataset:
features = list(data.columns.values)
X, y = data[features], data.FTR #FTR stands for Full Time Result
print(X.shape)
-> (4940, 20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)
This gives a very good accuracy (72%), but this is because when I am asking the model to predict the result, I am giving it the access to the statistics (including goals scored) from the match that I am trying to predict. Is there a way to "hide" all of the features apart from Home team and Away team and predict the results this way?
I have tried doing this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)
However, this gives the following error:
ValueError: operands could not be broadcast together with shapes (988,2) (20,)
python machine-learning scikit-learn training-data
add a comment |
I am building a machine learning model for predicting Premier League (football/soccer) results using this dataset, which has features such as Home Goals, Away Goals, Shots on target etc. This is my code currently after I have loaded the dataset:
features = list(data.columns.values)
X, y = data[features], data.FTR #FTR stands for Full Time Result
print(X.shape)
-> (4940, 20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)
This gives a very good accuracy (72%), but this is because when I am asking the model to predict the result, I am giving it the access to the statistics (including goals scored) from the match that I am trying to predict. Is there a way to "hide" all of the features apart from Home team and Away team and predict the results this way?
I have tried doing this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)
However, this gives the following error:
ValueError: operands could not be broadcast together with shapes (988,2) (20,)
python machine-learning scikit-learn training-data
2
There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...
– desertnaut
Nov 22 '18 at 23:32
@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...
– Ivan Novikov
Nov 23 '18 at 1:19
add a comment |
I am building a machine learning model for predicting Premier League (football/soccer) results using this dataset, which has features such as Home Goals, Away Goals, Shots on target etc. This is my code currently after I have loaded the dataset:
features = list(data.columns.values)
X, y = data[features], data.FTR #FTR stands for Full Time Result
print(X.shape)
-> (4940, 20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)
This gives a very good accuracy (72%), but this is because when I am asking the model to predict the result, I am giving it the access to the statistics (including goals scored) from the match that I am trying to predict. Is there a way to "hide" all of the features apart from Home team and Away team and predict the results this way?
I have tried doing this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)
However, this gives the following error:
ValueError: operands could not be broadcast together with shapes (988,2) (20,)
python machine-learning scikit-learn training-data
I am building a machine learning model for predicting Premier League (football/soccer) results using this dataset, which has features such as Home Goals, Away Goals, Shots on target etc. This is my code currently after I have loaded the dataset:
features = list(data.columns.values)
X, y = data[features], data.FTR #FTR stands for Full Time Result
print(X.shape)
-> (4940, 20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)
This gives a very good accuracy (72%), but this is because when I am asking the model to predict the result, I am giving it the access to the statistics (including goals scored) from the match that I am trying to predict. Is there a way to "hide" all of the features apart from Home team and Away team and predict the results this way?
I have tried doing this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)
X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)
However, this gives the following error:
ValueError: operands could not be broadcast together with shapes (988,2) (20,)
python machine-learning scikit-learn training-data
python machine-learning scikit-learn training-data
asked Nov 22 '18 at 20:39
Ivan NovikovIvan Novikov
315
315
2
There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...
– desertnaut
Nov 22 '18 at 23:32
@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...
– Ivan Novikov
Nov 23 '18 at 1:19
add a comment |
2
There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...
– desertnaut
Nov 22 '18 at 23:32
@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...
– Ivan Novikov
Nov 23 '18 at 1:19
2
2
There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...
– desertnaut
Nov 22 '18 at 23:32
There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...
– desertnaut
Nov 22 '18 at 23:32
@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...
– Ivan Novikov
Nov 23 '18 at 1:19
@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...
– Ivan Novikov
Nov 23 '18 at 1:19
add a comment |
1 Answer
1
active
oldest
votes
If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.
When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.
@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?
– Ivan Novikov
Nov 24 '18 at 13:45
@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.
– drbabaghanoush
Nov 24 '18 at 20:22
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437749%2fhow-to-only-use-all-features-for-training-but-only-2-features-for-testing-with%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.
When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.
@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?
– Ivan Novikov
Nov 24 '18 at 13:45
@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.
– drbabaghanoush
Nov 24 '18 at 20:22
add a comment |
If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.
When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.
@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?
– Ivan Novikov
Nov 24 '18 at 13:45
@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.
– drbabaghanoush
Nov 24 '18 at 20:22
add a comment |
If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.
When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.
If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.
When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.
answered Nov 24 '18 at 12:48
drbabaghanoushdrbabaghanoush
563
563
@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?
– Ivan Novikov
Nov 24 '18 at 13:45
@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.
– drbabaghanoush
Nov 24 '18 at 20:22
add a comment |
@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?
– Ivan Novikov
Nov 24 '18 at 13:45
@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.
– drbabaghanoush
Nov 24 '18 at 20:22
@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?
– Ivan Novikov
Nov 24 '18 at 13:45
@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?
– Ivan Novikov
Nov 24 '18 at 13:45
@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.
– drbabaghanoush
Nov 24 '18 at 20:22
@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.
– drbabaghanoush
Nov 24 '18 at 20:22
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437749%2fhow-to-only-use-all-features-for-training-but-only-2-features-for-testing-with%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...
– desertnaut
Nov 22 '18 at 23:32
@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...
– Ivan Novikov
Nov 23 '18 at 1:19