How to only use all features for training, but only 2 features for testing with SciKit learn?

-1

I am building a machine learning model for predicting Premier League (football/soccer) results using this dataset, which has features such as Home Goals, Away Goals, Shots on target etc. This is my code currently after I have loaded the dataset:

features = list(data.columns.values)

X, y = data[features], data.FTR     #FTR stands for Full Time Result

print(X.shape)

  -> (4940, 20)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

nb = GaussianNB()

nb.fit(X_train, y_train)

y_nb = nb.predict(X_test)

This gives a very good accuracy (72%), but this is because when I am asking the model to predict the result, I am giving it the access to the statistics (including goals scored) from the match that I am trying to predict. Is there a way to "hide" all of the features apart from Home team and Away team and predict the results this way?

I have tried doing this:

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

 X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name

 nb = GaussianNB()

 nb.fit(X_train, y_train)

 y_nb = nb.predict(X_test)

However, this gives the following error:

ValueError: operands could not be broadcast together with shapes (988,2) (20,)

asked Nov 22 '18 at 20:39

Ivan Novikov

315

2

There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

– desertnaut
Nov 22 '18 at 23:32

@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

– Ivan Novikov
Nov 23 '18 at 1:19

add a comment |

-1

features = list(data.columns.values)

X, y = data[features], data.FTR     #FTR stands for Full Time Result

print(X.shape)

  -> (4940, 20)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

nb = GaussianNB()

nb.fit(X_train, y_train)

y_nb = nb.predict(X_test)

I have tried doing this:

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

 X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name

 nb = GaussianNB()

 nb.fit(X_train, y_train)

 y_nb = nb.predict(X_test)

However, this gives the following error:

ValueError: operands could not be broadcast together with shapes (988,2) (20,)

asked Nov 22 '18 at 20:39

Ivan Novikov

315

2

There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

– desertnaut
Nov 22 '18 at 23:32

@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

– Ivan Novikov
Nov 23 '18 at 1:19

add a comment |

-1

features = list(data.columns.values)

X, y = data[features], data.FTR     #FTR stands for Full Time Result

print(X.shape)

  -> (4940, 20)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

nb = GaussianNB()

nb.fit(X_train, y_train)

y_nb = nb.predict(X_test)

I have tried doing this:

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

 X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name

 nb = GaussianNB()

 nb.fit(X_train, y_train)

 y_nb = nb.predict(X_test)

However, this gives the following error:

ValueError: operands could not be broadcast together with shapes (988,2) (20,)

asked Nov 22 '18 at 20:39

Ivan Novikov

315

features = list(data.columns.values)

X, y = data[features], data.FTR     #FTR stands for Full Time Result

print(X.shape)

  -> (4940, 20)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

nb = GaussianNB()

nb.fit(X_train, y_train)

y_nb = nb.predict(X_test)

I have tried doing this:

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

 X_test = X_test.iloc[:, [0, 1]] #this only keeps the column with home team name and away team name

 nb = GaussianNB()

 nb.fit(X_train, y_train)

 y_nb = nb.predict(X_test)

However, this gives the following error:

ValueError: operands could not be broadcast together with shapes (988,2) (20,)

python machine-learning scikit-learn training-data

asked Nov 22 '18 at 20:39

Ivan Novikov

315

asked Nov 22 '18 at 20:39

Ivan Novikov

315

asked Nov 22 '18 at 20:39

Ivan Novikov

315

asked Nov 22 '18 at 20:39

Ivan Novikov

315

asked Nov 22 '18 at 20:39

Ivan Novikov

315

2

There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

– desertnaut
Nov 22 '18 at 23:32

@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

– Ivan Novikov
Nov 23 '18 at 1:19

add a comment |

2

There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

– desertnaut
Nov 22 '18 at 23:32

@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

– Ivan Novikov
Nov 23 '18 at 1:19

There is absolutely no reason (in fact, it doesn't even make sense) to train your classifier using features that will not be available at prediction time; keep only the features that will be available and use only them for your training...

– desertnaut
Nov 22 '18 at 23:32

@desertnaut I understand, but how would I use the statistics to create a model, since obviously the in-game statistics are not available before games. I was thinking either use them to quantify how good a team is relative to the other team, or to use only the features available before the game, and essentially use linear regression based on past games between the teams to predict each statistic individually...

– Ivan Novikov
Nov 23 '18 at 1:19

add a comment |

1 Answer
1

active

oldest

votes

If you want to keep all the information your features give you, consider using the averages or some sort of historical measure of the in-game statistics prior to training your model. i.e. if team A has scored 2, 3, and 1 goals in its last three matches prior to scoring 5 in the game that you're training on, use the average of the last three games instead of the actual goal total itself. Your training error might be higher, but then when you go to predict a new game you can still use as much data as possible.

When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.

answered Nov 24 '18 at 12:48

drbabaghanoush

563

@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

– Ivan Novikov
Nov 24 '18 at 13:45

@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

– drbabaghanoush
Nov 24 '18 at 20:22

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437749%2fhow-to-only-use-all-features-for-training-but-only-2-features-for-testing-with%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.

answered Nov 24 '18 at 12:48

drbabaghanoush

563

@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

– Ivan Novikov
Nov 24 '18 at 13:45

@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

– drbabaghanoush
Nov 24 '18 at 20:22

add a comment |

When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.

answered Nov 24 '18 at 12:48

drbabaghanoush

563

@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

– Ivan Novikov
Nov 24 '18 at 13:45

@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

– drbabaghanoush
Nov 24 '18 at 20:22

add a comment |

When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.

answered Nov 24 '18 at 12:48

drbabaghanoush

563

When you're trying to train a model as a predictor, as @desertnaut said, only use the variables that will be available to you when you're going to run the prediction.

answered Nov 24 '18 at 12:48

drbabaghanoush

563

answered Nov 24 '18 at 12:48

drbabaghanoush

563

answered Nov 24 '18 at 12:48

drbabaghanoush

563

answered Nov 24 '18 at 12:48

drbabaghanoush

563

@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

– Ivan Novikov
Nov 24 '18 at 13:45

@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

– drbabaghanoush
Nov 24 '18 at 20:22

add a comment |

@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

– Ivan Novikov
Nov 24 '18 at 13:45

@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

– drbabaghanoush
Nov 24 '18 at 20:22

@rbabaghanoush Thanks, I will try doing that. Instead of using the averages, would it be feasible to predict in-game statistics using linear regression, and then base the match result on predicted statistics?

– Ivan Novikov
Nov 24 '18 at 13:45

@IvanNovikov Yeah that would definitely be an option. You'd have to make a linear regression model for each different in-game variable, and again you'd have to restrict yourself to only variables that are available prior to gametime.

– drbabaghanoush
Nov 24 '18 at 20:22

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

b6R,6wephIJXMeVcawps08w HRuqhD8rL,kmRug38E,O

搜尋此網誌

Ytukyg