Spark SVD is not reproducible
I am using method computeSVD from Spark class IndexedRowMatrix (in Scala). I have noticed it has no setSeed() method. I am getting slightly different results for multiple runs on the same input matrix, possibly due to the internal algorithm used by Spark. Although it also implements an approximate scalable SVD algorithm, I would say from the source code that computeSVD() from IndexedRowMatrix does not apply the approximate but the exact version.
Since I am doing recommendations with the SVD results, and the user and item latent factors matrices are different, I am actually getting different recommendation lists: in some runs roughly the same items in different order, sometimes a few new items get into the list and some are missing, because the predicted ratings are often almost tied after doing imputation on the missing input ratings matrix that is passed to computeSVD().
Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?
Thanks
apache-spark apache-spark-mllib apache-spark-ml svd non-deterministic
add a comment |
I am using method computeSVD from Spark class IndexedRowMatrix (in Scala). I have noticed it has no setSeed() method. I am getting slightly different results for multiple runs on the same input matrix, possibly due to the internal algorithm used by Spark. Although it also implements an approximate scalable SVD algorithm, I would say from the source code that computeSVD() from IndexedRowMatrix does not apply the approximate but the exact version.
Since I am doing recommendations with the SVD results, and the user and item latent factors matrices are different, I am actually getting different recommendation lists: in some runs roughly the same items in different order, sometimes a few new items get into the list and some are missing, because the predicted ratings are often almost tied after doing imputation on the missing input ratings matrix that is passed to computeSVD().
Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?
Thanks
apache-spark apache-spark-mllib apache-spark-ml svd non-deterministic
Since FP arithmetic is not associative, and merge order (computeSVDusestreeAggregatetocomputeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.
– user6910411
Nov 25 '18 at 14:11
Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!
– Pablo
Nov 25 '18 at 19:14
add a comment |
I am using method computeSVD from Spark class IndexedRowMatrix (in Scala). I have noticed it has no setSeed() method. I am getting slightly different results for multiple runs on the same input matrix, possibly due to the internal algorithm used by Spark. Although it also implements an approximate scalable SVD algorithm, I would say from the source code that computeSVD() from IndexedRowMatrix does not apply the approximate but the exact version.
Since I am doing recommendations with the SVD results, and the user and item latent factors matrices are different, I am actually getting different recommendation lists: in some runs roughly the same items in different order, sometimes a few new items get into the list and some are missing, because the predicted ratings are often almost tied after doing imputation on the missing input ratings matrix that is passed to computeSVD().
Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?
Thanks
apache-spark apache-spark-mllib apache-spark-ml svd non-deterministic
I am using method computeSVD from Spark class IndexedRowMatrix (in Scala). I have noticed it has no setSeed() method. I am getting slightly different results for multiple runs on the same input matrix, possibly due to the internal algorithm used by Spark. Although it also implements an approximate scalable SVD algorithm, I would say from the source code that computeSVD() from IndexedRowMatrix does not apply the approximate but the exact version.
Since I am doing recommendations with the SVD results, and the user and item latent factors matrices are different, I am actually getting different recommendation lists: in some runs roughly the same items in different order, sometimes a few new items get into the list and some are missing, because the predicted ratings are often almost tied after doing imputation on the missing input ratings matrix that is passed to computeSVD().
Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?
Thanks
apache-spark apache-spark-mllib apache-spark-ml svd non-deterministic
apache-spark apache-spark-mllib apache-spark-ml svd non-deterministic
asked Nov 25 '18 at 13:55
PabloPablo
6712
6712
Since FP arithmetic is not associative, and merge order (computeSVDusestreeAggregatetocomputeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.
– user6910411
Nov 25 '18 at 14:11
Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!
– Pablo
Nov 25 '18 at 19:14
add a comment |
Since FP arithmetic is not associative, and merge order (computeSVDusestreeAggregatetocomputeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.
– user6910411
Nov 25 '18 at 14:11
Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!
– Pablo
Nov 25 '18 at 19:14
Since FP arithmetic is not associative, and merge order (
computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.– user6910411
Nov 25 '18 at 14:11
Since FP arithmetic is not associative, and merge order (
computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.– user6910411
Nov 25 '18 at 14:11
Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!
– Pablo
Nov 25 '18 at 19:14
Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!
– Pablo
Nov 25 '18 at 19:14
add a comment |
1 Answer
1
active
oldest
votes
Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:
FP arithmetic is not associative.
scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
res0: Boolean = false
Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.
This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.
Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.
In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468193%2fspark-svd-is-not-reproducible%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:
FP arithmetic is not associative.
scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
res0: Boolean = false
Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.
This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.
Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.
In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.
add a comment |
Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:
FP arithmetic is not associative.
scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
res0: Boolean = false
Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.
This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.
Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.
In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.
add a comment |
Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:
FP arithmetic is not associative.
scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
res0: Boolean = false
Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.
This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.
Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.
In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.
Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:
FP arithmetic is not associative.
scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
res0: Boolean = false
Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.
This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.
Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.
In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.
answered Nov 27 '18 at 21:42
user6910411user6910411
35.2k1088108
35.2k1088108
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468193%2fspark-svd-is-not-reproducible%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Since FP arithmetic is not associative, and merge order (
computeSVDusestreeAggregatetocomputeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.– user6910411
Nov 25 '18 at 14:11
Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!
– Pablo
Nov 25 '18 at 19:14