Spark SVD is not reproducible

I am using method computeSVD from Spark class IndexedRowMatrix (in Scala). I have noticed it has no setSeed() method. I am getting slightly different results for multiple runs on the same input matrix, possibly due to the internal algorithm used by Spark. Although it also implements an approximate scalable SVD algorithm, I would say from the source code that computeSVD() from IndexedRowMatrix does not apply the approximate but the exact version.

Since I am doing recommendations with the SVD results, and the user and item latent factors matrices are different, I am actually getting different recommendation lists: in some runs roughly the same items in different order, sometimes a few new items get into the list and some are missing, because the predicted ratings are often almost tied after doing imputation on the missing input ratings matrix that is passed to computeSVD().

Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?

Thanks

asked Nov 25 '18 at 13:55

Pablo

6712

Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

– user6910411
Nov 25 '18 at 14:11

Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

– Pablo
Nov 25 '18 at 19:14

add a comment |

Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?

Thanks

asked Nov 25 '18 at 13:55

Pablo

6712

Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

– user6910411
Nov 25 '18 at 14:11

Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

– Pablo
Nov 25 '18 at 19:14

add a comment |

Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?

Thanks

asked Nov 25 '18 at 13:55

Pablo

6712

Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?

Thanks

apache-spark apache-spark-mllib apache-spark-ml svd non-deterministic

asked Nov 25 '18 at 13:55

Pablo

6712

asked Nov 25 '18 at 13:55

Pablo

6712

asked Nov 25 '18 at 13:55

Pablo

6712

asked Nov 25 '18 at 13:55

Pablo

6712

asked Nov 25 '18 at 13:55

Pablo

6712

Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

– user6910411
Nov 25 '18 at 14:11

Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

– Pablo
Nov 25 '18 at 19:14

add a comment |

Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

– user6910411
Nov 25 '18 at 14:11

Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

– Pablo
Nov 25 '18 at 19:14

Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

– user6910411
Nov 25 '18 at 14:11

Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

– Pablo
Nov 25 '18 at 19:14

add a comment |

1 Answer
1

active

oldest

votes

Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:

FP arithmetic is not associative.

scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)

res0: Boolean = false

Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.

This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.

Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.

In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.

answered Nov 27 '18 at 21:42

user6910411

35.2k1088108

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468193%2fspark-svd-is-not-reproducible%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:

FP arithmetic is not associative.

scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)

res0: Boolean = false

Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.

This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.

Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.

answered Nov 27 '18 at 21:42

user6910411

35.2k1088108

add a comment |

Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:

FP arithmetic is not associative.

scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)

res0: Boolean = false

Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.

This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.

Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.

answered Nov 27 '18 at 21:42

user6910411

35.2k1088108

add a comment |

Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:

FP arithmetic is not associative.

scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)

res0: Boolean = false

Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.

This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.

Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.

answered Nov 27 '18 at 21:42

user6910411

35.2k1088108

Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:

FP arithmetic is not associative.

scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)

res0: Boolean = false

Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.

This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.

Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.

answered Nov 27 '18 at 21:42

user6910411

35.2k1088108

answered Nov 27 '18 at 21:42

user6910411

35.2k1088108

answered Nov 27 '18 at 21:42

user6910411

35.2k1088108

answered Nov 27 '18 at 21:42

user6910411

35.2k1088108

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg