Spark SVD is not reproducible












1















I am using method computeSVD from Spark class IndexedRowMatrix (in Scala). I have noticed it has no setSeed() method. I am getting slightly different results for multiple runs on the same input matrix, possibly due to the internal algorithm used by Spark. Although it also implements an approximate scalable SVD algorithm, I would say from the source code that computeSVD() from IndexedRowMatrix does not apply the approximate but the exact version.



Since I am doing recommendations with the SVD results, and the user and item latent factors matrices are different, I am actually getting different recommendation lists: in some runs roughly the same items in different order, sometimes a few new items get into the list and some are missing, because the predicted ratings are often almost tied after doing imputation on the missing input ratings matrix that is passed to computeSVD().



Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?



Thanks










share|improve this question























  • Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

    – user6910411
    Nov 25 '18 at 14:11













  • Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

    – Pablo
    Nov 25 '18 at 19:14
















1















I am using method computeSVD from Spark class IndexedRowMatrix (in Scala). I have noticed it has no setSeed() method. I am getting slightly different results for multiple runs on the same input matrix, possibly due to the internal algorithm used by Spark. Although it also implements an approximate scalable SVD algorithm, I would say from the source code that computeSVD() from IndexedRowMatrix does not apply the approximate but the exact version.



Since I am doing recommendations with the SVD results, and the user and item latent factors matrices are different, I am actually getting different recommendation lists: in some runs roughly the same items in different order, sometimes a few new items get into the list and some are missing, because the predicted ratings are often almost tied after doing imputation on the missing input ratings matrix that is passed to computeSVD().



Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?



Thanks










share|improve this question























  • Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

    – user6910411
    Nov 25 '18 at 14:11













  • Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

    – Pablo
    Nov 25 '18 at 19:14














1












1








1








I am using method computeSVD from Spark class IndexedRowMatrix (in Scala). I have noticed it has no setSeed() method. I am getting slightly different results for multiple runs on the same input matrix, possibly due to the internal algorithm used by Spark. Although it also implements an approximate scalable SVD algorithm, I would say from the source code that computeSVD() from IndexedRowMatrix does not apply the approximate but the exact version.



Since I am doing recommendations with the SVD results, and the user and item latent factors matrices are different, I am actually getting different recommendation lists: in some runs roughly the same items in different order, sometimes a few new items get into the list and some are missing, because the predicted ratings are often almost tied after doing imputation on the missing input ratings matrix that is passed to computeSVD().



Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?



Thanks










share|improve this question














I am using method computeSVD from Spark class IndexedRowMatrix (in Scala). I have noticed it has no setSeed() method. I am getting slightly different results for multiple runs on the same input matrix, possibly due to the internal algorithm used by Spark. Although it also implements an approximate scalable SVD algorithm, I would say from the source code that computeSVD() from IndexedRowMatrix does not apply the approximate but the exact version.



Since I am doing recommendations with the SVD results, and the user and item latent factors matrices are different, I am actually getting different recommendation lists: in some runs roughly the same items in different order, sometimes a few new items get into the list and some are missing, because the predicted ratings are often almost tied after doing imputation on the missing input ratings matrix that is passed to computeSVD().



Has anyone else had this problem? Is there a way to make this fully deterministic, or I am missing something?



Thanks







apache-spark apache-spark-mllib apache-spark-ml svd non-deterministic






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 25 '18 at 13:55









PabloPablo

6712




6712













  • Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

    – user6910411
    Nov 25 '18 at 14:11













  • Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

    – Pablo
    Nov 25 '18 at 19:14



















  • Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

    – user6910411
    Nov 25 '18 at 14:11













  • Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

    – Pablo
    Nov 25 '18 at 19:14

















Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

– user6910411
Nov 25 '18 at 14:11







Since FP arithmetic is not associative, and merge order (computeSVD uses treeAggregate to computeGramianMatrix) in Spark is non-deterministic, some fluctuations in the results are expected. Since RNG is not involved, setting seed wouldn't make any difference.

– user6910411
Nov 25 '18 at 14:11















Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

– Pablo
Nov 25 '18 at 19:14





Wow, that is a deep answer. If you turn your comment into a post I will be glad to accept it, thank you!

– Pablo
Nov 25 '18 at 19:14












1 Answer
1






active

oldest

votes


















1














Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:





  • FP arithmetic is not associative.



    scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
    res0: Boolean = false



  • Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.



    This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.




Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.



In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468193%2fspark-svd-is-not-reproducible%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:





    • FP arithmetic is not associative.



      scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
      res0: Boolean = false



    • Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.



      This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.




    Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.



    In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.






    share|improve this answer




























      1














      Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:





      • FP arithmetic is not associative.



        scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
        res0: Boolean = false



      • Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.



        This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.




      Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.



      In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.






      share|improve this answer


























        1












        1








        1







        Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:





        • FP arithmetic is not associative.



          scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
          res0: Boolean = false



        • Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.



          This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.




        Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.



        In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.






        share|improve this answer













        Whenever you work with numeric computations in Apache Spark you have to keep in mind two things:





        • FP arithmetic is not associative.



          scala> (0.1 + 0.2) + 0.3 == 0.1 + (0.2 + 0.3)
          res0: Boolean = false



        • Every exchange in Spark is a potential source of non-determinism. To achieve optimal performance Spark can merge partial results of the upstream tasks in an arbitrary order.



          This could be addressed with some defensive programming, but run-time overhead is typically to high to be useful in practice.




        Because of that the final results can fluctuate, even if the procedure doesn't depend on random number generator (like computeSVD), or if generator seed is set.



        In practice there is really not much you can do about it, short of rewriting the internals. If you suspect that the problem is somehow ill-conditioned you can try building multiple models with some random noise, to see how sensitive the final predictions are, and taking this into account, when the prediction is generated.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 27 '18 at 21:42









        user6910411user6910411

        35.2k1088108




        35.2k1088108
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468193%2fspark-svd-is-not-reproducible%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Tonle Sap (See)

            I get strange results when I access the Sqlitedatabase with Unity C# via XAMPP

            Guatemaltekische Davis-Cup-Mannschaft