Get unique rows of dask array without using dask dataframe











up vote
4
down vote

favorite












Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame?



I currently use this approach



import dask.array as da
import dask.dataframe as dd

dx = da.random.random((10000, 10000), chunks=(1000, 1000))
ddf = dd.from_dask_array(dx)
ddf = ddf.drop_duplicates()
dx = ddf.to_dask_array(lengths=True)


which works for bigger data-sets than np.unique(dx, axis=0) but eventually also runs out of memory.



I'm using Python 3.6 (but can upgrade), Dask 0.20 and Ubuntu 18.04 LTS.










share|improve this question




























    up vote
    4
    down vote

    favorite












    Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame?



    I currently use this approach



    import dask.array as da
    import dask.dataframe as dd

    dx = da.random.random((10000, 10000), chunks=(1000, 1000))
    ddf = dd.from_dask_array(dx)
    ddf = ddf.drop_duplicates()
    dx = ddf.to_dask_array(lengths=True)


    which works for bigger data-sets than np.unique(dx, axis=0) but eventually also runs out of memory.



    I'm using Python 3.6 (but can upgrade), Dask 0.20 and Ubuntu 18.04 LTS.










    share|improve this question


























      up vote
      4
      down vote

      favorite









      up vote
      4
      down vote

      favorite











      Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame?



      I currently use this approach



      import dask.array as da
      import dask.dataframe as dd

      dx = da.random.random((10000, 10000), chunks=(1000, 1000))
      ddf = dd.from_dask_array(dx)
      ddf = ddf.drop_duplicates()
      dx = ddf.to_dask_array(lengths=True)


      which works for bigger data-sets than np.unique(dx, axis=0) but eventually also runs out of memory.



      I'm using Python 3.6 (but can upgrade), Dask 0.20 and Ubuntu 18.04 LTS.










      share|improve this question















      Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame?



      I currently use this approach



      import dask.array as da
      import dask.dataframe as dd

      dx = da.random.random((10000, 10000), chunks=(1000, 1000))
      ddf = dd.from_dask_array(dx)
      ddf = ddf.drop_duplicates()
      dx = ddf.to_dask_array(lengths=True)


      which works for bigger data-sets than np.unique(dx, axis=0) but eventually also runs out of memory.



      I'm using Python 3.6 (but can upgrade), Dask 0.20 and Ubuntu 18.04 LTS.







      python numpy dask






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 at 10:05

























      asked Nov 20 at 9:05









      Edgar H

      428615




      428615
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          You can always just use numpy.unique:



          import dask.array as da
          import numpy as np

          dx = da.random.random((10000, 10000), chunks=(1000, 1000))
          dx = np.unique(dx, axis=0)


          This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique function, but it doesn't support the axis keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array either.



          My recommendation would be to just suck it up for now and convert to dask.dataframe. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.



          Edit



          I initially thought there might be a simple hack that could be used to implement the axis parameter for dask.array.unique. However, the blob type trick that numpy.unqiue uses to implement its own axis keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.



          So no clever worakaround for now. Just use dask.dataframe.






          share|improve this answer























          • The problem with this approach is that I get a MemoryError error
            – Edgar H
            Nov 20 at 10:04











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53389534%2fget-unique-rows-of-dask-array-without-using-dask-dataframe%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote



          accepted










          You can always just use numpy.unique:



          import dask.array as da
          import numpy as np

          dx = da.random.random((10000, 10000), chunks=(1000, 1000))
          dx = np.unique(dx, axis=0)


          This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique function, but it doesn't support the axis keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array either.



          My recommendation would be to just suck it up for now and convert to dask.dataframe. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.



          Edit



          I initially thought there might be a simple hack that could be used to implement the axis parameter for dask.array.unique. However, the blob type trick that numpy.unqiue uses to implement its own axis keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.



          So no clever worakaround for now. Just use dask.dataframe.






          share|improve this answer























          • The problem with this approach is that I get a MemoryError error
            – Edgar H
            Nov 20 at 10:04















          up vote
          2
          down vote



          accepted










          You can always just use numpy.unique:



          import dask.array as da
          import numpy as np

          dx = da.random.random((10000, 10000), chunks=(1000, 1000))
          dx = np.unique(dx, axis=0)


          This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique function, but it doesn't support the axis keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array either.



          My recommendation would be to just suck it up for now and convert to dask.dataframe. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.



          Edit



          I initially thought there might be a simple hack that could be used to implement the axis parameter for dask.array.unique. However, the blob type trick that numpy.unqiue uses to implement its own axis keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.



          So no clever worakaround for now. Just use dask.dataframe.






          share|improve this answer























          • The problem with this approach is that I get a MemoryError error
            – Edgar H
            Nov 20 at 10:04













          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          You can always just use numpy.unique:



          import dask.array as da
          import numpy as np

          dx = da.random.random((10000, 10000), chunks=(1000, 1000))
          dx = np.unique(dx, axis=0)


          This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique function, but it doesn't support the axis keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array either.



          My recommendation would be to just suck it up for now and convert to dask.dataframe. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.



          Edit



          I initially thought there might be a simple hack that could be used to implement the axis parameter for dask.array.unique. However, the blob type trick that numpy.unqiue uses to implement its own axis keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.



          So no clever worakaround for now. Just use dask.dataframe.






          share|improve this answer














          You can always just use numpy.unique:



          import dask.array as da
          import numpy as np

          dx = da.random.random((10000, 10000), chunks=(1000, 1000))
          dx = np.unique(dx, axis=0)


          This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique function, but it doesn't support the axis keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array either.



          My recommendation would be to just suck it up for now and convert to dask.dataframe. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.



          Edit



          I initially thought there might be a simple hack that could be used to implement the axis parameter for dask.array.unique. However, the blob type trick that numpy.unqiue uses to implement its own axis keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.



          So no clever worakaround for now. Just use dask.dataframe.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 20 at 10:39

























          answered Nov 20 at 9:19









          tel

          4,45911429




          4,45911429












          • The problem with this approach is that I get a MemoryError error
            – Edgar H
            Nov 20 at 10:04


















          • The problem with this approach is that I get a MemoryError error
            – Edgar H
            Nov 20 at 10:04
















          The problem with this approach is that I get a MemoryError error
          – Edgar H
          Nov 20 at 10:04




          The problem with this approach is that I get a MemoryError error
          – Edgar H
          Nov 20 at 10:04


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53389534%2fget-unique-rows-of-dask-array-without-using-dask-dataframe%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Wiesbaden

          Marschland

          Dieringhausen