Get unique rows of dask array without using dask dataframe
up vote
4
down vote
favorite
Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame?
I currently use this approach
import dask.array as da
import dask.dataframe as dd
dx = da.random.random((10000, 10000), chunks=(1000, 1000))
ddf = dd.from_dask_array(dx)
ddf = ddf.drop_duplicates()
dx = ddf.to_dask_array(lengths=True)
which works for bigger data-sets than np.unique(dx, axis=0)
but eventually also runs out of memory.
I'm using Python 3.6 (but can upgrade), Dask 0.20 and Ubuntu 18.04 LTS.
python numpy dask
add a comment |
up vote
4
down vote
favorite
Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame?
I currently use this approach
import dask.array as da
import dask.dataframe as dd
dx = da.random.random((10000, 10000), chunks=(1000, 1000))
ddf = dd.from_dask_array(dx)
ddf = ddf.drop_duplicates()
dx = ddf.to_dask_array(lengths=True)
which works for bigger data-sets than np.unique(dx, axis=0)
but eventually also runs out of memory.
I'm using Python 3.6 (but can upgrade), Dask 0.20 and Ubuntu 18.04 LTS.
python numpy dask
add a comment |
up vote
4
down vote
favorite
up vote
4
down vote
favorite
Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame?
I currently use this approach
import dask.array as da
import dask.dataframe as dd
dx = da.random.random((10000, 10000), chunks=(1000, 1000))
ddf = dd.from_dask_array(dx)
ddf = ddf.drop_duplicates()
dx = ddf.to_dask_array(lengths=True)
which works for bigger data-sets than np.unique(dx, axis=0)
but eventually also runs out of memory.
I'm using Python 3.6 (but can upgrade), Dask 0.20 and Ubuntu 18.04 LTS.
python numpy dask
Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame?
I currently use this approach
import dask.array as da
import dask.dataframe as dd
dx = da.random.random((10000, 10000), chunks=(1000, 1000))
ddf = dd.from_dask_array(dx)
ddf = ddf.drop_duplicates()
dx = ddf.to_dask_array(lengths=True)
which works for bigger data-sets than np.unique(dx, axis=0)
but eventually also runs out of memory.
I'm using Python 3.6 (but can upgrade), Dask 0.20 and Ubuntu 18.04 LTS.
python numpy dask
python numpy dask
edited Nov 20 at 10:05
asked Nov 20 at 9:05
Edgar H
428615
428615
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
You can always just use numpy.unique
:
import dask.array as da
import numpy as np
dx = da.random.random((10000, 10000), chunks=(1000, 1000))
dx = np.unique(dx, axis=0)
This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique
function, but it doesn't support the axis
keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array
either.
My recommendation would be to just suck it up for now and convert to dask.dataframe
. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.
Edit
I initially thought there might be a simple hack that could be used to implement the axis
parameter for dask.array.unique
. However, the blob type trick that numpy.unqiue
uses to implement its own axis
keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.
So no clever worakaround for now. Just use dask.dataframe
.
The problem with this approach is that I get a MemoryError error
– Edgar H
Nov 20 at 10:04
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53389534%2fget-unique-rows-of-dask-array-without-using-dask-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
You can always just use numpy.unique
:
import dask.array as da
import numpy as np
dx = da.random.random((10000, 10000), chunks=(1000, 1000))
dx = np.unique(dx, axis=0)
This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique
function, but it doesn't support the axis
keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array
either.
My recommendation would be to just suck it up for now and convert to dask.dataframe
. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.
Edit
I initially thought there might be a simple hack that could be used to implement the axis
parameter for dask.array.unique
. However, the blob type trick that numpy.unqiue
uses to implement its own axis
keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.
So no clever worakaround for now. Just use dask.dataframe
.
The problem with this approach is that I get a MemoryError error
– Edgar H
Nov 20 at 10:04
add a comment |
up vote
2
down vote
accepted
You can always just use numpy.unique
:
import dask.array as da
import numpy as np
dx = da.random.random((10000, 10000), chunks=(1000, 1000))
dx = np.unique(dx, axis=0)
This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique
function, but it doesn't support the axis
keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array
either.
My recommendation would be to just suck it up for now and convert to dask.dataframe
. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.
Edit
I initially thought there might be a simple hack that could be used to implement the axis
parameter for dask.array.unique
. However, the blob type trick that numpy.unqiue
uses to implement its own axis
keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.
So no clever worakaround for now. Just use dask.dataframe
.
The problem with this approach is that I get a MemoryError error
– Edgar H
Nov 20 at 10:04
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
You can always just use numpy.unique
:
import dask.array as da
import numpy as np
dx = da.random.random((10000, 10000), chunks=(1000, 1000))
dx = np.unique(dx, axis=0)
This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique
function, but it doesn't support the axis
keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array
either.
My recommendation would be to just suck it up for now and convert to dask.dataframe
. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.
Edit
I initially thought there might be a simple hack that could be used to implement the axis
parameter for dask.array.unique
. However, the blob type trick that numpy.unqiue
uses to implement its own axis
keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.
So no clever worakaround for now. Just use dask.dataframe
.
You can always just use numpy.unique
:
import dask.array as da
import numpy as np
dx = da.random.random((10000, 10000), chunks=(1000, 1000))
dx = np.unique(dx, axis=0)
This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique
function, but it doesn't support the axis
keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array
either.
My recommendation would be to just suck it up for now and convert to dask.dataframe
. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.
Edit
I initially thought there might be a simple hack that could be used to implement the axis
parameter for dask.array.unique
. However, the blob type trick that numpy.unqiue
uses to implement its own axis
keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.
So no clever worakaround for now. Just use dask.dataframe
.
edited Nov 20 at 10:39
answered Nov 20 at 9:19
tel
4,45911429
4,45911429
The problem with this approach is that I get a MemoryError error
– Edgar H
Nov 20 at 10:04
add a comment |
The problem with this approach is that I get a MemoryError error
– Edgar H
Nov 20 at 10:04
The problem with this approach is that I get a MemoryError error
– Edgar H
Nov 20 at 10:04
The problem with this approach is that I get a MemoryError error
– Edgar H
Nov 20 at 10:04
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53389534%2fget-unique-rows-of-dask-array-without-using-dask-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown