Group GPS points with Pandas
I have a Pandas dataframe of towers, like:
site lat lon
18ALOP01 11.1278 14.3578
18ALOP02 11.1278 14.3578
18ALOP12 11.1288 14.3575
18PENO01 11.1580 14.2898
And I need to group them if they are too close (50m). Then, I made a script that performs a "self cross join", calculates the distance between the combinations of all sites and set the same id for those where the distance is less than a threshold. So, if I have n sites, it will calculate (n^2) - n
combinations, then, it is a poor algorithm. Is there a better way of doing that?
python pandas geo geopandas
add a comment |
I have a Pandas dataframe of towers, like:
site lat lon
18ALOP01 11.1278 14.3578
18ALOP02 11.1278 14.3578
18ALOP12 11.1288 14.3575
18PENO01 11.1580 14.2898
And I need to group them if they are too close (50m). Then, I made a script that performs a "self cross join", calculates the distance between the combinations of all sites and set the same id for those where the distance is less than a threshold. So, if I have n sites, it will calculate (n^2) - n
combinations, then, it is a poor algorithm. Is there a better way of doing that?
python pandas geo geopandas
what does your expected output look like?
– Ken Dekalb
Nov 23 '18 at 18:57
What happens if pointa
is 40m from pointb
, 80m from pointc
, and pointb
is 45m from pointc
? Are they all in the same group?
– andersource
Nov 23 '18 at 19:07
Yes, no problem.
– Krogiar
Nov 23 '18 at 19:27
add a comment |
I have a Pandas dataframe of towers, like:
site lat lon
18ALOP01 11.1278 14.3578
18ALOP02 11.1278 14.3578
18ALOP12 11.1288 14.3575
18PENO01 11.1580 14.2898
And I need to group them if they are too close (50m). Then, I made a script that performs a "self cross join", calculates the distance between the combinations of all sites and set the same id for those where the distance is less than a threshold. So, if I have n sites, it will calculate (n^2) - n
combinations, then, it is a poor algorithm. Is there a better way of doing that?
python pandas geo geopandas
I have a Pandas dataframe of towers, like:
site lat lon
18ALOP01 11.1278 14.3578
18ALOP02 11.1278 14.3578
18ALOP12 11.1288 14.3575
18PENO01 11.1580 14.2898
And I need to group them if they are too close (50m). Then, I made a script that performs a "self cross join", calculates the distance between the combinations of all sites and set the same id for those where the distance is less than a threshold. So, if I have n sites, it will calculate (n^2) - n
combinations, then, it is a poor algorithm. Is there a better way of doing that?
python pandas geo geopandas
python pandas geo geopandas
edited Nov 23 '18 at 19:02
Krogiar
asked Nov 23 '18 at 17:42
KrogiarKrogiar
275
275
what does your expected output look like?
– Ken Dekalb
Nov 23 '18 at 18:57
What happens if pointa
is 40m from pointb
, 80m from pointc
, and pointb
is 45m from pointc
? Are they all in the same group?
– andersource
Nov 23 '18 at 19:07
Yes, no problem.
– Krogiar
Nov 23 '18 at 19:27
add a comment |
what does your expected output look like?
– Ken Dekalb
Nov 23 '18 at 18:57
What happens if pointa
is 40m from pointb
, 80m from pointc
, and pointb
is 45m from pointc
? Are they all in the same group?
– andersource
Nov 23 '18 at 19:07
Yes, no problem.
– Krogiar
Nov 23 '18 at 19:27
what does your expected output look like?
– Ken Dekalb
Nov 23 '18 at 18:57
what does your expected output look like?
– Ken Dekalb
Nov 23 '18 at 18:57
What happens if point
a
is 40m from point b
, 80m from point c
, and point b
is 45m from point c
? Are they all in the same group?– andersource
Nov 23 '18 at 19:07
What happens if point
a
is 40m from point b
, 80m from point c
, and point b
is 45m from point c
? Are they all in the same group?– andersource
Nov 23 '18 at 19:07
Yes, no problem.
– Krogiar
Nov 23 '18 at 19:27
Yes, no problem.
– Krogiar
Nov 23 '18 at 19:27
add a comment |
1 Answer
1
active
oldest
votes
Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True
as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)
Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.
In [2]: from sklearn.cluster import MeanShift
In [3]: import numpy as np
In [4]: X = np.array([
...: [0, 1], [51, 1], [100, 1], [151, 1],
...: ])
In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True
In [6]: print(clustering.labels_)
[1 0 0 2]
In [7]: print(clustering.cluster_centers_)
[[ 75.5 1. ]
[ 0. 1. ]
[151. 1. ]]
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451006%2fgroup-gps-points-with-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True
as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)
Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.
In [2]: from sklearn.cluster import MeanShift
In [3]: import numpy as np
In [4]: X = np.array([
...: [0, 1], [51, 1], [100, 1], [151, 1],
...: ])
In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True
In [6]: print(clustering.labels_)
[1 0 0 2]
In [7]: print(clustering.cluster_centers_)
[[ 75.5 1. ]
[ 0. 1. ]
[151. 1. ]]
add a comment |
Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True
as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)
Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.
In [2]: from sklearn.cluster import MeanShift
In [3]: import numpy as np
In [4]: X = np.array([
...: [0, 1], [51, 1], [100, 1], [151, 1],
...: ])
In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True
In [6]: print(clustering.labels_)
[1 0 0 2]
In [7]: print(clustering.cluster_centers_)
[[ 75.5 1. ]
[ 0. 1. ]
[151. 1. ]]
add a comment |
Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True
as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)
Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.
In [2]: from sklearn.cluster import MeanShift
In [3]: import numpy as np
In [4]: X = np.array([
...: [0, 1], [51, 1], [100, 1], [151, 1],
...: ])
In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True
In [6]: print(clustering.labels_)
[1 0 0 2]
In [7]: print(clustering.cluster_centers_)
[[ 75.5 1. ]
[ 0. 1. ]
[151. 1. ]]
Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True
as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)
Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.
In [2]: from sklearn.cluster import MeanShift
In [3]: import numpy as np
In [4]: X = np.array([
...: [0, 1], [51, 1], [100, 1], [151, 1],
...: ])
In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True
In [6]: print(clustering.labels_)
[1 0 0 2]
In [7]: print(clustering.cluster_centers_)
[[ 75.5 1. ]
[ 0. 1. ]
[151. 1. ]]
edited Jan 16 at 17:37
answered Nov 24 '18 at 19:47
GarrettGarrett
21.9k34544
21.9k34544
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451006%2fgroup-gps-points-with-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
what does your expected output look like?
– Ken Dekalb
Nov 23 '18 at 18:57
What happens if point
a
is 40m from pointb
, 80m from pointc
, and pointb
is 45m from pointc
? Are they all in the same group?– andersource
Nov 23 '18 at 19:07
Yes, no problem.
– Krogiar
Nov 23 '18 at 19:27