Group GPS points with Pandas












1















I have a Pandas dataframe of towers, like:



site       lat      lon
18ALOP01 11.1278 14.3578
18ALOP02 11.1278 14.3578
18ALOP12 11.1288 14.3575
18PENO01 11.1580 14.2898


And I need to group them if they are too close (50m). Then, I made a script that performs a "self cross join", calculates the distance between the combinations of all sites and set the same id for those where the distance is less than a threshold. So, if I have n sites, it will calculate (n^2) - n combinations, then, it is a poor algorithm. Is there a better way of doing that?










share|improve this question

























  • what does your expected output look like?

    – Ken Dekalb
    Nov 23 '18 at 18:57











  • What happens if point a is 40m from point b, 80m from point c, and point b is 45m from point c? Are they all in the same group?

    – andersource
    Nov 23 '18 at 19:07











  • Yes, no problem.

    – Krogiar
    Nov 23 '18 at 19:27
















1















I have a Pandas dataframe of towers, like:



site       lat      lon
18ALOP01 11.1278 14.3578
18ALOP02 11.1278 14.3578
18ALOP12 11.1288 14.3575
18PENO01 11.1580 14.2898


And I need to group them if they are too close (50m). Then, I made a script that performs a "self cross join", calculates the distance between the combinations of all sites and set the same id for those where the distance is less than a threshold. So, if I have n sites, it will calculate (n^2) - n combinations, then, it is a poor algorithm. Is there a better way of doing that?










share|improve this question

























  • what does your expected output look like?

    – Ken Dekalb
    Nov 23 '18 at 18:57











  • What happens if point a is 40m from point b, 80m from point c, and point b is 45m from point c? Are they all in the same group?

    – andersource
    Nov 23 '18 at 19:07











  • Yes, no problem.

    – Krogiar
    Nov 23 '18 at 19:27














1












1








1


1






I have a Pandas dataframe of towers, like:



site       lat      lon
18ALOP01 11.1278 14.3578
18ALOP02 11.1278 14.3578
18ALOP12 11.1288 14.3575
18PENO01 11.1580 14.2898


And I need to group them if they are too close (50m). Then, I made a script that performs a "self cross join", calculates the distance between the combinations of all sites and set the same id for those where the distance is less than a threshold. So, if I have n sites, it will calculate (n^2) - n combinations, then, it is a poor algorithm. Is there a better way of doing that?










share|improve this question
















I have a Pandas dataframe of towers, like:



site       lat      lon
18ALOP01 11.1278 14.3578
18ALOP02 11.1278 14.3578
18ALOP12 11.1288 14.3575
18PENO01 11.1580 14.2898


And I need to group them if they are too close (50m). Then, I made a script that performs a "self cross join", calculates the distance between the combinations of all sites and set the same id for those where the distance is less than a threshold. So, if I have n sites, it will calculate (n^2) - n combinations, then, it is a poor algorithm. Is there a better way of doing that?







python pandas geo geopandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 19:02







Krogiar

















asked Nov 23 '18 at 17:42









KrogiarKrogiar

275




275













  • what does your expected output look like?

    – Ken Dekalb
    Nov 23 '18 at 18:57











  • What happens if point a is 40m from point b, 80m from point c, and point b is 45m from point c? Are they all in the same group?

    – andersource
    Nov 23 '18 at 19:07











  • Yes, no problem.

    – Krogiar
    Nov 23 '18 at 19:27



















  • what does your expected output look like?

    – Ken Dekalb
    Nov 23 '18 at 18:57











  • What happens if point a is 40m from point b, 80m from point c, and point b is 45m from point c? Are they all in the same group?

    – andersource
    Nov 23 '18 at 19:07











  • Yes, no problem.

    – Krogiar
    Nov 23 '18 at 19:27

















what does your expected output look like?

– Ken Dekalb
Nov 23 '18 at 18:57





what does your expected output look like?

– Ken Dekalb
Nov 23 '18 at 18:57













What happens if point a is 40m from point b, 80m from point c, and point b is 45m from point c? Are they all in the same group?

– andersource
Nov 23 '18 at 19:07





What happens if point a is 40m from point b, 80m from point c, and point b is 45m from point c? Are they all in the same group?

– andersource
Nov 23 '18 at 19:07













Yes, no problem.

– Krogiar
Nov 23 '18 at 19:27





Yes, no problem.

– Krogiar
Nov 23 '18 at 19:27












1 Answer
1






active

oldest

votes


















1














Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)



Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.



In [2]: from sklearn.cluster import MeanShift

In [3]: import numpy as np

In [4]: X = np.array([
...: [0, 1], [51, 1], [100, 1], [151, 1],
...: ])

In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True

In [6]: print(clustering.labels_)
[1 0 0 2]

In [7]: print(clustering.cluster_centers_)
[[ 75.5 1. ]
[ 0. 1. ]
[151. 1. ]]





share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451006%2fgroup-gps-points-with-pandas%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)



    Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.



    In [2]: from sklearn.cluster import MeanShift

    In [3]: import numpy as np

    In [4]: X = np.array([
    ...: [0, 1], [51, 1], [100, 1], [151, 1],
    ...: ])

    In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True

    In [6]: print(clustering.labels_)
    [1 0 0 2]

    In [7]: print(clustering.cluster_centers_)
    [[ 75.5 1. ]
    [ 0. 1. ]
    [151. 1. ]]





    share|improve this answer






























      1














      Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)



      Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.



      In [2]: from sklearn.cluster import MeanShift

      In [3]: import numpy as np

      In [4]: X = np.array([
      ...: [0, 1], [51, 1], [100, 1], [151, 1],
      ...: ])

      In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True

      In [6]: print(clustering.labels_)
      [1 0 0 2]

      In [7]: print(clustering.cluster_centers_)
      [[ 75.5 1. ]
      [ 0. 1. ]
      [151. 1. ]]





      share|improve this answer




























        1












        1








        1







        Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)



        Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.



        In [2]: from sklearn.cluster import MeanShift

        In [3]: import numpy as np

        In [4]: X = np.array([
        ...: [0, 1], [51, 1], [100, 1], [151, 1],
        ...: ])

        In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True

        In [6]: print(clustering.labels_)
        [1 0 0 2]

        In [7]: print(clustering.cluster_centers_)
        [[ 75.5 1. ]
        [ 0. 1. ]
        [151. 1. ]]





        share|improve this answer















        Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)



        Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.



        In [2]: from sklearn.cluster import MeanShift

        In [3]: import numpy as np

        In [4]: X = np.array([
        ...: [0, 1], [51, 1], [100, 1], [151, 1],
        ...: ])

        In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True

        In [6]: print(clustering.labels_)
        [1 0 0 2]

        In [7]: print(clustering.cluster_centers_)
        [[ 75.5 1. ]
        [ 0. 1. ]
        [151. 1. ]]






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 16 at 17:37

























        answered Nov 24 '18 at 19:47









        GarrettGarrett

        21.9k34544




        21.9k34544
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451006%2fgroup-gps-points-with-pandas%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Wiesbaden

            Marschland

            Dieringhausen