timestamp difference between rows for each user - Pyspark Dataframe











up vote
-1
down vote

favorite












I have a CSV file with following structure



USER_ID       location          timestamp          
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013


Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:



USER_ID       location          timestamp difference         
1 1001-1041 08:00:00


any idea how to reach the solution










share|improve this question


























    up vote
    -1
    down vote

    favorite












    I have a CSV file with following structure



    USER_ID       location          timestamp          
    1 1001 19:11:39 5-2-2010
    1 6022 17:51:19 6-6-2010
    1 1041 11:11:39 5-2-2010
    2 9483 10:51:23 3-2-2012
    2 4532 11:11:11 4-5-2012
    3 4374 03:21:23 6-9-2013
    3 4334 04:53:13 4-5-2013


    Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:



    USER_ID       location          timestamp difference         
    1 1001-1041 08:00:00


    any idea how to reach the solution










    share|improve this question
























      up vote
      -1
      down vote

      favorite









      up vote
      -1
      down vote

      favorite











      I have a CSV file with following structure



      USER_ID       location          timestamp          
      1 1001 19:11:39 5-2-2010
      1 6022 17:51:19 6-6-2010
      1 1041 11:11:39 5-2-2010
      2 9483 10:51:23 3-2-2012
      2 4532 11:11:11 4-5-2012
      3 4374 03:21:23 6-9-2013
      3 4334 04:53:13 4-5-2013


      Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:



      USER_ID       location          timestamp difference         
      1 1001-1041 08:00:00


      any idea how to reach the solution










      share|improve this question













      I have a CSV file with following structure



      USER_ID       location          timestamp          
      1 1001 19:11:39 5-2-2010
      1 6022 17:51:19 6-6-2010
      1 1041 11:11:39 5-2-2010
      2 9483 10:51:23 3-2-2012
      2 4532 11:11:11 4-5-2012
      3 4374 03:21:23 6-9-2013
      3 4334 04:53:13 4-5-2013


      Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:



      USER_ID       location          timestamp difference         
      1 1001-1041 08:00:00


      any idea how to reach the solution







      python apache-spark pyspark pyspark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 19 at 17:11









      imed eddines

      33




      33
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote



          accepted










          Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.



          Example Code:



          from pyspark.sql.functions import unix_timestamp, col, datediff

          data = [
          (1, 1001, '19:11:39 5-2-2010'),
          (1, 6022, '17:51:19 6-6-2010'),
          (1, 1041, '11:11:39 5-2-2010'),
          (2, 9483, '10:51:23 3-2-2012'),
          (2, 4532, '11:11:11 4-5-2012'),
          (3, 4374, '03:21:23 6-9-2013'),
          (3, 4334, '04:53:13 4-5-2013')
          ]

          df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
          df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))

          # Renaming columns to avoid conflicts after join
          df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
          cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")

          # Filter to get rid of reversed duplicates, and rows where location is same on both sides
          pairs = cartesian.filter("location < location2")
          .drop("USER_ID2")
          .withColumn("diff", col("timestamp2") - col("timestamp"))
          pairs.show()





          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379599%2ftimestamp-difference-between-rows-for-each-user-pyspark-dataframe%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote



            accepted










            Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.



            Example Code:



            from pyspark.sql.functions import unix_timestamp, col, datediff

            data = [
            (1, 1001, '19:11:39 5-2-2010'),
            (1, 6022, '17:51:19 6-6-2010'),
            (1, 1041, '11:11:39 5-2-2010'),
            (2, 9483, '10:51:23 3-2-2012'),
            (2, 4532, '11:11:11 4-5-2012'),
            (3, 4374, '03:21:23 6-9-2013'),
            (3, 4334, '04:53:13 4-5-2013')
            ]

            df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
            df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))

            # Renaming columns to avoid conflicts after join
            df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
            cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")

            # Filter to get rid of reversed duplicates, and rows where location is same on both sides
            pairs = cartesian.filter("location < location2")
            .drop("USER_ID2")
            .withColumn("diff", col("timestamp2") - col("timestamp"))
            pairs.show()





            share|improve this answer

























              up vote
              0
              down vote



              accepted










              Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.



              Example Code:



              from pyspark.sql.functions import unix_timestamp, col, datediff

              data = [
              (1, 1001, '19:11:39 5-2-2010'),
              (1, 6022, '17:51:19 6-6-2010'),
              (1, 1041, '11:11:39 5-2-2010'),
              (2, 9483, '10:51:23 3-2-2012'),
              (2, 4532, '11:11:11 4-5-2012'),
              (3, 4374, '03:21:23 6-9-2013'),
              (3, 4334, '04:53:13 4-5-2013')
              ]

              df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
              df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))

              # Renaming columns to avoid conflicts after join
              df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
              cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")

              # Filter to get rid of reversed duplicates, and rows where location is same on both sides
              pairs = cartesian.filter("location < location2")
              .drop("USER_ID2")
              .withColumn("diff", col("timestamp2") - col("timestamp"))
              pairs.show()





              share|improve this answer























                up vote
                0
                down vote



                accepted







                up vote
                0
                down vote



                accepted






                Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.



                Example Code:



                from pyspark.sql.functions import unix_timestamp, col, datediff

                data = [
                (1, 1001, '19:11:39 5-2-2010'),
                (1, 6022, '17:51:19 6-6-2010'),
                (1, 1041, '11:11:39 5-2-2010'),
                (2, 9483, '10:51:23 3-2-2012'),
                (2, 4532, '11:11:11 4-5-2012'),
                (3, 4374, '03:21:23 6-9-2013'),
                (3, 4334, '04:53:13 4-5-2013')
                ]

                df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
                df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))

                # Renaming columns to avoid conflicts after join
                df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
                cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")

                # Filter to get rid of reversed duplicates, and rows where location is same on both sides
                pairs = cartesian.filter("location < location2")
                .drop("USER_ID2")
                .withColumn("diff", col("timestamp2") - col("timestamp"))
                pairs.show()





                share|improve this answer












                Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.



                Example Code:



                from pyspark.sql.functions import unix_timestamp, col, datediff

                data = [
                (1, 1001, '19:11:39 5-2-2010'),
                (1, 6022, '17:51:19 6-6-2010'),
                (1, 1041, '11:11:39 5-2-2010'),
                (2, 9483, '10:51:23 3-2-2012'),
                (2, 4532, '11:11:11 4-5-2012'),
                (3, 4374, '03:21:23 6-9-2013'),
                (3, 4334, '04:53:13 4-5-2013')
                ]

                df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
                df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))

                # Renaming columns to avoid conflicts after join
                df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
                cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")

                # Filter to get rid of reversed duplicates, and rows where location is same on both sides
                pairs = cartesian.filter("location < location2")
                .drop("USER_ID2")
                .withColumn("diff", col("timestamp2") - col("timestamp"))
                pairs.show()






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 19 at 17:45









                Ryan Widmaier

                2,75611117




                2,75611117






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379599%2ftimestamp-difference-between-rows-for-each-user-pyspark-dataframe%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Wiesbaden

                    Marschland

                    Dieringhausen