timestamp difference between rows for each user

timestamp difference between rows for each user - Pyspark Dataframe

up vote
-1
down vote

favorite

I have a CSV file with following structure

USER_ID       location          timestamp          

 1             1001         19:11:39  5-2-2010

 1             6022         17:51:19  6-6-2010

 1             1041         11:11:39  5-2-2010

 2             9483         10:51:23  3-2-2012

 2             4532         11:11:11  4-5-2012

 3             4374         03:21:23  6-9-2013

 3             4334         04:53:13  4-5-2013

Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:

USER_ID       location          timestamp difference         

 1            1001-1041         08:00:00

any idea how to reach the solution

asked Nov 19 at 17:11

imed eddines

add a comment |

up vote
-1
down vote

favorite

I have a CSV file with following structure

USER_ID       location          timestamp          

 1             1001         19:11:39  5-2-2010

 1             6022         17:51:19  6-6-2010

 1             1041         11:11:39  5-2-2010

 2             9483         10:51:23  3-2-2012

 2             4532         11:11:11  4-5-2012

 3             4374         03:21:23  6-9-2013

 3             4334         04:53:13  4-5-2013

Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:

USER_ID       location          timestamp difference         

 1            1001-1041         08:00:00

any idea how to reach the solution

asked Nov 19 at 17:11

imed eddines

add a comment |

up vote
-1
down vote

favorite

I have a CSV file with following structure

USER_ID       location          timestamp          

 1             1001         19:11:39  5-2-2010

 1             6022         17:51:19  6-6-2010

 1             1041         11:11:39  5-2-2010

 2             9483         10:51:23  3-2-2012

 2             4532         11:11:11  4-5-2012

 3             4374         03:21:23  6-9-2013

 3             4334         04:53:13  4-5-2013

Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:

USER_ID       location          timestamp difference         

 1            1001-1041         08:00:00

any idea how to reach the solution

asked Nov 19 at 17:11

imed eddines

I have a CSV file with following structure

USER_ID       location          timestamp          

 1             1001         19:11:39  5-2-2010

 1             6022         17:51:19  6-6-2010

 1             1041         11:11:39  5-2-2010

 2             9483         10:51:23  3-2-2012

 2             4532         11:11:11  4-5-2012

 3             4374         03:21:23  6-9-2013

 3             4334         04:53:13  4-5-2013

Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:

USER_ID       location          timestamp difference         

 1            1001-1041         08:00:00

any idea how to reach the solution

python apache-spark pyspark pyspark-sql

asked Nov 19 at 17:11

imed eddines

asked Nov 19 at 17:11

imed eddines

asked Nov 19 at 17:11

imed eddines

asked Nov 19 at 17:11

imed eddines

asked Nov 19 at 17:11

imed eddines

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.

Example Code:

from pyspark.sql.functions import unix_timestamp, col, datediff



data = [

    (1, 1001, '19:11:39 5-2-2010'),

    (1, 6022, '17:51:19 6-6-2010'),

    (1, 1041, '11:11:39 5-2-2010'),

    (2, 9483, '10:51:23 3-2-2012'),

    (2, 4532, '11:11:11 4-5-2012'),

    (3, 4374, '03:21:23 6-9-2013'),

    (3, 4334, '04:53:13 4-5-2013')

]



df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])

df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))



# Renaming columns to avoid conflicts after join

df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')

cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")



# Filter to get rid of reversed duplicates, and rows where location is same on both sides

pairs = cartesian.filter("location < location2") 

                 .drop("USER_ID2") 

                 .withColumn("diff", col("timestamp2") - col("timestamp"))

pairs.show()

answered Nov 19 at 17:45

Ryan Widmaier

2,75611117

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379599%2ftimestamp-difference-between-rows-for-each-user-pyspark-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

Example Code:

from pyspark.sql.functions import unix_timestamp, col, datediff



data = [

    (1, 1001, '19:11:39 5-2-2010'),

    (1, 6022, '17:51:19 6-6-2010'),

    (1, 1041, '11:11:39 5-2-2010'),

    (2, 9483, '10:51:23 3-2-2012'),

    (2, 4532, '11:11:11 4-5-2012'),

    (3, 4374, '03:21:23 6-9-2013'),

    (3, 4334, '04:53:13 4-5-2013')

]



df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])

df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))



# Renaming columns to avoid conflicts after join

df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')

cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")



# Filter to get rid of reversed duplicates, and rows where location is same on both sides

pairs = cartesian.filter("location < location2") 

                 .drop("USER_ID2") 

                 .withColumn("diff", col("timestamp2") - col("timestamp"))

pairs.show()

answered Nov 19 at 17:45

Ryan Widmaier

2,75611117

add a comment |

up vote
0
down vote

accepted

Example Code:

from pyspark.sql.functions import unix_timestamp, col, datediff



data = [

    (1, 1001, '19:11:39 5-2-2010'),

    (1, 6022, '17:51:19 6-6-2010'),

    (1, 1041, '11:11:39 5-2-2010'),

    (2, 9483, '10:51:23 3-2-2012'),

    (2, 4532, '11:11:11 4-5-2012'),

    (3, 4374, '03:21:23 6-9-2013'),

    (3, 4334, '04:53:13 4-5-2013')

]



df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])

df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))



# Renaming columns to avoid conflicts after join

df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')

cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")



# Filter to get rid of reversed duplicates, and rows where location is same on both sides

pairs = cartesian.filter("location < location2") 

                 .drop("USER_ID2") 

                 .withColumn("diff", col("timestamp2") - col("timestamp"))

pairs.show()

answered Nov 19 at 17:45

Ryan Widmaier

2,75611117

add a comment |

up vote
0
down vote

accepted

Example Code:

from pyspark.sql.functions import unix_timestamp, col, datediff



data = [

    (1, 1001, '19:11:39 5-2-2010'),

    (1, 6022, '17:51:19 6-6-2010'),

    (1, 1041, '11:11:39 5-2-2010'),

    (2, 9483, '10:51:23 3-2-2012'),

    (2, 4532, '11:11:11 4-5-2012'),

    (3, 4374, '03:21:23 6-9-2013'),

    (3, 4334, '04:53:13 4-5-2013')

]



df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])

df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))



# Renaming columns to avoid conflicts after join

df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')

cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")



# Filter to get rid of reversed duplicates, and rows where location is same on both sides

pairs = cartesian.filter("location < location2") 

                 .drop("USER_ID2") 

                 .withColumn("diff", col("timestamp2") - col("timestamp"))

pairs.show()

answered Nov 19 at 17:45

Ryan Widmaier

2,75611117

Example Code:

from pyspark.sql.functions import unix_timestamp, col, datediff



data = [

    (1, 1001, '19:11:39 5-2-2010'),

    (1, 6022, '17:51:19 6-6-2010'),

    (1, 1041, '11:11:39 5-2-2010'),

    (2, 9483, '10:51:23 3-2-2012'),

    (2, 4532, '11:11:11 4-5-2012'),

    (3, 4374, '03:21:23 6-9-2013'),

    (3, 4334, '04:53:13 4-5-2013')

]



df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])

df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))



# Renaming columns to avoid conflicts after join

df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')

cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")



# Filter to get rid of reversed duplicates, and rows where location is same on both sides

pairs = cartesian.filter("location < location2") 

                 .drop("USER_ID2") 

                 .withColumn("diff", col("timestamp2") - col("timestamp"))

pairs.show()

answered Nov 19 at 17:45

Ryan Widmaier

2,75611117

answered Nov 19 at 17:45

Ryan Widmaier

2,75611117

answered Nov 19 at 17:45

Ryan Widmaier

2,75611117

answered Nov 19 at 17:45

Ryan Widmaier

2,75611117

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

DDmk lgVJbvZuLXquveQfhl9MDffN,lU 9z2YobuwW6lScKjY7i9 mqnnaGi QKaaB,WrfI8QTyHVjx0PPAfJuVuaLB1sGjCjQDsdt

搜尋此網誌

Ytukyg