timestamp difference between rows for each user - Pyspark Dataframe
up vote
-1
down vote
favorite
I have a CSV file with following structure
USER_ID location timestamp
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013
Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:
USER_ID location timestamp difference
1 1001-1041 08:00:00
any idea how to reach the solution
python apache-spark pyspark pyspark-sql
add a comment |
up vote
-1
down vote
favorite
I have a CSV file with following structure
USER_ID location timestamp
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013
Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:
USER_ID location timestamp difference
1 1001-1041 08:00:00
any idea how to reach the solution
python apache-spark pyspark pyspark-sql
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I have a CSV file with following structure
USER_ID location timestamp
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013
Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:
USER_ID location timestamp difference
1 1001-1041 08:00:00
any idea how to reach the solution
python apache-spark pyspark pyspark-sql
I have a CSV file with following structure
USER_ID location timestamp
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013
Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:
USER_ID location timestamp difference
1 1001-1041 08:00:00
any idea how to reach the solution
python apache-spark pyspark pyspark-sql
python apache-spark pyspark pyspark-sql
asked Nov 19 at 17:11
imed eddines
33
33
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.
Example Code:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2")
.drop("USER_ID2")
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.
Example Code:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2")
.drop("USER_ID2")
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()
add a comment |
up vote
0
down vote
accepted
Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.
Example Code:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2")
.drop("USER_ID2")
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.
Example Code:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2")
.drop("USER_ID2")
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()
Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.
Example Code:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2")
.drop("USER_ID2")
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()
answered Nov 19 at 17:45
Ryan Widmaier
2,75611117
2,75611117
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379599%2ftimestamp-difference-between-rows-for-each-user-pyspark-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown