Pivot String column on Pyspark Dataframe
I have a simple dataframe like this:
rdd = sc.parallelize(
[
(0, "A", 223,"201603", "PORT"),
(0, "A", 22,"201602", "PORT"),
(0, "A", 422,"201601", "DOCK"),
(1,"B", 3213,"201602", "DOCK"),
(1,"B", 3213,"201601", "PORT"),
(2,"C", 2321,"201601", "DOCK")
]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])
df_data.show()
+---+----+----+------+----+
| id|type|cost| date|ship|
+---+----+----+------+----+
| 0| A| 223|201603|PORT|
| 0| A| 22|201602|PORT|
| 0| A| 422|201601|DOCK|
| 1| B|3213|201602|DOCK|
| 1| B|3213|201601|PORT|
| 2| C|2321|201601|DOCK|
+---+----+----+------+----+
and I need to pivot it by date:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|2321.0| null| null|
| 0| A| 422.0| 22.0| 223.0|
| 1| B|3213.0|3213.0| null|
+---+----+------+------+------+
Everything works as expected. But now I need to pivot it and get a non-numeric column:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()
and of course I would get an exception:
AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'
I would like to generate something on the line of
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|DOCK | null| null|
| 0| A| DOCK | PORT| DOCK|
| 1| B|DOCK |PORT | null|
+---+----+------+------+------+
Is that possible with pivot
?
python apache-spark dataframe pyspark apache-spark-sql
add a comment |
I have a simple dataframe like this:
rdd = sc.parallelize(
[
(0, "A", 223,"201603", "PORT"),
(0, "A", 22,"201602", "PORT"),
(0, "A", 422,"201601", "DOCK"),
(1,"B", 3213,"201602", "DOCK"),
(1,"B", 3213,"201601", "PORT"),
(2,"C", 2321,"201601", "DOCK")
]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])
df_data.show()
+---+----+----+------+----+
| id|type|cost| date|ship|
+---+----+----+------+----+
| 0| A| 223|201603|PORT|
| 0| A| 22|201602|PORT|
| 0| A| 422|201601|DOCK|
| 1| B|3213|201602|DOCK|
| 1| B|3213|201601|PORT|
| 2| C|2321|201601|DOCK|
+---+----+----+------+----+
and I need to pivot it by date:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|2321.0| null| null|
| 0| A| 422.0| 22.0| 223.0|
| 1| B|3213.0|3213.0| null|
+---+----+------+------+------+
Everything works as expected. But now I need to pivot it and get a non-numeric column:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()
and of course I would get an exception:
AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'
I would like to generate something on the line of
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|DOCK | null| null|
| 0| A| DOCK | PORT| DOCK|
| 1| B|DOCK |PORT | null|
+---+----+------+------+------+
Is that possible with pivot
?
python apache-spark dataframe pyspark apache-spark-sql
add a comment |
I have a simple dataframe like this:
rdd = sc.parallelize(
[
(0, "A", 223,"201603", "PORT"),
(0, "A", 22,"201602", "PORT"),
(0, "A", 422,"201601", "DOCK"),
(1,"B", 3213,"201602", "DOCK"),
(1,"B", 3213,"201601", "PORT"),
(2,"C", 2321,"201601", "DOCK")
]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])
df_data.show()
+---+----+----+------+----+
| id|type|cost| date|ship|
+---+----+----+------+----+
| 0| A| 223|201603|PORT|
| 0| A| 22|201602|PORT|
| 0| A| 422|201601|DOCK|
| 1| B|3213|201602|DOCK|
| 1| B|3213|201601|PORT|
| 2| C|2321|201601|DOCK|
+---+----+----+------+----+
and I need to pivot it by date:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|2321.0| null| null|
| 0| A| 422.0| 22.0| 223.0|
| 1| B|3213.0|3213.0| null|
+---+----+------+------+------+
Everything works as expected. But now I need to pivot it and get a non-numeric column:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()
and of course I would get an exception:
AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'
I would like to generate something on the line of
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|DOCK | null| null|
| 0| A| DOCK | PORT| DOCK|
| 1| B|DOCK |PORT | null|
+---+----+------+------+------+
Is that possible with pivot
?
python apache-spark dataframe pyspark apache-spark-sql
I have a simple dataframe like this:
rdd = sc.parallelize(
[
(0, "A", 223,"201603", "PORT"),
(0, "A", 22,"201602", "PORT"),
(0, "A", 422,"201601", "DOCK"),
(1,"B", 3213,"201602", "DOCK"),
(1,"B", 3213,"201601", "PORT"),
(2,"C", 2321,"201601", "DOCK")
]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])
df_data.show()
+---+----+----+------+----+
| id|type|cost| date|ship|
+---+----+----+------+----+
| 0| A| 223|201603|PORT|
| 0| A| 22|201602|PORT|
| 0| A| 422|201601|DOCK|
| 1| B|3213|201602|DOCK|
| 1| B|3213|201601|PORT|
| 2| C|2321|201601|DOCK|
+---+----+----+------+----+
and I need to pivot it by date:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|2321.0| null| null|
| 0| A| 422.0| 22.0| 223.0|
| 1| B|3213.0|3213.0| null|
+---+----+------+------+------+
Everything works as expected. But now I need to pivot it and get a non-numeric column:
df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()
and of course I would get an exception:
AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'
I would like to generate something on the line of
+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|DOCK | null| null|
| 0| A| DOCK | PORT| DOCK|
| 1| B|DOCK |PORT | null|
+---+----+------+------+------+
Is that possible with pivot
?
python apache-spark dataframe pyspark apache-spark-sql
python apache-spark dataframe pyspark apache-spark-sql
edited Jan 5 at 15:34
Community♦
11
11
asked May 27 '16 at 15:11
IvanIvan
6,5231560104
6,5231560104
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Assuming that (id |type | date)
combinations are unique and your only goal is pivoting and not aggregation you can use first
(or any other function not restricted to numeric values):
from pyspark.sql.functions import first
(df_data
.groupby(df_data.id, df_data.type)
.pivot("date")
.agg(first("ship"))
.show())
## +---+----+------+------+------+
## | id|type|201601|201602|201603|
## +---+----+------+------+------+
## | 2| C| DOCK| null| null|
## | 0| A| DOCK| PORT| PORT|
## | 1| B| PORT| DOCK| null|
## +---+----+------+------+------+
If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship
value:
from pyspark.sql.functions import max, struct
(df_data
.groupby("id", "type", "date", "ship")
.count()
.groupby("id", "type")
.pivot("date")
.agg(max(struct("count", "ship")))
.show())
## +---+----+--------+--------+--------+
## | id|type| 201601| 201602| 201603|
## +---+----+--------+--------+--------+
## | 2| C|[1,DOCK]| null| null|
## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
## | 1| B|[1,PORT]|[1,DOCK]| null|
## +---+----+--------+--------+--------+
Another solution would be tocollect_set
to keep all theship
values.
– Jacek Laskowski
Nov 15 '18 at 10:30
@Jacek,, can you give that solution here
– stack0114106
Nov 15 '18 at 15:44
@stack0114106 Replacemax(struct
in the above withcollect_set
and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)
– Jacek Laskowski
Nov 16 '18 at 6:05
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37486910%2fpivot-string-column-on-pyspark-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Assuming that (id |type | date)
combinations are unique and your only goal is pivoting and not aggregation you can use first
(or any other function not restricted to numeric values):
from pyspark.sql.functions import first
(df_data
.groupby(df_data.id, df_data.type)
.pivot("date")
.agg(first("ship"))
.show())
## +---+----+------+------+------+
## | id|type|201601|201602|201603|
## +---+----+------+------+------+
## | 2| C| DOCK| null| null|
## | 0| A| DOCK| PORT| PORT|
## | 1| B| PORT| DOCK| null|
## +---+----+------+------+------+
If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship
value:
from pyspark.sql.functions import max, struct
(df_data
.groupby("id", "type", "date", "ship")
.count()
.groupby("id", "type")
.pivot("date")
.agg(max(struct("count", "ship")))
.show())
## +---+----+--------+--------+--------+
## | id|type| 201601| 201602| 201603|
## +---+----+--------+--------+--------+
## | 2| C|[1,DOCK]| null| null|
## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
## | 1| B|[1,PORT]|[1,DOCK]| null|
## +---+----+--------+--------+--------+
Another solution would be tocollect_set
to keep all theship
values.
– Jacek Laskowski
Nov 15 '18 at 10:30
@Jacek,, can you give that solution here
– stack0114106
Nov 15 '18 at 15:44
@stack0114106 Replacemax(struct
in the above withcollect_set
and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)
– Jacek Laskowski
Nov 16 '18 at 6:05
add a comment |
Assuming that (id |type | date)
combinations are unique and your only goal is pivoting and not aggregation you can use first
(or any other function not restricted to numeric values):
from pyspark.sql.functions import first
(df_data
.groupby(df_data.id, df_data.type)
.pivot("date")
.agg(first("ship"))
.show())
## +---+----+------+------+------+
## | id|type|201601|201602|201603|
## +---+----+------+------+------+
## | 2| C| DOCK| null| null|
## | 0| A| DOCK| PORT| PORT|
## | 1| B| PORT| DOCK| null|
## +---+----+------+------+------+
If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship
value:
from pyspark.sql.functions import max, struct
(df_data
.groupby("id", "type", "date", "ship")
.count()
.groupby("id", "type")
.pivot("date")
.agg(max(struct("count", "ship")))
.show())
## +---+----+--------+--------+--------+
## | id|type| 201601| 201602| 201603|
## +---+----+--------+--------+--------+
## | 2| C|[1,DOCK]| null| null|
## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
## | 1| B|[1,PORT]|[1,DOCK]| null|
## +---+----+--------+--------+--------+
Another solution would be tocollect_set
to keep all theship
values.
– Jacek Laskowski
Nov 15 '18 at 10:30
@Jacek,, can you give that solution here
– stack0114106
Nov 15 '18 at 15:44
@stack0114106 Replacemax(struct
in the above withcollect_set
and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)
– Jacek Laskowski
Nov 16 '18 at 6:05
add a comment |
Assuming that (id |type | date)
combinations are unique and your only goal is pivoting and not aggregation you can use first
(or any other function not restricted to numeric values):
from pyspark.sql.functions import first
(df_data
.groupby(df_data.id, df_data.type)
.pivot("date")
.agg(first("ship"))
.show())
## +---+----+------+------+------+
## | id|type|201601|201602|201603|
## +---+----+------+------+------+
## | 2| C| DOCK| null| null|
## | 0| A| DOCK| PORT| PORT|
## | 1| B| PORT| DOCK| null|
## +---+----+------+------+------+
If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship
value:
from pyspark.sql.functions import max, struct
(df_data
.groupby("id", "type", "date", "ship")
.count()
.groupby("id", "type")
.pivot("date")
.agg(max(struct("count", "ship")))
.show())
## +---+----+--------+--------+--------+
## | id|type| 201601| 201602| 201603|
## +---+----+--------+--------+--------+
## | 2| C|[1,DOCK]| null| null|
## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
## | 1| B|[1,PORT]|[1,DOCK]| null|
## +---+----+--------+--------+--------+
Assuming that (id |type | date)
combinations are unique and your only goal is pivoting and not aggregation you can use first
(or any other function not restricted to numeric values):
from pyspark.sql.functions import first
(df_data
.groupby(df_data.id, df_data.type)
.pivot("date")
.agg(first("ship"))
.show())
## +---+----+------+------+------+
## | id|type|201601|201602|201603|
## +---+----+------+------+------+
## | 2| C| DOCK| null| null|
## | 0| A| DOCK| PORT| PORT|
## | 1| B| PORT| DOCK| null|
## +---+----+------+------+------+
If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship
value:
from pyspark.sql.functions import max, struct
(df_data
.groupby("id", "type", "date", "ship")
.count()
.groupby("id", "type")
.pivot("date")
.agg(max(struct("count", "ship")))
.show())
## +---+----+--------+--------+--------+
## | id|type| 201601| 201602| 201603|
## +---+----+--------+--------+--------+
## | 2| C|[1,DOCK]| null| null|
## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
## | 1| B|[1,PORT]|[1,DOCK]| null|
## +---+----+--------+--------+--------+
edited May 27 '16 at 20:51
answered May 27 '16 at 15:20
zero323zero323
166k40484574
166k40484574
Another solution would be tocollect_set
to keep all theship
values.
– Jacek Laskowski
Nov 15 '18 at 10:30
@Jacek,, can you give that solution here
– stack0114106
Nov 15 '18 at 15:44
@stack0114106 Replacemax(struct
in the above withcollect_set
and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)
– Jacek Laskowski
Nov 16 '18 at 6:05
add a comment |
Another solution would be tocollect_set
to keep all theship
values.
– Jacek Laskowski
Nov 15 '18 at 10:30
@Jacek,, can you give that solution here
– stack0114106
Nov 15 '18 at 15:44
@stack0114106 Replacemax(struct
in the above withcollect_set
and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)
– Jacek Laskowski
Nov 16 '18 at 6:05
Another solution would be to
collect_set
to keep all the ship
values.– Jacek Laskowski
Nov 15 '18 at 10:30
Another solution would be to
collect_set
to keep all the ship
values.– Jacek Laskowski
Nov 15 '18 at 10:30
@Jacek,, can you give that solution here
– stack0114106
Nov 15 '18 at 15:44
@Jacek,, can you give that solution here
– stack0114106
Nov 15 '18 at 15:44
@stack0114106 Replace
max(struct
in the above with collect_set
and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)– Jacek Laskowski
Nov 16 '18 at 6:05
@stack0114106 Replace
max(struct
in the above with collect_set
and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)– Jacek Laskowski
Nov 16 '18 at 6:05
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37486910%2fpivot-string-column-on-pyspark-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown