Pivot String column on Pyspark Dataframe

I have a simple dataframe like this:

rdd = sc.parallelize(

    [

        (0, "A", 223,"201603", "PORT"), 

        (0, "A", 22,"201602", "PORT"), 

        (0, "A", 422,"201601", "DOCK"), 

        (1,"B", 3213,"201602", "DOCK"), 

        (1,"B", 3213,"201601", "PORT"), 

        (2,"C", 2321,"201601", "DOCK")

    ]

)

df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])



df_data.show()

 +---+----+----+------+----+

| id|type|cost|  date|ship|

+---+----+----+------+----+

|  0|   A| 223|201603|PORT|

|  0|   A|  22|201602|PORT|

|  0|   A| 422|201601|DOCK|

|  1|   B|3213|201602|DOCK|

|  1|   B|3213|201601|PORT|

|  2|   C|2321|201601|DOCK|

+---+----+----+------+----+

and I need to pivot it by date:

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()



+---+----+------+------+------+

| id|type|201601|201602|201603|

+---+----+------+------+------+

|  2|   C|2321.0|  null|  null|

|  0|   A| 422.0|  22.0| 223.0|

|  1|   B|3213.0|3213.0|  null|

+---+----+------+------+------+

Everything works as expected. But now I need to pivot it and get a non-numeric column:

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()

and of course I would get an exception:

AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'

I would like to generate something on the line of

+---+----+------+------+------+

| id|type|201601|201602|201603|

+---+----+------+------+------+

|  2|   C|DOCK  |  null|  null|

|  0|   A| DOCK |  PORT| DOCK|

|  1|   B|DOCK  |PORT  |  null|

+---+----+------+------+------+

Is that possible with pivot?

edited Jan 5 at 15:34

Community♦

asked May 27 '16 at 15:11

Ivan

6,5231560104

add a comment |

I have a simple dataframe like this:

rdd = sc.parallelize(

    [

        (0, "A", 223,"201603", "PORT"), 

        (0, "A", 22,"201602", "PORT"), 

        (0, "A", 422,"201601", "DOCK"), 

        (1,"B", 3213,"201602", "DOCK"), 

        (1,"B", 3213,"201601", "PORT"), 

        (2,"C", 2321,"201601", "DOCK")

    ]

)

df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])



df_data.show()

 +---+----+----+------+----+

| id|type|cost|  date|ship|

+---+----+----+------+----+

|  0|   A| 223|201603|PORT|

|  0|   A|  22|201602|PORT|

|  0|   A| 422|201601|DOCK|

|  1|   B|3213|201602|DOCK|

|  1|   B|3213|201601|PORT|

|  2|   C|2321|201601|DOCK|

+---+----+----+------+----+

and I need to pivot it by date:

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()



+---+----+------+------+------+

| id|type|201601|201602|201603|

+---+----+------+------+------+

|  2|   C|2321.0|  null|  null|

|  0|   A| 422.0|  22.0| 223.0|

|  1|   B|3213.0|3213.0|  null|

+---+----+------+------+------+

Everything works as expected. But now I need to pivot it and get a non-numeric column:

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()

and of course I would get an exception:

AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'

I would like to generate something on the line of

+---+----+------+------+------+

| id|type|201601|201602|201603|

+---+----+------+------+------+

|  2|   C|DOCK  |  null|  null|

|  0|   A| DOCK |  PORT| DOCK|

|  1|   B|DOCK  |PORT  |  null|

+---+----+------+------+------+

Is that possible with pivot?

edited Jan 5 at 15:34

Community♦

asked May 27 '16 at 15:11

Ivan

6,5231560104

add a comment |

I have a simple dataframe like this:

rdd = sc.parallelize(

    [

        (0, "A", 223,"201603", "PORT"), 

        (0, "A", 22,"201602", "PORT"), 

        (0, "A", 422,"201601", "DOCK"), 

        (1,"B", 3213,"201602", "DOCK"), 

        (1,"B", 3213,"201601", "PORT"), 

        (2,"C", 2321,"201601", "DOCK")

    ]

)

df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])



df_data.show()

 +---+----+----+------+----+

| id|type|cost|  date|ship|

+---+----+----+------+----+

|  0|   A| 223|201603|PORT|

|  0|   A|  22|201602|PORT|

|  0|   A| 422|201601|DOCK|

|  1|   B|3213|201602|DOCK|

|  1|   B|3213|201601|PORT|

|  2|   C|2321|201601|DOCK|

+---+----+----+------+----+

and I need to pivot it by date:

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()



+---+----+------+------+------+

| id|type|201601|201602|201603|

+---+----+------+------+------+

|  2|   C|2321.0|  null|  null|

|  0|   A| 422.0|  22.0| 223.0|

|  1|   B|3213.0|3213.0|  null|

+---+----+------+------+------+

Everything works as expected. But now I need to pivot it and get a non-numeric column:

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()

and of course I would get an exception:

AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'

I would like to generate something on the line of

+---+----+------+------+------+

| id|type|201601|201602|201603|

+---+----+------+------+------+

|  2|   C|DOCK  |  null|  null|

|  0|   A| DOCK |  PORT| DOCK|

|  1|   B|DOCK  |PORT  |  null|

+---+----+------+------+------+

Is that possible with pivot?

edited Jan 5 at 15:34

Community♦

asked May 27 '16 at 15:11

Ivan

6,5231560104

I have a simple dataframe like this:

rdd = sc.parallelize(

    [

        (0, "A", 223,"201603", "PORT"), 

        (0, "A", 22,"201602", "PORT"), 

        (0, "A", 422,"201601", "DOCK"), 

        (1,"B", 3213,"201602", "DOCK"), 

        (1,"B", 3213,"201601", "PORT"), 

        (2,"C", 2321,"201601", "DOCK")

    ]

)

df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])



df_data.show()

 +---+----+----+------+----+

| id|type|cost|  date|ship|

+---+----+----+------+----+

|  0|   A| 223|201603|PORT|

|  0|   A|  22|201602|PORT|

|  0|   A| 422|201601|DOCK|

|  1|   B|3213|201602|DOCK|

|  1|   B|3213|201601|PORT|

|  2|   C|2321|201601|DOCK|

+---+----+----+------+----+

and I need to pivot it by date:

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()



+---+----+------+------+------+

| id|type|201601|201602|201603|

+---+----+------+------+------+

|  2|   C|2321.0|  null|  null|

|  0|   A| 422.0|  22.0| 223.0|

|  1|   B|3213.0|3213.0|  null|

+---+----+------+------+------+

Everything works as expected. But now I need to pivot it and get a non-numeric column:

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()

and of course I would get an exception:

AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'

I would like to generate something on the line of

+---+----+------+------+------+

| id|type|201601|201602|201603|

+---+----+------+------+------+

|  2|   C|DOCK  |  null|  null|

|  0|   A| DOCK |  PORT| DOCK|

|  1|   B|DOCK  |PORT  |  null|

+---+----+------+------+------+

Is that possible with pivot?

python apache-spark dataframe pyspark apache-spark-sql

edited Jan 5 at 15:34

Community♦

asked May 27 '16 at 15:11

Ivan

6,5231560104

edited Jan 5 at 15:34

Community♦

asked May 27 '16 at 15:11

Ivan

6,5231560104

edited Jan 5 at 15:34

Community♦

edited Jan 5 at 15:34

Community♦

edited Jan 5 at 15:34

Community♦

asked May 27 '16 at 15:11

Ivan

6,5231560104

asked May 27 '16 at 15:11

Ivan

6,5231560104

asked May 27 '16 at 15:11

Ivan

6,5231560104

add a comment |

1 Answer
1

active

oldest

votes

Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):

from pyspark.sql.functions import first



(df_data

    .groupby(df_data.id, df_data.type)

    .pivot("date")

    .agg(first("ship"))

    .show())



## +---+----+------+------+------+

## | id|type|201601|201602|201603|

## +---+----+------+------+------+

## |  2|   C|  DOCK|  null|  null|

## |  0|   A|  DOCK|  PORT|  PORT|

## |  1|   B|  PORT|  DOCK|  null|

## +---+----+------+------+------+

If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:

from pyspark.sql.functions import max, struct



(df_data

    .groupby("id", "type", "date", "ship")

    .count()

    .groupby("id", "type")

    .pivot("date")

    .agg(max(struct("count", "ship")))

    .show())



## +---+----+--------+--------+--------+

## | id|type|  201601|  201602|  201603|

## +---+----+--------+--------+--------+

## |  2|   C|[1,DOCK]|    null|    null|

## |  0|   A|[1,DOCK]|[1,PORT]|[1,PORT]|

## |  1|   B|[1,PORT]|[1,DOCK]|    null|

## +---+----+--------+--------+--------+

edited May 27 '16 at 20:51

answered May 27 '16 at 15:20

zero323

166k40484574

Another solution would be to collect_set to keep all the ship values.

– Jacek Laskowski
Nov 15 '18 at 10:30

@Jacek,, can you give that solution here

– stack0114106
Nov 15 '18 at 15:44

@stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

– Jacek Laskowski
Nov 16 '18 at 6:05

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37486910%2fpivot-string-column-on-pyspark-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):

from pyspark.sql.functions import first



(df_data

    .groupby(df_data.id, df_data.type)

    .pivot("date")

    .agg(first("ship"))

    .show())



## +---+----+------+------+------+

## | id|type|201601|201602|201603|

## +---+----+------+------+------+

## |  2|   C|  DOCK|  null|  null|

## |  0|   A|  DOCK|  PORT|  PORT|

## |  1|   B|  PORT|  DOCK|  null|

## +---+----+------+------+------+

If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:

from pyspark.sql.functions import max, struct



(df_data

    .groupby("id", "type", "date", "ship")

    .count()

    .groupby("id", "type")

    .pivot("date")

    .agg(max(struct("count", "ship")))

    .show())



## +---+----+--------+--------+--------+

## | id|type|  201601|  201602|  201603|

## +---+----+--------+--------+--------+

## |  2|   C|[1,DOCK]|    null|    null|

## |  0|   A|[1,DOCK]|[1,PORT]|[1,PORT]|

## |  1|   B|[1,PORT]|[1,DOCK]|    null|

## +---+----+--------+--------+--------+

edited May 27 '16 at 20:51

answered May 27 '16 at 15:20

zero323

166k40484574

Another solution would be to collect_set to keep all the ship values.

– Jacek Laskowski
Nov 15 '18 at 10:30

@Jacek,, can you give that solution here

– stack0114106
Nov 15 '18 at 15:44

@stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

– Jacek Laskowski
Nov 16 '18 at 6:05

add a comment |

Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):

from pyspark.sql.functions import first



(df_data

    .groupby(df_data.id, df_data.type)

    .pivot("date")

    .agg(first("ship"))

    .show())



## +---+----+------+------+------+

## | id|type|201601|201602|201603|

## +---+----+------+------+------+

## |  2|   C|  DOCK|  null|  null|

## |  0|   A|  DOCK|  PORT|  PORT|

## |  1|   B|  PORT|  DOCK|  null|

## +---+----+------+------+------+

If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:

from pyspark.sql.functions import max, struct



(df_data

    .groupby("id", "type", "date", "ship")

    .count()

    .groupby("id", "type")

    .pivot("date")

    .agg(max(struct("count", "ship")))

    .show())



## +---+----+--------+--------+--------+

## | id|type|  201601|  201602|  201603|

## +---+----+--------+--------+--------+

## |  2|   C|[1,DOCK]|    null|    null|

## |  0|   A|[1,DOCK]|[1,PORT]|[1,PORT]|

## |  1|   B|[1,PORT]|[1,DOCK]|    null|

## +---+----+--------+--------+--------+

edited May 27 '16 at 20:51

answered May 27 '16 at 15:20

zero323

166k40484574

Another solution would be to collect_set to keep all the ship values.

– Jacek Laskowski
Nov 15 '18 at 10:30

@Jacek,, can you give that solution here

– stack0114106
Nov 15 '18 at 15:44

@stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

– Jacek Laskowski
Nov 16 '18 at 6:05

add a comment |

Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):

from pyspark.sql.functions import first



(df_data

    .groupby(df_data.id, df_data.type)

    .pivot("date")

    .agg(first("ship"))

    .show())



## +---+----+------+------+------+

## | id|type|201601|201602|201603|

## +---+----+------+------+------+

## |  2|   C|  DOCK|  null|  null|

## |  0|   A|  DOCK|  PORT|  PORT|

## |  1|   B|  PORT|  DOCK|  null|

## +---+----+------+------+------+

If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:

from pyspark.sql.functions import max, struct



(df_data

    .groupby("id", "type", "date", "ship")

    .count()

    .groupby("id", "type")

    .pivot("date")

    .agg(max(struct("count", "ship")))

    .show())



## +---+----+--------+--------+--------+

## | id|type|  201601|  201602|  201603|

## +---+----+--------+--------+--------+

## |  2|   C|[1,DOCK]|    null|    null|

## |  0|   A|[1,DOCK]|[1,PORT]|[1,PORT]|

## |  1|   B|[1,PORT]|[1,DOCK]|    null|

## +---+----+--------+--------+--------+

edited May 27 '16 at 20:51

answered May 27 '16 at 15:20

zero323

166k40484574

Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):

from pyspark.sql.functions import first



(df_data

    .groupby(df_data.id, df_data.type)

    .pivot("date")

    .agg(first("ship"))

    .show())



## +---+----+------+------+------+

## | id|type|201601|201602|201603|

## +---+----+------+------+------+

## |  2|   C|  DOCK|  null|  null|

## |  0|   A|  DOCK|  PORT|  PORT|

## |  1|   B|  PORT|  DOCK|  null|

## +---+----+------+------+------+

If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:

from pyspark.sql.functions import max, struct



(df_data

    .groupby("id", "type", "date", "ship")

    .count()

    .groupby("id", "type")

    .pivot("date")

    .agg(max(struct("count", "ship")))

    .show())



## +---+----+--------+--------+--------+

## | id|type|  201601|  201602|  201603|

## +---+----+--------+--------+--------+

## |  2|   C|[1,DOCK]|    null|    null|

## |  0|   A|[1,DOCK]|[1,PORT]|[1,PORT]|

## |  1|   B|[1,PORT]|[1,DOCK]|    null|

## +---+----+--------+--------+--------+

edited May 27 '16 at 20:51

answered May 27 '16 at 15:20

zero323

166k40484574

edited May 27 '16 at 20:51

answered May 27 '16 at 15:20

zero323

166k40484574

answered May 27 '16 at 15:20

zero323

166k40484574

answered May 27 '16 at 15:20

zero323

166k40484574

Another solution would be to collect_set to keep all the ship values.

– Jacek Laskowski
Nov 15 '18 at 10:30

@Jacek,, can you give that solution here

– stack0114106
Nov 15 '18 at 15:44

@stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

– Jacek Laskowski
Nov 16 '18 at 6:05

add a comment |

Another solution would be to collect_set to keep all the ship values.

– Jacek Laskowski
Nov 15 '18 at 10:30

@Jacek,, can you give that solution here

– stack0114106
Nov 15 '18 at 15:44

@stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

– Jacek Laskowski
Nov 16 '18 at 6:05

Another solution would be to collect_set to keep all the ship values.

– Jacek Laskowski
Nov 15 '18 at 10:30

@Jacek,, can you give that solution here

– stack0114106
Nov 15 '18 at 15:44

@stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

– Jacek Laskowski
Nov 16 '18 at 6:05

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

BPB3srnyRPIThNiHHK OY2 2 tY tYhzbnXbz7B4 bhDit5vEy bPOagR5ckoK3u6wq0BK 13U9,fKzl66FMFDFxXmBIEUpg1f7Dh

搜尋此網誌

Ytukyg