Pivot String column on Pyspark Dataframe












18















I have a simple dataframe like this:



rdd = sc.parallelize(
[
(0, "A", 223,"201603", "PORT"),
(0, "A", 22,"201602", "PORT"),
(0, "A", 422,"201601", "DOCK"),
(1,"B", 3213,"201602", "DOCK"),
(1,"B", 3213,"201601", "PORT"),
(2,"C", 2321,"201601", "DOCK")
]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])

df_data.show()
+---+----+----+------+----+
| id|type|cost| date|ship|
+---+----+----+------+----+
| 0| A| 223|201603|PORT|
| 0| A| 22|201602|PORT|
| 0| A| 422|201601|DOCK|
| 1| B|3213|201602|DOCK|
| 1| B|3213|201601|PORT|
| 2| C|2321|201601|DOCK|
+---+----+----+------+----+


and I need to pivot it by date:



df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()

+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|2321.0| null| null|
| 0| A| 422.0| 22.0| 223.0|
| 1| B|3213.0|3213.0| null|
+---+----+------+------+------+


Everything works as expected. But now I need to pivot it and get a non-numeric column:



df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()


and of course I would get an exception:



AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'


I would like to generate something on the line of



+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
| 2| C|DOCK | null| null|
| 0| A| DOCK | PORT| DOCK|
| 1| B|DOCK |PORT | null|
+---+----+------+------+------+


Is that possible with pivot?










share|improve this question





























    18















    I have a simple dataframe like this:



    rdd = sc.parallelize(
    [
    (0, "A", 223,"201603", "PORT"),
    (0, "A", 22,"201602", "PORT"),
    (0, "A", 422,"201601", "DOCK"),
    (1,"B", 3213,"201602", "DOCK"),
    (1,"B", 3213,"201601", "PORT"),
    (2,"C", 2321,"201601", "DOCK")
    ]
    )
    df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])

    df_data.show()
    +---+----+----+------+----+
    | id|type|cost| date|ship|
    +---+----+----+------+----+
    | 0| A| 223|201603|PORT|
    | 0| A| 22|201602|PORT|
    | 0| A| 422|201601|DOCK|
    | 1| B|3213|201602|DOCK|
    | 1| B|3213|201601|PORT|
    | 2| C|2321|201601|DOCK|
    +---+----+----+------+----+


    and I need to pivot it by date:



    df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()

    +---+----+------+------+------+
    | id|type|201601|201602|201603|
    +---+----+------+------+------+
    | 2| C|2321.0| null| null|
    | 0| A| 422.0| 22.0| 223.0|
    | 1| B|3213.0|3213.0| null|
    +---+----+------+------+------+


    Everything works as expected. But now I need to pivot it and get a non-numeric column:



    df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()


    and of course I would get an exception:



    AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'


    I would like to generate something on the line of



    +---+----+------+------+------+
    | id|type|201601|201602|201603|
    +---+----+------+------+------+
    | 2| C|DOCK | null| null|
    | 0| A| DOCK | PORT| DOCK|
    | 1| B|DOCK |PORT | null|
    +---+----+------+------+------+


    Is that possible with pivot?










    share|improve this question



























      18












      18








      18


      9






      I have a simple dataframe like this:



      rdd = sc.parallelize(
      [
      (0, "A", 223,"201603", "PORT"),
      (0, "A", 22,"201602", "PORT"),
      (0, "A", 422,"201601", "DOCK"),
      (1,"B", 3213,"201602", "DOCK"),
      (1,"B", 3213,"201601", "PORT"),
      (2,"C", 2321,"201601", "DOCK")
      ]
      )
      df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])

      df_data.show()
      +---+----+----+------+----+
      | id|type|cost| date|ship|
      +---+----+----+------+----+
      | 0| A| 223|201603|PORT|
      | 0| A| 22|201602|PORT|
      | 0| A| 422|201601|DOCK|
      | 1| B|3213|201602|DOCK|
      | 1| B|3213|201601|PORT|
      | 2| C|2321|201601|DOCK|
      +---+----+----+------+----+


      and I need to pivot it by date:



      df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()

      +---+----+------+------+------+
      | id|type|201601|201602|201603|
      +---+----+------+------+------+
      | 2| C|2321.0| null| null|
      | 0| A| 422.0| 22.0| 223.0|
      | 1| B|3213.0|3213.0| null|
      +---+----+------+------+------+


      Everything works as expected. But now I need to pivot it and get a non-numeric column:



      df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()


      and of course I would get an exception:



      AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'


      I would like to generate something on the line of



      +---+----+------+------+------+
      | id|type|201601|201602|201603|
      +---+----+------+------+------+
      | 2| C|DOCK | null| null|
      | 0| A| DOCK | PORT| DOCK|
      | 1| B|DOCK |PORT | null|
      +---+----+------+------+------+


      Is that possible with pivot?










      share|improve this question
















      I have a simple dataframe like this:



      rdd = sc.parallelize(
      [
      (0, "A", 223,"201603", "PORT"),
      (0, "A", 22,"201602", "PORT"),
      (0, "A", 422,"201601", "DOCK"),
      (1,"B", 3213,"201602", "DOCK"),
      (1,"B", 3213,"201601", "PORT"),
      (2,"C", 2321,"201601", "DOCK")
      ]
      )
      df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])

      df_data.show()
      +---+----+----+------+----+
      | id|type|cost| date|ship|
      +---+----+----+------+----+
      | 0| A| 223|201603|PORT|
      | 0| A| 22|201602|PORT|
      | 0| A| 422|201601|DOCK|
      | 1| B|3213|201602|DOCK|
      | 1| B|3213|201601|PORT|
      | 2| C|2321|201601|DOCK|
      +---+----+----+------+----+


      and I need to pivot it by date:



      df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()

      +---+----+------+------+------+
      | id|type|201601|201602|201603|
      +---+----+------+------+------+
      | 2| C|2321.0| null| null|
      | 0| A| 422.0| 22.0| 223.0|
      | 1| B|3213.0|3213.0| null|
      +---+----+------+------+------+


      Everything works as expected. But now I need to pivot it and get a non-numeric column:



      df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()


      and of course I would get an exception:



      AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'


      I would like to generate something on the line of



      +---+----+------+------+------+
      | id|type|201601|201602|201603|
      +---+----+------+------+------+
      | 2| C|DOCK | null| null|
      | 0| A| DOCK | PORT| DOCK|
      | 1| B|DOCK |PORT | null|
      +---+----+------+------+------+


      Is that possible with pivot?







      python apache-spark dataframe pyspark apache-spark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 5 at 15:34









      Community

      11




      11










      asked May 27 '16 at 15:11









      IvanIvan

      6,5231560104




      6,5231560104
























          1 Answer
          1






          active

          oldest

          votes


















          27














          Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):



          from pyspark.sql.functions import first

          (df_data
          .groupby(df_data.id, df_data.type)
          .pivot("date")
          .agg(first("ship"))
          .show())

          ## +---+----+------+------+------+
          ## | id|type|201601|201602|201603|
          ## +---+----+------+------+------+
          ## | 2| C| DOCK| null| null|
          ## | 0| A| DOCK| PORT| PORT|
          ## | 1| B| PORT| DOCK| null|
          ## +---+----+------+------+------+


          If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:



          from pyspark.sql.functions import max, struct

          (df_data
          .groupby("id", "type", "date", "ship")
          .count()
          .groupby("id", "type")
          .pivot("date")
          .agg(max(struct("count", "ship")))
          .show())

          ## +---+----+--------+--------+--------+
          ## | id|type| 201601| 201602| 201603|
          ## +---+----+--------+--------+--------+
          ## | 2| C|[1,DOCK]| null| null|
          ## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
          ## | 1| B|[1,PORT]|[1,DOCK]| null|
          ## +---+----+--------+--------+--------+





          share|improve this answer


























          • Another solution would be to collect_set to keep all the ship values.

            – Jacek Laskowski
            Nov 15 '18 at 10:30











          • @Jacek,, can you give that solution here

            – stack0114106
            Nov 15 '18 at 15:44











          • @stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

            – Jacek Laskowski
            Nov 16 '18 at 6:05











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37486910%2fpivot-string-column-on-pyspark-dataframe%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          27














          Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):



          from pyspark.sql.functions import first

          (df_data
          .groupby(df_data.id, df_data.type)
          .pivot("date")
          .agg(first("ship"))
          .show())

          ## +---+----+------+------+------+
          ## | id|type|201601|201602|201603|
          ## +---+----+------+------+------+
          ## | 2| C| DOCK| null| null|
          ## | 0| A| DOCK| PORT| PORT|
          ## | 1| B| PORT| DOCK| null|
          ## +---+----+------+------+------+


          If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:



          from pyspark.sql.functions import max, struct

          (df_data
          .groupby("id", "type", "date", "ship")
          .count()
          .groupby("id", "type")
          .pivot("date")
          .agg(max(struct("count", "ship")))
          .show())

          ## +---+----+--------+--------+--------+
          ## | id|type| 201601| 201602| 201603|
          ## +---+----+--------+--------+--------+
          ## | 2| C|[1,DOCK]| null| null|
          ## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
          ## | 1| B|[1,PORT]|[1,DOCK]| null|
          ## +---+----+--------+--------+--------+





          share|improve this answer


























          • Another solution would be to collect_set to keep all the ship values.

            – Jacek Laskowski
            Nov 15 '18 at 10:30











          • @Jacek,, can you give that solution here

            – stack0114106
            Nov 15 '18 at 15:44











          • @stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

            – Jacek Laskowski
            Nov 16 '18 at 6:05
















          27














          Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):



          from pyspark.sql.functions import first

          (df_data
          .groupby(df_data.id, df_data.type)
          .pivot("date")
          .agg(first("ship"))
          .show())

          ## +---+----+------+------+------+
          ## | id|type|201601|201602|201603|
          ## +---+----+------+------+------+
          ## | 2| C| DOCK| null| null|
          ## | 0| A| DOCK| PORT| PORT|
          ## | 1| B| PORT| DOCK| null|
          ## +---+----+------+------+------+


          If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:



          from pyspark.sql.functions import max, struct

          (df_data
          .groupby("id", "type", "date", "ship")
          .count()
          .groupby("id", "type")
          .pivot("date")
          .agg(max(struct("count", "ship")))
          .show())

          ## +---+----+--------+--------+--------+
          ## | id|type| 201601| 201602| 201603|
          ## +---+----+--------+--------+--------+
          ## | 2| C|[1,DOCK]| null| null|
          ## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
          ## | 1| B|[1,PORT]|[1,DOCK]| null|
          ## +---+----+--------+--------+--------+





          share|improve this answer


























          • Another solution would be to collect_set to keep all the ship values.

            – Jacek Laskowski
            Nov 15 '18 at 10:30











          • @Jacek,, can you give that solution here

            – stack0114106
            Nov 15 '18 at 15:44











          • @stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

            – Jacek Laskowski
            Nov 16 '18 at 6:05














          27












          27








          27







          Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):



          from pyspark.sql.functions import first

          (df_data
          .groupby(df_data.id, df_data.type)
          .pivot("date")
          .agg(first("ship"))
          .show())

          ## +---+----+------+------+------+
          ## | id|type|201601|201602|201603|
          ## +---+----+------+------+------+
          ## | 2| C| DOCK| null| null|
          ## | 0| A| DOCK| PORT| PORT|
          ## | 1| B| PORT| DOCK| null|
          ## +---+----+------+------+------+


          If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:



          from pyspark.sql.functions import max, struct

          (df_data
          .groupby("id", "type", "date", "ship")
          .count()
          .groupby("id", "type")
          .pivot("date")
          .agg(max(struct("count", "ship")))
          .show())

          ## +---+----+--------+--------+--------+
          ## | id|type| 201601| 201602| 201603|
          ## +---+----+--------+--------+--------+
          ## | 2| C|[1,DOCK]| null| null|
          ## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
          ## | 1| B|[1,PORT]|[1,DOCK]| null|
          ## +---+----+--------+--------+--------+





          share|improve this answer















          Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):



          from pyspark.sql.functions import first

          (df_data
          .groupby(df_data.id, df_data.type)
          .pivot("date")
          .agg(first("ship"))
          .show())

          ## +---+----+------+------+------+
          ## | id|type|201601|201602|201603|
          ## +---+----+------+------+------+
          ## | 2| C| DOCK| null| null|
          ## | 0| A| DOCK| PORT| PORT|
          ## | 1| B| PORT| DOCK| null|
          ## +---+----+------+------+------+


          If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:



          from pyspark.sql.functions import max, struct

          (df_data
          .groupby("id", "type", "date", "ship")
          .count()
          .groupby("id", "type")
          .pivot("date")
          .agg(max(struct("count", "ship")))
          .show())

          ## +---+----+--------+--------+--------+
          ## | id|type| 201601| 201602| 201603|
          ## +---+----+--------+--------+--------+
          ## | 2| C|[1,DOCK]| null| null|
          ## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]|
          ## | 1| B|[1,PORT]|[1,DOCK]| null|
          ## +---+----+--------+--------+--------+






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 27 '16 at 20:51

























          answered May 27 '16 at 15:20









          zero323zero323

          166k40484574




          166k40484574













          • Another solution would be to collect_set to keep all the ship values.

            – Jacek Laskowski
            Nov 15 '18 at 10:30











          • @Jacek,, can you give that solution here

            – stack0114106
            Nov 15 '18 at 15:44











          • @stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

            – Jacek Laskowski
            Nov 16 '18 at 6:05



















          • Another solution would be to collect_set to keep all the ship values.

            – Jacek Laskowski
            Nov 15 '18 at 10:30











          • @Jacek,, can you give that solution here

            – stack0114106
            Nov 15 '18 at 15:44











          • @stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

            – Jacek Laskowski
            Nov 16 '18 at 6:05

















          Another solution would be to collect_set to keep all the ship values.

          – Jacek Laskowski
          Nov 15 '18 at 10:30





          Another solution would be to collect_set to keep all the ship values.

          – Jacek Laskowski
          Nov 15 '18 at 10:30













          @Jacek,, can you give that solution here

          – stack0114106
          Nov 15 '18 at 15:44





          @Jacek,, can you give that solution here

          – stack0114106
          Nov 15 '18 at 15:44













          @stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

          – Jacek Laskowski
          Nov 16 '18 at 6:05





          @stack0114106 Replace max(struct in the above with collect_set and you're done. Looking for opportunity to use it as a full-fledged answer though. You know any questions that beg for such an answer? ;-)

          – Jacek Laskowski
          Nov 16 '18 at 6:05


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37486910%2fpivot-string-column-on-pyspark-dataframe%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Wiesbaden

          Marschland

          Dieringhausen