Filtering a spark partitioned table is not working in Pyspark











up vote
1
down vote

favorite












I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.



newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')


Here is my table structure and partitions information.



hive> desc emp.partition_Load_table;
OK
veh_code varchar(17)
veh_flag varchar(1)
veh_model smallint
veh_country varchar(3)

# Partition Information
# col_name data_type comment

veh_country varchar(3)

hive> show partitions partition_Load_table;
OK
veh_country=CHN
veh_country=USA
veh_country=RUS


Now I am reading this table back in pyspark inside a dataframe.



    df2_data = spark.sql("""
SELECT *
from udb.partition_Load_table
""");

df2_data.show() --> is working


But I am not able to filter it using partition key column



from pyspark.sql.functions import col
newdf = df2_data.where(col("veh_country")=='CHN')


I am getting below error message:



: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. 
You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)


whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.



newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")


below is working



newdataframe.where(col("veh_country")=='CHN').show()


my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.










share|improve this question




























    up vote
    1
    down vote

    favorite












    I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.



    newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')


    Here is my table structure and partitions information.



    hive> desc emp.partition_Load_table;
    OK
    veh_code varchar(17)
    veh_flag varchar(1)
    veh_model smallint
    veh_country varchar(3)

    # Partition Information
    # col_name data_type comment

    veh_country varchar(3)

    hive> show partitions partition_Load_table;
    OK
    veh_country=CHN
    veh_country=USA
    veh_country=RUS


    Now I am reading this table back in pyspark inside a dataframe.



        df2_data = spark.sql("""
    SELECT *
    from udb.partition_Load_table
    """);

    df2_data.show() --> is working


    But I am not able to filter it using partition key column



    from pyspark.sql.functions import col
    newdf = df2_data.where(col("veh_country")=='CHN')


    I am getting below error message:



    : java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. 
    You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
    however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
    Caused by: MetaException(message:Filtering is supported only on partition keys of type string)


    whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.



    newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")


    below is working



    newdataframe.where(col("veh_country")=='CHN').show()


    my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.










    share|improve this question


























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.



      newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')


      Here is my table structure and partitions information.



      hive> desc emp.partition_Load_table;
      OK
      veh_code varchar(17)
      veh_flag varchar(1)
      veh_model smallint
      veh_country varchar(3)

      # Partition Information
      # col_name data_type comment

      veh_country varchar(3)

      hive> show partitions partition_Load_table;
      OK
      veh_country=CHN
      veh_country=USA
      veh_country=RUS


      Now I am reading this table back in pyspark inside a dataframe.



          df2_data = spark.sql("""
      SELECT *
      from udb.partition_Load_table
      """);

      df2_data.show() --> is working


      But I am not able to filter it using partition key column



      from pyspark.sql.functions import col
      newdf = df2_data.where(col("veh_country")=='CHN')


      I am getting below error message:



      : java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. 
      You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
      however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
      Caused by: MetaException(message:Filtering is supported only on partition keys of type string)


      whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.



      newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")


      below is working



      newdataframe.where(col("veh_country")=='CHN').show()


      my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.










      share|improve this question















      I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.



      newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')


      Here is my table structure and partitions information.



      hive> desc emp.partition_Load_table;
      OK
      veh_code varchar(17)
      veh_flag varchar(1)
      veh_model smallint
      veh_country varchar(3)

      # Partition Information
      # col_name data_type comment

      veh_country varchar(3)

      hive> show partitions partition_Load_table;
      OK
      veh_country=CHN
      veh_country=USA
      veh_country=RUS


      Now I am reading this table back in pyspark inside a dataframe.



          df2_data = spark.sql("""
      SELECT *
      from udb.partition_Load_table
      """);

      df2_data.show() --> is working


      But I am not able to filter it using partition key column



      from pyspark.sql.functions import col
      newdf = df2_data.where(col("veh_country")=='CHN')


      I am getting below error message:



      : java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. 
      You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
      however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
      Caused by: MetaException(message:Filtering is supported only on partition keys of type string)


      whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.



      newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")


      below is working



      newdataframe.where(col("veh_country")=='CHN').show()


      my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.







      hive pyspark partitioning






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 2 days ago

























      asked 2 days ago









      vikrant rana

      10619




      10619





























          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372962%2ffiltering-a-spark-partitioned-table-is-not-working-in-pyspark%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown






























          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372962%2ffiltering-a-spark-partitioned-table-is-not-working-in-pyspark%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Wiesbaden

          Marschland

          Dieringhausen