Filtering a spark partitioned table is not working in Pyspark
up vote
1
down vote
favorite
I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.
newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')
Here is my table structure and partitions information.
hive> desc emp.partition_Load_table;
OK
veh_code varchar(17)
veh_flag varchar(1)
veh_model smallint
veh_country varchar(3)
# Partition Information
# col_name data_type comment
veh_country varchar(3)
hive> show partitions partition_Load_table;
OK
veh_country=CHN
veh_country=USA
veh_country=RUS
Now I am reading this table back in pyspark inside a dataframe.
df2_data = spark.sql("""
SELECT *
from udb.partition_Load_table
""");
df2_data.show() --> is working
But I am not able to filter it using partition key column
from pyspark.sql.functions import col
newdf = df2_data.where(col("veh_country")=='CHN')
I am getting below error message:
: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive.
You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.
newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")
below is working
newdataframe.where(col("veh_country")=='CHN').show()
my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.
hive pyspark partitioning
add a comment |
up vote
1
down vote
favorite
I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.
newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')
Here is my table structure and partitions information.
hive> desc emp.partition_Load_table;
OK
veh_code varchar(17)
veh_flag varchar(1)
veh_model smallint
veh_country varchar(3)
# Partition Information
# col_name data_type comment
veh_country varchar(3)
hive> show partitions partition_Load_table;
OK
veh_country=CHN
veh_country=USA
veh_country=RUS
Now I am reading this table back in pyspark inside a dataframe.
df2_data = spark.sql("""
SELECT *
from udb.partition_Load_table
""");
df2_data.show() --> is working
But I am not able to filter it using partition key column
from pyspark.sql.functions import col
newdf = df2_data.where(col("veh_country")=='CHN')
I am getting below error message:
: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive.
You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.
newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")
below is working
newdataframe.where(col("veh_country")=='CHN').show()
my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.
hive pyspark partitioning
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.
newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')
Here is my table structure and partitions information.
hive> desc emp.partition_Load_table;
OK
veh_code varchar(17)
veh_flag varchar(1)
veh_model smallint
veh_country varchar(3)
# Partition Information
# col_name data_type comment
veh_country varchar(3)
hive> show partitions partition_Load_table;
OK
veh_country=CHN
veh_country=USA
veh_country=RUS
Now I am reading this table back in pyspark inside a dataframe.
df2_data = spark.sql("""
SELECT *
from udb.partition_Load_table
""");
df2_data.show() --> is working
But I am not able to filter it using partition key column
from pyspark.sql.functions import col
newdf = df2_data.where(col("veh_country")=='CHN')
I am getting below error message:
: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive.
You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.
newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")
below is working
newdataframe.where(col("veh_country")=='CHN').show()
my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.
hive pyspark partitioning
I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.
newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')
Here is my table structure and partitions information.
hive> desc emp.partition_Load_table;
OK
veh_code varchar(17)
veh_flag varchar(1)
veh_model smallint
veh_country varchar(3)
# Partition Information
# col_name data_type comment
veh_country varchar(3)
hive> show partitions partition_Load_table;
OK
veh_country=CHN
veh_country=USA
veh_country=RUS
Now I am reading this table back in pyspark inside a dataframe.
df2_data = spark.sql("""
SELECT *
from udb.partition_Load_table
""");
df2_data.show() --> is working
But I am not able to filter it using partition key column
from pyspark.sql.functions import col
newdf = df2_data.where(col("veh_country")=='CHN')
I am getting below error message:
: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive.
You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.
newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")
below is working
newdataframe.where(col("veh_country")=='CHN').show()
my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.
hive pyspark partitioning
hive pyspark partitioning
edited 2 days ago
asked 2 days ago
vikrant rana
10619
10619
add a comment |
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372962%2ffiltering-a-spark-partitioned-table-is-not-working-in-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown