Hive: how to show all partitions of a table?
HadoopHiveHadoop Problem Overview
I have a table with 1000+ partitions.
"Show partitions
" command only lists a small number of partitions.
How can i show all partitions?
Update:
-
I found "
show partitions
" command only lists exactly 500 partitions. -
"
select ... where ...
" only processes the 500 partitions!
Hadoop Solutions
Solution 1 - Hadoop
CLI has some limit when ouput is displayed. I suggest to export output into local file:
$hive -e 'show partitions table;' > partitions
Solution 2 - Hadoop
hive> show partitions table_name;
Solution 3 - Hadoop
Okay, I'm writing this answer by extending wmky's answer above & also, assuming that you've configured mysql for your metastore instead of derby.
select PART_NAME FROM PARTITIONS WHERE TBL_ID=(SELECT TBL_ID FROM TBLS WHERE TBL_NAME='<table_name>');
The above query gives you all possible values of the partition columns.
Example:
hive> desc clicks_fact;
OK
time timestamp
..
day date
file_date varchar(8)
# Partition Information
# col_name data_type comment
day date
file_date varchar(8)
Time taken: 1.075 seconds, Fetched: 28 row(s)
I'm going to fetch the values of partition columns.
mysql> select PART_NAME FROM PARTITIONS WHERE TBL_ID=(SELECT TBL_ID FROM TBLS WHERE TBL_NAME='clicks_fact');
+-----------------------------------+
| PART_NAME |
+-----------------------------------+
| day=2016-08-16/file_date=20160816 |
| day=2016-08-17/file_date=20160816 |
....
....
| day=2017-09-09/file_date=20170909 |
| day=2017-09-08/file_date=20170909 |
| day=2017-09-09/file_date=20170910 |
| day=2017-09-10/file_date=20170910 |
+-----------------------------------+
1216 rows in set (0.00 sec)
Returns all partition columns.
Note: JOIN
table DBS
ON DB_ID
when there is a DB involved (i.e, when, multiple DB's have same table_name)
Solution 4 - Hadoop
You can see Hive MetaStore tables,Partitions information in table of "PARTITIONS". You could use "TBLS" join "Partition" to query special table partitions.
Solution 5 - Hadoop
Yet another option is to communicate with Hive Metastore via Thrift protocol.
If you write code in python, you may benefit from hmsclient library:
Hive cli:
hive> create table test_table_with_partitions(f1 string, f2 int) partitioned by (dt string);
OK
Time taken: 0.127 seconds
hive> alter table test_table_with_partitions add partition(dt=20210504) partition(dt=20210505);
OK
Time taken: 0.152 seconds
Python cli:
>>> from hmsclient import hmsclient
>>> client = hmsclient.HMSClient(host='hive.metastore.location', port=9083)
>>> with client as c:
... all_partitions = c.get_partitions(db_name='default',
... tbl_name='test_table_with_partitions',
... max_parts=24 * 365 * 3)
...
>>> print([{'dt': part.values[0]} for part in all_partitions])
[{'dt': '20210504'}, {'dt': '20210505'}]
NB: max_parts
is a parameter than cannot be greater than 32767 (java short max value).
If you have Airflow installed together with apache.hive
extra, you create hmsclient
quite easy:
hive_hook = HiveMetastoreHook()
with hive_hook.metastore as hive_client:
... your code goes here ...
This seems a more valid way of communication with Hive Metastore than accessing DB directly (and database engine agnostic BTW).