How to skip CSV header in Hive External Table?
HiveHive Problem Overview
I am using Cloudera's version of Hive and trying to create an external table over a csv file that contains the column names in the first column. Here is the code that I am using to do that.
CREATE EXTERNAL TABLE Test (
RecordId int,
FirstName string,
LastName string
)
ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde'
WITH SerDeProperties (
"separatorChar" = ","
)
STORED AS TEXTFILE
LOCATION '/user/File.csv'
Sample Data
RecordId,FirstName,LastName
1,"John","Doe"
2,"Jane","Doe"
Can anyone help me with how to skip the first row or do I need to add an intermediate step?
Hive Solutions
Solution 1 - Hive
As of Hive v0.13.0, you can use skip.header.line.count
table property:
create external table testtable (name string, message string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
location '/testtable'
TBLPROPERTIES ("skip.header.line.count"="1");
Use ALTER TABLE
for an existing table:
ALTER TABLE tablename
SET TBLPROPERTIES ("skip.header.line.count"="1");
Please note that while it works it comes with its own issues. When there is more than one output file generated i.e. reducers are greater than 1, it skips the first record for each and every file which might not necessarily be the desired behaviour.
Solution 2 - Hive
While you have your answer from Daniel, here are some customizations possible using OpenCSVSerde
:
CREATE EXTERNAL TABLE `mydb`.`mytable`(
`product_name` string,
`brand_id` string,
`brand` string,
`color` string,
`description` string,
`sale_price` string)
PARTITIONED BY (
`seller_id` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = '\t',
'quoteChar' = '"',
'escapeChar' = '\\')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://namenode.com:port/data/mydb/mytable'
TBLPROPERTIES (
'serialization.null.format' = '',
'skip.header.line.count' = '1')
With this, you have total control over the separator, quote character, escape character, null handling and header handling.
Solution 3 - Hive
Just append below property in your query and the first header or line int the record will not load or it will be skipped.
Try this
tblproperties ("skip.header.line.count"="1");
Solution 4 - Hive
skip.header.line.count
will skip the header line.
However, if you have some external tool accessing accessing the table, it will still see that actual data without skipping those lines
Solution 5 - Hive
create external table table_name(
Year int,
Month int,
column_name data_type )
row format delimited fields terminated by ','
location '/user/user_name/example_data' TBLPROPERTIES('serialization.null.format'='', 'skip.header.line.count'='1');
Solution 6 - Hive
I am not quite sure if it works with ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde' but I guess that it should be similar to ROW FORMAT DELIMITED FIELDS TERMINATED BY ','.
In your case first row will be treated like normal row. But first field fails to be INT so all fields, for first row, will be set as NULL. You need only one intermediate step to fix it:
INSERT OVERWRITE TABLE Test
SELECT * from Test WHERE RecordId IS NOT NULL
Only one drawback is that your original csv file will be modified. I hope it helps. GL!
Solution 7 - Hive
Just for those who have already created the table with the header. Here is the alter command for the same. This is useful in case you already have the table and want the first row to be ignored without dropping and recreating. It also helps with people to familiarize with ALTER as a option with TBLPROPERTIES.
ALTER TABLE tablename SET TBLPROPERTIES ("skip.header.line.count"="1");
Solution 8 - Hive
I also struggled with this and found no way to tell hive to skip first row, like there is e.g. in Greenplum. So finally I had to remove it from the files. e.g. "cat File.csv | grep -v RecordId > File_no_header.csv"