How to skip CSV header in Hive External Table?

Hive

Hive Problem Overview


I am using Cloudera's version of Hive and trying to create an external table over a csv file that contains the column names in the first column. Here is the code that I am using to do that.

CREATE EXTERNAL TABLE Test ( 
  RecordId int, 
  FirstName string, 
  LastName string 
) 
ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde' 
WITH SerDeProperties ( 	
  "separatorChar" = ","
) 
STORED AS TEXTFILE 
LOCATION '/user/File.csv'

Sample Data

RecordId,FirstName,LastName
1,"John","Doe"
2,"Jane","Doe"

Can anyone help me with how to skip the first row or do I need to add an intermediate step?

Hive Solutions


Solution 1 - Hive

As of Hive v0.13.0, you can use skip.header.line.count table property:

create external table testtable (name string, message string)
row format delimited 
fields terminated by '\t' 
lines terminated by '\n' 
location '/testtable'
TBLPROPERTIES ("skip.header.line.count"="1");

Use ALTER TABLE for an existing table:

ALTER TABLE tablename
SET TBLPROPERTIES ("skip.header.line.count"="1");

Please note that while it works it comes with its own issues. When there is more than one output file generated i.e. reducers are greater than 1, it skips the first record for each and every file which might not necessarily be the desired behaviour.

Solution 2 - Hive

While you have your answer from Daniel, here are some customizations possible using OpenCSVSerde:

CREATE EXTERNAL TABLE `mydb`.`mytable`(
    `product_name` string,
    `brand_id` string,
    `brand` string,
    `color` string,
    `description` string,
    `sale_price` string)
PARTITIONED BY (
    `seller_id` string)
ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    'separatorChar' = '\t',
    'quoteChar' = '"',
    'escapeChar' = '\\')
STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
    'hdfs://namenode.com:port/data/mydb/mytable'
TBLPROPERTIES (
    'serialization.null.format' = '',
    'skip.header.line.count' = '1')

With this, you have total control over the separator, quote character, escape character, null handling and header handling.

Look here and here.

Solution 3 - Hive

Just append below property in your query and the first header or line int the record will not load or it will be skipped.

Try this

tblproperties ("skip.header.line.count"="1");

Solution 4 - Hive

skip.header.line.count will skip the header line.

However, if you have some external tool accessing accessing the table, it will still see that actual data without skipping those lines

Solution 5 - Hive

create external table table_name( 
Year int, 
Month int,
column_name data_type ) 
row format delimited fields terminated by ',' 
location '/user/user_name/example_data' TBLPROPERTIES('serialization.null.format'='', 'skip.header.line.count'='1');

Solution 6 - Hive

I am not quite sure if it works with ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde' but I guess that it should be similar to ROW FORMAT DELIMITED FIELDS TERMINATED BY ','.
In your case first row will be treated like normal row. But first field fails to be INT so all fields, for first row, will be set as NULL. You need only one intermediate step to fix it:

INSERT OVERWRITE TABLE Test
SELECT * from Test WHERE RecordId IS NOT NULL

Only one drawback is that your original csv file will be modified. I hope it helps. GL!

Solution 7 - Hive

Just for those who have already created the table with the header. Here is the alter command for the same. This is useful in case you already have the table and want the first row to be ignored without dropping and recreating. It also helps with people to familiarize with ALTER as a option with TBLPROPERTIES.

ALTER TABLE tablename SET TBLPROPERTIES ("skip.header.line.count"="1");

Solution 8 - Hive

I also struggled with this and found no way to tell hive to skip first row, like there is e.g. in Greenplum. So finally I had to remove it from the files. e.g. "cat File.csv | grep -v RecordId > File_no_header.csv"

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRick GittinsView Question on Stackoverflow
Solution 1 - HiveDaniel KovermanView Answer on Stackoverflow
Solution 2 - HiveNirmalView Answer on Stackoverflow
Solution 3 - HiveManishView Answer on Stackoverflow
Solution 4 - HiveitsavyView Answer on Stackoverflow
Solution 5 - HiveAman MundraView Answer on Stackoverflow
Solution 6 - HivewwwView Answer on Stackoverflow
Solution 7 - HiveSunil K-Standard CharteredView Answer on Stackoverflow
Solution 8 - HiveAtanasView Answer on Stackoverflow