Avro vs. Parquet

HadoopAvroParquet

Hadoop Problem Overview


I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data!

Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms?

Hadoop Solutions


Solution 1 - Hadoop

If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,

job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());

for

job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());

The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.

In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.

I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.

Solution 2 - Hadoop

Avro is a Row based format. If you want to retrieve the data as a whole you can use Avro

Parquet is a Column based format. If your data consists of a lot of columns but you are interested in a subset of columns then you can use Parquet

HBase is useful when frequent updating of data is involved. Avro is fast in retrieval, Parquet is much faster.

Solution 3 - Hadoop

Avro

  • Widely used as a serialization platform
  • Row-based, offers a compact and fast binary format
  • Schema is encoded on the file so the data can be untagged
  • Files support block compression and are splittable
  • Supports schema evolution

Parquet

  • Column-oriented binary file format
  • Uses the record shredding and assembly algorithm described in the Dremel paper
  • Each data file contains the values for a set of rows
  • Efficient in terms of disk I/O when specific columns need to be queried

From Choosing an HDFS data storage format- Avro vs. Parquet and more

Solution 4 - Hadoop

Both Avro and Parquet are "self-describing" storage formats, meaning that both embed data, metadata information and schema when storing data in a file. The use of either storage formats depends on the use case. Three aspects constitute the basis upon which you may choose which format will be optimal in your case:

  1. Read/Write operation: Parquet is a column-based file format. It supports indexing. Because of that it is suitable for write-once and read-intensive, complex or analytical querying, low-latency data queries. This is generally used by end users/data scientists.
    Meanwhile Avro, being a row-based file format, is best used for write-intensive operation. This is generally used by data engineers. Both support serialization and compression formats, although they do so in different ways.

  2. Tools: Parquet is a good fit for Impala. (Impala is a Massive Parallel Processing (MPP) RDBM SQL-query engine which knows how to operate on data that resides in one or a few external storage engines.) Again Parquet lends itself well to complex/interactive querying and fast (low-latency) outputs over data in HDFS. This is supported by CDH (Cloudera Distribution Hadoop). Hadoop supports Apache's Optimized Row Columnar (ORC) formats (selections depends on the Hadoop distribution), whereas Avro is best suited to Spark processing.

  3. Schema Evolution: Evolving a DB schema means changing the DB's structure, therefore its data, and thus its query processing.
    Both Parquet and Avro supports schema evolution but to a varying degree.
    Parquet is good for 'append' operations, e.g. adding columns, but not for renaming columns unless 'read' is done by index.
    Avro is better suited for appending, deleting and generally mutating columns than Parquet. Historically Avro has provided a richer set of schema evolution possibilities than Parquet, and although their schema evolution capabilities tend to blur, Avro still shines in that area, when compared to Parquet.

Solution 5 - Hadoop

Your understanding is right. In fact, we ran into a similar situation during data migration in our DWH. We chose Parquet over Avro as the disk saving we got was almost double than what we got with AVro. Also, the query processing time was much better than Avro. But yes, our queries were based on aggregation, column based operations etc. hence Parquet was predictably a clear winner.

We are using Hive 0.12 from CDH distro. You mentioned you are running into issues with Hive+Parquet, what are those? We did not encounter any.

Solution 6 - Hadoop

Silver Blaze put description nicely with an example use case and described how Parquet was the best choice for him. It makes sense to consider one over the other depending on your requirements. I am putting up a brief description of different other file formats too along with time space complexity comparison. Hope that helps.

There are a bunch of file formats that you can use in Hive. Notable mentions are AVRO, Parquet. RCFile & ORC. There are some good documents available online that you may refer to if you want to compare the performance and space utilization of these file formats. Follows some useful links that will get you going.

This Blog Post

This link from MapR [They don't discuss Parquet though]

This link from Inquidia

The above given links will get you going. I hope this answer your query.

Thanks!

Solution 7 - Hadoop

Just for a description on Parquet, you can refer here: http://bigdata.devcodenote.com/2015/04/parquet-file-format.html

I intend to write very soon on Avro and a comparison between the 2 as well. Will post it here once done.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAbhishekView Question on Stackoverflow
Solution 1 - Hadoopsteamer25View Answer on Stackoverflow
Solution 2 - HadoopAravind KrishnakumarView Answer on Stackoverflow
Solution 3 - HadoopsecfreeView Answer on Stackoverflow
Solution 4 - HadoopAakash AggarwalView Answer on Stackoverflow
Solution 5 - HadoopSilver BlazeView Answer on Stackoverflow
Solution 6 - HadoopRahulView Answer on Stackoverflow
Solution 7 - HadoopAbhishek JainView Answer on Stackoverflow