Parquet is a great file format for use with higher level tools like Impala, Hive, Pig, and Spark. But what if you want to use it in MapReduce? Cloudera provides an easy to follow example on how to do this, and is a perfect guide for basic usage of the Parquet MapReduce API. As an enhancement, more speed can be gained by using a different object model.
The Parquet SimpleGroup toString() method, which is what is utilized in MapReduce when using the default Parquet object model, is extremely slow. I had a client recently with a job that was taking over an hour to run with only 1.9 GB of data (18 GB uncompressed) because they were following the sample code.
The solution to this problem is to use a different in-memory object format. There are two pieces to using Parquet: the object format and the storage format. Parquet provides its famous binary columnar storage format and excellent compression. It also provides an object model in the form of the “example” Group class. Other object models exist, though, including Avro, Google Protocol Buffers, Thrift, Hive, and Pig. You still get the benefits of Parquet’s efficient storage mechanism, but you get the added benefit of a more robust and versatile in-memory object model to manipulate data after you’ve loaded it.
After moving the object model for this particular client to Avro, the job duration dropped to under 7 minutes. Check out how to use Parquet with an Avro object model instead on my GitHub and let me know if you have any questions.