Hive file format comparison
Apache Hive 0.11+
in this post I’d like to compare the different file formats for Hive, as well as the different execution times for queries depending on the file format, compression and execution engine. As for the data, I’m using the Uber data set that I also used in my last post. I’m aware that the following query stats are not exhaustive and you may get to different results in your environment or with other table formats. Also you could try different serdes for Hive as well as consider compression. Still, it gives you some idea that both the file format and the execution engine play an important role for Hive’s query performance. However, when choosing a file format, you may also consider data management aspects. For example, if you get your source files in CSV format, than you will likely process the files in this format at least during the first process step.
As test queries I used the query to measure the total trip time (query 1) and the query to find all trips ending at San Francisco airport (query 2) from my last post. Here is the result for the file formats I tested:
Here is some information about the different file formats being used here:
|textfile||separated text file (for example tab separated fields)|
|rcfile||internal hive format (binary)|
|orc||columnar storage format (highly compressed, binary, introduced with Hive 0.11)|
|parquet||columnar storage format (compressed, binary)
Parquet is supported by a plugin in Hive since version 0.10 and natively in Hive 0.13 and later.
|avro||serialization file format from Apache Avro (contains schema and data, tools available for processing).
Avro is supported in Hive since version 0.9.
For Table create/Write Time I measured a “create table as select” (CTAS) into the specific format. As you can see, the resulting size of the table depends a lot on the file format. The columnar Orc file format compresses the data in a very efficient way:
Using Tez as the execution engine (set hive.execution.engine=tez) results in a much better performance compared to map reduce (set hive.execution.engine=mr). The total time for the two queries is shown in this table:
In map reduce mode, query time does not seem to depend too much on the file format being used:
However, when running the queries in Tez, you’ll see a significant difference between file formats like Parquet and Orc (with Orc being about 30% faster)