Hive file format comparison

Technical Value

Apache Hive 0.11+

in this post I’d like to compare the different file formats for Hive, as well as the different execution times for queries depending on the file format, compression and execution engine. As for the data, I’m using the Uber data set that I also used in my last post. I’m aware that the following query stats are not exhaustive and you may get to different results in your environment or with other table formats. Also you could try different serdes for Hive as well as consider compression. Still, it gives you some idea that both the file format and the execution engine play an important role for Hive’s query performance. However, when choosing a file format, you may also consider data management aspects. For example, if you get your source files in CSV format, than you will likely process the files in this format at least during the first process step.

As test queries I used the query to measure the total trip time (query 1) and the query to find all trips ending at San Francisco airport (query 2) from my last post. Here is the result for the file formats I tested:


Here is some information about the different file formats being used here:

File format Description
textfile separated text file (for example tab separated fields)
rcfile internal hive format (binary)
orc columnar storage format (highly compressed, binary, introduced with Hive 0.11)
parquet columnar storage format (compressed, binary)
Link: http://parquet.incubator.apache.org/
Parquet is supported by a plugin in Hive since version 0.10 and natively in Hive 0.13 and later.
avro serialization file format from Apache Avro (contains schema and data, tools available for processing).
Link: http://avro.apache.org/
Avro is supported in Hive since version 0.9.

For Table create/Write Time I measured a “create table as select” (CTAS) into the specific format. As you can see, the resulting size of the table depends a lot on the file format. The columnar Orc file format compresses the data in a very efficient way:


Using Tez as the execution engine (set hive.execution.engine=tez) results in a much better performance compared to map reduce (set hive.execution.engine=mr). The total time for the two queries is shown in this table:


In map reduce mode, query time does not seem to depend too much on the file format being used:


However, when running the queries in Tez, you’ll see a significant difference between file formats like Parquet and Orc (with Orc being about 30% faster)


Neuen Kommentar schreiben

Der Inhalt dieses Feldes wird nicht öffentlich zugänglich angezeigt.


  • Keine HTML-Tags erlaubt.
  • HTML - Zeilenumbrüche und Absätze werden automatisch erzeugt.
  • Web page addresses and email addresses turn into links automatically.
Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden

Unsere Website benutzt Cookies, um bestmögliche Funktionalitäten bieten zu können. Durch die Nutzung unserer Website, erklären Sie sich mit der Verwendung von Cookies einverstanden. In unserer Datenschutzerklärung finden Sie unter §5 Informationen und Links, wie Sie diesen Maßnahmen jederzeit widersprechen können.