Comparing Apache Hadoop Data Storage Formats
Apache Hadoop stores data in the Hadoop Distributed File System (HDFS). The data could be in several supported file formats.
To compare them, ask some questions about their properties:
- How easy is the file format to query? The query time should not incur much latency
- How easy is the file format to serialize? The data serialization should not incur much latency
- Is the file format splittable? Data is split several times during processing
- Does the file format support compression?
Files are broken into lines. A line end is indicated with a carriage return (\r) or line feed (\n). Keys are the position in the file and values are the line of text. Use a text file if the data stored in HDFS needs to be readily accessible in text format, such as email messages.
Text files are splittable. They can be relatively bulky, as they don’t support block compression, so query may not be as efficient as some of the other formats. For compression, use a file-level compression codec that supports splitting, such as BZip2.
A sequence file stores data in rows as binary key/value pairs. The binary format makes it smaller than a text file. Sequence files are splittable.
Three different sequence file formats are supported:
- Record compressed—only values are compressed
- Block compressed—both keys and values are compressed
One advantage over text format is that the sequence file format supports block compression, or compressing HDFS blocks separately, a block being the smallest unit of data. Block compression enables splitting data used in MapReduce processing.
Avro is a compact, fast, row-based binary data format. Avro is suitable for storing complex data, as it is a schema-based data serialization system. The schema is encoded along with the data, so Avro data is self-describing.
A sequence file could be used to store complex data, but it is not self-describing and the complexity of the data has to be tagged with the data, which makes sequence files slower to serialize and deserialize than Avro for complex data structures. A separate schema is easier to encode and decode.
Avro also supports block compression for ease of data splitting.
Parquet is a binary columnar storage format. Instead of storing adjacent rows, adjacent columns of data are stored together, which is especially useful if most queries are over a subset of data. Only the needed column data is fetched, so it’s more efficient than having to scan rows of data to find the subset. This is also an advantage because column data is similar data, and compressing similar data is more efficient and results in a smaller storage size.
Parquet may not be suitable for MapReduce processing because it uses the complete data set. Parquet is appropriate for Apache Hive for queries over a subset of data.
When to Use Which File Format
All of the commonly used Hadoop storage formats are binary except for text files. Use the text file format for simple storage, such as CSV and email messages. Use a sequence file for data processed by MapReduce. Use Avro for complex data structures. Use Parquet for data queried over a subset, such as data warehousing in Hive.