Comparing Apache Hadoop Data Storage Formats

By Deepak Vohra - May 18, 2020

Apache Hadoop stores data in the Hadoop Distributed File System (HDFS). The data could be in several supported file formats.

To compare them, ask some questions about their properties:

How easy is the file format to query? The query time should not incur much latency
How easy is the file format to serialize? The data serialization should not incur much latency
Is the file format splittable? Data is split several times during processing
Does the file format support compression?

Text file

Files are broken into lines. A line end is indicated with a carriage return (\r) or line feed (\n). Keys are the position in the file and values are the line of text. Use a text file if the data stored in HDFS needs to be readily accessible in text format, such as email messages.

Text files are splittable. They can be relatively bulky, as they don’t support block compression, so query may not be as efficient as some of the other formats. For compression, use a file-level compression codec that supports splitting, such as BZip2.

Sequence file

A sequence file stores data in rows as binary key/value pairs. The binary format makes it smaller than a text file. Sequence files are splittable.

Three different sequence file formats are supported:

Uncompressed
Record compressed—only values are compressed
Block compressed—both keys and values are compressed

One advantage over text format is that the sequence file format supports block compression, or compressing HDFS blocks separately, a block being the smallest unit of data. Block compression enables splitting data used in MapReduce processing.

Avro file

Avro is a compact, fast, row-based binary data format. Avro is suitable for storing complex data, as it is a schema-based data serialization system. The schema is encoded along with the data, so Avro data is self-describing.

A sequence file could be used to store complex data, but it is not self-describing and the complexity of the data has to be tagged with the data, which makes sequence files slower to serialize and deserialize than Avro for complex data structures. A separate schema is easier to encode and decode.

Avro also supports block compression for ease of data splitting.

Parquet file

Parquet is a binary columnar storage format. Instead of storing adjacent rows, adjacent columns of data are stored together, which is especially useful if most queries are over a subset of data. Only the needed column data is fetched, so it’s more efficient than having to scan rows of data to find the subset. This is also an advantage because column data is similar data, and compressing similar data is more efficient and results in a smaller storage size.

Parquet may not be suitable for MapReduce processing because it uses the complete data set. Parquet is appropriate for Apache Hive for queries over a subset of data.

When to Use Which File Format

All of the commonly used Hadoop storage formats are binary except for text files. Use the text file format for simple storage, such as CSV and email messages. Use a sequence file for data processed by MapReduce. Use Avro for complex data structures. Use Parquet for data queried over a subset, such as data warehousing in Hive.

Tags

cloud

databases

apache hadoop data storage

0 comments

Deepak is a Sun Certified Java Programmer and Web Component Developer, and has worked in the fields of XML, Java programming and Java EE for ten years. Deepak is the co-author of the Apress book Pro XML Development with Java Technology and was the technical reviewer for the O'Reilly book WebLogic: The Definitive Guide. Deepak was also the technical reviewer for the Course Technology PTR book Ruby Programming for the Absolute Beginner. Deepak is also the author of the Packt Publishing books JDBC 4.0 and Oracle JDeveloper for J2EE Development, Processing XML Documents with Oracle JDeveloper 11g, EJB 3.0 Database Persistence with Oracle Fusion Middleware 11g, and Java EE Development in Eclipse IDE. Deepak is a Docker Mentor and has published 5 books on Docker and Kubernetes.