Breaking Down Apache’s Hadoop Distributed File System

By Deepak Vohra - May 25, 2020

Apache Hadoop is a framework for big data. It has two main components: MapReduce, for processing large quantities of data in parallel, and HDFS, Hadoop Distributed File System, which stores big data.

Let’s examine HDFS. You might expect that a storage framework that holds large quantities of data requires special, state-of-the-art infrastructure for a file system that does not fail, but quite the contrary is true.

Distributed file system

As the name says, HDFS is a distributed file system that spans across a multi-node cluster. Data is replicated for durability and high availability.

Durability implies that data is not lost; even if one of the replicas were to be lost due to machine failure, other replicas would still be available. High availability implies that with multiple users accessing the same data, perhaps at different nodes or machines over different regions, the data is available even if some of the copies of the replicated data were to become unavailable.

Commodity hardware

HDFS uses off-the-shelf commodity hardware—machines that could be used for any other purpose—so it does not require any special infrastructure. HDFS is portable across heterogeneous hardware and software platforms.

Fault tolerance

HDFS is designed to be fault-tolerant, expecting that some of the machines used to store data will fail routinely. With thousands of machines in a cluster, HDFS is designed to detect failure of a machine and recover.

Because data is replicated three times by default, the data on the machines that have failed does not become unavailable, but instead is served from another machine. Failover is used to direct a user request to another machine if a node were to fail. In the meantime, data is replicated to bring it back to the configured replication level.

Abstract file system

HDFS is not a real file system on the hardware of a machine, but rather an abstract file system that is developed over the underlying file system of a machine. Data stored in HDFS has a smallest unit of storage called a block. An HDFS block’s size is 128 MB by default.

Locality of data

Locality of data is another benefit of replicating data in a distributed cluster. If data replicas are spread across multiple regions, a user has a greater likelihood of being served data from a local machine or a machine in relatively close proximity. If data does not have to be fetched from a distant machine, it reduces the network load.

Data and compute code

Because of the large quantities of data involved with HFDS, data is not moved to the compute code; instead, compute code is moved to the location of the data, reducing network overhead in terms of load and latency.

Read latency

HDFS is not designed to be your regular, interactive file system for general-purpose applications. It’s meant for batch processing, high throughput, and streaming access of large data sets. HDFS is a write-once-read-many file system. File systems that serve general-purpose, interactive applications generally have a low read latency, but with HDFS, some latency in accessing data is to be expected.

Tags

cloud

databases

big data apache hadoop

0 comments

Deepak is a Sun Certified Java Programmer and Web Component Developer, and has worked in the fields of XML, Java programming and Java EE for ten years. Deepak is the co-author of the Apress book Pro XML Development with Java Technology and was the technical reviewer for the O'Reilly book WebLogic: The Definitive Guide. Deepak was also the technical reviewer for the Course Technology PTR book Ruby Programming for the Absolute Beginner. Deepak is also the author of the Packt Publishing books JDBC 4.0 and Oracle JDeveloper for J2EE Development, Processing XML Documents with Oracle JDeveloper 11g, EJB 3.0 Database Persistence with Oracle Fusion Middleware 11g, and Java EE Development in Eclipse IDE. Deepak is a Docker Mentor and has published 5 books on Docker and Kubernetes.