Exploring Big Data Options in the Apache Hadoop Ecosystem

By Deepak Vohra - November 4, 2019

With the emergence of the World Wide Web came the need to manage large, web-scale quantities of data, which got termed “big data.”

The most notable tool to manage big data has been Apache Hadoop. Let’s explore some of the open source Apache projects in the Hadoop ecosystem.

Hadoop

Hadoop is a highly scalable big data framework for distributed processing of large data sets across a cluster of machines. Hadoop is based on HDFS (Hadoop Distributed File System), which gives high throughput access to application data during data processing and for storage.

Hadoop uses YARN for job scheduling and cluster resource management. For parallel processing of large data sets, Hadoop uses the YARN-based system MapReduce.

Hadoop runs on commodity hardware and provides fault tolerance and high availability.

HBase

HBase is a distributed, scalable big data store that runs on Hadoop and HDFS. HBase is a column-based NoSQL database that stores versioned big data in tables and columns. Each column has a name, value, and timestamp that distinguishes the latest data from an older version.

HBase features include scalability, strictly consistent reads and writes, and automatic failover.

Hive

Hive is a distributed data warehouse built on top of Hadoop to provide SQL querying to large data sets. Hive stores data in either HDFS or some other big data storage, such as HBase. Hive is not designed for online transaction processing (OLTP) workloads, but instead for extract-transform-load (ETL) workloads for data analysis and reporting.

Hive provides structure to stored data for SQL querying and supports a wide variety of data formats.

Sqoop

Sqoop is a framework for bulk transferring of data between Hadoop and structured data stores, including relational databases such as MySQL and PostgreSQL, using separate tools for import and export of data.

To import bulk data into HDFS, the import tool supports storing data as text files or binary format SequenceFiles and Avro data files. Sqoop import is designed for HDFS but could be used with other tables, while the export tool supports only HDFS as the source.

Flume

Flume collects, aggregates, and moves large quantities of log data to a centralized data store. It’s a distributed framework, like all Hadoop frameworks, and is based on a streaming flows architecture consisting of sources, channels, and sinks managed by a Flume agent.

Flume supports data sources other than log data and could be used to stream from just about any data source, such as event data, network traffic data, social media generated data, and email data, and also supports different types of sinks and channel types.

Kafka

Kafka is a distributed streaming framework to build real-time streaming data pipelines to transfer data between systems or applications. It’s a multi-purpose tool that could be used for messaging, stream processing and storage.

The streaming platform has three main functionalities:

Publish and subscribe to a stream of records similar to a messaging system
Process streams of records
Store streams of records

A stream of records is called a “topic,” with Kafka “producers” publishing to topics and Kafka “consumers” subscribing to topics.

Tags

big data

0 comments

Deepak is a Sun Certified Java Programmer and Web Component Developer, and has worked in the fields of XML, Java programming and Java EE for ten years. Deepak is the co-author of the Apress book Pro XML Development with Java Technology and was the technical reviewer for the O'Reilly book WebLogic: The Definitive Guide. Deepak was also the technical reviewer for the Course Technology PTR book Ruby Programming for the Absolute Beginner. Deepak is also the author of the Packt Publishing books JDBC 4.0 and Oracle JDeveloper for J2EE Development, Processing XML Documents with Oracle JDeveloper 11g, EJB 3.0 Database Persistence with Oracle Fusion Middleware 11g, and Java EE Development in Eclipse IDE. Deepak is a Docker Mentor and has published 5 books on Docker and Kubernetes.