Exploring Big Data Options in the Apache Hadoop Ecosystem
With the emergence of the World Wide Web came the need to manage large, web-scale quantities of data, which got termed “big data.”
The most notable tool to manage big data has been Apache Hadoop. Let’s explore some of the open source Apache projects in the Hadoop ecosystem.
Hadoop is a highly scalable big data framework for distributed processing of large data sets across a cluster of machines. Hadoop is based on HDFS (Hadoop Distributed File System), which gives high throughput access to application data during data processing and for storage.
Hadoop uses YARN for job scheduling and cluster resource management. For parallel processing of large data sets, Hadoop uses the YARN-based system MapReduce.
Hadoop runs on commodity hardware and provides fault tolerance and high availability.
HBase is a distributed, scalable big data store that runs on Hadoop and HDFS. HBase is a column-based NoSQL database that stores versioned big data in tables and columns. Each column has a name, value, and timestamp that distinguishes the latest data from an older version.
HBase features include scalability, strictly consistent reads and writes, and automatic failover.
Hive is a distributed data warehouse built on top of Hadoop to provide SQL querying to large data sets. Hive stores data in either HDFS or some other big data storage, such as HBase. Hive is not designed for online transaction processing (OLTP) workloads, but instead for extract-transform-load (ETL) workloads for data analysis and reporting.
Hive provides structure to stored data for SQL querying and supports a wide variety of data formats.
Sqoop is a framework for bulk transferring of data between Hadoop and structured data stores, including relational databases such as MySQL and PostgreSQL, using separate tools for import and export of data.
To import bulk data into HDFS, the import tool supports storing data as text files or binary format SequenceFiles and Avro data files. Sqoop import is designed for HDFS but could be used with other tables, while the export tool supports only HDFS as the source.
Flume collects, aggregates, and moves large quantities of log data to a centralized data store. It’s a distributed framework, like all Hadoop frameworks, and is based on a streaming flows architecture consisting of sources, channels, and sinks managed by a Flume agent.
Flume supports data sources other than log data and could be used to stream from just about any data source, such as event data, network traffic data, social media generated data, and email data, and also supports different types of sinks and channel types.
Kafka is a distributed streaming framework to build real-time streaming data pipelines to transfer data between systems or applications. It’s a multi-purpose tool that could be used for messaging, stream processing and storage.
The streaming platform has three main functionalities:
- Publish and subscribe to a stream of records similar to a messaging system
- Process streams of records
- Store streams of records
A stream of records is called a “topic,” with Kafka “producers” publishing to topics and Kafka “consumers” subscribing to topics.