Choosing a Data Processing Framework
With an assortment of open source data processing frameworks available, it may become a developer’s quandary as to which is the most suitable. More often than not, multiple frameworks are used in the same application. In this article, we will explore some of these data processing frameworks. Most data processing frameworks run on a distributed environment, which is a cluster of machines, with support for the Docker-containers Kubernetes cluster resource manager.
Apache Beam is for developing data pipelines that read data from one or more data sources, apply business logic for batch or streaming data processing, and output the resulting data to data sinks. Beam supports input/output to several data sources/sinks including File, XML, Avro, Parquet, Apache Kafka, Apache Solr, Apache Hadoop, HDFS, Apache HBase, Apache Cassandra, MongoDB, Apache Tika, Elasticsearch, Redis, and PostgreSQL. A data processing system (also called a “Runner”) is needed to run a Beam pipeline—supported runners include Apache Spark, and Apache Flink.
Apache Kafka is an event streaming platform to read, write, store, and process events, which are simply records or messages. Events are stored in a topic. A producer that generates events or messages, publishes or writes the events to a topic. A consumer that subscribes to the same topic is able to read the messages. A topic may have multiple producers and consumers associated with it. Kafka is most commonly used for real-time data streaming pipelines.
Apache Flink is a data processing engine for stateful computations over data streams. Stateful implies that some functions, or operations, keep, or store, the state across multiple events. The data streams may be bounded, i.e. they have a start and an end, or they may be unbounded, i.e. they have a start but no definite end. Unbounded data streams are data sets generated in real-time. Apache Flink is highly scalable, and provides in-memory performance. Flink may be run as a standalone cluster, or on a cluster manager such as Hadoop YARN.
Apache Spark is an engine for scalable computing of data analytics, data science, and machine learning applications. Spark integrates with several frameworks, and supports both batch, and streaming data. Spark is built on an advanced SQL Engine for fast, large-scale processing of structured and unstructured data.
Apache Hop is a data integration and orchestration platform for data workflows and pipelines. Hop is completely metadata-driven and the data pipelines may be run on the Hop-native engine, or on another engine such as Apache Flink and Apache Spark. Hop also lets you design with a GUI.
Apache Samza is for developing stateful applications that process streaming data in real-time. The applications may be built using one of the supported APIs, which includes Apache Beam and Streams DSL. Samza is a high-throughput, low-latency, and highly scalable framework with deployment support for Hadoop YARN. Samza supports multiple input/output data sources/sinks including Apache Kafka and Apache Hadoop’s HDFS.
Apache Solr is a search engine built on Apache Lucene, which provides the search, indexing, spell checking, and analysis/tokenization. Apache Solr provides full-text search capabilities with a REST-like API using documents that may be put/indexed and queried over HTTP as JSON, XML, CSV, or binary format. Solr is a high-volume, highly scalable, fault-tolerant search engine with near real-time indexing.
These are only some of the data processing frameworks. Some other commonly used frameworks are: Apache Tika as a content analysis toolkit, Apache Zeppelin for data-driven interactive data analytics, Apache Flume for aggregating, collecting, and moving large quantities of log data, and Apache Sqoop for bulk transferring data between Apache Hadoop and structured data sources such as relational databases.