Data Processing Performance of Apache Spark on Beowulf Clusters. An Overview

Abstract:

Despite the advent of cloud computing and the democratisation of data processing through parallel and distributed computing, Big Data systems may incur considerable costs that make them inaccessible to small and medium sized companies, and also to organizations that are financially stretched (such is the case with many universities). This paper presents preliminary results of data processing tasks (queries) for the Apache Spark framework deployed on a commodity Beowulf cluster. Association between query duration (the outcome) and some predictors, such as the cluster number of nodes, the cluster manager, cluster available RAM, database size, was examined.