Abstract:
Big Data is an umbrella covering a broad range of technologies for data ingestion, persistence, processing, and analysis. Its increasing adoption by companies and institutions has been supported by open-source projects such as Apache Spark and Big Data services provided by most Cloud providers. Apache Spark is currently the gravitational center of the big data processing. Still, data processing performance and tuning remains an open topic. This paper addresses the topic of Big Data formats in terms of data processing performance when running SQL queries in SparkSQL, and which further memory optimizations will also enhance performance. We have compared the most popular open-source data formats, i.e., Apache Parquet, Apache ORC and Apache Avro in isolation and in combination with the Spark RDD memory settings, to see if their distinct features will boost performance and under what circumstances. Initial results show that Apache Avro performed the best being a proper solution for Big Data processing with great compression and many other features. Also, RDD memory setting must be taken into account, since the performance increases when using the MEMORY_ONLY setting.