Distributed data processing for ELT workloads. The case of open source systems

Abstract:

ETL (Extract Transform Load) has been an established concept for many years, as stated by Vassiliadis (2009). The paradigm of extracting, transforming and loading data into a data warehouse has found use cases and implementations, especially for analytical purposes. ETL systems have been the foundation of reporting tools, since data is loaded only when the clean-up step is done. Recently, another concept has gained traction again, ELT (Extract Load Transform). It is closely related to big data processing which is loaded in a raw format, and then the transformations are done on the unprocessed data. Fotache et al (2018) mentioned this has become possible with the advances of hardware, and the development of distributed databases/storage on low commodity workstations. This paper aims to analyses a typical ELT workload, built from generic steps (primitives) and test it in terms of performance with various database and processing engines. Both relational (Oracle, SQL Server, Postgres), NoSQL (Apache Spark), distributed (Exasol), commercial and open source systems are considered for read/write workloads in a process graph, where each node represents a generic data operation with an input and produces an output. The purpose is to confirm the claim that NoSQL and distributed databases systems can be a viable choice, compared with established commercial systems, in terms of performance and cost. Also, the paper proposes an approach where ELT functionalities can be implemented in a cluster of computers using Spark and HDFS.

nsdlogo2016