Evaluating cluster configurations for big data processing: an exploratory study

R Sandel, M Shtern, M Fokaefs… - 2015 IEEE 9th …, 2015 - ieeexplore.ieee.org
2015 IEEE 9th International Symposium on the Maintenance and …, 2015ieeexplore.ieee.org
As data continues to grow rapidly, NoSQL clusters have been increasingly adopted to
address the storage and processing demands of these large amounts of data. In parallel,
cloud computing is also increasingly being adopted due to its flexibility, cost efficiency and
scalability. However, evaluating and modelling NoSQL clusters present many challenges. In
this work, we explore these challenges by performing a series of experiments with various
configurations. The intuition is that this process is laborious and expensive and the goal of …
As data continues to grow rapidly, NoSQL clusters have been increasingly adopted to address the storage and processing demands of these large amounts of data. In parallel, cloud computing is also increasingly being adopted due to its flexibility, cost efficiency and scalability. However, evaluating and modelling NoSQL clusters present many challenges. In this work, we explore these challenges by performing a series of experiments with various configurations. The intuition is that this process is laborious and expensive and the goal of our experiments is to confirm this intuition and to identify the factors that impact the performance of a Big Data cluster. Our experiments mostly focus on three factors: data compression, data schema and cluster topology. We performed a number of experiments based on these factors and measured and compared the response times of the resulting configurations. Eventually, the outcomes of our study are encapsulated in a performance model that predicts the cluster's response time as a function of the incoming workload and evaluates the cluster's performance less costly and faster. This systematic and effortless evaluation method will facilitate the selection and migration to a better cluster as the performance and budget goals change. We use HBase as the large data processing cluster and we conduct our experiments on traffic data from a large city and on a distributed community cloud infrastructure.
ieeexplore.ieee.org