Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2211.02286 (cs)

[Submitted on 4 Nov 2022]

Title:Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Authors:Ubaid Ullah Hafeez, Martin Maas, Mustafa Uysal, Richard McDougall

View PDF

Abstract:Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on optimizing these frameworks, including their storage management. The shift to cloud computing requires optimization across all pipelines concurrently running across a cluster. In this paper, we look at one specific instance of this problem: placement of I/O-intensive temporary intermediate data on SSD and HDD. Efficient data placement is challenging since I/O density is usually unknown at the time data needs to be placed. Additionally, external factors such as load variability, job preemption, or job priorities can impact job completion times, which ultimately affect the I/O density of the temporary files in the workload. In this paper, we envision that machine learning can be used to solve this problem. We analyze production logs from Google's data centers for a range of data processing pipelines. Our analysis shows that I/O density may be predictable. This suggests that learning-based strategies, if crafted carefully, could extract predictive features for I/O density of temporary files involved in various transformations, which could be used to improve the efficiency of storage management in data processing pipelines.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2211.02286 [cs.DC]
	(or arXiv:2211.02286v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2211.02286

Submission history

From: Martin Maas [view email]
[v1] Fri, 4 Nov 2022 06:57:04 UTC (871 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators