Abstract
Frequent happening of disk failures affects the reliability of the storage system, which can cause jittering of performance or even data loss of services and thus seriously threaten the quality of service. Although a host of machine (deep) learning-based disk failure prediction approaches have been proposed to prevent system breakdown due to unexpected disk failure, they are able to achieve high performance based on the assumption that the disk model has plenty of samples (especially failure samples). However, new disk models continuously appear in data centers with the evolution of disk manufacturing technology and the expansion of storage system capacity. Limited by the deploying time, these disk models have few failure samples and are called minority disks. The minority disks are widespread in large-scale data centers and contain amounts of disks while existing approaches cannot reach satisfying performance on such disks due to the lack of failure samples. What’s worse, failure prediction models trained on other disk models cannot be directly applied to these minority disks either due to the commonly existing distribution shift among disk models. In this work, we propose DiskDA, a novel multi-source domain adaption-based solution that can fully utilize knowledge from other disk models to predict failures for minority disks having no failure samples. Our experimental results on real-world datasets show the superiority of DiskDA against previous approaches on minority disks with a few failure samples. What’s more, DiskDA also shows its good adaptivity on minority disks having no failure samples, whereas previous works are unusable.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223. PMLR (2017)
Botezatu, M.M., Giurgiu, I., Bogojeska, J., Wiesmann, D.: Predicting disk replacement towards reliable data centers. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2016)
Chakraborttii, C., Litz, H.: Improving the accuracy, adaptability, and interpretability of SSD failure prediction models. In: Proceedings of the 11th ACM Symposium on Cloud Computing, pp. 120–133 (2020)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 539–546. IEEE (2005)
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43 (2003)
Jiang, T., Zeng, J., Zhou, K., Huang, P., Yang, T.: Lifelong disk failure prediction via gan-based anomaly detection. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp. 199–207. IEEE (2019)
Jiang, W., Hu, C., Zhou, Y., Kanevsky, A.: Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. ACM Trans. Storage (TOS) 4(3), 1–25 (2008)
Johnson, R., Zhang, T.: Learning nonlinear functions using regularized greedy forest. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 942–954 (2013)
Lan, X., et al.: Adversarial domain adaptation with correlation-based association networks for longitudinal disk fault prediction. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2021)
Li, J., et al.: Hard drive failure prediction using classification and regression trees. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 383–394. IEEE (2014)
Lu, S., Luo, B., Patel, T., Yao, Y., Tiwari, D., Shi, W.: Making disk failure predictions smarter! In: FAST, pp. 151–167 (2020)
Mikolov, T., Kombrink, S., Burget, L., Černockỳ, J., Khudanpur, S.: Extensions of recurrent neural network language model. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531. IEEE (2011)
Schroeder, B., Gibson, G.A.: Understanding disk failure rates: what does an MTTF of 1,000,000 hours mean to you? ACM Trans. Storage (TOS) 3(3), 8-es (2007)
Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Sun, X., et al.: System-level hardware failure prediction using deep learning. In: Proceedings of the 56th Annual Design Automation Conference 2019, pp. 1–6 (2019)
Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204 (2010)
Wang, Y., Miao, Q., Ma, E.W., Tsui, K.L., Pecht, M.G.: Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Trans. Reliab. 62(1), 136–145 (2013)
Wilson, G., Cook, D.J.: A survey of unsupervised deep domain adaptation. ACM Trans. Intell. Syst. Technol. 11(5), 1–46 (2020)
Xie, Y., Feng, D., Wang, F., Zhang, X., Han, J., Tang, X.: OME: an optimized modeling engine for disk failure prediction in heterogeneous datacenter. In: 2018 IEEE 36th International Conference on Computer Design (ICCD), pp. 561–564. IEEE (2018)
Xu, C., Wang, G., Liu, X., Guo, D., Liu, T.Y.: Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65(11), 3502–3508 (2016)
Xu, F., Han, S., Lee, P.P., Liu, Y., He, C., Liu, J.: General feature selection for failure prediction in large-scale SSD deployment. In: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 263–270. IEEE (2021)
Yang, W., Hu, D., Liu, Y., Wang, S., Jiang, T.: Hard drive failure prediction using big data. In: 2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW), pp. 13–18. IEEE (2015)
Zhang, J., Huang, P., Zhou, K., Xie, M., Schelter, S.: HDDSE: enabling high-dimensional disk state embedding for generic failure detection system of heterogeneous disks in large data centers. In: Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, pp. 111–126 (2020)
Zhang, J., et al.: Minority disk failure prediction based on transfer learning in large data centers of heterogeneous disk systems. IEEE Trans. Parallel Distrib. Syst. 31(9), 2155–2169 (2020)
Zhou, H., et al.: A proactive failure tolerant mechanism for SSDS storage systems based on unsupervised learning. In: 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), pp. 1–10. IEEE (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
7 Appendix
7 Appendix
Proof. The discrepancy between the source and target domain is measured using the Wasserstein distance in DiskDA. Specifically, the \(\mathop {p}\)-th Wasserstein distance between two Borel probability measures \(\mathbb {P}\) and \(\mathbb {Q}\) is defined as:
where the \(\Gamma (\mathbb {P,Q})\) is the set of all joint distributions \(\mu (x,y)\) whose marginal distribution are \(\mathbb {P}\) and \(\mathbb {Q}\). The \(\mu (x,y)\) can be viewed as a policy for transporting a unit quantity of material from x to y and the \(\rho (x,y)\) is the corresponding cost. And the Wasserstein distance between \(\mathbb {P}\) and \(\mathbb {Q}\) represents the minimum expected transport cost. As Wasserstein distance satisfies the triangle inequality, the following equation holds
Shen et al. [14] prove the generalization error bound of a classification function h in the target domain for unsupervised domain adaption based on Wasserstein distance as
where the K means that all hypotheses h are K-Lipschitz continous, \(\lambda \) is the combined error of the optimal hypothesis \(h*\) which minimizes the combined error \(\epsilon _{s}(h)+\epsilon _{t}(h)\), \(\mathbb {P}_s\) and \(\mathbb {P}_t\) are distributions of source and target domain, respectively. Let C denote \(2KW_{1}(\mathbb {P}_{t_{H}},\mathbb {P}_{t})\). By substituting inequality (11) for (10), Theorem 3.1 is derived.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, W. et al. (2024). A Multi-source Domain Adaption Approach to Minority Disk Failure Prediction. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14488. Springer, Singapore. https://doi.org/10.1007/978-981-97-0801-7_4
Download citation
DOI: https://doi.org/10.1007/978-981-97-0801-7_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0800-0
Online ISBN: 978-981-97-0801-7
eBook Packages: Computer ScienceComputer Science (R0)