A Multi-source Domain Adaption Approach to Minority Disk Failure Prediction

Wang Wang^10,11,
Xuehai Tang¹⁰,
Biyu Zhou¹⁰,
Yangchen Dong¹⁰,
Yuanhang Feng¹⁰,
Jizhong Han¹⁰ &
…
Songlin Hu¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14488))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

453 Accesses

Abstract

Frequent happening of disk failures affects the reliability of the storage system, which can cause jittering of performance or even data loss of services and thus seriously threaten the quality of service. Although a host of machine (deep) learning-based disk failure prediction approaches have been proposed to prevent system breakdown due to unexpected disk failure, they are able to achieve high performance based on the assumption that the disk model has plenty of samples (especially failure samples). However, new disk models continuously appear in data centers with the evolution of disk manufacturing technology and the expansion of storage system capacity. Limited by the deploying time, these disk models have few failure samples and are called minority disks. The minority disks are widespread in large-scale data centers and contain amounts of disks while existing approaches cannot reach satisfying performance on such disks due to the lack of failure samples. What’s worse, failure prediction models trained on other disk models cannot be directly applied to these minority disks either due to the commonly existing distribution shift among disk models. In this work, we propose DiskDA, a novel multi-source domain adaption-based solution that can fully utilize knowledge from other disk models to predict failures for minority disks having no failure samples. Our experimental results on real-world datasets show the superiority of DiskDA against previous approaches on minority disks with a few failure samples. What’s more, DiskDA also shows its good adaptivity on minority disks having no failure samples, whereas previous works are unusable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Disk Failure Prediction Based on Transfer Learning

A disk failure prediction model for multiple issues

Article 28 July 2023

Characterizing and Modeling for Proactive Disk Failure Prediction to Improve Reliability of Data Centers

Notes

1.
https://www.backblaze.com/b2/hard-drive-test-data.html.

References

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223. PMLR (2017)
Google Scholar
Botezatu, M.M., Giurgiu, I., Bogojeska, J., Wiesmann, D.: Predicting disk replacement towards reliable data centers. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2016)
Google Scholar
Chakraborttii, C., Litz, H.: Improving the accuracy, adaptability, and interpretability of SSD failure prediction models. In: Proceedings of the 11th ACM Symposium on Cloud Computing, pp. 120–133 (2020)
Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 539–546. IEEE (2005)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43 (2003)
Google Scholar
Jiang, T., Zeng, J., Zhou, K., Huang, P., Yang, T.: Lifelong disk failure prediction via gan-based anomaly detection. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp. 199–207. IEEE (2019)
Google Scholar
Jiang, W., Hu, C., Zhou, Y., Kanevsky, A.: Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. ACM Trans. Storage (TOS) 4(3), 1–25 (2008)
Article Google Scholar
Johnson, R., Zhang, T.: Learning nonlinear functions using regularized greedy forest. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 942–954 (2013)
Article Google Scholar
Lan, X., et al.: Adversarial domain adaptation with correlation-based association networks for longitudinal disk fault prediction. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2021)
Google Scholar
Li, J., et al.: Hard drive failure prediction using classification and regression trees. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 383–394. IEEE (2014)
Google Scholar
Lu, S., Luo, B., Patel, T., Yao, Y., Tiwari, D., Shi, W.: Making disk failure predictions smarter! In: FAST, pp. 151–167 (2020)
Google Scholar
Mikolov, T., Kombrink, S., Burget, L., Černockỳ, J., Khudanpur, S.: Extensions of recurrent neural network language model. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531. IEEE (2011)
Google Scholar
Schroeder, B., Gibson, G.A.: Understanding disk failure rates: what does an MTTF of 1,000,000 hours mean to you? ACM Trans. Storage (TOS) 3(3), 8-es (2007)
Google Scholar
Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Sun, X., et al.: System-level hardware failure prediction using deep learning. In: Proceedings of the 56th Annual Design Automation Conference 2019, pp. 1–6 (2019)
Google Scholar
Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204 (2010)
Google Scholar
Wang, Y., Miao, Q., Ma, E.W., Tsui, K.L., Pecht, M.G.: Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Trans. Reliab. 62(1), 136–145 (2013)
Article Google Scholar
Wilson, G., Cook, D.J.: A survey of unsupervised deep domain adaptation. ACM Trans. Intell. Syst. Technol. 11(5), 1–46 (2020)
Article Google Scholar
Xie, Y., Feng, D., Wang, F., Zhang, X., Han, J., Tang, X.: OME: an optimized modeling engine for disk failure prediction in heterogeneous datacenter. In: 2018 IEEE 36th International Conference on Computer Design (ICCD), pp. 561–564. IEEE (2018)
Google Scholar
Xu, C., Wang, G., Liu, X., Guo, D., Liu, T.Y.: Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65(11), 3502–3508 (2016)
Article MathSciNet Google Scholar
Xu, F., Han, S., Lee, P.P., Liu, Y., He, C., Liu, J.: General feature selection for failure prediction in large-scale SSD deployment. In: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 263–270. IEEE (2021)
Google Scholar
Yang, W., Hu, D., Liu, Y., Wang, S., Jiang, T.: Hard drive failure prediction using big data. In: 2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW), pp. 13–18. IEEE (2015)
Google Scholar
Zhang, J., Huang, P., Zhou, K., Xie, M., Schelter, S.: HDDSE: enabling high-dimensional disk state embedding for generic failure detection system of heterogeneous disks in large data centers. In: Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, pp. 111–126 (2020)
Google Scholar
Zhang, J., et al.: Minority disk failure prediction based on transfer learning in large data centers of heterogeneous disk systems. IEEE Trans. Parallel Distrib. Syst. 31(9), 2155–2169 (2020)
Google Scholar
Zhou, H., et al.: A proactive failure tolerant mechanism for SSDS storage systems based on unsupervised learning. In: 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), pp. 1–10. IEEE (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Wang Wang, Xuehai Tang, Biyu Zhou, Yangchen Dong, Yuanhang Feng, Jizhong Han & Songlin Hu
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Wang Wang

Authors

Wang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xuehai Tang
View author publications
You can also search for this author in PubMed Google Scholar
Biyu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yangchen Dong
View author publications
You can also search for this author in PubMed Google Scholar
Yuanhang Feng
View author publications
You can also search for this author in PubMed Google Scholar
Jizhong Han
View author publications
You can also search for this author in PubMed Google Scholar
Songlin Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuehai Tang .

Editor information

Editors and Affiliations

Royal Melbourne Institute of Technology, Melbourne, VIC, Australia
Zahir Tari
Tianjin University, Tianjin, China
Keqiu Li
University of Arizona, Tucson, AZ, USA
Hongyi Wu

7 Appendix

Proof. The discrepancy between the source and target domain is measured using the Wasserstein distance in DiskDA. Specifically, the $\mathop {p}$-th Wasserstein distance between two Borel probability measures $\mathbb {P}$ and $\mathbb {Q}$ is defined as:

$$\begin{aligned} W_{p}(\mathbb {P},\mathbb {Q}) = (\mathop {inf}\limits _{\mu \in \Gamma (\mathbb {P},\mathbb {Q})}\int \rho (x,y)^{p}d\mu (x,y))^{1/p} \end{aligned}$$

(9)

where the $\Gamma (\mathbb {P,Q})$ is the set of all joint distributions $\mu (x,y)$ whose marginal distribution are $\mathbb {P}$ and $\mathbb {Q}$. The $\mu (x,y)$ can be viewed as a policy for transporting a unit quantity of material from x to y and the $\rho (x,y)$ is the corresponding cost. And the Wasserstein distance between $\mathbb {P}$ and $\mathbb {Q}$ represents the minimum expected transport cost. As Wasserstein distance satisfies the triangle inequality, the following equation holds

$$\begin{aligned} W_p(\mathbb {P}_{s},\mathbb {P}_{t})\le W_{p}(\mathbb {P}_{s},\mathbb {P}_{t_{H}})+W_{p}(\mathbb {P}_{t_{H}},\mathbb {P}_{t}) \end{aligned}$$

(10)

Shen et al. [14] prove the generalization error bound of a classification function h in the target domain for unsupervised domain adaption based on Wasserstein distance as

$$\begin{aligned} \epsilon _{t}(h) \le \epsilon _{s}(h) + 2KW_{1}(\mathbb {P}_s,\mathbb {P}_t) + \lambda \end{aligned}$$

(11)

where the K means that all hypotheses h are K-Lipschitz continous, $\lambda $ is the combined error of the optimal hypothesis $h*$ which minimizes the combined error $\epsilon _{s}(h)+\epsilon _{t}(h)$, $\mathbb {P}_s$ and $\mathbb {P}_t$ are distributions of source and target domain, respectively. Let C denote $2KW_{1}(\mathbb {P}_{t_{H}},\mathbb {P}_{t})$. By substituting inequality (11) for (10), Theorem 3.1 is derived.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, W. et al. (2024). A Multi-source Domain Adaption Approach to Minority Disk Failure Prediction. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14488. Springer, Singapore. https://doi.org/10.1007/978-981-97-0801-7_4

Download citation

DOI: https://doi.org/10.1007/978-981-97-0801-7_4
Published: 01 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0800-0
Online ISBN: 978-981-97-0801-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Multi-source Domain Adaption Approach to Minority Disk Failure Prediction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Disk Failure Prediction Based on Transfer Learning

A disk failure prediction model for multiple issues

Characterizing and Modeling for Proactive Disk Failure Prediction to Improve Reliability of Data Centers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

7 Appendix

7 Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us