Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

An Attention-augmented Deep Architecture for Hard Drive Status Monitoring in Large-scale Storage Systems

Published: 13 August 2019 Publication History

Abstract

Data centers equipped with large-scale storage systems are critical infrastructures in the era of big data. The enormous amount of hard drives in storage systems magnify the failure probability, which may cause tremendous loss for both data service users and providers. Despite a set of reactive fault-tolerant measures such as RAID, it is still a tough issue to enhance the reliability of large-scale storage systems. Proactive prediction is an effective method to avoid possible hard-drive failures in advance. A series of models based on the SMART statistics have been proposed to predict impending hard-drive failures. Nonetheless, there remain some serious yet unsolved challenges like the lack of explainability of prediction results. To address these issues, we carefully analyze a dataset collected from a real-world large-scale storage system and then design an attention-augmented deep architecture for hard-drive health status assessment and failure prediction. The deep architecture, composed of a feature integration layer, a temporal dependency extraction layer, an attention layer, and a classification layer, cannot only monitor the status of hard drives but also assist in failure cause diagnoses. The experiments based on real-world datasets show that the proposed deep architecture is able to assess the hard-drive status and predict the impending failures accurately. In addition, the experimental results demonstrate that the attention-augmented deep architecture can reveal the degradation progression of hard drives automatically and assist administrators in tracing the cause of hard drive failures.

References

[1]
Martín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from: http://tensorflow.org/.
[2]
Backblaze. 2016. The backblaze hard drive data and stats. Retrieved from: https://www.backblaze.com/b2/hard-drive-test-data.html.
[3]
Backblaze. 2016. Hard drive reliability review for 2015. Retrieved from: https://www.backblaze.com/blog/hard-drive-reliability-q4-2015/.
[4]
Backblaze. 2018. The backblaze hard drive data and stats. Retrieved from: https://www.backblaze.com/b2/hard-drive-test-data.html.
[5]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from: arXiv:1409.0473.
[6]
Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), 157--166.
[7]
Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 39--48.
[8]
Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1724--1734.
[9]
Francois Chollet. 2015. Keras. Retrieved from: https://github.com/fchollet/keras.
[10]
Francois Chollet. 2015. Keras optimizer. Retrieved from: https://keras.io/optimizers/.
[11]
B. Eckart, X. Chen, X. He, and S. L. Scott. 2008. Failure prediction models for proactive fault tolerance within storage systems. In Proceedings of the IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems. 1--8.
[12]
J. G. Elerath and S. Shah. 2004. Server class disk drives: How reliable are they? In Proceedings of the Symposium on Reliability and Maintainability (RAMS’04). 151--156.
[13]
Paul Fearnhead. 2006. Exact and efficient Bayesian inference for multiple changepoint problems. Statist. Comput. 16, 2 (2006), 203--213.
[14]
Christian Franke. 2016. Smartmontools. Retrieved from: https://www.smartmontools.org/.
[15]
Moises Goldszmidt. 2012. Finding soon-to-fail disks in a haystack. In Proceedings of the 4th USENIX Conference on Hot Topics in Storage and File Systems. USENIX Association, 8--8.
[16]
A. Graves, A. R. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6645--6649.
[17]
Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J. Douglas, and H. Sebastian Seung. 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789 (2000), 947--951.
[18]
Greg Hamerly and Charles Elkan. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the 18th International Conference on Machine Learning. 202--209.
[19]
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sig. Proc. Mag. 29, 6 (2012), 82--97.
[20]
Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncert., Fuzz. Knowl.-based Syst. 6, 2 (1998), 107--116.
[21]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.
[22]
G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and C. Elkan. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliab. 51, 3 (2002), 350--357.
[23]
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. Server class disk drives: How reliable are they? In Proceedings of the 53rd Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1--10.
[24]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.
[25]
J. Li, R. J. Stones, G. Wang, Z. Li, X. Liu, and K. Xiao. 2016. Being accurate is not enough: New metrics for disk failure prediction. In Proceedings of the 35th IEEE Symposium on Reliable Distributed Systems. 71--80.
[26]
Jing Li, Rebecca J. Stones, Gang Wang, Xiaoguang Liu, Zhongwei Li, and Ming Xu. 2017. Hard drive failure prediction using decision trees. Reliab. Eng. Syst. Safety 164 (2017), 55--65.
[27]
Ao Ma, Rachel Traylor, Fred Douglis, Mark Chamness, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. ACM Trans. Stor. 11, 4 (2015), 17.
[28]
Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1903--1911.
[29]
Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In Proceedings of the USENIX Technical Conference. USENIX Association, 391--402.
[30]
Joseph F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado. 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. J. Machine Learn. Res. 6, 5 (2005), 783--816.
[31]
S. Pang, Y. Jia, R. Stones, G. Wang, and X. Liu. 2016. A combined Bayesian network method for predicting drive failure times from SMART attributes. In Proceedings of the International Joint Conference on Neural Networks. 4850--4856.
[32]
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. USENIX Association, 17--23.
[33]
M. Riedmiller and H. Braun. 1993. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks. 586--591.
[34]
Paul Rodriguez, Janet Wiles, and Jeffrey L. Elman. 1999. A recurrent neural network that learns to count. Connect. Sci. 11, 1 (1999), 5--40.
[35]
Hojjat Salehinejad, Joseph Barfett, Shahrokh Valaee, and Timothy Dowdell. 2017. Training neural networks with very little data—A draft. Retrieved from: arXiv:1708.04347.
[36]
Sriram Sankar, Mark Shaw, Kushagra Vaid, and Sudhanva Gurumurthi. 2013. Datacenter scale evaluation of the impact of temperature on hard disk drive failures. ACM Trans. Stor. 9, 2 (2013), 1--24.
[37]
Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies. USENIX Association, 8--24.
[38]
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Machine Learn. Res. 15, 1 (2014), 1929--1958.
[39]
Tijmen Tieleman and Geoffrey E. Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of Tts recent magnitude. COURSERA: Neural Netw. Machine Learn. 4, 2 (2012).
[40]
Jonathan J. Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Adv. Neural Inform. Proc. Syst. 27. Curran Associates, Inc., 1799--1807.
[41]
Userbenchmark. 2013. Seagate desktop HDD 4TB review. Retrieved from: https://hdd.userbenchmark.com/Seagate-Desktop-HDD-4TB-2013/Rating/1598.
[42]
Y. Wang, E. W. M. Ma, T. W. S. Chow, and K. L. Tsui. 2014. A two-step parametric method for failure prediction in hard disk drives. IEEE Trans. Industr. Inform. 10, 1 (2014), 419--430.
[43]
Yu Wang, Qiang Miao, and M. Pecht. 2011. Health monitoring of hard disk drive based on Mahalanobis distance. In Proceedings of the Prognostics and System Health Managment Conference. 1--8.
[44]
Mike West and Jeff Harrison. 1997. Bayesian Forecasting and Dynamic Models. Springer Series in Statistics.
[45]
Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural retworks. Neural Computation 1, 2 (1989), 270--280.
[46]
C. Xu, G. Wang, X. Liu, D. Guo, and T. Y. Liu. 2016. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (2016), 3502--3508.
[47]
Ying Zhao, Xiang Liu, Siqing Gan, and Weimin Zheng. 2010. Predicting disk failures with HMM-and HSMM-based approaches. In Proceedings of the 10th Industrial Conference on Advances in Data Mining. Springer, 390--404.
[48]
B. Zhu, G. Wang, X. Liu, D. Hu, S. Lin, and J. Ma. 2013. Proactive drive failure prediction for large scale storage systems. In Proceedings of the 29thIEEE Symposium on Mass Storage Systems and Technologies. 1--5.

Cited By

View all
  • (2024)SiaDFP: A Disk Failure Prediction Framework Based on Siamese Neural Network in Large-Scale Data CenterIEEE Transactions on Services Computing10.1109/TSC.2024.339469217:5(2890-2903)Online publication date: Sep-2024
  • (2024)ACPR: Adaptive Classification Predictive Repair Method for Different Fault ScenariosIEEE Access10.1109/ACCESS.2023.334688112(4631-4641)Online publication date: 2024
  • (2023)YuYin: a multi-task learning model of multi-modal e-commerce background music recommendationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-023-00306-62023:1Online publication date: 19-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 15, Issue 3
August 2019
173 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3336116
  • Editor:
  • Sam H. Noh
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2019
Accepted: 01 June 2019
Revised: 01 March 2019
Received: 01 June 2018
Published in TOS Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hard drive failure
  2. SMART
  3. attention mechanism
  4. deep neural network
  5. recurrent neural network

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Science Fund for Distinguished Young Scholars in Hunan Province
  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)4
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SiaDFP: A Disk Failure Prediction Framework Based on Siamese Neural Network in Large-Scale Data CenterIEEE Transactions on Services Computing10.1109/TSC.2024.339469217:5(2890-2903)Online publication date: Sep-2024
  • (2024)ACPR: Adaptive Classification Predictive Repair Method for Different Fault ScenariosIEEE Access10.1109/ACCESS.2023.334688112(4631-4641)Online publication date: 2024
  • (2023)YuYin: a multi-task learning model of multi-modal e-commerce background music recommendationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-023-00306-62023:1Online publication date: 19-Oct-2023
  • (2023)Weakly Supervised Hashing with Reconstructive Cross-modal AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358918519:6(1-19)Online publication date: 8-Apr-2023
  • (2023)Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction ModelsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2021.313157120:1(256-272)Online publication date: 1-Jan-2023
  • (2023)Construction of Power Equipment Running Status Monitoring System Based on Infrared Temperature Measurement Technology and Big Data Algorithm2023 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC)10.1109/PEEEC60561.2023.00054(250-254)Online publication date: 25-Sep-2023
  • (2023)Predicting Hard Disk Drive Faults, Failures and Associated Misbehavior’s2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00082(484-493)Online publication date: May-2023
  • (2022)Abnormality Detection and Failure Prediction Using Explainable Bayesian Deep Learning: Methodology and Case Study with Industrial DataMathematics10.3390/math1004055410:4(554)Online publication date: 11-Feb-2022
  • (2022)Egocentric Early Action Prediction via Adversarial Knowledge DistillationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354449319:2(1-21)Online publication date: 16-Jun-2022
  • (2022)Attention mechanism in intelligent fault diagnosis of machinery: A review of technique and applicationMeasurement10.1016/j.measurement.2022.111594199(111594)Online publication date: Aug-2022
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media