research-article

Open access

Optimizing Small-Sample Disk Fault Detection Based on LSTM-GAN Model

Authors:

Xingjun ZhangAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 19, Issue 1

Article No.: 13, Pages 1 - 24

https://doi.org/10.1145/3500917

Published: 23 January 2022 Publication History

All formats PDF

Editorial Notes

The authors have requested minor, non-substantive changes to the VoR and, in accordance with ACM policies, a Corrected Version of Record was published on March 14, 2022. For reference purposes, the VoR may still be accessed via the Supplemental Material section on this citation page.

Abstract

In recent years, researches on disk fault detection based on SMART data combined with different machine learning algorithms have been proven to be effective. However, these methods require a large amount of data. In the early stages of the establishment of a data center or the deployment of new storage devices, the amount of reliability data for disks is relatively limited, and the amount of failed disk data is even less, resulting in the unsatisfactory detection performances of machine learning algorithms.

To solve the above problems, we propose a novel small sample disk fault detection (SSDFD)¹ optimizing method based on Generative Adversarial Networks (GANs). Combined with the characteristics of hard disk reliability data, the generator of the original GAN is improved based on Long Short-Term Memory (LSTM), making it suitable for the generation of failed disk data. To alleviate the problem of data imbalance and expand the failed disk dataset with reduced amounts of original data, the proposed model is trained through adversarial training, which focuses on the generation of failed disk data. Experimental results on real HDD datasets show that SSDFD can generate enough virtual failed disk data to enable the machine learning algorithm to detect disk faults with increased accuracy under the condition of a few original failed disk data. Furthermore, the model trained with 300 original failed disk data has a significant effect on improving the accuracy of HDD fault detection. The optimal amount of generated virtual data are, 20–30 times that of the original data.

1 Introduction

With the continuous development of storage technology, large-scale data centers usually deploy hybrid storage servers integrating a great diversity of hard disk drives (HDDs) in their underlying storage devices; examples include the data servers of Alibaba Pangu [9], Amazon [33], Google [14], Facebook [38], and Microsoft Azure [5]. In such data centers, it has been extremely challenging to ensure high availability and reliability for IT management, as various disk failures constantly occur in the field. Data centers usually adopt some data protection mechanisms, such as data copy or erasure codes [5, 24]. If the drives fail to recover the lost data despite the data protection capabilities, permanent data loss occurs, and the system cannot be used, which would be disastrous for the data centers. HDDs are fairly complex devices consisting of a wide variety of magnetic, mechanical, and electronic components, each of which can fail. As a result, HDDs have different levels and manifestations of failures for a variety of reasons, and this can be observed in many data centers [17, 32, 35, 36, 41]. Compared with traditional passive fault-tolerant techniques such as Erasure Code (EC) and Redundant Arrays of Independent Disks (RAIDs), active fault detection techniques can guarantee the reliability and availability of large-scale storage systems in advance. Thus, the risk of data loss can be reduced by successfully identifying disk failures.

In order to monitor the health status of HDDs, manufacturers generally implement self-monitoring, analysis and reporting technology (SMART) [1] in the firmware of devices. The SMART attributes contain the disk state information and possible defects. Internally, disks use the so-called “threshold method” based on SMART values to evaluate the failures, which means the disks would raise alarms if the values of one or more of the SMART attributes cross the corresponding predefined threshold. However, this “threshold method” only achieved a 3% –10% failure detection rate (FDR) and a 0.1% false alarm rate (FAR) in practice [30]; in other words, this method is too conservative and misses opportunities to detect disk failures.

Several disk fault detection models based on SMART data with machine learning algorithms [40, 41, 42] have been proposed to improve the predictive performance. Unfortunately, these methods require a large amount of disk data to train the models. According to [35], in the early stages of the establishment of a data center or the deployment of new storage devices, a few failed disk data can be obtained. But in general, because the quality of disks in the same batch is relatively similar, the total amount of failed disk data is relatively small. Due to the small available sample size and the insufficient data, using a small amount of training data in traditional machine learning algorithms greatly increases the risks of overfitting or weak generalization, which weaken the performance of the model and seriously affect the reliability of the storage systems.

To obtain enough failed disk data to train the model, some data synthesis and extension methods can be considered. Under-sampling or over-sampling techniques such as Synthetic Minority Oversampling Technique (SMOTE) [7], Wilson’s edited nearest neighbor (ENN) [39], Adaptive Synthetic sampling (ADASYN) [21], and so on, can balance the dataset to some extent; however, there may be a large deviation between the simulated failed disk data and the real failed disk data by fitting the characteristic function of the sample with the traditional method, so these are not ideal data. Therefore, in the case of insufficient failed disk data, the construction of a disk fault detection model requires an effective method as the foundation for generating failed disk data.

To tackle the above-mentioned problem, this article proposes a novel small-sample disk fault detection (SSDFD) optimizing method, with synthetic data using generative adversarial networks (GANs) [15]. The proposed approach utilizes GAN to generate failed disk data conforming to the failed disk data distribution and expands the dataset with the generated data, then the classifiers are trained. Because the disk SMART attributes vary with the usage and are time-related; therefore, we adopt the Long Short-Term Memory (LSTM), which is good at learning the characteristics of time series data as the GAN generator to fit the distribution of SMART data, and use the multi-layer neural network as the discriminator to train the GAN-model to generate realistic failed disk data. With sufficient generated failed disk data samples, ML algorithms can detect disk faults more precisely than before with the small original failed disk data samples.

The rest of this article is organized as follows. Section 2 presents the related works. Section 3 presents the background knowledge. Section 4 describes the proposed method. The dataset description and the experimental results are listed in Sections 5 and 6, respectively. Finally, Section 7 summarizes the conclusion of this article.

2 Related Works

In recent years, well-known artificial intelligence technologies such as the CNN [25], LSTM [6], stacked automatic encoder, and deep belief network [15, 37] have been developed in a number of ways. In terms of data processing, adaptive feature learning and multilayer nonlinear mapping, machine learning and deep learning are superior to many traditional mathematical methods, and they solve many problems in scenarios such as image processing, label classification, and prediction. Machine learning and deep learning techniques are also widely used in disk fault diagnosis [22, 40, 41, 41].

Support Vector Machines (SVMs) can utilize kernel functions to effectively perform nonlinear classification and implicitly map their inputs to high-dimensional feature spaces [12]. Zhu et al. [44] studied the BP neural network and an improved SVM model to establish an HDD failure prediction model based on SMART data. Li et al. [28] proposed a new HDD failure prediction model based on the Gradient Regression Tree (GBRT). The GBRT is a kind of gradient descent enhancement technology based on tree theories. It is an accurate and effective machine learning method that can be applied to regression and classification problems. In addition, the Regularized Greedy Forests (RGF) method is a powerful nonlinear classification approach derived from the GBRT that decouples structural search from optimization and uses the concept of structural sparsity to conduct greedy searches of forest nodes directly based on the forest structure. Mirela et al. [3] used this method to model disk faults, and the results were relatively good. In particular, LSTM removes or increases information to cell state through the “gate” structure, which can alleviate gradient explosion and gradient vanishing to a certain extent and effectively solve the problem of long-term backpropagation. LSTM also has been adopted to build HDD failure prediction models in recent works [10, 23, 29, 30, 31]. Hu et al. [23] replaced the input of LSTM network with the continuous operation record of HDDs considering the individual differences between each HDD sample. Therefore, the prediction model is able to learn the status information of the HDDs over a period of time and predict whether the HDDs will fail. Experimental results show that the proposed method can predict HDD faults with the precision of 86.31%.

However, machine-learning algorithms require a large amount of data, and their performances are not good enough when dealing with small sample problems [13]. Therefore, improving the performance of the machine-learning algorithms through data generation and other technologies should be considered. Data resampling is a common method to deal with imbalanced-learning problems, which mainly includes under-sampling, over-sampling, and hybrids methods. Under-sampling removes the majority class to balance the dataset such as ENN [39]. Over-sampling generates new samples for the minority class. SMOTE [7] made some improvements on the basis of the random over-sampling method. The basic idea is to calculate the k-nearest neighbors of each minority sample, and then randomly select some k-nearest neighbors from the original sample to synthesize new samples and add them to the dataset according to the formula. Compared with the random over-sampling method, SMOTE avoids the overfitting caused by simple duplicating samples in the synthesis of new samples and plays a positive role in the generalization of the model, but it also has problems in the selection of nearby neighbors, and it is difficult to overcome the data distribution problem of imbalanced datasets, which tends to produce distribution marginalization. Han et al. optimized SMOTE and proposed BorderlineSMOTE [20]. BorderlineSMOTE takes into account the importance of the boundary and the samples nearby for classification, and only uses a few class samples on the boundary to synthesize new samples, thus improving the category distribution of samples. ADASYN [21] uses the distribution to automatically determine how many synthetic samples each minority class sample needs to produce, that is, to assign different weights to different minority samples, the more the surrounding majority class samples, the higher the weight assigned, but this method is susceptible to outliers. Although these methods can balance the dataset to some extent, the traditional methods are not accurate enough to fit the characteristic function of the samples.

Transfer learning is able to apply the model trained in the source domain to the target domain leveraging the similarity between data, tasks, or models. The target domain can still be modeled under the condition of insufficient labeled data. Han et al. [19] designed a deep transfer learning method on a small and unbalanced CT image dataset of the PNs and obtained higher efficiency than the traditional machine learning method for malignant classifications. Chen et al. [8] improved the recognition accuracy of electrocardiogram under small dataset by introducing transfer learning ideas into the training process of deep residual network. For the problem of HDD fault detection, applying transfer learning requires similar reliability characteristics between the old and new HDDs and plenty of data in the source HDDs domain. Normally, HDDs from different manufacturers have different electrical characteristics and different reliability attributes with different calculation rules. That is to say, if new HDDs are from different manufacturers than old HDDs, the reliability of the new HDDs is quite different from that of the old HDDs. In addition, the data shortage of new HDDs in the initial period of use also means the knowledge loss in the source domain, and this will both affect the effect of transfer learning. The method proposed in this article does not need a lot of existing knowledge in the source domain as the foundation. Our method expands the total amount of data by generating virtual data to solve the problem of low HDD fault detection accuracy caused by insufficient data.

As a very famous data generation technology in recent years, GANs have verified their data generation capability in many fields [18, 27, 34]. Greff et al. [16] proposed a cross-domain fault diagnosis method. When testing data under machine failure conditions that are not conducive to training, reliable cross-domain diagnostic results can be derived, and the proposed method can manually generate pseudo samples for domain adaptation. Guo et al. [18] proposed a data generation model called SDAE-GAN, and the experimental results showed that the developed SDAE-GAN method for planetary gearboxes has good anti-noise ability and achieves strong fault diagnosis performance in cases with small samples. A method incorporating GANs for statistical parametric speech synthesis was proposed in [34]. In [27], a generative adversarial network for generating realistic underwater images from in-air image and depth pairings was proposed. Using online learning technology, Xiao et al. [40] proposed an HDD failure prediction mechanism based on an online random forest. This model can discard outdated trees and introduce new trees according to the distribution characteristics of new data, making the model evolve with the arrival of new data and effectively improving the prediction performance.

Since the data are an important basis of ML algorithm performance, in view of the problems that lack disk reliability data in the early stages of the establishment of a data center or the deployment of new storage devices, we propose a novel failed disk data generation method, SSDFD, based on LSTM-DNN GAN. Considering that LSTM is good at dealing with long-term memory problems, the proposed GAN model with LSTM-based generator can better capture the time series characteristics of HDD SMART data and generate more realistic virtual HDD SMART data compared with existing generation methods. And, the Gaussian noise is added after the virtual data are generated to make the generated virtual data closer to the HDD SMART data collected in the real environment. By synthesizing failed disk data, the performance of the ML algorithm is effectively optimized for disk fault detection.

3 Preliminary Knowledge

3.1 Generative Adversarial Networks

Recently, generative models such as GANs [15] have attracted much interest from researchers and industrial practitioners. A GAN contains two components: A generator denoted as G and a discriminator denoted as D, as shown in Figure 1. G captures the potential distribution of real data samples and generates fake samples, and D determines whether the input is real data or generated data. GANs are equivalent to mini–max game problems. The optimization goal is to achieve Nash equilibrium, which can be described by the following formula:

\begin{equation} \begin{split}\min _{G} \max _{D} V(D, G) = \min _{G} \max _{D}({E}_{x \sim p_{x}(x)}[\log D(x)]+{E}_{z \sim p_{z}(z)}[\log (1-D(G(z)))]).\end{split} \end{equation}

(1)

where x is the real data, its distribution is \(p_x\), and z is a random vector. The generated data \(G(z)\) can be obtained from G based on the input z. \(P_G\) is the distribution of \(G(z)\). \(D(x)\) is the probability that x comes from \(P_x\) instead of \(P_g\). Ideally, if \(x \sim P_x\), then \(D(x)=1\); if \(x \sim P_g\), then \(D(x)=0\). D is trained to maximize the probability of assigning correct labels to real data x and generated data \(G(z)\).

Fig. 1.

The model can be modified during training and can update itself with the parameters of another model. Given m real samples and m generated samples, the stochastic gradients of D and G are calculated by Equations (2) and (3), respectively:

\begin{equation} \nabla _{\theta _{d}} \frac{1}{m} \sum _{i=1}^{m}[\log D(x^{(i)})+\log (1-D(G(Z^{i})))], \end{equation}

(2)

\begin{equation} \nabla _{\theta _{g}} \frac{1}{m} \sum _{i=1}^{m}[\log (1-D(G(Z^{i})))]. \end{equation}

(3)

By alternating iterations, the mini–max problem is likely to obtained the global optimal solution at \(P_g = P_{data}\), the loss function converges, and the GAN reaches Nash equilibrium.

In practice, Equation (1) may not provide a sufficient gradient for G to learn. In the early stage of learning, the performance of G is poor such that the generated data are obviously different from the real data, and D can accurately distinguish the real data from the generated data. In this case, \(\log (1-D(G(z))\) saturates. While training G, maximizing \(\log D(G(z))\) is better than minimizing \(\log (1-D (G(z)))\). The objective function causes G and D to have the same fixed point and provides a strong gradient in the early learning stage. G implicitly defines a probability distribution \(P_g\), which is the distribution of the sample \(G(z\)) obtained when \(z \sim P_z\). Hence, the estimators of \(P_{data}\) can converge to an improved state through adversarial training with enough space and training time. After completing the training of the GAN, G can effectively estimate the probability distribution of the original real data samples and generate fake samples in line with the real sample distribution to expand the overall sample size.

3.2 Long Short-Term Memory (LSTM) Models

LSTM is an extension of the Recurrent Neural Network (RNN). The LSTM network was first proposed by H. Schmidhuber [22] with a memory cell and further improved by Gerset al. [13] with an extra forget gate. It can effectively solve the vanishing gradient problem. The structure of the LSTM cell is shown in Figure 2.

Fig. 2.

To obtain long-term time dependence, the LSTM algorithm defines and maintains the unit state to regulate information flow; this is an important mechanism of LSTM [16]. The cell state \(C_{t-1}\) interacts with the intermediate output \(h_{t-1}\) and the subsequent input \(x_t\) to determine, which elements of the internal state vector should be updated, maintained, or removed based on the output of the previous time step and the input of the current time step. The formulas of the LSTM network are described as follows:

\begin{equation} i_t = \sigma \big (x_tU^i + h_{t-1}W^i\big), \end{equation}

(4)

\begin{equation} f_t = \sigma \big (x_tU^f + h_{t-1}W^f\big), \end{equation}

(5)

\begin{equation} o_t = \sigma \big (x_tU^o + h_{t-1}W^o\big), \end{equation}

(6)

\begin{equation} \hat{C}_t = \tan {h\big (x_tU^g + h_{t-1}W^g\big),} \end{equation}

(7)

\begin{equation} C_t = f_t * C_{t-1} + i_t * \hat{C}_t , \end{equation}

(8)

\begin{equation} h_t = \tan {h(C_t) * o_t} . \end{equation}

(9)

where the operator \(*\) represents element-wise multiplication and \(\sigma\) represents the sigmoid activation function; i, f, and o denote the input, forget, and output gates, respectively; \(W^i\), \(W^f\), \(W^o\), and \(W_g\) represent the weight matrices that need to be learned during training; \(U^i\), \(U^f\), \(U^o\), and \(U^g\) are coefficient matrices; \(\hat{C}^t\) is a “candidate” hidden state that is calculated based on the current input and the previous hidden state; \(C_t\) is the internal memory of the unit; \(h_t\) represents the final output of the memory unit. Via the function of various gates, LSTM memory units can capture the complicated correlation features within time series in both the short and long terms, and this is a remarkable improvement over other RNNs.

4 Proposed Method

In a real environment, HDD faults tend to occur gradually over time. Therefore, the HDD reliability characteristics have a strong time-related correlation. Compared with traditional data generation methods, LSTM is better at capturing temporal characteristics of samples and extracting time-related features. Therefore, we adopt LSTM as the generator of the proposed GAN model to fit the probability distribution function of failed HDD samples, so that the GAN model can better capture the temporal characteristics of HDD-SMART data and generate more realistic virtual data. In addition, Gaussian noise is added after the virtual data are generated, so that the generated virtual data can be closer to HDD-SMART data collected from the real environment. Finally, the effectiveness of the proposed method is verified by experimental results.

4.1 SSDFD Models

The proposed method can generate synthesis data from random noise and add Gaussian noise to smooth the data according to the characteristics of the original real data. Its structure is shown in Figure 3. The main steps are as follows:

Fig. 3.

(1) The collected data contain all the characteristics of the real samples and are processed into a format suitable for training in the preprocessing stage. The noise is then fed into the generator model, and by learning the characteristics of these samples, the generator acquires primary generation capability. At this stage, the discriminator cannot be trained to acquire the primary discriminability.

(2) The generator obtains the ability of interference discrimination after training. To train the discriminator based on the generated data, its parameters must be fixed. At this stage, the generator and discriminator can capture the characteristics of the data because LSTM has a good ability to learn the characteristics of the time-series data, and the DNN can also process the connections between different dimensions of the data. The parameters of the LSTM-based generator and DNN-based discriminator are shared. The generator tries to make the generated data as close to the real data as possible so that the discriminator cannot determine whether the data are fake or real. The discriminator recognizes whether the data are generated fake data, and it does this as much as possible. Iterative training is performed with the generator and discriminator by alternating learning mechanisms, and the parameters are updated to improve performance until Nash equilibrium is achieved.

(3) After Nash equilibrium is achieved, Gaussian noise is added to the data generated by the generator to smooth the data and enhance its representational ability. Then, the data samples are input into the fault classifier for fault detection, and the diagnosis results are output.

4.2 Generation Process

The structure of the generator is shown in Figure 4. The LSTM-based generator and the DNN-based discriminator constitute the GAN model. The effect of the generator is verified by the discriminator. The effect of the discriminator is verified by the real data \(X_{real}{x^m} (m=1 \ldots M)\) and the generated fake data \(X_{fake}{x^k} (k=1\ldots k)\) together. The loss function iteratively trains the generator and discriminator using real labels and fake labels. Thus, the generator can generate fake samples that approximate the real samples after the iterations are completed. Then, the generated fake samples with Gaussian noise and the real samples are combined, and fault detection is carried out. The training process is as follows:

Fig. 4.

A random vector \({z^k} (1 \ldots K)\) sampled from the random noise is input into the generator with the initial labels and are mapped to the hidden layer vector \({h_z}^k\). The corresponding labels \(y_{fake}^k\) are generated from the fake samples \(x_{fake}^k (1 \ldots K)\):

\begin{equation} h_{z}^k = \tanh {\big (f_t * C_{t-1} + i_t * \big (\tanh {\big (x_tU^g + h_{t-1}W^g\big)}\big)\big)} * o_t , \end{equation}

(10)

\begin{equation} x_{fake}^k = \sigma {\big (W_z * h_{z}^k\big)} . \end{equation}

(11)

where \(U_g\) and \(W_g\) represent weight matrices and \(f_t\), \(i_t,\) and \(o_t\) represent the forget, input, and output gates, respectively. \(\tanh\) and \(\sigma\) are activation functions.

4.3 Training of the Discriminator

Fake samples \(x_{fake}^k (k=1 \ldots K)\) are labeled 1, real samples \(x_m (m=1\ldots M)\) are labeled 0, and the corresponding output labels are \(d_{fake}^k\) and \(d_{real}^m\). The training of the discriminator can be completed by minimizing the loss function of the proposed model. The Binary Cross-Entropy (BCE) loss is selected as the loss function because the problem is a binary classification problem. The calculation formulas are as follows:

\begin{equation} L_{fake} = -W_{fake}^k\big [y_k\log {d_{fake}^k} + (1-y_k)\log {\big (1-d_{fake}^k\big)}\big ], \end{equation}

(12)

\begin{equation} L_{real} = -W_{real}^m\big [y_m\log {d_{real}^m} + (1-y_m)\log {\big (1-d_{real}^m\big)}\big ], \end{equation}

(13)

\begin{equation} L = \arg \min _{\theta }(L_{fake} + L_{real}). \end{equation}

(14)

where \(L_{fake}\) and \(L_{real}\) represent the BCE loss errors of fake and real labels, respectively, the loss function of the discriminator is expressed as L, and the parameter set is expressed as \(\theta = \lbrace \theta _{1}, \theta _{2},\ldots \theta _{N+1}\rbrace\).

4.4 Training of the Generator

When the fake samples \(x_{fake}^k (k=1\ldots K)\) are input into the discriminator, if the output label is 0, this means that the generated fake samples cannot pass the discriminator; if the output label is 1, then the discriminator can no longer distinguish whether the generated fake samples are real data or generated fake data. The training of the generator is completed by minimizing Equation (15):

\begin{equation} L_g = -W_{fake}^k\big [y_k\log {d_{fake}^k} + (1-y_k)\log {\big (1-d_{fake}^k\big)}\big ], \end{equation}

(15)

\begin{equation} L_G = \arg \min _{\theta }(L_g). \end{equation}

(16)

The loss error of the label is represented by \(L_g\), the loss function of the generator is \(L_G\), and its parameter set is \(\theta ^{\prime } = \lbrace \theta _{z}, {\theta ^{\prime }}_{z} \rbrace\). The output of the generator is similar to that of the discriminator, with an optimal value of 0.5.

4.5 Adversarial Training Mechanism

The purpose of the discriminator is to make the output vector \(d_m\) as close to 1 as possible when the real samples \({x_m} (m=1 \ldots M)\) are input into the model and make the output vector \(d_{fake}^k\) as close to 0 as possible when the fake samples \(x_{fake}^k (k=1 \ldots K)\) are input to the model. If \(d_{fake}^k\) is very close to 1, then the samples generated by the generator can successfully “fool” the discriminator. The zero-sum game, in which the optimization goal is converted to a mini–max problem, occurs between the discriminator and the generator.

The generator and the discriminator are optimized alternately during the training process. The parameters of the generator are first fixed and the output of the discriminator is maximized. Then, the parameters of the discriminator are fixed and the generator is optimized. The above steps are repeated until the generator and discriminator reach Nash equilibrium, and then the training is complete. Algorithm-1 summarizes the training process of SSDFD.

4.6 Adding Noise

Normally, noise is caused by many different sources of noise rather than a single source. Assuming that the real noise is treated as the summation of many random variables with different probability distributions, and each random variable is independent, then according to the central limit theorem, its normalized sum approaches a Gaussian distribution as the number of noise sources increases. The data are affected by many factors in the real environment. For the generated fake data, to effectively simulate the data in the real environment, Gaussian noise with an expectation of 0 and a standard deviation of 1 is added to the fake data in our experiments [43].

5 Dataset Description and Preprocessing

5.1 Datasets

To evaluate the performance and restrictions of the proposed method, we use: (1) the HDD dataset from Backblaze, which spans a period of 12 months ranging from January 2019 to December 2019 [2]; (2) the HDD dataset from Baidu [26]. The datasets contain SMART data about the disks and some basic information, such as timestamps and device serial numbers. It is important to note that the SMART attributes provided by different manufacturers may be different, and some attributes may have different meanings depending on the type of device. Therefore, we select one HDD model with complete data from the Backblaze dataset, one from the Baidu HDD dataset to perform the experiments. The basic information about the datasets is shown in Table 1. The Backblaze HDD dataset contains daily snapshots, and the sampling interval of the Baidu HDD dataset is 1 hour. At the initial stage of use, the variation trend of SMART attributes is relatively stable, and the failure rate is low; in the later part of the lifecycle, many SMART attributes show obvious changes, such as the Seek Error Rate, Uncorrectable Errors, and so on.

Table 1.

	Backblaze HDD	Baidu HDD
Disk Model	ST12000-NM0007	ST31000-524NS
Capacity	4T	1T
Total Disks	37,004	23,395
Failed Disks	1156	433
Duration	12 months	20 days
Positive Items	12,721,076	3,857,616
Negative Items	1,156	20,480

Table 1. Overview of the Dataset

Total Disks contains both good disks and failed disks. Duration indicates how long the disk records last. Positive Items are the number of disk records for good disks. Negative Items are the number of disk records for failed disks.

5.2 Feature Selection

In our experiments, the Backblaze HDD and Baidu HDD are referred to as HDD A and HDD B, respectively. The dataset HDD A contains 42 disk models from different manufacturers. For each disk model in HDD A, the data include disk serial number, model, capacity, flag of failure, and so on, and 126 SMART attributes. However, some attributes are irrelevant to build the model, so before performing an in-depth analysis, we carry out “feature selection” to remove repetitive and irrelevant features, and this can also shorten the time required for model training and improve the performance of the model.

Feature selection removes redundant SMART attributes from the dataset, leaving a group of SMART attributes combination that describe the dataset best. The most important features are selected through Recursive Feature Elimination (RFE), and SVM is selected as the estimator for RFE in our experiment. The RFE result is displayed in Figure 5. After RFE, 12 SMART attributes from HDD A are selected to build the corresponding model. Since HDD B only records 12 SMART attributes so we select all 12 attributes. The selected SMART attributes are listed in Table 2.

Fig. 5.

Table 2.

Attribute Name	HDD A (BackBlaze)	HDD B (Baidu)
Raw Read Error Rate	Norm	Norm
Spin Up Time	Norm	Norm
Reallocated Sectors Count	Norm & Raw	Norm & Raw
Seek Error Rate	Norm	Norm
Power On Hours	Raw	Norm
Reported Uncorrectable Errors	Norm & Raw	Norm
High Fly Writes		Norm
Temperature		Norm
Hardware ECC Recovered		Norm
Current Pending Sector Count	Norm & Raw	Norm & Raw
Uncorrectable Sector Count	Norm & Raw

Table 2. Selected Features

The SMART data in each dataset contain raw values denoted as Raw and normalized values denoted as Norm. The Norms in the dataset are calculated by Raw according to the manufacturer’s nonpublic custom formulas. As some Norms may result in a loss of data accuracy and the corresponding Raw may be highly sensitive to changes in disk health, Raw and Norm are all used in the experiments, as shown in Table 2.

The range of values spanned by different features varies widely. To avoid bias towards features with large values, we apply feature scaling for data normalization according to the following formula:

\begin{equation} x_n = \frac{x-x_{min}}{x_{max}-x_{min}} , \end{equation}

(17)

where x is the original value of a feature and \(x_{max}\) and \(x_{min}\) are the maximum value and the minimum value of this feature, respectively.

5.3 Visualization of Generated Samples

We selected some SMART attributes to visualize the generated samples and verify whether the proposed method can generate realistic enough failed disk data to detect hard disk failures. Detailed description of the datasets is given in Section 5.1. Due to space limitations, we selected three SMART attributes and plotted the general change trend of the real samples and generated samples. Because the good disk data in the dataset are sufficient and there is only a few failed disk data, so we focus on the failed disk data generation. As shown in Figure 6, we visualize three real failed disk data and the corresponding generated data. As we can see, the generated failed disk data display similar amplitudes and shape to the real data, with most peaks and troughs occurring at roughly the same locations. We also observe fluctuations in the generated failed disk data. For example, in Figure 6(c), near time 50, the generated data show an earlier rise than the real data, followed by a larger drop. Although the real data look smoother, the proposed method can also capture the trend of changes in the SMART attributes of the failed disk data, indicating that the proposed method can effectively generate very realistic failed disk data.

Fig. 6.

6 Experimental Results

6.1 Evaluations of SSDFD

We use Precision and Recall as the metrics to measure the performances of different ML algorithms. Precision indicates the proportion of true positives (TPs) among all predicted failures. Recall represents the proportion of TPs within all actually failed disks. These metrics are defined as

\begin{equation} Precision = \frac{TP}{TP + FP} , \end{equation}

(18)

\begin{equation} Recall = \frac{TP}{TP + FN} , \end{equation}

(19)

where TP is “true positive”, FP is “false positive”, and FN is “false negative”.

The batch size in our experiment is set as 100. The generators for HDD A and HDD B have the same architectures, which are composed of four sublayers. The input layer has 12 units and represents the randomly sampled signals in shape (100, 12) of a normal distribution with a mean of 0 and a standard deviation of 1. The next layer is the first hidden layer and the size of the LSTM cell in the first hidden layer is 50. The third layer is also hidden which is same as the first one. The output layer is a linear connection with the shape (50, 12), tanh is used as the activation function. It is learned by Adam optimization algorithm with the learning rate of 0.0003.

The discriminator architecture for HDD A and HDD B is as follows. The generated data in shape (100, 12) is delivered to the input layer. The next is a linear mapping, which transforms the data to shape (12, 50). The output layer is also a linear connection that transforms the data to shape (50, 1). Sigmoid is used as the activation function. The discriminator is learned by Adam optimization algorithm with the learning rate of 0.0003.

The generator and discriminator for HDD A and HDD B are trained separately. Figure 7 shows the costs of the generator and discriminator for the two models during the training process. It can be observed that after some training time, a balance is reached between the generator and the discriminator. The gradient is zeroed before each backpropagation, and this not only allows the model to avoid overfitting and modal collapse but also causes it to converge to the GAN model.

Fig. 7.

We implement the proposed method based on PyTorch 1.7.1 and scikit-learn 0.24.1. The experiment is done on the hardware environment of Intel Xeon E5-2620 v3 2.4 GHz CPU, 32 GB RAM, and Unbuntu 20.04 OS. The training is accelerated by two RTX 2080ti GPUs, and it is completed when the outputs of real samples and fake samples are both close to 0.5.

To demonstrate the effectiveness of SSDFD, five commonly used classification algorithms are compared: multilayer perceptron (MLP), random forest (RF), logistic regression (LR), decision tree (DT), and SVM. Each dataset is randomly divided into a training set and a test set with a ratio of 7:3. Then we expand the failed disk data in the training set, the generated failed disk data of different folds are added to the training set and the above classifiers are trained. Finally, the test is conducted by the test set to validate the effects of the model.

Before the large-scale experiments, we first conduct the faithfulness test of SSDFD. We randomly select 100, 300, 500, 700, and 900 failed disk data samples from HDD A and HDD B, respectively, to generate the virtual data based on each data volume, and then combine the original data with the generated virtual data to form the mixed dataset. The volume of the failed disk data in the mixed dataset is equal to the original dataset. For example, as the amount of failed disk data in HDD A is 1,156, we generate 1,056 pieces of virtual data based on 100 original failed disk data, then combine the 100 pieces of original data with 1,056 pieces of virtual data to form the mixed dataset, it is similar for other data volumes. As shown in Figure 8, for HDD A, due to the insufficient failed disk data in the original dataset, the mixed dataset based on 100 and 300 original data is far inferior to the original dataset in terms of both precision and recall. When the virtual data is generated based on the data volume of 500, 700, and 900, the precision and recall of the mixed dataset are significantly improved, which is still slightly lower than the original dataset, but close to it. The results of HDD B are shown in Figure 9. Similarly, the precision and recall of the mixed dataset based on 100 and 300 data volumes are much lower than that of the original dataset. However, as the data volume of failed disk data in the original dataset has increased a lot compared with HDD A, so the precision of the mixed dataset based on the data volume of 500, 700, and 900 increases faster than HDD A, and almost reaches the same level as the original dataset. Although the recall is slightly lower than the original dataset, it is very close. It shows that SSDFD based on 500 original data is able to generate the virtual data with characteristics very close to real data under the condition of equal quantity comparison.

Fig. 8.

Fig. 9.

In order to verify the effect of adding different times of generated failed disk data on the accuracy of fault detection under the condition of different volumes of original failed disk data, we randomly select 50,000 pieces of good disk data and 900 pieces of failed disk data from HDD A and HDD B are selected so that the ratio of good disk data to failed disk data is set to 50:1 to simulate the small sample environment. For the real failed disk data volumes of 100, 300, 500, 700, and 900 in HDD A and HDD B, the generated failed disk data are added in amounts of 1–30 times that of each data volume (100/300/500/700/900 \(\times\) 1–30 fold) to train the classifiers, respectively. The effect of SSDFD is verified from multiple angles through the experimental settings of multiple generated failed disk data based on different volumes of original failed disk data.

Ten times experiments are proceeded and the mean values are taken as the experimental results. The precision and recall of the HDD A and HDD B are shown in Figures 10 and 11, respectively. The “Raw” in Figures 10 and 11 means the amounts of original failed disk data, “x-fold” means the generated failed disk data is added by x times to the original data. For example, in Figure 10(a), “100” in caption means the amounts of original failed disk data is 100 and 1–30 fold (100/500/1,000/1,500/2,000/2,500/3,000) generated failed disk data are added respectively.

Fig. 10.

Fig. 11.

For HDD A, it can be seen from Figure 10(a)–(d) that when the original failed disk data volume is 100 or 300, the precision values of different methods are approximately 30%–40%, and the recall is approximately 40%. When 1–15-fold generated failed disk data are added, both precision and recall are slightly improved. When 20–30-fold generated failed disk data are added, the improvement is particularly obvious. The increase rate for 100 original failed disk data is approximately 15% and that of 300 original failed disk data is even greater (approximately 25%). The enhancement is nearly saturated with 25-fold or 30-fold generated failed disk data. The precision and recall of 500 original failed disk data are shown in Figure 10(e) and (f). When adding 1-fold–15-fold data, the precision improvement yielded by different methods is significantly improved, with the average value being higher than 95% and the average improvement rate being approximately 15%. Since the training process is only based on failed disk data, the recall fluctuates slightly between 1-fold and 5-fold data, and this is similar to the recall fluctuations of 100 and 300 failed disk data. The precision and recall for 700 and 900 original failed disk data are shown in Figure 10(g)–(j). With 700 original failed disk data points, the average precision of the five methods is lower than 70%, and the average recall is just higher than 60%. The average precision for 900 original failed disk data using different methods is slightly higher than that obtained using 700 items, and the average recall is basically the same as that obtained using 700 items. When multiple failed disk data are added, there are obvious improvements in precision and recall, and the trends of other experimental groups are similar. The precision and recall improve obviously after adding 20-fold generated failed disk data, and the improvement is close to saturation. When continuously adding 25–30-fold generated failed disk data, the precision and the recall are relatively stable, and the averages are higher than 99% and 85%, respectively.

The experimental results of HDD B are shown in Figure 11, in which the precision and recall for 100 and 300 original failed disk data points are displayed in Figure 11(a)–(d). Due to the small amount of data, the precision and recall of different methods are relatively low, and the performance of the GAN model is not good enough. When the amount of generated failed disk data added is less than a 10-fold increase, the precision and recall of different methods are slightly improved, and the precision and recall of 1-fold data are even lower than those of the original failed disk data. After adding 15-fold generated failed disk data, the precision rates of different methods are increased by approximately 5%–10%, and the average recall is increased by less than 5%. A significant improvement can be observed when 20–30-fold generated failed disk data are added, and the average precision rates for 100 and 300 raw original failed disk data rises to nearly 60% and 80%, respectively. In terms of recall, the average recall for 300 original failed disk data points exceeds 60%, which is approximately 10% higher than that of the average of 100 groups. The precision rates corresponding to the original failed disk data with amounts of 500, 700, and 900 points are shown in Figure 11(e), (g), and (i), and the recall values are shown in Figure 11(f), (h), and (j). When the volume of original failed disk data increases, the average precision of the different methods increases from 60%–75%, and the average recall also increases by approximately 10%. After the addition of different multiples of generated failed disk data, the overall trend of HDD B is similar to that of HDD A, and the improvement is significant after 15-fold generated failed disk data are added. When 20–30-fold generated failed disk data are added, the increase is almost unable to be observed, and the average precision and the recall exceed 99% and 85%, respectively.

When the generated failed disk data added is 20–30 times that volume of the original data, the improvements in precision and recall achieved by the different ML algorithms are close to saturation.

Moreover, to better observe the usage conditions of the proposed method, Figures 12 and 13 depict the precision and recall of training the model on the basis of 100, 300, 500, 700, and 900 original failed disk data with 30-fold generated failed disk data added. When 100 original failed disk data points are used to train the GAN model, for HDD A and HDD B, the precision is approximately 60%, and the recall is only 40%–50%. When the data volume reaches 300, the average precision and recall of HDD A are raised to 80% and 70%, respectively. The average precision and recall of HDD B are raised to 75% and 60%, respectively. Compared with the increases seen for 100 original failed disk data, the increase, in this case, is more obvious. With 500 original failed disk data, the precision rates of different algorithms on HDD A and HDD B are significantly improved, almost exceeding 95%, and the average recall is also over 70%. As the volume of original failed disk data continues to increase, the precision rates for the two HDD datasets basically increase to the limit, and it is difficult to increase further. The average recall of the two HDD datasets also reaches 80%.

Fig. 12.

Fig. 13.

Looking at the precision and recall of HDDs in general, the trained model has a remarkable effect on the accuracy of HDD fault detection when it is trained with 300 original failed disk data.

The authors in [40] used the FDR and FAR to evaluate the effect of their method, where the meaning of FDR is the same as that of “recall” used in this article. The content is shown in Figures 10–13 in this article exhibits a phenomenon similar to that of [40], that is, in the initial usage stage of the disks, the detection effects of the ML algorithms are not ideal because of the lack of relevant reliability data. Furthermore, our method trains the model with less failed disk data than the amount used in [40] and achieves the same experimental results. Also, literature [23] adopted precision and recall as evaluation metrics to verify the prediction effect of the proposed LSTM model compared with DT, RF, and SVM based on the dataset from Backblaze. For different algorithms, our experimental results outperform [23] in both precision and recall. This is because we improve the structure of the proposed model according to the characteristics of disk reliability data by considering its temporal features, and this enables us to capture the characteristics of failed disk data more reasonably than [40] and [23], also the datasets are expanded with less raw data. For the other algorithms used in the comparative experiments in [40] and [23], our method can also achieve equivalent experimental results or outperform their results based on less failed disk data, thereby fully illustrating the effectiveness of the method proposed in this article.

6.2 Evaluations Compared to Other Generating Methods

In addition to the experimental results of SSDFD, we also conduct experiments based on other generating methods for comparison. SMOTE [7] utilizes KNN to interpolate data to generate new data, the generated data does not exist in the original dataset so that new information can be added. SMOTE can reduce the risk of overfitting compared to basic over-sampling. This technique is a commonly used method to deal with imbalanced data, and has been unanimously recognized by the academia and industry, so we add the comparative experiments between SMOTE and SSDFD. Due to space limitations, Figures 14 and 15 only show the comparison of precision and recall for HDD A and HDD B with 30-fold generated data based on SSDFD and SMOTE, respectively. As shown in Figure 14, the precision and recall performance of ML algorithms based on SSDFD are all better than SMOTE under the condition of 100-900 original failed disk data of HDD A. The average precision of SSDFD is about 20% higher than SMOTE, and the average recall is about 15% higher. A similar phenomenon can be observed in Figures 15(a) and 14(a). For the precision of HDD B, under the condition of 100–900 original failed disk data, the performance of ML algorithms based on SSDFD is better than that based on SMOTE, and the average precision of SSDFD is about 13% higher than SMOTE. However, Figure 15(b) shows different trends from that of Figure 14(b). The SSDFD outperforms SMOTE when the original failed disk data volume is in the range 100–500, and SMOTE outperforms SSDFD when the original failed disk data volume is 700–900, but the average recall of SMOTE is only 3% higher than that of SSDFD. In general, SSDFD outperforms SMOTE in most test cases for different datasets, which fully demonstrates the effectiveness of SSDFD.

Fig. 14.

Fig. 15.

6.3 Evaluations Compared to Ensemble Learning Algorithms

Ensemble learning is a common solution to solve the imbalanced learning problem, it can also improve the accuracy of classification. In view of this, we also compared the effects of SSDFD with Bagging [4] and AdaBoost [11] in the experiment. Bagging uses DT as the base estimator, the number of base estimators is 10, the resample strategy is only resampling the minority class. AdaBoost uses DT as the base estimator, the number of base estimators is 50, the boosting algorithm is “SAMME.R”. Due to space limitations, we only select MLP which performs the worst in fault detection and the RF, which performs the best to compare with Bagging and AdaBoost under the condition of different amounts of original failed disk data with 30-fold generated failed disk data added. As shown in Figure 16, for the precision of HDD A, MLP with the lowest average accuracy surpasses Bagging and AdaBoost in different original data volumes, while RF shows the better performance. For the recall of HDD A, when the original data volume is 100, the performance of Bagging and AdaBoost outperforms MLP and RF based on SSDFD. However, with the increase of the amount of original data, the performance of MLP and RF begin to rise. MLP almost outperforms Bagging and AdaBoost in the case of 300–900 data volume, and RF also shows better effect. In Figure 17, the precision of Bagging and AdaBoost is close to MLP and RF in the case of 700 and 900 original data volumes, while both precision and recall are lower than MLP and RF in the case of other original data volumes, and the overall trend is similar to that of HDD A. Overall, Bagging and AdaBoost are better than ML algorithms based on SSDFD only under a few conditions, but the MLP which performs the worst in fault detection among the five ML algorithms even has better precision and recall than Bagging and AdaBoost in most conditions, which fully demonstrates the superiority of SSDFD.

Fig. 16.

Fig. 17.

7 Conclusions

To solve the problem of low fault detection accuracy due to the lack of failed disk data, a small sample disk fault detection optimizing method SSDFD based on LSTM-GAN is proposed in this article. By learning the characteristics of the original failed disk data, the proposed method generates virtual failed disk data and adds noise to simulate the impact of the real environment on the data. The datasets used in this article are two HDD datasets published by the industry. The experimental results show that the method has good universality. Under different operating conditions, the total amount of data can be increased by generating virtual failed disk data for different types of HDDs. To improve the detection accuracy of different ML algorithms, the problem of small sample disk fault detection is effectively solved.

Moreover, the experimental results show that for HDDs, training the model with 300 original failed disk data has a significant effect on the accuracy of fault detection. Additionally, when the amount of generated failed disk data is 20–30 times that of the original data, the improvements in the precision and recall rates achieved by the proposed method for the fault detection process using various ML algorithms are close to saturation, and the effect of continuously generating additional virtual failed disk data has a limited effect on disk fault detection. Therefore, the optimal amount of data to generate is 20–30 times that of the original failed disk data. In future work, we intend to use multi-nodes for distributed training and accelerate the training process of the proposed model leveraging data parallelism or model parallelism.

Acknowledgments

Yufei Wang would like to thank Xiaoshe Dong, Xingjun Zhang, Longxiang Wang and Weiguo Wu, for their valuable advice and reviews of this paper.

Footnote

https://github.com/wyfzidane/SSDFD.

Supplementary Material

3500917-vor (3500917-vor.pdf)

Version of Record for "Optimizing Small-Sample Disk Fault Detection Based on LSTM-GAN Model" by Wang et al., ACM Transactions on Architecture and Code Optimization, Volume 19, Issue 1 (TACO 19:1).

Download
6.97 MB

References

[1]

Bruce Allen. 2004. Monitoring hard disks with smart. Linux Journal117 (2004), 74–77.

Editorial Notes

Abstract

1 Introduction

2 Related Works

3 Preliminary Knowledge

3.1 Generative Adversarial Networks

3.2 Long Short-Term Memory (LSTM) Models

4 Proposed Method

4.1 SSDFD Models

4.2 Generation Process

4.3 Training of the Discriminator

4.4 Training of the Generator

4.5 Adversarial Training Mechanism

4.6 Adding Noise

5 Dataset Description and Preprocessing

5.1 Datasets

5.2 Feature Selection

5.3 Visualization of Generated Samples

6 Experimental Results

6.1 Evaluations of SSDFD

6.2 Evaluations Compared to Other Generating Methods

6.3 Evaluations Compared to Ensemble Learning Algorithms

7 Conclusions

Acknowledgments

Footnote

Supplementary Material

References

Cited By

Index Terms

Recommendations

Fault Detection in Hard Disk Drives Based on a Semi Parametric Model and Statistical Estimators

From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC Environments

Striping in disk array RM2 enabling the tolerance of double disk failures

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations