6.1 Evaluations of SSDFD
We use Precision and Recall as the metrics to measure the performances of different ML algorithms. Precision indicates the proportion of
true positives (
TPs) among all predicted failures. Recall represents the proportion of TPs within all actually failed disks. These metrics are defined as
where TP is “true positive”, FP is “false positive”, and FN is “false negative”.
The batch size in our experiment is set as 100. The generators for HDD A and HDD B have the same architectures, which are composed of four sublayers. The input layer has 12 units and represents the randomly sampled signals in shape (100, 12) of a normal distribution with a mean of 0 and a standard deviation of 1. The next layer is the first hidden layer and the size of the LSTM cell in the first hidden layer is 50. The third layer is also hidden which is same as the first one. The output layer is a linear connection with the shape (50, 12), tanh is used as the activation function. It is learned by Adam optimization algorithm with the learning rate of 0.0003.
The discriminator architecture for HDD A and HDD B is as follows. The generated data in shape (100, 12) is delivered to the input layer. The next is a linear mapping, which transforms the data to shape (12, 50). The output layer is also a linear connection that transforms the data to shape (50, 1). Sigmoid is used as the activation function. The discriminator is learned by Adam optimization algorithm with the learning rate of 0.0003.
The generator and discriminator for HDD A and HDD B are trained separately. Figure
7 shows the costs of the generator and discriminator for the two models during the training process. It can be observed that after some training time, a balance is reached between the generator and the discriminator. The gradient is zeroed before each backpropagation, and this not only allows the model to avoid overfitting and modal collapse but also causes it to converge to the GAN model.
We implement the proposed method based on PyTorch 1.7.1 and scikit-learn 0.24.1. The experiment is done on the hardware environment of Intel Xeon E5-2620 v3 2.4 GHz CPU, 32 GB RAM, and Unbuntu 20.04 OS. The training is accelerated by two RTX 2080ti GPUs, and it is completed when the outputs of real samples and fake samples are both close to 0.5.
To demonstrate the effectiveness of SSDFD, five commonly used classification algorithms are compared: multilayer perceptron (MLP), random forest (RF), logistic regression (LR), decision tree (DT), and SVM. Each dataset is randomly divided into a training set and a test set with a ratio of 7:3. Then we expand the failed disk data in the training set, the generated failed disk data of different folds are added to the training set and the above classifiers are trained. Finally, the test is conducted by the test set to validate the effects of the model.
Before the large-scale experiments, we first conduct the faithfulness test of SSDFD. We randomly select 100, 300, 500, 700, and 900 failed disk data samples from HDD A and HDD B, respectively, to generate the virtual data based on each data volume, and then combine the original data with the generated virtual data to form the mixed dataset. The volume of the failed disk data in the mixed dataset is equal to the original dataset. For example, as the amount of failed disk data in HDD A is 1,156, we generate 1,056 pieces of virtual data based on 100 original failed disk data, then combine the 100 pieces of original data with 1,056 pieces of virtual data to form the mixed dataset, it is similar for other data volumes. As shown in Figure
8, for HDD A, due to the insufficient failed disk data in the original dataset, the mixed dataset based on 100 and 300 original data is far inferior to the original dataset in terms of both precision and recall. When the virtual data is generated based on the data volume of 500, 700, and 900, the precision and recall of the mixed dataset are significantly improved, which is still slightly lower than the original dataset, but close to it. The results of HDD B are shown in Figure
9. Similarly, the precision and recall of the mixed dataset based on 100 and 300 data volumes are much lower than that of the original dataset. However, as the data volume of failed disk data in the original dataset has increased a lot compared with HDD A, so the precision of the mixed dataset based on the data volume of 500, 700, and 900 increases faster than HDD A, and almost reaches the same level as the original dataset. Although the recall is slightly lower than the original dataset, it is very close. It shows that SSDFD based on 500 original data is able to generate the virtual data with characteristics very close to real data under the condition of equal quantity comparison.
In order to verify the effect of adding different times of generated failed disk data on the accuracy of fault detection under the condition of different volumes of original failed disk data, we randomly select 50,000 pieces of good disk data and 900 pieces of failed disk data from HDD A and HDD B are selected so that the ratio of good disk data to failed disk data is set to 50:1 to simulate the small sample environment. For the real failed disk data volumes of 100, 300, 500, 700, and 900 in HDD A and HDD B, the generated failed disk data are added in amounts of 1–30 times that of each data volume (100/300/500/700/900 \(\times\) 1–30 fold) to train the classifiers, respectively. The effect of SSDFD is verified from multiple angles through the experimental settings of multiple generated failed disk data based on different volumes of original failed disk data.
Ten times experiments are proceeded and the mean values are taken as the experimental results. The precision and recall of the HDD A and HDD B are shown in Figures
10 and
11, respectively. The “Raw” in Figures
10 and
11 means the amounts of original failed disk data, “
x-fold” means the generated failed disk data is added by
x times to the original data. For example, in Figure
10(a), “100” in caption means the amounts of original failed disk data is 100 and 1–30 fold (100/500/1,000/1,500/2,000/2,500/3,000) generated failed disk data are added respectively.
For HDD A, it can be seen from Figure
10(a)–(d) that when the original failed disk data volume is 100 or 300, the precision values of different methods are approximately 30%–40%, and the recall is approximately 40%. When 1–15-fold generated failed disk data are added, both precision and recall are slightly improved. When 20–30-fold generated failed disk data are added, the improvement is particularly obvious. The increase rate for 100 original failed disk data is approximately 15% and that of 300 original failed disk data is even greater (approximately 25%). The enhancement is nearly saturated with 25-fold or 30-fold generated failed disk data. The precision and recall of 500 original failed disk data are shown in Figure
10(e) and (f). When adding 1-fold–15-fold data, the precision improvement yielded by different methods is significantly improved, with the average value being higher than 95% and the average improvement rate being approximately 15%. Since the training process is only based on failed disk data, the recall fluctuates slightly between 1-fold and 5-fold data, and this is similar to the recall fluctuations of 100 and 300 failed disk data. The precision and recall for 700 and 900 original failed disk data are shown in Figure
10(g)–(j). With 700 original failed disk data points, the average precision of the five methods is lower than 70%, and the average recall is just higher than 60%. The average precision for 900 original failed disk data using different methods is slightly higher than that obtained using 700 items, and the average recall is basically the same as that obtained using 700 items. When multiple failed disk data are added, there are obvious improvements in precision and recall, and the trends of other experimental groups are similar. The precision and recall improve obviously after adding 20-fold generated failed disk data, and the improvement is close to saturation. When continuously adding 25–30-fold generated failed disk data, the precision and the recall are relatively stable, and the averages are higher than 99% and 85%, respectively.
The experimental results of HDD B are shown in Figure
11, in which the precision and recall for 100 and 300 original failed disk data points are displayed in Figure
11(a)–(d). Due to the small amount of data, the precision and recall of different methods are relatively low, and the performance of the GAN model is not good enough. When the amount of generated failed disk data added is less than a 10-fold increase, the precision and recall of different methods are slightly improved, and the precision and recall of 1-fold data are even lower than those of the original failed disk data. After adding 15-fold generated failed disk data, the precision rates of different methods are increased by approximately 5%–10%, and the average recall is increased by less than 5%. A significant improvement can be observed when 20–30-fold generated failed disk data are added, and the average precision rates for 100 and 300 raw original failed disk data rises to nearly 60% and 80%, respectively. In terms of recall, the average recall for 300 original failed disk data points exceeds 60%, which is approximately 10% higher than that of the average of 100 groups. The precision rates corresponding to the original failed disk data with amounts of 500, 700, and 900 points are shown in Figure
11(e), (g), and (i), and the recall values are shown in Figure
11(f), (h), and (j). When the volume of original failed disk data increases, the average precision of the different methods increases from 60%–75%, and the average recall also increases by approximately 10%. After the addition of different multiples of generated failed disk data, the overall trend of HDD B is similar to that of HDD A, and the improvement is significant after 15-fold generated failed disk data are added. When 20–30-fold generated failed disk data are added, the increase is almost unable to be observed, and the average precision and the recall exceed 99% and 85%, respectively.
When the generated failed disk data added is 20–30 times that volume of the original data, the improvements in precision and recall achieved by the different ML algorithms are close to saturation.
Moreover, to better observe the usage conditions of the proposed method, Figures
12 and
13 depict the precision and recall of training the model on the basis of 100, 300, 500, 700, and 900 original failed disk data with 30-fold generated failed disk data added. When 100 original failed disk data points are used to train the GAN model, for HDD A and HDD B, the precision is approximately 60%, and the recall is only 40%–50%. When the data volume reaches 300, the average precision and recall of HDD A are raised to 80% and 70%, respectively. The average precision and recall of HDD B are raised to 75% and 60%, respectively. Compared with the increases seen for 100 original failed disk data, the increase, in this case, is more obvious. With 500 original failed disk data, the precision rates of different algorithms on HDD A and HDD B are significantly improved, almost exceeding 95%, and the average recall is also over 70%. As the volume of original failed disk data continues to increase, the precision rates for the two HDD datasets basically increase to the limit, and it is difficult to increase further. The average recall of the two HDD datasets also reaches 80%.
Looking at the precision and recall of HDDs in general, the trained model has a remarkable effect on the accuracy of HDD fault detection when it is trained with 300 original failed disk data.
The authors in [
40] used the FDR and FAR to evaluate the effect of their method, where the meaning of FDR is the same as that of “recall” used in this article. The content is shown in Figures
10–
13 in this article exhibits a phenomenon similar to that of [
40], that is, in the initial usage stage of the disks, the detection effects of the ML algorithms are not ideal because of the lack of relevant reliability data. Furthermore, our method trains the model with less failed disk data than the amount used in [
40] and achieves the same experimental results. Also, literature [
23] adopted precision and recall as evaluation metrics to verify the prediction effect of the proposed LSTM model compared with DT, RF, and SVM based on the dataset from Backblaze. For different algorithms, our experimental results outperform [
23] in both precision and recall. This is because we improve the structure of the proposed model according to the characteristics of disk reliability data by considering its temporal features, and this enables us to capture the characteristics of failed disk data more reasonably than [
40] and [
23], also the datasets are expanded with less raw data. For the other algorithms used in the comparative experiments in [
40] and [
23], our method can also achieve equivalent experimental results or outperform their results based on less failed disk data, thereby fully illustrating the effectiveness of the method proposed in this article.