A Domain Adaptation Model For Early Gear Pitting Fault Diagnosis Based On Deep Transfer Learning Network

Original Article
Proc IMechE Part O:

J Risk and Reliability
2020, Vol. 234(1) 168–182
A domain adaptation model for early Ó IMechE 2019
Article reuse guidelines:
gear pitting fault diagnosis based on sagepub.com/journals-permissions
DOI: 10.1177/1748006X19867776
deep transfer learning network journals.sagepub.com/home/pio
Jialin Li1 , Xueyi Li1 , David He1,2 and Yongzhi Qu3
Abstract
In recent years, research on gear pitting fault diagnosis has been conducted. Most of the research has focused on feature
extraction and feature selection process, and diagnostic models are only suitable for one working condition. To diagnose
early gear pitting faults under multiple working conditions, this article proposes to develop a domain adaptation diagnos-
tic model–based improved deep neural network and transfer learning with raw vibration signals. A particle swarm opti-
mization algorithm and L2 regularization are used to optimize the improved deep neural network to improve the
stability and accuracy of the diagnosis. When using the domain adaptation diagnostic model for fault diagnosis, it is neces-
sary to discriminate whether the target domain (test data) is the same as the source domain (training data). If the target
domain and the source domain are consistent, the trained improved deep neural network can be used directly for diag-
nosis. Otherwise, the transfer learning is combined with improved deep neural network to develop a deep transfer
learning network to improve the domain adaptability of the diagnostic model. Vibration signals for seven gear types with
early pitting faults under 25 working conditions collected from a gear test rig are used to validate the proposed method.
It is confirmed by the validation results that the developed domain adaptation diagnostic model has a significant improve-
ment in the adaptability of multiple working conditions.
Keywords
Early gear pitting, multiple working conditions, transfer learning, improved deep neural network
Date received: 16 December 2018; accepted: 20 June 2019
Introduction operation to obtain transmission error and used it to

identify the different characteristics. Shi et al.5 estab-
Gears are some common transmission devices in lished a double motor torque and rotational speed cou-
machinery and widely used in aircraft, automobile, pling model to have a detailed simulation analysis on
machine tools, and so on. In addition, due to the harsh the situation that the experiment platform is difficult to
working conditions, the gears have a higher fault rate. realize or test. On the contrary, data-driven methods do
Gear faults include broken teeth, cracked teeth, and not require much experience with the system, and we
tooth pitting. Gear pitting fault is responsible for 31% can use a model established by the data to diagnose the
of all faults.1
gear faults. The traditional data-driven methods typi-
In recent years, numerous research projects have
cally involve three necessary processes: (1) feature
been done on the diagnosis of gear pitting, which can
be summarized into two types: model-based methods
and data-driven methods.2 In model-based methods, 1
School of Mechanical Engineering and Automation, Northeastern
experts usually establish a dynamic modeling to simu- University, Shenyang, China
2
Department of Mechanical and Industrial Engineering, The University of
late the system operation and then modify it based on Illinois at Chicago, Chicago, IL, USA
the error between the actual outputs and the ideal out- 3
School of Mechanical and Electronic Engineering, Wuhan University of
puts.3 Applying a model-based method requires not Technology, Wuhan, China
only a thorough understanding of the system, but also
multiple parameter adjustments to optimize the model, Corresponding author:
David He, Department of Mechanical and Industrial Engineering, The
and the accuracy of the model directly affect the diag- University of Illinois at Chicago, 842 West Taylor Street, Chicago, IL
nosis result. For example, Park et al.4 used finite ele- 60607, USA.
ment models of two gear faults to simulate the gear Email: davidhe@uic.edu
Li et al. 169
extraction, (2) feature selection, and (3) pattern recogni- method to diagnose bearing faults. Qu et al.27 used the
tion.6 Saravanan et al.7 used wavelet analysis to extract deep sparse autoencoder (SAE) method to diagnose
features from vibration signals and used two pattern gear pitting: the authors combined dictionary learning
recognition methods, artificial neural network (ANN), with sparse coding and then stacked it into the AE net-
and proximal support vector machine (PSVM), to diag- work and diagnosed two types of gear conditions
nose gearbox faults. Wu and Chan8 used acoustic emis- (health, pitting) with raw data as the inputs of deep
sion signals instead of vibration signals for gear faults SAE.
diagnosis, and a continuous wavelet transform tech- The domain adaptability of the diagnostic model is
nique combined with a feature selection of energy spec- also a key evaluation criterion. Ren et al.28 proposed a
trum is used to generate the inputs of ANN. In the new feature extraction method for diagnosing rolling
study by Samanta et al.,9 statistical features extracted bearing faults under varying speed conditions.
from time domain signals were applied as the inputs of Considering the increase in energy when the ball passes
ANN and SVM. In addition, the genetic algorithm through the fault, the frequency values are divided by
(GA) is applied for optimization. Traditional pattern instantaneous speed and corresponding amplitude to
recognition methods such as ANN and SVM can only form a new fault feature array, and the Euclidean dis-
achieve shallow learning tasks, and the diagnosis per- tance classifier was used for recognition. Tong et al.29
formance is directly affected by feature selection pro- proposed domain adaptation using transferable fea-
cess.10,11 Moreover, the feature selection process is tures (DATF) to solve the diagnosis of different work-
done manually, largely depending on prior diagnostic ing conditions. They used maximum mean discrepancy
knowledge. And the feature selection method of one (MMD) to reduce the marginal and conditional distri-
faulty diagnosis issue may not be applicable to another butions simultaneously during domains across. Cheng
issue. et al.30 first transformed the vibration signal into a
In recent years, enthusiasm for deep learning has recurrence plot (RP) with two dimensions and then uti-
been triggered by Hinton et al.12 Deep learning can lized speed up robust feature to extract fault features
overcome the shortcomings of the shallow model. considering the visual invariance characteristic of the
When it is applied to faults diagnosis, the feature selec- human visual system (HVS). Liu et al.31 applied
tion process can be omitted, which can save time and Hilbert–Huang transform (HHT), singular value
labor. There are many different methods for deep decomposition (SVD), and Elman neural network to
learning, and according to the training method, it can solve the bearing fault diagnosis under variable work-
be divided into two types: supervised training and ing conditions. This method is mainly used to apply the
unsupervised training.13 Methods for supervised train- SVD method to reduce the dimension of the instanta-
ing include deep neural network (DNN)14 and convolu- neous amplitude matrix and obtain the insensitive fault
tional neural network (CNN).15,16 Methods for feature. Zhang et al.32 applied the method of transfer
unsupervised training include deep belief network learning (TL) to make diagnostic methods quickly
(DBN)17,18 and autoencoder (AE).19,20 Heydarzadeh et adaptable to other working conditions.
al.21 applied the discrete wavelet transform (DWT) Most of the aforementioned gear pitting fault diag-
results of three common monitoring signals (vibration, nosis methods include feature extraction and feature
acoustic, and torque) as the inputs of the DNN to diag- selection process. Moreover, the conventional diagnos-
nose the five classes of gear faults. Sun et al.22 applied tic model is only suitable for fault diagnosis under one
a dual-tree complex wavelet transform (DTCWT) to working condition. This article proposes a newly devel-
extract multi-scale features of signals. In addition, the oped DNN methodology for diagnosis of early gear
CNN is applied for gear fault diagnosis. Shao et al.23 pitting faults. Meanwhile, particle swarm optimization
also applied DTCWT for feature extraction and used (PSO) algorithm and L2 regularization are used to
adaptive deep belief network (ADBN) for fault diagno- optimize the traditional DNN. In addition, TL is com-
sis. Jia et al.24 used AE technology to pre-train the bined to develop a deep transfer learning network
parameters of the DNN to diagnose rotating machin- (DTLN) to improve the domain adaptability of the
ery faults. Several of the references presented above diagnostic model. The innovation of the proposed
used different deep learning methods to diagnose method is that the feature extraction and selection pro-
mechanical faults, but all include feature extraction cess are omitted, and the domain adaptability of the
process such as DWT. Manual feature extraction pro- network is improved. The rest of the article is orga-
cess is time-consuming and labor-intensive, and unsui- nized as follows: in ‘‘The proposed method’’ section,
table extraction methods will also affect the diagnosis the methodology of the proposed method is intro-
results. Jing et al.25 proposed an adaptive gearbox duced. In ‘‘Experiment setup and data segmentation’’
faults diagnosis method based on deep convolutional section, the data collected from the experimental test
neural network (DCNN), and there is no feature rig and preprocess of the collected vibration data are
extraction process in the article, and the raw data col- explained. In ‘‘Results and discussions’’ section, the
lected from the experiment were directly applied as the validation of proposed method using the collected
inputs of the DCNN. Wang et al.26 proposed the adap- vibration data is reported. Finally, ‘‘Conclusions’’ sec-
tive deep convolutional neural network (ADCNN) tion concludes the article.
170 Proc IMechE Part O: J Risk and Reliability 234(1)
The proposed method ei

fsoftmax = P j ð7Þ
e
The improved deep neural network j
Conventional DNN. DNN has a fully connected network 1

EMSE = ðo yÞ2 ð8Þ
structure: neurons in the adjacent layers are connected 2
to each other, and neurons in same layer are not con- Ecrossentropy = ð½o ln y + ð1 oÞ lnð1 yÞÞ ð9Þ
nected to each other. The forward propagation process
of DNN is similar to that of ANN. The calculation where o is the ideal output vector, y is the actual output
principles of data passing through layer m in DNN are vector, E0 is the error of output vector.
shown in equations (1)–(3)33 Equations (10) and (11) modify the network weights
and biases: the loss function partial derivative for the
X
n
weights and biases is multiplied with the learning rate.
um
k = wm m
ki xi ð1Þ Equations (12)–(14) demonstrate that the cross-entropy
i=1
loss function can train the network faster than the tra-
zm
k = u m
k bk
m
ð2Þ ditional mean square error loss function

ym
k = f zk
m
ð3Þ
∂EMSE
Dw = h ð10Þ
where xm m
i is the ith input value of layer m, wki is the ∂w
m
weight of layer m, uk is the weighted sum of all ∂EMSE
inputs,bm Db = h ð11Þ
k is bias vector, f() is the activation function, ∂b
and ymk is the output of layer m. ∂EMSE
There are many activation functions available. The = ðo yÞf 0sig ðwx + bÞx ð12Þ
∂w
following introduces two commonly used in DNN sig- ∂Ecrossentropy
moid function and ReLU function as shown in equa- = ðy oÞx ð13Þ
∂w
tions (4) and (5). Equation (6) is the derivative function
of ReLU34 DMSE jo yj f 0sig ðwz + bÞ jxj
= 40:25
Dcross entropy jy oj jxj
1
fsig = ð4Þ ð14Þ
ezk + 1
fReLU = maxð0, zk Þ ð5Þ where Dw is the correction of weights, Db is the correc-

d 1, z . 0 tion of biases, and h is the learning rate.
fReLU = ð6Þ
dz 0, z40 When the cross-entropy function is chosen as the loss
function and sigmoid as the activation function, the last
Both activation functions have their own advan- layer of weights is corrected as shown in equation (15).
tages and disadvantages. The output of sigmoid is Equation (16) can be obtained by applying equation
from 0 to 1, so it can control the amplitude change in (13) to equation (15), and equation (18) obtained after
the deep learning. But it contains exponential calcula- equation (17) is applied to equation (16). Similarly, the
tion, so the amount of calculation is large. And when correction amount of biases Db can be obtained as
sigmoid is used as the activation function, with the shown in equation (19)
increase in number of layers and neurons, the gradient
and sparsity problems cannot be solved well. The ∂Ecrossentropy
Dw = h ð15Þ
advantage of the ReLU compared to the sigmoid is ∂w

that it has better sparseness and can recognize the ∂Ecrossentropy o ∂y 1 o ∂y
=
fault feature from the multi-scale signal features in ∂w y ∂w 1 y ∂w
deep learning. The derivative of ReLU is 1 or 0 so that
o y ∂y
the network can solve the problems of gradient des- = ð16Þ
cent and gradient explosion in a better way. However, yð1 yÞ ∂w
the forced sparsity of ReLU also leads to neuron ∂y
= yð1 yÞx ð17Þ
‘‘necrosis,’’ resulting in the model that cannot extract ∂w
valid features. Moreover, it cannot limit the ampli- Dw = hðy oÞx ð18Þ
tude like sigmoid activation function. Db = hðy oÞ ð19Þ
The last layer of the network has a softmax classifier
as shown in equation (7). We will get the vector y after
inputting the vector x to the DNN network. There Improved DNN. We used the ELU function36 instead of
must be an error between the actual output y and the the activation function ReLU. The ELU function is
desired output o, and we can use the error to modify shown in equation (20), and equation (21) is its deriva-
the network weights and biases. There are two com- tive function. Figure 1 shows the graph of two activa-
monly used loss function: equation (8) is the mean tion functions. In Figure 1, X and Y are the input and
square error loss function and equation (9) is the cross- output of the activation function, respectively.
entropy loss function35 Comparing equations (6) and (20), we can see that the
Li et al. 171
TL and fine-tuning strategy

The improved deep neural network (IDNN) can per-
form fault diagnosis well, but the trained network can
only diagnose faults under one working condition.
When the data under other working conditions are
applied as inputs to diagnose the fault, a new diagnos-
tic model needs to be trained to adapt this working
condition. In practical applications, the working condi-
tions of the equipment change over time. Training a
network for each working condition requires not only a
Figure 1. Comparison of two activation functions: (a) ReLU large amount of time, but also a large number of train-
and (b) ELU.
ing samples. So a diagnostic model adapted to multiple
working conditions is highly desirable.
TL38–40 is a popular approach in machine learning.
ELU function has no change when z is more than 0,
It can be applied between two related domains to
and changed when z is less than 0. Therefore, it retains
reduce training time and save training samples.
the advantage of ReLU that prevents the gradient dis-
Combining IDNN with TL to develop a DTLN can
appearing and saves the partly information of less than
make the diagnostic model more adaptable to different
0. Thus, the average of the neurons is closer to 0, and it
working conditions. Figure 2 shows a comparison
can reduce the bias shift of the active unit. Since the
between traditional machine learning and TL. In tradi-
soft saturation characteristic of the function is acti-
tional machine learning, each task or domain requires
vated when the input value is small, the robustness to
separate training of its corresponding diagnostic model.
noise is improved
This not only requires a large number of training sam-

zk , z.0 ples for each working condition, but also takes a lot of
fELU = ð20Þ time.
aðezk 1Þ, z40
The training process of TL shown in Figure 2
d 1, z.0 1, z.0
fELU = = includes the source domain Ds and target domain Dt.
dz aex , z40 fELU ðzk Þ + a, z40
The source domain in TL is the same as domain A in
ð21Þ traditional machine learning. When the target domain
is used for training, the pre-trained diagnostic model in
In order to avoid overfitting of the DNN, L2 regu-
larization37 is used to correct the loss function, as the source domain is transferred to the target domain,
shown in equation (22). Equations (23) and (24) reveal and then the pre-trained model is fine-tuned with a
the nature of L2 regularization optimization. The L2 small number of target domain samples. In this way, a
regularization term is added in the error function, and domain diagnostic model can be used under multiple
it will directly affect the network parameter correction. working conditions with only a small number of sam-
As shown in equations (23) and (24), the correction of ples and less training time.
the bias does not change, but the weight correction
changes. Equation (26) is the final weight update func-
tion. It can be found that after the L2 regularization, Particle swarm optimization
the effect of weight decay is achieved because the weight PSO is a method inspired by the behavior of birds
is multiplied by a coefficient less than 1 searching for food, which was proposed by Kennedy
and Eberhart. In previous papers,41,42 PSO algorithm
lX 2
Enew = ð½o ln y + ð1 oÞlnð1 yÞÞ + wki was analyzed in detail. Similar algorithms include ant
2n wki colony optimization (ACO)43 and GA,44 all of which
ð22Þ are inspired by the behavior or laws of biology.
∂E ∂E0 l PSO is widely used, since it has a great adaptability,
= + w ð23Þ easy implementation, and few parameters to be set. Its
∂w ∂w n
∂E ∂E0 basic principle can be described as n particles in a P-
= ð24Þ dimensional space, and their speed and location chan-
∂b ∂b
ged over time. The particle i of speed and position can
∂E0 l
wnew = w h + w ð25Þ be expressed by vi = (vi1 , vi2 , . . . , vip ) and
∂w n xi = (xi1 , xi2 , . . . xip ), Pf is the fitness of particle, and

hl ∂E0 hl the size of the fitness corresponds to the distance
wnew = 1 wh 1 \1 ð26Þ
n ∂w n between each bird and food. The extremum of individ-
ual Pb and extremum of population gb can be updated
where E is the output error corrected by L2 regulariza-
according to particle fitness, and then we can use the
tion, l is the coefficient of L2 regularization, and n is
individual extremum and the population extremum to
the sample size.
Figure 2. Different learning processes between traditional machine learning and transfer learning.

calculate the particle velocity and position, as shown in vmax vij . vmax xmax xij . xmax
vij = ; xij =
equations (27) and (28) vmin vij \ vmin xmin xij \ xmin
ð30Þ
vij ðt + 1Þ = rvij ðtÞ + c1 e1 Pbj ðtÞ xij ðtÞ
The range of particle velocity cannot be too large,
+ c2 e2 gbj xij ðtÞ ð27Þ
otherwise the system will be unstable and it is easy to
xij ðt + 1Þ = xij ðtÞ + vij ðt + 1Þ ð28Þ ‘‘skip’’ the optimal solution during particle iteration.
Particle activity range setting is also similar to the speed
where i is the ith particle; j is the jth dimensional of the setting, and limiting the particles’ position helps find
P-dimensional space; c1 and c2 are the learning factor; the optimal solution.
c1 is the particle’s own part, expresses its own under- The initial position of the particles is randomly
standing and influence on the optimization; c2 is the assigned within a certain range, and the optimal solu-
social part, indicates that the particles are affected by tion found by the several iteration may not be the glo-
the population; t is the number of iterations; e1 and e2 bal optimal solution. Therefore, the position of the
are random numbers that are evenly distributed particles should be mutated at a certain probability,
between 0 and 1; and r is the inertia weight of particle, which can increase the diversity of particles and find
indicates that it is affected by the last speed. optimal solution in a new area. After repeating the
PSO is used to optimize the parameters in the DNN. above-mentioned operation several times, the global
If the DNN contains a total of k parameters, the dimen- optimal solution can be found.
sional space j in equations (27) and (28) is equal to k.
The number of particles is set empirically, and each par-
ticle contains j parameters. Select the best performing The framework and diagnostic process of DTLN
particle after t iterations and attach its j parameters to Figure 3 shows the framework of DTLN. It can be seen
the DNN as the initial parameters. The weight deter- that the overall framework of DTLN is divided into
mines the influence of the previous speed of the particle two parts: (1) when the training and test data are in the
on the current speed, which plays a role in balancing same working condition, perform 1 -2 (purple circles
the global search and the local search. As shown in marked in Figure 3) and (2) when the test data (target
equation (29), the weights are linear decay with itera- domain) are different from the training data (source
tions. This makes the particle swarm algorithm have domain), perform 1 -3 -4 -5.
strong search ability at the beginning of the iteration, The detailed diagnostic process of the DTLN is
and good local search ability in the later stage45 defined as follow:
t
r = rmax ðrmax rmin Þ ð29Þ Step 1: Select one working condition data from all the
tmax
collected data. Then, cut the raw data into n segmenta-
where rmax is the set maximum weight, rmin is the set tions with the same amount of points. Finally, divide all
minimum weight, and tmax is the maximum number of segmentations into two groups, 80% of which is used
iterations. for training and the remaining 20% for testing.
The position and velocity of the particles all have a Step 2: Set the structure of the IDNN, set the minimum
range. When the velocity or position value is out of training error and the maximum epoch of training, and
range, the processes as shown in equation (30) will be use the PSO algorithm to generate the initial weight
performed and bias of IDNN.
Li et al. 173
Figure 3. The framework of the deep transfer learning network.
Step 3: Randomly select a batch of segmentations as Experiment setup and data segmentation
the inputs of the IDNN.
Step 4: Get the actual output through IDNN network,
Experiment setup
and use the cost function corrected by L2 regularization The experimental test rig and gear pitting type are
to calculate the error between the actual output and the shown in Figure 4. The gearbox is driven by two 45 kW
ideal output. Siemens servo motors: motor 1 is the drive motor and
Step 5: Compute the gradients of weights and biases in motor 2 is the load motor. The gearbox contains a pair
each layer with the back propagation algorithm, and of spur gears. The driving gear connected to the motor
update the weights and biases with the learning rate. 1 has 40 teeth, the driven gear connected to the motor
Step 6: Change another batch of segmentations to 2 has 72 teeth, and the gear module is 3 mm. The gear-
repeat Steps 3–5 until all the training data are used up. box was also equipped with a lubrication and cooling
Step 7: Repeat Steps 3–6 until training epochs reach the system, and the vibration sensor is mounted on the
maximum epoch or the output error reaches the mini- bearing housing of the driven gear.
mum set value. Table 1 describes the gear pitting condition in Figure
Step 8: Test the trained network with the testing data. 4. Six different early pitting were designed manually by
When the test working condition is the same as the a drill on the driven gear, and the degree of gear pitting
selected working condition in Step 1, the trained net- is gradually increased, as shown in Table 1. The setting
work will be directly used for fault diagnosis. of gear pitting fault simulates the process of gear pitting
Otherwise, Steps 9–10 will be performed. from small to large and can also analyze the relation-
Step 9: Transfer the parameters of the trained IDNN to ship between pitting type and fault diagnostic accuracy.
the new diagnostic model. This article proposes to establish a gear pitting diag-
Step 10: Fine-tune the new diagnostic network with a nosis model suitable for various working conditions, so
small amount of data (fine-tuning with 1% of all data the vibration data of various working conditions are
has a significant improvement) from the target domain. collected to construct and test the model. In the experi-
Finally, the fine-tuned model is used to diagnose the ment, vibration signal under five speed conditions and
fault. five torque conditions are collected, a total of 25
Figure 4. (a) Experimental test rig and (b) gear pitting type.
Figure 5. The vibration signal of 100 r/min-100 Nm: (a) one second signals and (b) one segmentation signals.
working conditions, as shown in Table 2. Note that the We collected vibration signal five times in each gear
circles in Table 2 represent the six conditions used in fault type (C1–C7). So there are 35 files in each working
the mixed working condition diagnosis in ‘‘Diagnosis condition and 60,000 data points per file. The number
results of IDNN under multiple working conditions’’ of data points in each file is too large to be directly used
and ‘‘Diagnostic results with DTLN’’ sections. as input to the DNN, so we cut the raw signal into suit-
able segmentation. The advantage of data segmentation
is that the number of neurons in input layer is reduced,
Data segmentation which in turn reduces the complexity of the DNN struc-
The tri-axial accelerometer was mounted on the bear- ture and makes the network fit more quickly. On the
ing housing of the driven gear and collected vibration contrary, the training sample size and sample diversity
signals in all the three directions, with a sampling rate is increased, and the diagnostic accuracy of the network
of 10,240 Hz. In this article, the vibration signals of is improved.
seven kinds of gears under 25 working conditions are The sampling rate is 10,240 Hz and the max rotation
collected. Comparing the vibration signals of all the speed is 500 r/min, so approximately 1200 data points
three directions, the amplitude of Z-axis is the largest. per gear rotation can be computed. We put 300 data
Therefore, we use the Z-axis vibration signal in the points (quarter of per gear rotation collected data) in
diagnosis of gear pitting faults. The vibration signal in each segmentation.46 So each file is divided into 200
the Z-axis of 100 r/min-100 Nm working condition is segmentations, a total of 7000 segmentations. About
shown in Figure 5(a). 80% of all data are used for training and the rest is
Li et al. 175
Table 1. Driven gear pitting type.
Label Gear pitting type

72nd tooth First tooth Second tooth
C1 Healthy Healthy Healthy

C2 Healthy 10% in middle Healthy
C5 10% in middle 50% in middle Healthy
C6 10% in middle 50% in middle 10% in middle
C7 30% in middle 50% in middle 10% in middle
Table 2. Experimental working conditions. Table 3 shows the effect of PSO algorithm on training
time and training accuracy. The term NAN in the table
Speed (r/min) 100 200 300 400
Torque (Nm) indicates that the network does not converge. The PSO
algorithm allows the network to start with good initial
100 s N s N parameters. In this case, it is possible to choose a larger
200 N N N N learning rate and speed up the network convergence.
300 s N N N
400 N N N N Figure 6(b) shows the influence of the magnitude of
500 s N N N the L2 coefficient l on the diagnostic accuracy. It can
be seen from the figure that when l is equal to 0, that
is, there is no L2 optimization, the accuracy is about
0.9. As the value of l increases, the accuracy shows an
used for testing. The diagnostic model training matrix upward trend. The accuracy reaches the maximum
dimension for each working condition is 300 3 5600, value of 0.96386 when l is equal to 0.35. As L2 coeffi-
and the testing matrix dimension for each working con- cient l continues to increase, the fluctuation of the
dition is 300 3 1400. accuracy becomes larger, that is, the stability of the
diagnostic model decreases.
Results and discussions The confusion matrixes of standard DNN (SDNN)
method and IDNN are shown in Figure 7. The activa-
Diagnosis results of IDNN under working condition tion function ReLU is used in the standard DNN. It
100 r/min-100 Nm can be seen that the improved method has a better diag-
First, we should decide the structure of the IDNN: the nostic accuracy. The misdiagnosis of the two methods
number of neurons in input layer is equal to the num- is consistent (case1: C2 misjudge as C4, case2: C2 mis-
ber of data points in segmentation (300 neurons), seven judge as C6, case3: C5 misjudge as C6, case4: C6 mis-
neurons in the output layer (corresponding to seven judge as C4). The initial judgment of misdiagnosis is
gear types), and contained three hidden layers (300, due to the occasional single-tooth engagement of the
200, and 100 neurons). The minimum training error is gearbox resulting in a change in the type of fault.
set to 0.01 and the maximum training epochs is set to The diagnostic accuracy of four methods for diag-
150. All samples are randomly branched, and then each nosing gear pitting faults under the 100 r/min-100 Nm
branch is trained in turn, and one training epoch is is shown in Table 4. When the SVM and ANN meth-
completed when all branches are trained. ods were used, 12 statistical characteristics (mean, root
Figure 6(a) shows the effect of PSO on training. By mean square (RMS), variance, etc.) were extracted
comparison, it is found that after PSO optimization, from the time domain and frequency domain. On the
the initial error is reduced from 25 to 2, and the num- contrary, the standard DNN method and proposed
ber of training epoch is also greatly reduced, which method used the raw vibration signal as the input.
means that PSO optimization can shorten the training The fault type of the gear is the type corresponding
process and make the training process more stable. to the neuron with maximum value. The diagnostic
Table 3. The effect of PSO on the training time and diagnostic accuracy.
Learning rate PSO time Network computing time Total Accuracy
With PSO 0.1 20.817 s 39.61 s 60.42 s 0.9364

Without PSO 0.1 – NAN NAN 0.1429
Without PSO 0.05 – 136.644 s/stop in max epochs (250) 136.64 s 0.8364
PSO: particle swarm optimization.

Figure 6. Training error curve of hybrid model: (a) influence of PSO and (b) influence of L2 coefficient l.
Figure 7. Confusion matrixes: (a) SDNN and (b) IDNN.
accuracy has not differed when the maximum output Diagnosis results of IDNN under multiple working
of neurons is 0.5 and 0.99. Therefore, the diagnostic conditions
accuracy cannot fully represent the diagnostic ability of
the network. We perform principal component analysis ‘‘Diagnosis results of IDNN under working condition
(PCA) on the output matrix of the network to further 100 r/min-100 Nm’’ section shows the results of apply-
analyze the diagnostic ability of three methods, and ing IDNN to diagnose gear faults under 100 r/min-
then used the first two principal components (PCs) of 100 Nm working conditions. This section applies a vari-
the PCA results to form a scatter figure, as shown in ety of working conditions to verify the adaptability of
Figure 8. The diagnostic accuracy of Figure 8(a) and IDNN for diagnosing multiple working conditions.
(b) is similar, but from the PCA results, we can know Figure 10(a), (c), and (e) shows the diagnostic accuracy
the diagnostic ability of SDNN method is significantly of the three methods (SVM, ANN, and IDNN) in 25
better than ANN method. Compared with the SDNN working conditions (as shown in Table 2). It can be
method, diagnostic ability of the IDNN has also found from Figure 10(e) that IDNN method has a high
improved significantly. accuracy under each working condition, but it is neces-
The parameter setting during training also affects sary to retrain the network when the working condi-
the diagnostic accuracy. Figure 9 shows the effect of tions change. Figure 10(b), (d), and (f) shows the cross-
parameters learning rate and batch size (samples in each diagnosis accuracy of six working conditions (labeled
batch). Figure 9(a) and (b) shows that as the learning as circles in Table 2) without retraining the network. It
rate and batch size increase, the accuracy decreases. can be seen from Figure 10(f) that the diagnostic
Li et al. 177
Table 4. Diagnostic accuracy of four methods.
Method C1 C2 C3 C4 C5 C6 C7 Average
SVM 0.995 0.865 0.975 0.755 0.665 0.765 0.995 0.8594

ANN 0.995 0.895 0.995 0.765 0.845 0.99 1 0.9264
SDNN 1 0.985 1 0.735 0.785 0.945 1 0.9214
IDNN 1 0.955 1 0.89 0.99 0.99 1 0.975
SVM: support vector machine; ANN: artificial neural network; SDNN: standard deep neural network; IDNN: improved deep neural network.
Figure 8. The PCA result of three kinds of network outputs: (a) ANN, (b) SDNN, and (c) improved DNN.
Figure 9. Influence of the network parameter on diagnostic accuracy: (a) learning rate and (b) batch size.
accuracy is better only when the training and testing However, a well-trained IDNN under one working
data are from the same working condition. In other condition can only diagnose the data under this condi-
words, a trained IDNN developed under one working tion. In order to improve the working condition adapt-
condition is only applicable to the same working condi- ability of the diagnostic model, this article proposes a
tion and cannot be used in other working conditions. DTLN based on TL. This section applies six working
conditions (labeled as circles in Table 2) to test the
adaptability of the DTLN. The six working conditions
Diagnostic results with DTLN are as follows: A: 100 r/min-100 Nm, B: 100 r/min-
As can be seen from Figure 10, IDNN has a good diag- 300 Nm, C: 100 r/min-500 Nm, D: 300 r/min-100 Nm,
nostic accuracy under each working condition. E: 500 r/min-100 Nm, and F: 500 r/min-500 Nm.
Figure 10. Diagnostic accuracy of different methods: (a), (b) SVM method; (c), (d) ANN method; (e), (f) IDNN method. The
training data and test data used in (a), (c), and (e) are from the same working condition; the training data and test data used in (b),
(d), and (f) are from different working conditions.
Figure 11 shows the diagnostic accuracy changes A and test with the remaining data in working condi-
corresponding to an increase in the training sample size tion A; (2) case 2 (A-B with DTLN, with different train-
of both DTLN and IDNN. The horizontal axis is the ing sample size): uses DTLN to diagnose faults, where
target domain sample size used to fine-tune the pre- the source domain Ds was used as data in working con-
trained network. The data in working condition A and dition A and the target domain Dt as data in working
data in working condition B were used as training data condition B. As discussed in ‘‘Data segmentation’’ sec-
for the results in Figure 11(a) and (b), respectively. The tion, the number of samples in each working condition
four curves in Figure 11(a) correspond to four cases: was 7000. Setting the percentage of the data for fine-
(1) case 1 (A-A with IDNN, all samples are used): trains tuning from 0.1% to 2%, the fine-tuning sample size
the network with 80% of the data in working condition used was changed from seven (7000 3 0.1%) to 140
Li et al. 179
Figure 11. The accuracy changes corresponding to the changes in training sample size for DTLN and IDNN: (a) source domain:
working condition A, target domain: working condition B; (b) source domain: working condition B, target domain: working condition
A.
(7000 3 2%). (3) Case 3 (A-B with IDLN, all samples In summary, DTLN can not only make model adapt to
are used): trains the network with 80% of the data in multiple working conditions, but also save training
working condition A, and then test the trained network time and samples.
with 20% data from working condition B. (4) Case 4
(A-A with IDLN, accuracy fluctuates with training
sample size): trains the network with data sample size Conclusions
from seven to 140 in working condition A, and then In this article, a domain adaptation model for early
tests the trained network with 20% data in working gear pitting fault diagnosis based on deep TL was pre-
condition A. In Figure 11(b), data in working condition sented. By combining an IDNN with TL, DTLN was
B were used as the source domain and data in working developed to make the diagnostic model have a good
condition A were used as the target domain. As can be diagnostic accuracy under multiple working conditions.
seen from the figure, the DTLN can achieve high diag- The vibration signals for seven types of gears with early
nostic accuracy with 1% of data for fine-tuning. pitting faults under 25 working conditions collected
Figure 12 shows a comparison of the diagnostic from a gear test rig were used to validate the DTLN.
accuracy of DTLN and IDNN with different target Based on the validation results, we can draw the fol-
domains. Taking Figure 12(a) as an example, the data lowing conclusions:
in working condition A as source domain were used to
train the model and the data in other five working con- 1. Using PSO optimization to initialize model para-
ditions (B to F) were used as the target domain to test meters speeds up the training process. L2 regulari-
the model. When using the DTLN method, 5% of the zation improves the diagnostic ability of the
target domain data were used to fine-tune the pre- diagnostic model by weight decay during training.
trained model. Comparing the diagnostic accuracy of 2. The IDNN has a high diagnostic accuracy when
the two methods, it can be found that the DTLN is sig- the target domain (testing data) and the source
nificantly more adaptable to different working condi- domain (training data) are in the same working
tions than IDNN. The DTLN method not only condition, and the maximum accuracy can reach
improves the diagnostic accuracy under multiple work- 99.93%. However, a diagnostic model developed
ing conditions, but also requires fewer training samples with IDNN is only suitable for fault diagnosis
and less training time. The IDNN required 67 s to train under the same working condition.
the model with 80% data in working condition A. 3. The DTLN overcomes the shortcomings of IDNN,
When the data in working condition B were used as the and greatly improves the adaptability of the diag-
target domain, it took 72 s to develop the model with nostic model to multiple working conditions.
80% data in working condition B. However, using the Moreover, to fine-tune the pre-trained model, only
DTLN method to fine-tune the model required only a small number of target samples and less training
6 s, which reduced the training time by a factor of 10. time are required.
Figure 12. Comparison of diagnostic accuracy between DTLN and IDNN: (a) source domain: working condition A, (b) source
domain: working condition B, (c) source domain: working condition C, (d) source domain: working condition D, (e) source domain:
working condition E, and (f) source domain: working condition F.
Declaration of conflicting interests publication of this article: This work was funded by the
The author(s) declared no potential conflicts of interest National Natural Science Foundation of China (No.
with respect to the research, authorship, and/or publi- 51675089 and No. 51505353).
cation of this article.
ORCID iDs
Funding
Jialin Li https://orcid.org/0000-0002-9940-179X
The author(s) disclosed receipt of the following finan- Xueyi Li https://orcid.org/0000-0002-1751-2809
cial support for the research, authorship, and/or David He https://orcid.org/0000-0002-5703-6616
Li et al. 181
Yongzhi Qu https://orcid.org/0000-0002-5314-023X 16. Lu C, Wang Z and Zhou B. Intelligent fault diagnosis of

rolling bearing using hierarchical convolutional network
based health state classification. Adv Eng Inform 2017; 32:
References 139–151.
1. Ali YH, Rahman RA and Hamzah RIR. Acoustic emis- 17. Zhang Z and Zhao J. A deep belief network based fault
sion signal analysis and artificial intelligence techniques diagnosis model for complex chemical processes. Comput
in machine condition monitoring and fault diagnosis: a Chem Eng 2017; 107: 395–407.
review. J Teknologi 2014; 69(2): 121–126. 18. Ren H, Chai Y, Qu JF, et al. A novel adaptive fault
2. Lei Y, Liu Z, Wang D, et al. A probability distribution detection methodology for complex system using deep
model of tooth pits for evaluating time-varying mesh belief networks and multiple models: a case study on
stiffness of pitting gears. Mech Syst Signal Pr 2018; 106: cryogenic propellant loading system. Neurocomputing
355–366. 2018; 275: 2111–2125.
3. Liu X, Yang Y and Zhang J. Resultant vibration signal 19. Schmidhuber J. Deep learning in neural networks: an
model based fault diagnosis of a single stage planetary overview. Neural Netw 2015; 61: 85–117.
gear train with an incipient tooth crack on the sun gear. 20. Shao HD, Jiang HK, Lin Y, et al. A novel method for
Renew Energ 2018; 122: 65–79. intelligent fault diagnosis of rolling bearings using ensem-
4. Park S, Kim S and Choi JH. Gear fault diagnosis using ble deep auto-encoders. Mech Syst Signal Pr 2018; 102:
transmission error and ensemble empirical mode decom- 278–297.
position. Mech Syst Signal Pr 2018; 108: 262–275. 21. Heydarzadeh M, Kia SH, Nourani M, et al. Gear fault
5. Shi X, Gao Q, Li W, et al. Simulation study on gear fault diagnosis using discrete wavelet transform and deep
diagnosis simulation test-bed of doubly fed wind genera- neural networks. In: IECON 2016 – 42nd annual confer-
tor. In: Proceedings of the 12th international conference ence of the IEEE industrial electronics society, Florence,
on computer science and education, Houston, TX, 22–25 23–26 October 2016. New York: IEEE.
August 2017. New York: IEEE. 22. Sun W, Yao B, Zeng N, et al. An intelligent gear fault
6. Sun W, Shao S, Zhao R, et al. A sparse auto-encoder- diagnosis methodology using a complex wavelet
based deep neural network approach for induction motor enhanced convolutional neural network. Materials 2017;
faults classification. Measurement 2016; 89: 171–178. 10(7): 790.
7. Saravanan N, Siddabattuni VNSK and Ramachandran 23. Shao HD, Jiang HK, Wang F, et al. Rolling bearing fault
KI. Fault diagnosis of spur bevel gear box using artificial diagnosis using adaptive deep belief network with dual-
neural network (ANN), and proximal support vector tree complex wavelet packet. ISA T 2017; 69: 187–201.
machine (PSVM). Appl Soft Comput 2010; 10(1): 344– 24. Jia F, Lei Y, Lin J, et al. Deep neural networks: a pro-
360. mising tool for fault characteristic mining and intelligent
8. Wu JD and Chan JJ. Faulted gear identification of a diagnosis of rotating machinery with massive data. Mech
rotating machinery based on wavelet transform and arti- Syst Signal Pr 2016; 72–73: 303–315.
ficial neural network. Expert Syst Appl 2009; 36(5): 25. Jing L, Wang T, Zhao M, et al. An adaptive multi-sensor
8862–8875. data fusion method based on deep convolutional neural
9. Samanta B, Al-Balushi KR and Al-Araimi SA. Artificial networks for fault diagnosis of planetary gearbox. Sen-
neural networks and support vector machines with sors 2017; 17(2): E414.
genetic algorithm for bearing fault detection. Eng Appl 26. Wang F, Jiang HK, Shao HD, et al. An adaptive deep
Artif Intel 2003; 16: 657–665. convolutional neural network for rolling bearing fault
10. Keskes H and Braham A. Recursive undecimated diagnosis. Meas Sci Technol 2017; 28(9): 095005.
wavelet packet transform and dag SVM for induction 27. Qu YZ, He M, Deutsch J, et al. Detection of pitting in
motor diagnosis. IEEE T Ind Inform 2015; 11(5): 1059– gears using a deep sparse autoencoder. Appl Sci 2017;
1066. 7(5): 515.
11. Bi TS, Ni YX, Shen CM, et al. A novel ANN fault diag- 28. Ren Y, Li W, Zhu ZC, et al. A new fault feature for roll-
nosis system for power systems using dual GA loops in ing bearing fault diagnosis under varying speed condi-
ANN training. In: Proceedings of the 2000 power engi- tions. Adv Mech Eng 2017; 9(6). DOI: 10.1177/
neering society summer meeting, Seattle, WA, 16–20 July 1687814017703897.
2000. New York: IEEE. 29. Tong Z, Li W, Zhang B, et al. Bearing fault diagnosis
12. Hinton GE, Osindero S and Teh YW. A fast learning based on domain adaptation using transferable features
algorithm for deep belief nets. Neural Comput 2006; under different working conditions. Shock Vib 2018;
18(7): 1527–1554. 2018: 6714520.
13. Liu W, Wang Z, Liu X, et al. A survey of deep neural net- 30. Cheng Y, Zhou B, Lu C, et al. Fault diagnosis for rolling
work architectures and their applications. Neurocomput- bearings under variable conditions based on visual cogni-
ing 2017; 234: 11–26. tion. Materials 2017; 10(6): E582.
14. Lu W, Wang X, Yang C, et al. A novel feature extraction 31. Liu HM, Wang X and Lu C. Rolling bearing fault diag-
method using deep neural network for rolling bearing nosis under variable conditions using Hilbert-Huang
fault diagnosis. In: Proceedings of the 27th Chinese con- transform and singular value decomposition. Math Probl
trol and decision conference, Qingdao, China, 23–25 May Eng 2014; 2014: 765621.
2015. New York: IEEE. 32. Zhang R, Tao H, Wu L, et al. Transfer learning with
15. Chen ZQ, Li C and Sanchez RV. Gearbox fault identifi- neural networks for bearing fault diagnosis in
cation and classification with convolutional neural net- changing working conditions. IEEE Access 2017; 5:
works. Shock Vib 2015; 2015(2): 390134. 14347–14357.
33. Wang Z, Wang J and Wang Y. An intelligent diagnosis convolutional neural network-based transfer learning.
scheme based on generative adversarial learning deep IEEE Access 2018; 6: 26241–26253.
neural networks and its application to planetary gearbox 40. Qian WW, Li SM and Wang JR. A new transfer learning
fault pattern recognition. Neurocomputing 2018; 310: method and its application on rotating machine fault
132–222. diagnosis under variant working conditions. IEEE Access
34. Cui JL, Qiu S, Jiang MY, et al. Text classification based 2018; 6: 69907–69917.
on ReLU activation function of SAE algorithm. In: Pro- 41. Mohammadi N and Mirabedini SJ. Comparison of parti-
ceedings of the international symposium on neural net- cle swarm optimization and backpropagation algorithms
works, Hokkaido, Japan, 21–26 June 2017, pp.44–50. for training feedforward neural network. J Math Comp
Cham: Springer. Sci 2014; 12: 113–123.
35. Ye J. Fault diagnosis of turbine based on fuzzy cross 42. Kulkarni VR and Desai V. ABC and PSO: a comparative
entropy of vague sets. Expert Syst Appl 2009; 36(4): analysis. In: Proceedings of the IEEE international confer-
8103–8106. ence on computational intelligence & computing research,
36. Clevert DA, Unterthiner T and Hochreiter S. Fast and Chennai, India, 14–16 December 2017. New York: IEEE.
accurate deep network learning by exponential linear 43. Chen L, Xiao C, Li X, et al. A seismic fault recognition
units (ELUs). In: Proceedings of the international confer- method based on ant colony optimization. J Appl Geo-
ence on learning representations, San Juan, PR, 2–4 May phys 2018; 152: 1–8.
2016, https://arxiv.org/abs/1511.07289 44. Rajeswari C, Sathiyabhama B, Devendiran S, et al. A
37. Zhao M, Chow TWS, Zhang H, et al. Rolling fault diag- gear fault identification using wavelet transform, rough
nosis via robust semi-supervised model with capped l2, 1- set based GA, ANN and C4.5 algorithm. Procedia Engi-
norm regularization. In: Proceedings of the IEEE interna- neer 2014; 97: 1831–1841.
tional conference on industrial technology, Toronto, ON, 45. Fang H. Monopole-gear design based on neural network
Canada, 22–25 March 2017. New York: IEEE. and modified particle swarm optimization. Appl Mech
38. Shao SY, McAleer S, Yan RQ, et al. Highly accurate Mater 2013; 477–478: 368–373.
machine fault diagnosis using deep transfer learning. 46. Zhang R, Peng Z, Wu L, et al. Fault diagnosis from raw
IEEE T Ind Inform 2018; 15: 2466–2455. sensor data using deep neural networks considering tem-
39. Cao P, Zhang S and Tang J. Pre-processing-free gear poral coherence. Sensors 2017; 17(3): E549.
fault diagnosis using small datasets with deep

A Domain Adaptation Model For Early Gear Pitting Fault Diagnosis Based On Deep Transfer Learning Network

Uploaded by

Copyright:

Available Formats

A Domain Adaptation Model For Early Gear Pitting Fault Diagnosis Based On Deep Transfer Learning Network

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Domain Adaptation Model For Early Gear Pitting Fault Diagnosis Based On Deep Transfer Learning Network

Uploaded by

Copyright:

Available Formats

Original Article

Proc IMechE Part O:

deep transfer learning network journals.sagepub.com/home/pio

Jialin Li1 , Xueyi Li1 , David He1,2 and Yongzhi Qu3

Date received: 16 December 2018; accepted: 20 June 2019

Introduction operation to obtain transmission error and used it to

The proposed method ei

Conventional DNN. DNN has a fully connected network 1

TL and fine-tuning strategy

Figure 3. The framework of the deep transfer learning network.

Table 1. Driven gear pitting type.

Label Gear pitting type

C1 Healthy Healthy Healthy

Learning rate PSO time Network computing time Total Accuracy

With PSO 0.1 20.817 s 39.61 s 60.42 s 0.9364

PSO: particle swarm optimization.

Figure 7. Confusion matrixes: (a) SDNN and (b) IDNN.

Table 4. Diagnostic accuracy of four methods.

SVM 0.995 0.865 0.975 0.755 0.665 0.765 0.995 0.8594

Yongzhi Qu https://orcid.org/0000-0002-5314-023X 16. Lu C, Wang Z and Zhou B. Intelligent fault diagnosis of

You might also like