Nothing Special   »   [go: up one dir, main page]

A Survey On Deep Learning For Data-Driven Soft Sensors

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 17, NO.

9, SEPTEMBER 2021 5853

A Survey on Deep Learning for Data-Driven


Soft Sensors
Qingqiang Sun and Zhiqiang Ge , Senior Member, IEEE

Abstract—Soft sensors are widely constructed in pro- variables as input and hard-to-measured variables as output,
cess industry to realize process monitoring, quality predic- has been developed to estimate or predict important variables
tion, and many other important applications. With the de- expediently during the past decades [4].
velopment of hardware and software, industrial processes
have embraced new characteristics, which lead to the poor There are three main types of approaches to establish soft
performance of traditional soft sensor modeling methods. sensing models, namely mechanism-based, knowledge-based,
Deep learning, as a kind of data-driven approach, shows and data-driven methods [5]. The first two kinds of approaches
its great potential in many fields, as well as in soft sens- can work well if detailed and accurate mechanism of process is
ing scenarios. After a period of development, especially in known or a wealth of experience and knowledge about process
the last five years, many new issues have emerged that
need to be investigated. Therefore, in this article, the ne- is available. However, the increasing complexity of industrial
cessity and significance of deep learning for soft sensor process makes these preconditions difficult to meet. As a result,
applications are demonstrated first by analyzing the merits data-driven modeling has become the mainstream soft sensing
of deep learning and the trends of industrial processes. modeling methods [6], [7].
Next, mainstream deep learning models, tricks, and frame- Conventional data-driven soft sensor modeling methods
works/toolkits are summarized and discussed to help de-
signers propel the developing progress of soft sensors. mainly include a wide variety of statistical inference techniques
Then, existing works are reviewed and analyzed to discuss and machine learning techniques, such as principal component
the demands and problems occurred in practical applica- regression, which combines principal component analysis with a
tions. Finally, outlook and conclusions are given. regression model, partial least squares (PLS) regression, support
Index Terms—Data-driven modeling, deep learning (DL), vector machine (SVM), and artificial neural network (ANN)
industrial big data, neural networks (NNs), soft sensor. [8]–[12]. In last two decades, with technical breakthroughs on
some key issues, networks with enough number of hidden layers
I. INTRODUCTION or with complex enough structures are available, which are
known as deep learning (DL) techniques [13], [14]. Due to DL
OWADAYS, the process industry is becoming more and
N more complicated, due to the development of information
technologies and the increase of customer demands. As a result,
techniques, computational models that are composed of multiple
processing layers are allowed to learn representations of data
with multiple levels of abstraction. These methods have dramat-
the cost and difficulty of direct measurement and analysis of
ically improved the state of the art in speech recognition, object
key quality variables are increasing [1]–[3]. However, in order
detection, and many other domains, such as drug discovery and
to monitor the operation status of systems, realize the smooth
genomics [15].
control of processes and improve the quality of products, those
In recent years, there has been a proliferation of research
key variables or quality indices have to be obtained as fast and
that applies DL approaches to soft sensors. From conventional
accurately as possible. Therefore, soft sensing technique, which
artificial intelligence field to soft sensing field, many differences
is a kind of mathematical model with easy-to-measured auxiliary
exist objectively. There are many questions need to be investi-
gated and discussed (including but not limited to the following
Manuscript received September 26, 2020; revised December 2, 2020,
December 20, 2020, and January 10, 2021; accepted January 17, 2021. issues): Is it necessary and suitable to use DL techniques in
Date of publication January 20, 2021; date of current version June 16, soft sensing scenario? What DL models can be utilized for
2021. This work was supported in part by the National Key Research practical application? How to apply them to solving problems
and Development Program of China under Grant 2018YFC0808600,
in part by the National Natural Science Foundation of China under in real processes? What are the potential research points for the
Grant 61722310, in part by the Natural Science Foundation of Zhejiang future? Therefore, the motivation of this article is to answer these
Province under Grant LR18F030001, and in part by the Open Research questions as reasonably as possible.
Project of the State Key Laboratory of Industrial Control Technology,
Zhejiang University (ICT20098). Paper no. TII-20-4482. (Corresponding The rest of this article is organized as follows. Section II dis-
author: Zhiqiang Ge.) cusses the distinct merit of DL and demonstrates its necessity for
The authors are with the State Key Laboratory of Industrial Control soft sensor modeling. Section III provides an overview of several
Technology, Institute of Industrial Process Control, College of Control
Science and Engineering, Zhejiang University, Hangzhou 310027, China typical DL models and core training techniques. Then, the state
(e-mail: sunqingqiang@zju.edu.cn; gezhiqiang@zju.edu.cn). of the art of soft sensor applications using DL approaches is
Color versions of one or more figures in this article are available at investigated in Section IV. Discussions and outlook are given in
https://doi.org/10.1109/TII.2021.3053128.
Digital Object Identifier 10.1109/TII.2021.3053128 Section V. Finally, Section VI concludes this article.

1551-3203 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
5854 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 17, NO. 9, SEPTEMBER 2021

Fig. 2. Comparison of four kinds of theories.


Fig. 1. Structure of a network with single hidden layer.

II. SIGNIFICANCE OF DL FOR SOFT SENSOR According to universal approximation theory, if there are
enough nodes in the hidden layer, the function represented by the
Detailed review about conventional methods can be found in
network shown in Fig. 1 can approximate any continuous func-
existing work, such as [7], [16], etc. Although those methods
tion [17]–[19]. Furthermore, using multiple layers of neurons to
already have many applications, they may suffer from some
represent some functions is much simpler.
drawbacks, such as heavy workload brought by handcrafted
Since Hinton et al. proposed a faster learning algorithm,
feature engineering or inefficiency when dealing with large
which was applied to deep belief network (DBN), the maximum
amount of data, etc. To demonstrate the significance of DL for
depth of network can be tens of layers [13]. Later on, He et
soft sensor modeling, the distinct merits of DL and the trends or
al. [20] proposed the deep residual network, which solved the
characteristics of industrial processes should be discussed.
performance degradation problem caused by increasing network
depth. From then on, the depth of neural network (NN) can reach
A. Merits of DL Techniques a level of hundreds of layers. However, “deep” in DL theory
To begin with, the structure of a simple network with single is not absolutely defined. In speech recognition domain, four
hidden layer is shown in Fig. 1. There are three layers, namely layers of network can be considered as “deep,” whereas in image
an input layer, a hidden layer, and an output layer. Input layer recognition, networks with more than 20 layers are common.
contains variables x1 , · · · , xm and a constant node “1.” The DL has its own advantage compared with conventional soft
hidden layer has many nodes, and each node has an activation sensor modeling methods. Here, we classify them into four
function ϕ. The feature in each node is extracted through affine categories at a greater granularity: rule-based system, classical
transformation and activation function transformation from orig- machine learning, shallow representation learning, and deep
inal input layer, which are defined as the following formula: learning. The differences between them are shown in Fig. 2,
in which the green blocks indicate components that are able to
Hi = ϕ (Mi (x1 , · · · , xm )) learn information from data [21].
 m   Rule-based system, also known as production system or ex-

=ϕ wik ∗ xk + bi .
0 0
(1) pert system, is the simplest form of artificial intelligence. Rules
k=1
are coded into the programs as the representation of knowledge,
which tell the system what to do or what to conclude in different
Then, the final output is the combination of those composite situations [22]–[24]. In this way, the performance of rule-based
functions system depends almost entirely on expert knowledge, which is
n
hard to obtain and hard to update, especially in complicated
 cases. A rule-based system could be considered as having “fixed”
y (x) = wk1 Hk (x). (2)
k=1
intelligence, in contrast, a machine learning system is more
adaptive and closer to human intelligence. Instead of outputting
The weight and bias parameters (wij 0
, b0i ) need to be learned a result directly from a fixed set of rules wrote by human,
by minimizing the loss function, which is defined according to classical machine learning first extracts features from raw in-
specific task and target. This process is called as “training” or put data and then the final output is obtained by mapping the
“learning.” features. However, the forms of features are still handcrafted

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
SUN AND GE: SURVEY ON DEEP LEARNING FOR DATA-DRIVEN SOFT SENSORS 5855

defined. In this case, the performance of the model depends


mainly on the skill of feature engineering and other algorithm
details, so it is quite possible that traditional algorithms could do
better. By the way, even if only small amount of data is available,
the transferable character of DL algorithms can also ensure the
performance of modeling since the underlying networks are
relatively general as long as the data distribution is as consistent
as possible [29], [30]. In contrast, in big data regions where there
are very large training sets, it can be more consistently seen that
large NN dominates the other approaches. Thus, the relatively
more reliable way to improve the performance of an algorithm
Fig. 3. Scale drives algorithm performance. today is to train a bigger network and get more data.
In conclusion, the merits of DL techniques compared with tra-
ditional algorithms mainly lie in learning representation without
the requirement of knowledge or experience and taking full ad-
based on knowledge and experience, which is called as feature vantage of huge amount of data for performance improvements.
engineering [25], [26]. In order to extract features that better
represent the underlying problem, the process of feature en-
gineering is usually complicated, including feature selection, B. Trends of Industrial Process
feature construction, and feature extraction. Because the upper The industrial processes are more and more complicated and
bound of the performance of conventional machine learning ever changing. The ever-increasing demands for profits and
is mainly determined by data and features, the effect of those environmental factors have added the complexity of industrial
approaches relies heavily on the ability of the engineer to extract processes. For example, the demands of different product grades
good features. Therefore, representation learning approaches lead to many chemical processes working with multiple condi-
were proposed so as to automatically learn the implicit useful tions [31], [32]. Besides, the complicated process mechanisms
representations or features from raw data [27]. In this way, data also increase the difficulty of process modeling, such as peni-
representation is often trained in conjunction with subsequent cillin fermentation process, in which the microorganisms have
predictive tasks. Representation learning does not rely on expert to experience multiple growth phases [33]. Due to such causes,
experience, but requires a large training dataset. Compared with process industry may possess many characteristics, such as non-
shallow representation learning, DL is a kind of deep represen- linear, multimodal properties, etc. Therefore, it is increasingly
tation learning, which tries to learn more hierarchical and more difficult to construct monitoring or predictive models for those
abstract representations using deep networks. As an end-to-end complex processes. In addition, changes in process character-
approach, what DL needs is enough and quality data, rather than istics or operating conditions are almost ubiquitous [34]. In
complicated feature engineering. chemical processes, for instance, equipment characteristics are
However, is DL always better than conventional machine changed due to catalyst deactivation, scale adhesion, preventive
learning or is deep representation learning always better than maintenance, and others. The changes of loads and feedstocks
shallower one? The key factor is the amount of data that are also result in process variations and deteriorate the performance
available for modeling, especially labeled ones [28]. Visually, of process modeling, such as in the pharmaceutical industry
the performance of algorithm is plotted as the function of the [35]. Therefore, soft sensors have to be updated as the process
amount of data used for a task in Fig. 3. characteristics change, but manual and frequent construction of
Improvements in data availability and computational scale them should be avoided due to their heavy workload, especially
have been two of the biggest drivers of recent progress in in feature engineering. This trend and corresponding issues are
machine learning, which means large enough training sets are shown in the left part of Fig. 4.
available and large enough NNs are trainable. As for traditional Looking at the development of the process industry in recent
learning algorithms, such as SVM or logistic regression, the per- years, industrial big data is another trend that cannot be ignored
formance improves for a while as more data are added. However, [36], [37]. More and more process monitoring sensors are in-
even as more data are accumulated after that, usually the perfor- stalled to measure real time process status (e.g., temperature,
mance of those algorithms goes into plateaus. This means their flow rate, pressure, etc.) and a lot of data storage devices (e.g.,
learning curves flattens out and the algorithms stop improving distributed control system) are utilized in plants and factories
even as more data are given since they do not know what to do [38]. All of these developments make it possible to obtain large
with huge amounts of data. Nevertheless, if a small NN, which amount of data for process modeling. At the same time, the data
contains only a small number of hidden units/layers/parameters, form also evolves a lot [39]. For instance, from univariate to
is trained on the same supervised learning task, slightly better multivariate, to high-dimensional [40]–[42]; from homogeneous
performance might be attainable. Analogously, if larger and data to heterogeneous datasets [43], [44]; from static to dynamic
larger NNs are trained, even better performance can be obtained. [45], [46]. Therefore, enough and various data are available,
Besides, it is notable that in the regime of small training sets, which need to be utilized efficiently to train monitoring or
the relative ordering of the algorithms is actually not very well predictive models. This trend is shown in the right part of Fig. 4.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
5856 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 17, NO. 9, SEPTEMBER 2021

Fig. 5. Learning strategy of SAE.

x̃ = decode (h) = fd (W d  h + bd ) (4)

where x is the original input vector, h is the feature vector


after encoding, x̃ is the vector of reconstructed input, {W e , be }
and {W d , bd } are weights and biases of encoder and decoder,
respectively, and fe () and fd () are corresponding nonlinear
activation function, such as sigmoid, Tanh, ReLU, etc.
Besides, AEs can be stacked to construct deeper network,
namely stacked AE (SAE). The learning strategy of SAE is
represented as Fig. 5. The whole process is actually a pro-
Fig. 4. DL matches the trends of industrial processes. cess of unsupervised layerwise training. SAE possesses more
encoding layers so that it can extract more abstract represen-
tations. Besides, AE has many extensions, such as denoising
In a nutshell, based on sufficient literature research and to our AE (DAE) [49], sparse AE [50], [51], contractive AE, and
best knowledge, two main trends in the development of industrial etc. [52].
processes are concluded: they are more and more complicated
and ever changing and a huge amount of process data are gener- B. Restricted Boltzmann Machine
ated and stored. Under such a circumstance, the characteristics RBM is an undirected probability graph model with one visi-
of DL technique, discussed in Section II, exactly match these ble layer and one hidden layer. There is no connection between
two trends well. First of all, DL can avoid complicated feature neurons in the same layer, which is the meaning of “restricted.”
engineering and learn abstract representation automatically (see The goal of RBM is to make the output of the visible layer as
Fig. 2). Second, DL can make full use of large amounts of close to the original input as possible so that the hidden layers
data to effectively improve modeling performance (see Fig. 3). are regarded as different representations of the visible layer.
These are why DL techniques are of great significance and The joint probability distribution and conditional distribution are
are going to be more and more significant for soft sensor related to an energy function, and detail derivation process can be
applications. found in [21]. RBMs can be trained by approximate maximum
likelihood stochastic gradient descent, often involving a Monte
III. DL MODELS AND GENERAL TRICKS Carlo Markov chain to obtain those model samples. A much
In this section, typical models and general tricks in DL field more complete tutorial and other tips or tricks can be seen in
are reviewed and summarized, including autoencoder (AE), re- [53] and [54].
stricted Boltzmann machine (RBM), convolutional NN (CNN), RBM has various extensions, among which are DBN and deep
and recurrent NN (RNN). Boltzmann machine (DBM). DBN is a hybrid graphical model
involving both directed and undirected connections. Except that
the top two layers are undirected (pure RBM), the connections
A. Autoencoder
of all the other layers are directed (Bayesian network). DBN has
An AE is actually a system that attempts to reproduce its multiple hidden layers and hidden units in adjacent layers are
original input. To achieve this goal, AE must capture the most connected. All of the local conditional probability distributions
important information that can represent the input data [47], in DBN are copied directly from that in its constituent RBMs.
[48]. Therefore, the code dimension is constrained to be less DBN is layerwise pretrained by a fast and greedy algorithm and
than the input dimension, which is also called as undercomplete then is fine-tuned using contractive wake–sleep algorithm [13].
AE. While a DBM is an undirected graphical model with several
Technically, the full encoding and decoding process can be layers, and it is constructed to learn high-level representations
represented as the following formula: of the input [55]. Generally speaking, DBM is more robust than
DBN, but the cost is greater computational complexity since
h = encode(x) = fe (W e  x + be ) (3)
DBM needs to be jointly trained.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
SUN AND GE: SURVEY ON DEEP LEARNING FOR DATA-DRIVEN SOFT SENSORS 5857

Fig. 6. 2-D convolution case.

C. Convolution NN
CNN is a specialized kind of NN for processing data that have Fig. 7. Typical components of RNN (x is the input data in sequence
a gridlike topology, such as time series data (1-D grid taking form, h is the hidden layer, o is the output layer, y is the target label, and
L is the loss. U , V , and W are corresponding weight matrixes.).
samples at regular time intervals) and image data (2-D grid of
pixels). It is notable that the “convolution” here actually refers
to cross-correlation function, which is the same as convolution
but without flipping the kernel parameter space. CNN also has a lot of variants, such as AlexNet
[56], LeNet [58], VggNet [59], etc.

S(i, j) = (I ∗ K) (i, j) = I (i + m, j + n) K (m, n)
D. Recurrent NN
m n
(5) RNN is developed for processing sequential data. The basic
where I and K denote 2-D input and 2-D kernel function, architecture and loss computation graph of RNN are shown in
respectively, the symbol “∗” denotes convolution operation and Fig. 7. The left network can be unfolded over time sequence to
i and j are the indexes in these two dimensions. get the right form. Every time step has an input, a hidden unit, and
The detailed computation process of a 2-D convolution case an output. Besides, recurrent connections exist between hidden
can be seen in Fig. 6. From the example, the merits of such units.
convolution operation can be concluded, which are as follows. Given a specific status h(0) , RNN can propagate forward.
1) Sparse interactions: The size of kernel is much smaller Suppose the activation of hidden layer is tanh() and the output
than that of input so the interaction between the input and layer is fed into a softmax function to generate normalized
output is a kind of sparse connectivity, which saves a lot of probabilities ŷ, the corresponding layers from t = 1 to t = τ
time complexity compared with common fully connected can be updated according to the following formulas:
networks.
2) Parameters sharing: Different from the entries of weight a(t) = b + W h(t−1) + U x(t) (6)
matrix of traditional networks, which are used only one  
time when computing the output of a layer, every ele- h(t) = tanh a(t) (7)
ment of a kernel is used at every position of the input
so the storage requirements for parameters are reduced o(t) = c + V h(t) (8)
significantly.  
ŷ (t) = softmax o(t) (9)
3) Equivariant representations: Due to the characteristic of
parameter sharing, the result of convolution operation where b and c denote the bias vectors.
before which the input is shifted is the same as that of The total loss is just the sum of the losses over all the
shifting the output of convolution of the input. time steps. For example, if L(t) is computed as the negative
It is because of these three features that CNN is particularly log-likelihood of y (t) given x(1) , · · · , x(t) , then
suited to processing gridlike data [56].    
Generally, after the convolution, there is a pooling operation L x(1) , · · · , x(τ ) , y (1) , · · · , y (τ )
to further adjust the output. The pooling function uses the overall
    
statistical characteristics of the adjacent outputs at a certain
= L(t) = − log pmodel y (t) x(1) , · · · , x(t)
location to replace the network output at that location, and no t t
parameters need to be learned. For instance, the max-pooling op- (10)
eration uses the maximum output to represent the corresponding
rectangular region [57]. Other common pooling functions, such where pmodel (y (t) |{x(1) , · · · , x(t) }) is given by reading the en-
as the average of a rectangular neighborhood, the L2 norm of a try for y (t) from the model’s output vector ŷ (t) . The parameters
rectangular neighborhood, or a weighted average based on the are updated using backpropagation through time (BPTT) [21],
distance from the central pixel, are also widely used to compress [60].

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
5858 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 17, NO. 9, SEPTEMBER 2021

TABLE I
SUMMARY OF FOUR MAIN TYPES OF DL MODELS

The basic problem of RNN is that gradients propagated over common regularization, the penalty term based on the activation
many stages tend to either vanish or explode, which is called as state of hidden units is added into the cost function. To obtain a
the challenge of long-term dependencies [61], [62]. Therefore, relatively smaller cost, the probability of neuronal activation
long short-term memory (LSTM) and other gated RNNs, such should be as small as possible [73]. Other approaches, such
as gated recurrent units (GRUs), are proposed, which use several as KL divergence penalties or imposing a hard constraint on
gate units to control the memory and forgetting behaviors of the activation values, are also applied [74], [75].
hidden state [63]–[66]. 5) Dropout: Dropout is a kind of ensemblelike strategy [76].
The summary of four main commonly used DL techniques is The basic principle is to remove the nonoutput units (e.g., mul-
listed in Table I. tiply the output by zero) from base network to form several sub-
networks. Every input unit and hidden unit is included according
to a sampling probability so that the randomness and diversity of
E. General Tricks for Developing DL Models submodels can be guaranteed. The ensemble weights are often
Although DL has huge potential, it could be very challenging obtained according to the probability p(y|x)of submodels [77].
to train deep models with satisfactory generalization perfor- Another significant advantage is that there are few restrictions
mance efficiently. The reasons mainly lie in the overfitting and on the applicable model or training process. However, it does
gradient vanish problems caused by deep structure. To overcome not work well if there is only a small amount of data [76].
or mitigate these issues, several tricks should be helpful when 6) Batch Normalization: Batch normalization is a method of
training deep models. adaptive reparameterization that aims to better train extremely
1) Regularization: Regularization is an effective tool to over- deep network [78]. When training, the parameters of hidden
come high-variance problem, namely overfitting. A direct way layers in deep networks will consistently change, which leads
is to regularize the cost function with a parameter norm penalty, to the internal covariate shift problem. Generally, the global
such as L2 regularization. When minimizing the cost function, distribution gradually approaches the upper and lower limits of
the parameters are also constrained to not be too large [67]. the value interval of the nonlinear function. Thus, the gradients
2) Dataset Augmentation: Getting more data for training are easy to vanish when conducting backpropagation. With
machine learning models is the best way to improve their gen- batch normalization, the mean and the variance of each unit
eralization performance. Although, it may be not easy to collect are standardized so as to stabilize learning, but the relationships
large amount of data from real scenarios, creating new fake data between units and the nonlinear statistics of a single unit are
is meaningful for some specific tasks, such as object recognition allowed to change.
[68] and speech recognition [69]. Introducing noise into the input
layer can also be regarded as a kind of data augmentation [70],
F. Frameworks for Developing DL Algorithms
[71].
3) Early Stopping: The cost of training process usually runs To better realize the development of DL algorithm, several
down first and then may increase when too much further learning open-source frameworks are available, which may consist of
is conducted, which denotes the occur of overfitting. To avoid state-of-the-art algorithms or well-designed underlying network
this problem, each time a better validation error is achieved, the elements, such as TensorFlow [79], Caffe [80], Theano [81],
parameter setting should be saved so that returning to the point CNTK [82], Keras [83], Pytorch [84], and etc. The comparison
with best performance after all training steps is realizable [72]. of these platforms is shown in Table II.
Therefore, the early stop strategy can prevent overlearning of
parameters.
IV. DL APPLICATIONS FOR SOFT SENSOR MODELING
4) Sparse Representations: Another kind of parameter
penalty is to constrain activation unit, which will indirectly A successful development of DL algorithms is actually a
impose a penalty on the complexity of parameters. Similar with highly iterative process, which can be summarized as Fig. 8.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
SUN AND GE: SURVEY ON DEEP LEARNING FOR DATA-DRIVEN SOFT SENSORS 5859

TABLE II latent variable regression model was developed using variational


COMPARISON OF MAINSTREAM PLATFORMS
AE (VAE) in [85]. A common way is to introduce the supervi-
sion from label variables into the procedures of encoding and
decoding. In [86], a variablewise weighted SAE was proposed
to introduce the linear Pearson coefficient between the inputs of
each hidden layer and quality labels when pretraining so as to ex-
tract features in a semisupervised way. Furthermore, techniques
based on nonlinear relationships, such as mutual information
[87], were adopted to better extract feature representations.
However, both linear and nonlinear relationships are artificially
specified and may be inadequate or unsuitable. Thus, a relatively
more intelligent and automatic way is to add the predictive
loss of quality labels into the pretraining cost [88]. Besides,
other strategies also can be adopted to build the connections
between hidden layers and label values. Sun and Ge used gated
units to measure the contribution of the features in different
hidden layers and better control the information flows between
hidden layers and the output layer [89]. Moreover, focusing on
semisupervised scenarios when there are only a small number
of labeled samples and an excess of unlabeled samples, a kind
of double ensemble learning approach was proposed that takes
both data diversity and structural diversity into account [90].
Missing data is one of the most commonly encountered prob-
lems while designing industrial soft sensors. As a variant of AE,
VAE performs well in learning data distribution and dealing
Fig. 8. Iterative process for developing DL algorithms. with missing data problem. For example, a generative model
named VA-WGAN was proposed based on VAE and Wasserstein
GAN, and it can generate the same distributions of real data from
For soft sensing applications, the first step is to find the de- industrial processes, which is hard to achieve by conventional
mands or problems existing in real industrial processes (such regression models [91]. In [92], VAE was employed to extract the
as semisupervised learning, dynamic modeling, missing data, distribution of each feature variable for a just-in-time modeling
etc.) and try to come up with a new idea worth trying. The next approach, and the effectiveness of it was verified through a
thing that needs to be done is to code it up with open-source numerical example and an industrial process. Moreover, the
frameworks or toolkits. After that, the data are collected and fed authors enriched the theory by proposing an output-relevant
into the program to obtain a result that tells the designer how VAE for just-in-time soft sensor application, which aims to deal
well this particular algorithm or configuration works. Based on with missing data [93]. Different with the former, two kinds
the outcome, the designer should refine the ideas and change of VAEs were used in a new soft sensor framework, which also
the strategies to find a better NN. Then, the process is repeated focuses on the missing data [94]. The first one named supervised
and the scheme is improved iteratively until the ideal effect is deep VAE was designed to obtain the distribution of latent
achieved. features, which was used as a prior of the second one known as
To help the readers know about state-of-the-art progress and the modified unsupervised deep VAE. Then, the framework was
better develop high-performance soft sensors, the soft sensing constructed by combining the encoder of the first one with the
applications based on DL techniques are reviewed here. The decoder of the second one, which works well under the missing
existing work is introduced and discussed, and the factors, data situation.
such as motivation, strategy, and effectiveness, are mainly high- In some cases, AEs could work better by combining it with
lighted. The following contents are expanded according to the other methods or improving its learning strategy. For example,
mainstream model to which each work belongs. Yao and Ge implemented a deep network of AEs for unsuper-
vised feature extraction and then utilized extreme learning ma-
chine (ELM) for regression task [95]. Wang and Liu [96] adopted
A. AE-Based Applications the limited-memory Broyden–Fletcher–Goldfarb–Shanno algo-
AE and its variants are widely used to construct soft sensors rithm to optimize the weights parameters learned by SAE,
for semisupervised learning and dealing with missing data in in- and then the features extracted were fed into support vector
dustrial processes. Also, excellent performance can be achieved regression (SVR) model for estimating the rotor deformation
by combining with traditional machine learning algorithms. of air preheaters. Instead of using pure data-driven models
Since AE is an unsupervised-learning model, it is often mod- (DDMS), Wang and Liu combined a knowledge-based model
ified to a semisupervised or supervised form so as to complete (KDM) named the lab model with a DDM namely SAE, and
the predictive tasks. For example, a semisupervised probabilistic the experimental results verified that the hybrid method is prior

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
5860 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 17, NO. 9, SEPTEMBER 2021

than using only KDM or DDM [97]. Using an improved gra- prediction [111]. Zhu and Zhang [112] investigated the selection
dient descent algorithm, Yan et al. [98] proposed a DAE-based of DBN structure for the soft sensor application in an industrial
method, which was demonstrated to be effective compared with polymerization process. By comparing with feedforward NNs,
conventional approaches, such as shallow learning methods. Be- the DBN-based method can give more accurate predictions of
sides, to adaptively model time-varying processes, a just-in-time the polymer melt index.
fine-tuning framework was proposed for SAE-based soft sensor
construction [99]. C. CNN-Based Applications
CNNs are mainly utilized for processing gridlike data, es-
B. RBM-Based Applications pecially image data. Besides, they can also be developed to
capture local dynamic characteristics of industrial process data
Nonlinearity is a widely existing characteristic in industrial
or process signals in frequency domain.
processes. Aiming at this, RBM and its variants, especially DBN,
By processing image data, CNN can be used to construct
are generally used as unsupervised nonlinear feature extractors
soft sensors. For example, Horn et al. [113] use CNN to extract
in industrial process modeling.
features in froth flotation sensing, which shows a good feature
Predictors can take advantage of features learned by RBM or
extraction speed and predictive performance. However, images
DBN, and SVR and BPNN are two common kinds of predic-
are still seldom utilized for soft sensor construction compared
tors. For example, to address the problem of high nonlinearity
to common data forms.
and strong correlation among multivariables in the process of
As for dynamic problems, Yuan et al. [114] proposed mul-
coal-fired boiler, a novel deep structure using continuous RBM
tichannel CNN for soft sensing application in the industrial
and SVR algorithms was proposed [100]. A related work was
debutanizer column and hydrocracking process, which can learn
proposed by Lian et al. [101], which uses DBN and SVR
dynamics and various local correlations of different variable
with the improved particle swarm optimization to complete the
combinations. Besides, Wang et al. [115] used two CNN-based
task of rotor thermal deformation prediction. In [102], a soft
soft sensor models to deal with abundant process data for the
sensor model based on the DBN and BPNN was proposed to
purpose of staying low complexity and embracing the process
predict the 4-carboxy-benzaldchydc concentration in the pu-
dynamics at the same time. In [116], a soft sensor was proposed
rified terephthalic acid (PTA) industrial production process.
using the CNN, which predicts the measurements at next time
Faced with the complexity and nonlinearity of nonlinear system
step by extracting time-dependent correlations from a moving
modeling, an improved BPNN based on RBM was proposed
window.
in [103]. In this article, the structure of BPNN is optimized by
In frequency domain, CNNs can acquire high invariance to
utilizing sensitivity analysis and mutual information theories
signal translation, scaling, and distortion. In [117], a pair of
and the initialization of parameters is done by RBM. While in
convolution layer and max-pooling layer was utilized at the
[104], DBN was used to learn hierarchical features for a BPNN,
lowest part of network to extract high-level abstraction from
which was constructed for modeling the relationships between
the vibration spectral features of the mill bearing. Then, ELM
extracted features and mill level in a ball mill production process.
learns a mapping from the extracted features to the mill level.
In addition to SVR and BPNN, ELM can also work as a predictor
In the field of aerospace engineering, a virtual sensor model
based on the features extracted by DBN. The idea was realized
with partial vibration measurements using a CNN was proposed
in the measurement of nutrient solution composition for soilless
for estimating the structural response, which is important for
culture [105].
structural health monitoring and damage detection but physical
To overcome the data-rich-but-information-poor problem,
sensors are limited in the corresponding operational conditions
RBMs can be utilized for ensemble learning. For instance,
[118].
Zheng et al. [106] proposed a soft sensing framework that
integrates the ensemble strategy, DBN, and correntropy kernel
regression into a unified soft sensing framework. Similarly, an D. RNN-Based Applications
ensemble deep kernel learning model was proposed in industrial RNNs are widely used for dynamic modeling, and various
polymerization process, which adopts DBN for unsupervised variants, such as LSTM, are also applied in real cases.
information extraction [107]. In the other case, lack of the RNN-based soft sensors were developed to estimate vari-
labeled sample also leads to poor information, which can be ables with strong dynamic characteristic, such as the curing
settled by semisupervised learning using DBN, such as the work of epoxy/graphite fiber composites [119], the contact area that
proposed in [108]. In [109], focusing on labeled data scarcity, tires of a car are making with the ground [120], the indoor
computational complexity reduction, and unsupervised feature air quality in the subway [121], the melt-flow-length in the
exploitation, a DBN-based soft sensor is designed. injection molding process [122], the biomass concentrations
RBMs have some other interesting applications as well. [123], and the product concentration of reactive distillation
Graziani and Xibilia [110] designed a soft sensor based on DBN columns [124].
for a plant process to estimate an unknown measurement delay Apart from methods based on ordinary RNN, LSTM is also a
rather than quality variables. Another DBN-based model was popular model in soft sensing applications, which can be deeper
applied to process flame images, rather than common structural and more powerful since long-term dependence is weakened.
data, in industrial combustion processes for oxygen content For example, an LSTM-based soft sensor model was proposed

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
SUN AND GE: SURVEY ON DEEP LEARNING FOR DATA-DRIVEN SOFT SENSORS 5861

to cope with strong nonlinearity and dynamics of the process in Therefore, a generative adversarial networks based method was
[125]. Besides, Yuan et al. [126] proposed a supervised LSTM proposed for data generation in [138].
network, which used both the input and quality variables to 4) Elimination of Redundancy: In [139], a double least abso-
learn dynamic hidden states, and the method was proved to be lute shrinkage and selection operator algorithm was integrated
effective on a penicillin fermentation process and an industrial into a multilayer perceptron network to solve two redundancy
debutanizer column. Besides, an LSTM network was used to problems: the input variable redundancy and the model structure
predict the content of nitrogen-derived components in wastew- redundancy.
ater treatment plants [127]. 5) Inference and Approximation: Due to the strong learning
There are other variants that are designed for specific in- ability, DNNs can be used for intelligent control purposes.
dustrial applications. As an example, a two-stream network For example, a soft sensor based on Levenberg–Marquart and
structure was designed, which adopts batch normalization and adaptive linear network was designed and applied in inferential
dropout tricks, to learn diverse features of the various process control of a multicomponent distillation process [140]. In addi-
data [128]. In [129], another type of RNN called time delayed tion, the adaptive fuzzy means algorithm was utilized to evolve
NN (TDNN) was implemented for inferential state estimation a radial basis function NN, which aimed at the approximation
for an ideal reactive distillation column. Besides, the echo state of an unknown system [141].
network as a kind of RNN was also used for soft sensing
application in the high-density polyethylene production process
and PTA production process [130]. By taking advantage of F. Summary of the Existing Applications
singular value decomposition, the collinearity and overfitting The purposes of developing DL-based novel soft sensors
problems were solved. Recently, an ensemble semisupervised include feature extraction, solving missing value issues, dy-
model, which combining SAE with bidirectional LSTM, was namic characteristics capture, semisupervised modeling, etc.
proposed in [131]. The new method can not only extract and (see Table I). It is worth noting that only existing applications
utilize the temporal behavior in labeled and unlabeled data but in soft sensor field are discussed in detail, which does not mean
also take the time dependence hidden in quality metric self that what has not yet appeared in the field of soft sensor is not
into consideration. Also, GRU-based method is proposed for possible. For example, although VAE is the mainstream method
automatic deep extraction of robust dynamic features in [132], to deal with missing value problems for soft sensor application
and achieves good performance in a debutanizer distillation using DL, methods based on RBM and GAN are also feasible
process. in other fields [142], [143]. To design feasible models, differ-
ent strategies were adopted, such as optimizing network struc-
ture, improving the training algorithm, and integrating different
E. Other DL-Based Applications
algorithms.
In addition to applications based on aforementioned main- From the applications discussed in aforementioned sections,
stream models, some other deep models are also used to solve some points can be further summarized. First, the statistics on
soft sensing problems. Some typical applications are discussed soft sensor applications using DL methods can be seen in Fig. 9,
as the following and the others will not be analyzed in detail which is based on a total of 57 references discussed and cited in
here. Section IV. From Fig. 9(a), the trend is clear that there are more
1) Semisupervised Modeling: In [133], a semisupervised and more algorithms based on DL theory during recent years,
framework was constructed by integrating manifold embedding which is a reflection of the increasing demand for DL models
into a deep NN (DNN), in which manifold embedding exploited in real industrial process modeling. Moreover, compared with
the local neighbor relationship among industrial data and im- three other main theories, CNN-based methods are applied less.
proved the utilization efficiency of unlabeled data in DNN. This is because gridlike data, such as images, are more used for
Besides, a just-in-time semisupervised soft sensor based on classification rather than regression tasks. Besides, although AE
ELM was proposed to online estimate the Mooney viscosity looks simpler than other main models, it is easier to be developed
with multiple recipes in [134]. and extended, so it is also of great potential.
2) Dynamic Modeling: Except CNNs and RNNs, there are As shown in Fig. 9(b), soft sensors based on DL theory were
some other NNs that are used for dynamic modeling. Graziani constructed in many scenarios, including chemical industry,
and Xibilia [135] proposed a dynamic DNN-based soft sensor power industry, machinery manufacturing, aerospace engineer-
to estimate the research octane number for a reformer unit ing, etc. Among them, chemical industry applications account
in a refinery and nonlinear finite input response models were for the largest proportion at about 66.7%. The effectiveness of
investigated. Wang et al. [136] proposed a dynamic network most of the work reviewed in this survey is verified by doing
called NARX-DNN, which can interpret the quality prediction numerical simulation experiments (e.g., [93], [114], etc.), or by
error of validation data from different aspects and automatically using public available benchmark datasets (e.g., [137]), or by
determine the most appropriate delay of historical data. Besides, modeling the datasets from real-world processes (e.g., [91], [92],
a dynamic strategy is adopted to improve the dynamic capture [93], [108], [114], [121], etc.). The most common case is the
performance of the ELM, which is combined with PLS in [137]. third type, which can reflect the characteristics of real processes
3) Data Generation: Due to the harsh environment of the as much as possible. For example, in chemical industry field,
industrial process, directly collecting data may be difficult. actual run data are collected from processes, such as debutanizer

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
5862 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 17, NO. 9, SEPTEMBER 2021

in speech, images, text, and reinforcement learning fields [146].


However, a lot of investigation and exploration work remains to
be done for its soft sensing application.

B. Hyperparameter Optimization
For a long time, how to optimize hyperparameters and struc-
tures of networks is a difficult issue for researchers and engineers
[104], [112], [139]. Most of such work require manual trial.
To avoid heavy workload and great randomness, meta-learning
was proposed and investigated, which is also called as “learn to
learn” [146]. The motivation is to offer machine with humanlike
learning ability. Instead of learning a single function for a spe-
cific task, meta-learning learns a function to output functions for
several subtasks. At the same time, many subtasks are required
for meta-learning, and each subtask has its own training set and
test set. After effective training, machine can possess the ability
to optimize hyperparameters, including selecting network struc-
tures by itself. This is attractive for multimodal and changing
processes.

C. Model Reliability
DL methods learn features in an end-to-end way, which in-
creases the difficulty for engineers or designers to understand
Fig. 9. Statistics on existing relevant work. (a) Publications in different what and how they learned. Besides, the dependence of the
years. (b) Applications on different fields.
learning process on data increases the inaccuracy caused by
poor data quality. Both of these two factors pose a threat on
the reliability of DL models. To improve the model reliability,
process [94], polymerization processes [107], hydrocracking model visualization [147], [148] and combining DL models
process [114], to name a few. However, more detailed and with experience or knowledge [149] are two feasible ways.
specific factors need to be considered when applying those soft Model visualization helps researchers to understand what has
sensors to real scenarios. been learned, and introducing experience or knowledge helps to
reduce inaccuracy brought by just relying on data. Nevertheless,
V. DISCUSSIONS AND OUTLOOK these two points need more investigations for practical industrial
Although DL has made great progress in many fields, there application.
is still a lot of work to do to better apply the advanced methods
in the soft sensor domain, especially to meet the demands in D. Distributed Parallel Modeling
practical industrial processes. Data and structure are the two With the trend of industrial big data discussed in Section II,
most important issues required to be considered all the time. how to efficiently model the process from large amount of data is
Around these two topics, some hot research directions should an important and urgent issue. A feasible solution is to transform
be paid more attention in the future. original DL models into the distributed and parallel modeling.
By splitting a large dataset into several small distributed blocks,
A. Lack of Labeled Sample data processing can be carried out simultaneously, which is
conducive to large-scale data modeling [150], [151]. So far,
Although the data are easy to obtain under the trend of big
however, there is still a long distance to go.
data, the annotation cost is still very expensive. Therefore, we
always hope that using fewer labeled samples can train a model
with good generalization ability. Traditional solution of this VI. CONCLUSION
problem is using semisupervised learning methods, whereas DL techniques have shown their great potential in many fields,
the more and more serious imbalance problem between unla- as well as in soft sensor. In order to summarize the past, analyze
beled and labeled data makes it less satisfactory. Self-supervised the present, and look into the future, in this article, we made the
learning (SSL) is another feasible solution, which is a kind of following contributions to the application of DL theory in the
unsupervised strategy [144]. Different with transfer learning field of soft sensor.
[32], [33], the useful feature representations are learned from 1) The merits of DL compared with traditional algorithms
a pretext task designed from the unlabeled input data (not from and the trends of the industrial processes were discussed
other similar datasets). Contrastive way is one of the most in detail to demonstrate the necessity and significance of
popular types of SSL, and has made some great achievements DL algorithms for soft sensor modeling.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
SUN AND GE: SURVEY ON DEEP LEARNING FOR DATA-DRIVEN SOFT SENSORS 5863

2) Main DL models, tricks, and frameworks/toolkits were [22] C. Grosan and A. Abraham, “Rule-based expert systems,” Intell. Syst.,
discussed and summarized to help readers better develop vol. 17, pp. 149–185, 2011.
[23] A. Ligeza,
˛ Logical Foundations for Rule-Based Systems, 2nd ed., Hei-
DL-based soft sensors. delberg, Germany: Springer, 2006.
3) Practical application scenarios were analyzed by review- [24] J. Durkin, Expert Systems: Design and Development. New York, NY,
ing and discussing existing work or publications. USA: Prentice-Hall, 1994.
[25] C. R. Turner, A. Fuggetta, L. Lavazza, and A. L. Wolf, “A conceptual
4) Possible research hot points for future work were inves- basis for feature engineering,” J. Syst. Softw., vol. 49, no. 1, pp. 3–15,
tigated shortly. 1999.
It is our hope for this article to serve as a taxonomy and also [26] F. Nargesian, H. Samulowitz, U. Khurana, E. B. Khalil, and D. Turaga,
“Learning feature engineering for classification,” in Proc. 26th Int. Joint
a tutorial of advances elucidated from a multitude of works on Conf. Artif. Intell., 2017, pp. 2529–2535.
DL-based soft sensors, and to provide the community with a [27] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
picture of the roadmap and matters for future endeavors. review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
[28] A. Ng, “Machine learning yearning,” 2017. [Online]. Available: http:
REFERENCES //www.mlyearning.org/(96)
[29] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
[1] B. Huang and R. Kadali, Dynamic Modeling, Predictive Control and Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
Performance Monitoring. London, U.K.: Springer, 2008. [30] Y. Bengio, “Deep learning of representations for unsupervised and trans-
[2] X. Wang, B. Huang, and T. Chen, “Multirate minimum variance control fer learning,” in Proc. ICML Workshop Unsupervised Transfer Learn.,
design and control performance assessment: A data-driven subspace 2012, pp. 17–36.
approach,” IEEE Trans. Control Syst. Technol., vol. 15, no. 1, pp. 65–74, [31] W. Shao, Z. Song, and L. Yao, “Soft sensor development for multimode
Jan. 2007. processes based on semisupervised Gaussian mixture models,” IFAC-
[3] Z. Chen, S. X. Ding, T. Peng, C. Yang, and W. Gui, “Fault detection for PapersOnLine, vol. 51, no. 18, pp. 614–619, 2018.
non-Gaussian processes using generalized canonical correlation analysis [32] F. A. A. Souza and R. Araújo, “Mixture of partial least squares ex-
and randomized algorithms,” IEEE Trans. Ind. Electron., vol. 65, no. 2, perts and application in prediction settings with multiple operating
pp. 1559–1567, Feb. 2018. modes,” Chemometrics Intell. Lab. Syst., vol. 130, no. 15, pp. 192–202,
[4] Y. Jiang, S. Yin, J. Dong, and O. Kaynak, “A review on soft sensors 2014.
for monitoring, control and optimization of industrial processes,” IEEE [33] H. Jin, X. Chen, L. Wang, K. Yang, and L. Wu, “Dual learning-based
Sensors J., to be published, doi: 10.1109/JSEN.2020.3033153. online ensemble regression approach for adaptive soft sensor modeling
[5] V. Venkatasubramanian, R. Rengaswamy, and S. N. Kavuri, “A review of non-linear time-varying processes,” Chemometrics Intell. Lab. Syst.,
of process fault detection and diagnosis: Part II: Qualitative models and vol. 151, pp. 228–244, 2016.
search strategies,” Comput. Chem. Eng., vol. 27, no. 3, pp. 313–326, [34] M. Kano and K. Fujiwara, “Virtual sensing technology in process
2003. industries: Trends and challenges revealed by recent industrial appli-
[6] P. Kadlec, B. Gabrys, and S. Strandt, “Data-driven soft sensors in the cations,” J. Chem. Eng. Jpn., vol. 46, 2012, Art. no. 12we167, doi:
process industry,” Comput. Chem. Eng., vol. 33, pp. 795–814, 2009. 10.1252/jcej.12we167.
[7] M. Kano and M. Ogawa, “The state of the art in chemical process control [35] L. X. Yu, “Pharmaceutical quality by design: Product and process
in Japan: Good practice and questionnaire survey,” J. Process Control, development, understanding, and control,” Pharmaceut. Res., vol. 25,
vol. 20, pp. 969–982, 2010. pp. 781–791, 2008.
[8] K. Pearson, “LIII. On lines and planes of closest fit to systems of points [36] S. J. Qin, “Process data analytics in the era of big data,” AIChE J., vol. 60,
in space,” Philos. Mag., vol. 2, no. 11, pp. 559–572, 1901. no. 9, pp. 3092–3100, 2014.
[9] H. Wold, “Estimation of principal components and related models by [37] N. Stojanovic, M. Dinic, and L. Stojanovic, “Big data process analytics
iterative least squares,” J. Multivariate Anal., vol. 1, pp. 391–420, 1966. for continuous process improvement in manufacturing,” in Proc. IEEE
[10] Q. Jiang, X. Yan, H. Yi, and F. Gao, “Data-driven batch-end quality mod- Int. Conf. Big Data, 2015, pp. 1398–1407.
eling and monitoring based on optimized sparse partial least squares,” [38] L. Yao and Z. Ge, “Big data quality prediction in the process industry:
IEEE Trans. Ind. Electron., vol. 67, no. 5, pp. 4098–4107, May 2020. A distributed parallel modeling framework,” J. Process Control, vol. 68,
[11] W. Yan, H. Shao, and X. Wang, “Soft sensing modeling based on support pp. 1–13, 2018.
vector machine and Bayesian model selection,” Comput. Chem. Eng., [39] M. S. Reis and G. Gins, “Industrial process monitoring in the big
vol. 28, pp. 1489–1498, 2004. data/Industry 4.0 era: From detection, to diagnosis, to prognosis,” Pro-
[12] K. Desai, Y. Badhe, S. S. Tambe, and B. D. Kulkarni, “Soft-sensor cesses, vol. 5, no. 3, 2017, Art. no. 35.
development for fed-batch bioreactors using support vector regression,” [40] S. W. Roberts, “Control charts tests based on geometric moving aver-
Biochem. Eng. J., vol. 27, pp. 225–239, 2006. ages,” Technometrics, vol. 1, pp. 239–250, 1959.
[13] G. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for [41] C. A. Lowry, W. H. Woodall, C. W. Champ, and C. E. Rigdon, “A
deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006. multivariate exponentially weighted moving average control chart,” Tech-
[14] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep nometrics, vol. 34, pp. 46–53, 1992.
feedforward neural networks,” J. Mach. Learn. Res., vol. 9, pp. 249–256, [42] T. Kourti and J. F. MacGregor, “Multivariate SPC methods for
2010. process and product monitoring,” J. Qual. Technol., vol. 28,
[15] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 409–428, 1996.
no. 7553, pp. 436–444, 2015. [43] M. S. Reis and P. M. Saraiva, “Prediction of profiles in the process
[16] F. A. A. Souza, R. Araújo, and J. Mendes, “Review of soft sensor methods industries,” Ind. Eng. Chem. Res., vol. 51, pp. 4254–4266, 2012.
for regression applications,” Chemometrics Intell. Lab. Syst., vol. 152, [44] C. Duchesne, J. J. Liu, and J. F. MacGregor, “Multivariate image analysis
pp. 69–79, 2016. in the process industries: A review,” Chemometrics Intell. Lab. Syst.,
[17] K. Hornik et al. “Multilayer feedforward networks are universal approx- vol. 117, pp. 116–128, 2012.
imations,” Neural Netw., vol. 2, pp. 359–366, 1989. [45] D. C. Montgomery and C. M. Mastrangelo, “Some statistical process
[18] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” control methods for autocorrelated data,” J. Qual. Technol., vol. 23,
Math. Control Signals Syst., vol. 2, pp. 303–314, 1989. pp. 179–193, 1991.
[19] K. Hornik, “Approximation capabilities of multilayer feedforward net- [46] T. J. Rato and M. S. Reis, “Advantage of using decorrelated resid-
works,” Neural Netw., vol. 4, pp. 251–257, 1991. uals in dynamic principal component analysis for monitoring large-
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image scale systems,” Ind. Eng. Chem. Res., vol. 52, pp. 13685–13698,
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, 2013.
pp. 770–778. [47] G. E. Hinton and J. L. McClelland, “Learning representations by
[21] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, Cambridge, recirculation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 1988,
MA, USA: MIT Press, 2016. pp. 358–366.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
5864 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 17, NO. 9, SEPTEMBER 2021

[48] D. E. Rumelhar, G. E. Hinton, and R. J. Williams, “Learning rep- [74] H. Larochelle and Y. Bengio, “Classification using discriminative re-
resentations by back-propagating errors,” Nature, vol. 323, no. 6088, stricted Boltzmann machines,” in Proc. 25th Int. Conf. Mach. Learn.,
pp. 533–536, 1986. 2008, pp. 536–543.
[49] H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol, “Stacked de- [75] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching pursuit:
noising autoencoders: Learning useful representations in a deep network Recursive function approximation with applications to wavelet decom-
with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, no. 12, position,” in Proc. 27th Annu. Asilomar Conf. Signals, Syst., Comput.,
pp. 3371–3408, 2010. 1993, pp. 40–44.
[50] B. Schölkopf, J. Platt, and T. Hofmann, “Efficient learning of sparse [76] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
representations with an energy-based model,” in Proc. Adv. Neural Inf. dinov, “Dropout: A simple way to prevent neural networks from overfit-
Process. Syst., 2006, pp. 1137–1144. ting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
[51] M. A. Ranzato, Y. L. Boureau, and Y. Lecun, “Sparse feature learning [77] G. E. Hinton et al., “Improving neural networks by preventing co-
for deep belief networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., adaptation of feature detectors,” 2012, arXiv:1207.0580.
2007, pp. 1185–1192. [78] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep net-
[52] A. Hassanzadeh, A. Kaarna, and T. Kauranne, “Unsupervised multi- work training by reducing internal covariate shift,” in Proc. Int. Conf.
manifold classification of hyperspectral remote sensing images with Mach. Learn., 2015, pp. 448–456.
contractive autoencoder,” in Proc. Scandinavian Conf. Image Anal., 2017, [79] M. Abadi et al., “TensorFlow: A system for large-scale machine learn-
pp. 169–180. ing,” in Proc. 12th Symp. Oper. Syst. Des. Implementation, 2016,
[53] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach. pp. 265–283.
Learn., vol. 2, no. 1, pp. 1–127, 2009. [80] Y. Jia et al., “Caffe: Convolutional architecture for fast feature
[54] G. E. Hinton, “A practical guide to training restricted Boltzmann ma- embedding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014,
chines,” in Neural Networks: Tricks of the Trade. Berlin, Heidelberg: pp. 675–678.
Springer, 2012, pp. 599–619. [81] F. Bastien et al., “Theano: New features and speed improvements,” 2012,
[55] G. E. Hinton and R. R. Salakhutdinov, “Deep Boltzmann machines,” J. arXiv:1211.5590.
Mach. Learn. Res., vol. 5, no. 2, pp. 1967–2006, 2009. [82] F. Seide and A. Agarwal, “CNTK: Microsoft’s open-source deep-learning
[56] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification toolkit,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Mining, 2016, pp. 2135–2135.
Process. Syst., 2012, pp. 1097–1105. [83] A. Gulli and S. Pal, Deep Learning With Keras. Birmingham, U.K.: Packt
[57] Y. Zhou and R. Chellappa, “Computation of optical flow using a neural Publishing Ltd., 2017.
network,” in Proc. IEEE Int. Conf. Neural Netw., 1988, pp. 71–78. [84] A. Paszke et al. “PyTorch: An imperative style, high-performance
[58] Y. LeCun et al., “Gradient-based learning applied to docu- deep learning library,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
ment recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, pp. 8026–8037.
Nov. 1998. [85] B. Shen, L. Yao, and Z. Ge, “Nonlinear probabilistic latent variable
[59] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification regression models for soft sensor application: From shallow to deep
with deep convolutional neural networks,” Adv. Neural Inf. Process. structure,” Control Eng. Pract., vol. 94, 2020, Art. no. 104198.
Syst., vol. 2012, no. 25, pp. 1097–1105, 2012. [86] X. Yuan, B. Huang, Y. Wang, C. Yang, and W. Gui, “Deep learning-based
[60] P. J. Werbos, “Backpropagation through time: What it does and how to feature representation and its application for soft sensor modeling with
do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990. variable-wise weighted SAE,” IEEE Trans. Ind. Informat., vol. 14, no. 7,
[61] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies pp. 3235–3243, Jul. 2018.
with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2, [87] X. Yan, J. Wang, and Q. Jiang, “Deep relevant representation learning
pp. 157–166, Mar. 1994. for soft sensing,” Inf. Sci., vol. 514, pp. 263–274, 2020.
[62] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training [88] X. Yuan, J. Zhou, B. Huang, Y. Wang, C. Yang, and W. Gui, “Hierarchical
recurrent neural networks,” in Proc. Int. Conf. Mach. Learn., 2013, quality-relevant feature representation for soft sensor modeling: A novel
pp. 1310–1318. deep learning strategy,” IEEE Trans. Ind. Informat., vol. 16, no. 6,
[63] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: pp. 3721–3730, Jun. 2019.
Continual prediction with LSTM,” Neural Comput., vol. 12, no. 10, [89] Q. Sun and Z. Ge, “Gated stacked target-related autoencoder: A novel
pp. 2451–2471, 2000. deep feature extraction and layerwise ensemble method for industrial
[64] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep soft sensor application,” IEEE Trans. Cybern., to be published, doi:
recurrent neural networks,” 2013, arXiv:1312.6026. 10.1109/TCYB.2020.3010331.
[65] K. Cho, B. V. Merriënboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. [90] Q. Sun and Z. Ge, “Deep learning for industrial KPI prediction: When
Bengio, “Learning phrase representations using RNN encoder-decoder ensemble learning meets semi-supervised data,” IEEE Trans. Ind. Infor-
for statistical machine translation,” in Proc. Conf. Empirical Methods mat., vol. 17, no. 1, pp. 260–269, Jan. 2021.
Nat. Lang. Process., 2014, pp. 1724–1734. [91] X. Wang and H. Liu, “Data supplement for a soft sensor using a new
[66] G. Chrupala, A. Kadar, and A. Alishahi, “Learning language through generative model based on a variational autoencoder and Wasserstein
pictures,” 2015, arXiv:1506.03694. GAN,” J. Process Control, vol. 85, pp. 91–99, 2020.
[67] F. Girosi, M. Jones, and T. Poggio, “Regularization theory and neural [92] F. Guo, R. Xie, and B. Huang, “A deep learning just-in-time modeling ap-
networks architectures,” Neural Comput., vol. 7, no. 2, pp. 219–269, proach for soft sensor based on variational autoencoder,” Chemometrics
1995. Intell. Lab. Syst., vol. 197, 2020, Art. no. 103922.
[68] D. M. Montserrat, Q. Lin, J. Allebach, and E. J. Delp, “Training ob- [93] F. Guo, W. Bai, and B. Huang, “Output-relevant variational autoencoder
ject detection and recognition CNN models using data augmentation,” for just-in-time soft sensor modeling with missing data,” J. Process
Electron. Imag., vol. 2017, no. 10, pp. 27–36, 2017. Control, vol. 92, pp. 90–97, 2020.
[69] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP) [94] R. Xie, N. M. Jan, K. Hao, L. Chen, and B. Huang, “Supervised variational
improves speech recognition,” in Proc. ICML Workshop Deep Learn. autoencoders for soft sensor modeling with missing data,” IEEE Trans.
Audio, Speech, Lang. Process., 2013. Ind. Informat., vol. 16, no. 4, pp. 2820–2828, Apr. 2019.
[70] P. Vincent et al., “Extracting and composing robust features with de- [95] L. Yao and Z. Ge, “Deep learning of semisupervised process data
noising autoencoders,” in Proc. 25th Int. Conf. Mach. Learn., 2008, with hierarchical extreme learning machine and soft sensor appli-
pp. 1096–1103. cation,” IEEE Trans. Ind. Electron., vol. 65, no. 2, pp. 1490–1498,
[71] B. Poole, J. Sohl-Dickstein, and S. Ganguli, “Analyzing noise in autoen- Feb. 2017.
coders and deep networks,” 2014, arXiv:1406.1831. [96] X. Wang and H. Liu, “Soft sensor based on stacked auto-encoder deep
[72] R. Caruana, S. Lawrence, and C. L. Giles, “Overfitting in neural nets: neural network for air preheater rotor deformation prediction,” Adv. Eng.
Backpropagation, conjugate gradient, and early stopping,” in Proc. Adv. Informat., vol. 36, pp. 112–119, 2018.
Neural Inf. Process. Syst., 2001, pp. 381–387. [97] X. Wang and H. Liu, “A knowledge- and data-driven soft sensor based
[73] Z. Zhang, Y. Xu, J. Yang, X. Li, and D. Zhang, “A survey of sparse on deep learning for predicting the deformation of an air preheater rotor,”
representation: Algorithms and applications,” IEEE Access, vol. 3, IEEE Access, vol. 7, pp. 159651–159660, 2019.
pp. 4910–4530, 2015.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
SUN AND GE: SURVEY ON DEEP LEARNING FOR DATA-DRIVEN SOFT SENSORS 5865

[98] W. Yan, D. Tang, and Y. Lin, “A data-driven soft sensor modeling method [122] X. Chen, F. Gao, and G. Chen, “A soft-sensor development for melt-flow-
based on deep learning and its application,” IEEE Trans. Ind. Electron., length measurement during injection mold filling,” Mater. Sci. Eng.: A,
vol. 64, no. 5, pp. 4237–4245, May 2017. vol. 384, no. 1/2, pp. 245–254, 2004.
[99] Y. Wu, D. Liu, X. Yuan, and Y. Wang, “A just-in-time fine-tuning [123] L. Z. Chen et al., “Soft sensors for on-line biomass measurements,”
framework for deep learning of SAE in adaptive data-driven modeling Bioprocess Biosyst. Eng., vol. 26, no. 3, pp. 191–195, 2004.
of time-varying industrial processes,” IEEE Sensors J., vol. 21, no. 3, [124] G. Kataria and K. Singh, “Recurrent neural network based soft sensor for
pp. 3497–3505, Feb. 2021. monitoring and controlling a reactive distillation column,” Chem. Product
[100] W. Fan et al., “Integration of continuous restricted Boltzmann machine Process Model., vol. 13, no. 3, 2017, doi: 10.1515/cppm-2017-0044.
and SVR in NOx emissions prediction of a tangential firing boiler,” [125] W. Ke, D. Huang, F. Yang, and Y. Jiang, “Soft sensor development and
Chemometrics Intell. Lab. Syst., vol. 195, 2019, Art. no. 103870. applications based on LSTM in deep neural networks,” in Proc. IEEE
[101] P. Lian et al., “Soft sensor based on DBN-IPSO-SVR approach for rotor Symp. Ser. Comput. Intell., 2017, pp. 1–6.
thermal deformation prediction of rotary air-preheater,” Measurement, [126] X. Yuan, L. Li, and Y. Wang, “Nonlinear dynamic soft sensor modeling
vol. 165, 2020, Art. no. 108109. with supervised long short-term memory network,” IEEE Trans. Ind.
[102] R. Liu, Z. Rong, B. Jiang, Z. Pang, and C. Tang, “Soft sensor of 4- Informat., vol. 16, no. 5, pp. 3168–3176, May 2020.
CBA concentration using deep belief networks with continuous restricted [127] I. Pisa et al., “ANN-based soft sensor to predict effluent violations in
Boltzmann machine,” in Proc. 5th IEEE Int. Conf. Cloud Comput. Intell. wastewater treatment plants,” Sensors, vol. 19, no. 6, 2019, Art. no. 1280.
Syst., Nanjing, China, 2018, pp. 421–424. [128] R. Xie, K. Hao, B. Huang, L. Chen, and X. Cai, “Data-driven modeling
[103] J. Qiao and L. Wang, “Nonlinear system modeling and application based based on two-stream λ gated recurrent unit network with soft sensor
on restricted Boltzmann machine and improved BP neural network,” application,” IEEE Trans. Ind. Electron., vol. 67, no. 8, pp. 7034–7043,
Appl. Intell., vol. 51, pp. 37–50, 2020. Aug. 2020.
[104] M. Lu, Y. Kang, X. Han, and G. Yan, “Soft sensor modeling of mill level [129] S. R. V. Raghavan, T. K. Radhakrishnan, and K. Srinivasan, “Soft sensor
based on deep belief network,” in Proc. 26th Chin. Control Decis. Conf., based composition estimation and controller design for an ideal reactive
2014, pp. 189–193. distillation column,” ISA Trans., vol. 50, no. 1, pp. 61–70, 2011.
[105] X. Wang, W. Hu, K. Li, L. Song, and L. Song, “Modeling of soft sensor [130] Y. L. He et al., “Novel soft sensor development using echo state network
based on DBN-ELM and its application in measurement of nutrient integrated with singular value decomposition: Application to complex
solution composition for soilless culture,” in Proc. IEEE Int. Conf. Saf. chemical processes,” Chemometrics Intell. Lab. Syst., vol. 200, 2020,
Produce Informatization, Chongqing, China, 2018, pp. 93–97. Art. no. 103981.
[106] S. Zheng et al., “Robust soft sensor with deep kernel learning for quality [131] X. Yin et al., “Ensemble deep learning based semi-supervised soft sensor
prediction in rubber mixing processes,” Sensors, vol. 20, no. 3, 2020, modeling method and its application on quality prediction for coal
Art. no. 695. preparation process,” Adv. Eng. Informat., vol. 46, 2020, Art. no. 101136.
[107] Y. Liu et al., “Ensemble deep kernel learning with application to quality [132] X. Zhang and Z. Ge, “Automatic deep extraction of robust dynamic
prediction in industrial polymerization processes,” Chemometrics Intell. features for industrial big data modeling and soft sensor application,”
Lab. Syst., vol. 174, pp. 15–21, 2018. IEEE Trans. Ind. Informat., vol. 16, no. 7, pp. 4456–4467, Jul. 2020.
[108] C. Shang et al., “Data-driven soft sensor development based on deep [133] W. Yan et al., “Soft sensor modeling method based on semisupervised
learning technique,” J. Process Control, vol. 24, no. 3, pp. 223–233, 2014. deep learning and its application to wastewater treatment plant,” Ind. Eng.
[109] S. Graziani and M. G. Xibilia, “Deep learning for soft sensor design,” in Chem. Res., vol. 59, no. 10, pp. 4589–4601, 2020.
Proc. Develop. Anal. Deep Learn. Architectures, 2020, pp. 31–59. [134] W. Zheng et al., “Just-in-time semi-supervised soft sensor for quality
[110] S. Graziani and M. G. Xibilia, “Design of a soft sensor for an industrial prediction in industrial rubber mixers,” Chemometrics Intell. Lab. Syst.,
plant with unknown delay by using deep learning,” in Proc. IEEE Int. vol. 180, pp. 36–41, 2018.
Instrum. Meas. Technol. Conf., Auckland, New Zealand, 2019, pp. 1–6. [135] S. Graziani and M. G. Xibilia, “Deep structures for a reformer unit soft
[111] Y. Liu, Y. Fan, and J. Chen, “Flame images for oxygen content predic- sensor,” in Proc. IEEE 16th Int. Conf. Ind. Informat., 2018, pp. 927–932.
tion of combustion systems using DBN,” Energy Fuels, vol. 31, no. 8, [136] K. Wang, C. Shang, F. Yang, Y. Jiang, and D. Huang, “Automatic
pp. 8776–8783, 2017. hyper-parameter tuning for soft sensor modeling based on dynamic deep
[112] C. H. Zhu and J. Zhang, “Developing soft sensors for polymer melt index neural network,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., 2017,
in an industrial polymerization process using deep belief networks,” Int. pp. 989–994.
J. Autom. Comput., vol. 17, no. 1, pp. 44–54, 2020. [137] Y. He, Y. Xu, and Q. Zhu, “Soft-sensing model development using PLSR-
[113] Z. C. Horn et al., “Performance of convolutional neural networks for fea- based dynamic extreme learning machine with an enhanced hidden layer,”
ture extraction in froth flotation sensing,” IFAC-PapersOnLine, vol. 50, Chemometrics Intell. Lab. Syst., vol. 154, pp. 101–111, 2016.
no. 2, pp. 13–18, 2017. [138] X. Wang, “Data preprocessing for soft sensor using generative adversarial
[114] X. Yuan et al., “Soft sensor model for dynamic processes based on networks,” in Proc. 15th Int. Conf. Control, Autom., Robot. Vis., 2018,
multichannel convolutional neural network,” Chemometrics Intell. Lab. pp. 1355–1360.
Syst., vol. 203, 2020, Art. no. 104050. [139] Y. Fan, B. Tao, Y. Zheng, and S. Jang, “A data-driven soft sensor based on
[115] K. Wang et al., “Dynamic soft sensor development based on convolutional multilayer perceptron neural network with a double LASSO approach,”
neural networks,” Ind. Eng. Chem. Res., vol. 58, no. 26, pp. 11521–11531, IEEE Trans. Instrum. Meas., vol. 69, no. 7, pp. 3972–3979, Jul. 2020.
2019. [140] A. Rani, V. Singh, and J. R. P. Gupta, “Development of soft sensor for
[116] W. Zhu et al., “Deep learning based soft sensor and its application on a neural network based control of distillation column,” ISA Trans., vol. 52,
pyrolysis reactor for compositions predictions of gas phase components,” no. 3, pp. 438–449, 2013.
Comput. Aided Chem. Eng., vol. 44, pp. 2245–2250, 2018. [141] A. Alexandridis, “Evolving RBF neural networks for adaptive soft-sensor
[117] J. Wei, L. Guo, X. Xu, and G. Yan, “Soft sensor modeling of mill level design,” Int. J. Neural Syst., vol. 23, no. 6, 2013, Art. no. 1350029.
based on convolutional neural network,” in Proc. 27th Chin. Control [142] M. D. Zeiler et al., “Modeling pigeon behavior using a conditional
Decis. Conf., 2015, pp. 4738–4743. restricted Boltzmann machine,” in Proc. Eur. Symp. Artif. Neural Netw.,
[118] S. Sun et al., “A data-driven response virtual sensor technique with partial 2009.
vibration measurements using convolutional neural network,” Sensors, [143] Y. Luo et al., “Multivariate time series imputation with generative
vol. 17, no. 12, 2017, Art. no. 2888. adversarial networks,” in Proc. Adv. Neural Inf. Process. Syst., 2018,
[119] H. B. Su, L. T. Fan, and J. R. Schlup, “Monitoring the process of curing pp. 1603–1614.
of epoxy/graphite fiber composites with a recurrent neural network as [144] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep
a soft sensor,” Eng. Appl. Artif. Intell., vol. 11, no. 2, pp. 293–306, neural networks: A survey,” in Proc. IEEE Trans. Pattern Anal. Mach.
1998. Intell., 2020, doi: 10.1109/TPAMI.2020.2992393.
[120] C. A. Duchanoy et al., “A novel recurrent neural network soft sensor [145] A. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive
via a differential evolution training algorithm for the tire contact patch,” predictive coding,” 2018, arXiv:1807.03748.
Neurocomputing, vol. 235, pp. 71–82, 2017. [146] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast
[121] J. Loy-Benitez, S. K. Heo, and C. K. Yoo, “Soft sensor validation for adaptation of deep networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp.
monitoring and resilient control of sequential subway indoor air quality 1126–1135.
through memory-gated recurrent neural networks-based autoencoders,” [147] L. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn.
Control Eng. Pract., vol. 97, 2020, Art. no. 104330. Res., vol. 9, pp. 2579–2605, Nov. 2008.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.
5866 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 17, NO. 9, SEPTEMBER 2021

[148] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional Zhiqiang Ge (Senior Member, IEEE) received
networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833. the B.Eng. and Ph.D. degrees in automation
[149] S. Kabir et al., “An integrated approach of belief rule base and deep learn- from the Department of Control Science and
ing to predict air pollution,” Sensors, vol. 20, no. 7, 2020, Art. no. 1956. Engineering, Zhejiang University, Hangzhou,
[150] Q. Jiang, S. Yan, H. Cheng, and X. Yan, “Local-global modeling and China, in 2004 and 2009, respectively.
distributed computing framework for nonlinear plant-wide process mon- He was a Research Associate with the De-
itoring with industrial big data,” IEEE Trans. Neural Netw. Learn. Syst., partment of Chemical and Biomolecular Engi-
to be published, doi: 10.1109/TNNLS.2020.2985223. neering, Hong Kong University of Science and
[151] Z. Yang and Z. Ge, “Monitoring and prediction of big process data with Technology, Hong Kong, from 2010 to 2011
deep latent variable models and parallel computing,” J. Process Control, and a Visiting Professor with the Department of
vol. 92, pp. 19–34, 2020. Chemical and Materials Engineering, University
of Alberta, Edmonton, AB, Canada, in 2013. He is currently a Full
Professor with the College of Control Science and Engineering, Zhejiang
University. His research interests include industrial big data, process
monitoring, soft sensor, data-driven modeling, machine intelligence, and
knowledge automation.
Dr. Ge was an Alexander von Humboldt Research Fellow with the
Qingqiang Sun received the B.Eng. degree in University of Duisburg-Essen, Duisburg, Germany, from 2014 to 2017,
electrical engineering and automation from the and also a JSPS Invitation Fellow with Kyoto University, Kyoto, Japan, in
School of Aerospace Engineering, Xiamen Uni- 2018.
versity, Xiamen, China, in 2017, and the M.Eng.
degree in automation from the Department of
Control Science and Engineering, Zhejiang Uni-
versity, Hangzhou, China, in 2020.
His research interests include data-driven
modeling, deep learning, and soft sensor.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 24,2022 at 06:11:10 UTC from IEEE Xplore. Restrictions apply.

You might also like