1. Introduction
Interest in the study of deep neural networks in the field of neural computation is increasing, both as it regards new training procedures and topologies, and significant applications. In particular, greater attention is being paid to challenging applications, which are not adequately addressed by classical machine learning methods. Consequently, the use of deep structures has significantly improved the state of the art in many fields, such as object and gesture recognition, speech and language processing, and the Internet of Things (IoT).
This Special Issue comprises discussions and analysis of relevant applications in the fields of speech and text analysis; object and gesture recognition; medical applications; IoT implementations; and sentiment analysis.
Successful solutions to complex problems, such as those examined in the contributions noted above, are closely linked to the identification of suitable network architectures. In this issue, long short-term memory (LSTM)- and convolutional neural network (CNN)-derived architectures are the most commonly used neural structures.
Furthermore, in many of the contributions in this issue, a deep interplay exists between the adopted neural structure and the investigated application, leading to the proposal of tailored architectures.
The papers of this Special Issue make significant contributions to the above-mentioned fields by merging theoretical aspects and relevant applications. Nevertheless, topics related to the choice of neural structure and learning algorithm, network sizing, and selection of hyperparameters require further examination and are the focus of a vivid research interest.
2. Contributions
The papers included in this Special Issue of Future Internet provide interesting examinations of deep neural networks, considering both theoretical contributions and relevant applications. Case studies are reported for object recognition, text recognition and sentiment analysis, medical applications, and other emerging fields.
The first paper [
1] investigates pedestrian attribute recognition within surveillance scenarios. This challenging task is approached as a form of multi-label classification. The authors propose a novel model based on a graph convolutional network (GCN), which uses a CNN to extract pedestrian features and a correlation matrix between labels to propagate information between nodes. Reported results show that the approach proposed by the authors outperforms other existing state-of-the-art methods.
The second paper [
2] focuses on the exploding field of IoT systems. Fog computing is used to process the huge amount of data produced by IoT applications. In the paper presented here, Deep Neural Networks Partitioning for Constrained IoT Devices is proposed as a new algorithm to partition neural networks for efficient distributed execution. The authors show that the partitioning offered by popular machine learning frameworks, such as TensorFlow, or by the general-purpose framework METIS, may produce invalid partitioning for highly constrained systems, while the proposed algorithm can be more efficient.
In the third paper [
3], text sentiment analysis is addressed as an important and challenging application. A sentiment-feature-enhanced deep neural network (SDNN) is proposed to integrate sentiment linguistic knowledge into a deep neural network via a sentiment attention mechanism. This helps to select the crucial sentiment-relevant context words by leveraging the sentiment lexicon in an attention mechanism, bridging the gap between traditional sentiment linguistic knowledge and deep learning methods. Experimental results are reported showing that the proposed structure achieved better performance than competitors for text sentiment classification tasks.
The fourth paper [
4] deals with another relevant application field, i.e., gesture recognition in video. A neural network comprising an alternate fusion of a 3D CNN and ConvLSTM, called the Multiple extraction and Multiple prediction (MEMP) network, is proposed. The main feature of the MEMP network is the repeated extraction and prediction of the temporal and spatial feature information of gesture video, which enables a high accuracy rate to be obtained. The performance of the proposed method is tested on benchmark datasets, showing high accuracy.
A recognition problem is also the topic of [
5]. Specifically, ship detection and recognition are addressed to better manage port resources. The authors propose an on-site processing approach, called Embedded Ship Detection and Recognition using Deep Learning (ESDR-DL). Processing of a video stream using embedded devices, and a two-stage neural network composed of a DNet for ship detection and a CNet for ship recognition, also running on embedded devices, is proposed. The ESDR-DL is deployed at the Dongying port of China, where it has been running for over a year.
A medical application is the subject of [
6], in which the tooth-marked tongue, an important indicator in traditional Chinese medicinal diagnosis, is considered. This paper is an example of a typical application in which a correct diagnosis relies on the experience and knowledge of the practitioner. In the study, a visual explanation method uses a CNN to extract features and a Gradient-weighted Class Activation Mapping is used to produce a coarse localization map. Experimental results demonstrate the effectiveness of the proposed method.
Paper [
7] concerns human activity recognition. The paper introduces a new framework combining 3D-CNN and LSTM networks. The framework integrates a motion map with the next video frame to obtain a new motion map by increasing the training video length iteratively. A linear weighted fusion scheme is then used to fuse the network feature maps into spatio-temporal features. Finally, an LSTM encoder-decoder is used for predictions. Public benchmark datasets are used to prove the effectiveness of the proposed method.
An LSTM-conditional random field model (LSTM-CRF model) with an integrity algorithm is proposed in [
8]. The method incorporates the advantages of the data-driven method and dependency syntax, and improves the precision rate of the elements without undermining the recall rate. Cross-domain experiments based on a multi-industry corpus in the financial field are reported.
Object detection is addressed in [
9], where feature fusion is added to an object detection network to obtain a better CNN feature, thus improving the performance of a small object. An attention mechanism is applied to an object detection network to enhance the impact of significant features and weaken background interference. Empirical evaluation on a public dataset demonstrates the effectiveness of the proposed approach.
Paper [
10] uses a bidirectional LSTM model as an approach to address named entity recognition (NER) in natural language processing tasks in Arabic text. The LSTM network can process sequences and relate them to each part of the text, making it useful for NER tasks. Pre-trained word embedding is used to train the inputs that are fed into the LSTM network. The proposed model is evaluated on a popular dataset.
A medical application is, again, the topic of [
11], where facial nerve paralysis (FNP) is considered. The use of objective measurements can reduce the frequency of errors caused by subjective methods. A single CNN, trained directly from images classified by neurologists, is proposed. The proposed CNN successfully matched the neurologists’ classification.
Text classification returns in [
12], in which the case of Chinese text is considered. After comparing different methods, LSTM and CNN approaches are selected as deep learning methods to classify Chinese text. Two layers of LSTM and one layer of CNN are integrated into a new model, labelled the BLSTM-C model. The LSTM is responsible for obtaining a sequence output based on past and future contexts, which is then input to the convolutional layer for feature extraction. The model exhibited remarkable performance in classification of Chinese texts.