CN112766165B

CN112766165B - Falling pre-judging method based on deep neural network and panoramic segmentation

Info

Publication number: CN112766165B
Application number: CN202110076029.8A
Authority: CN
Inventors: 张立国; 李枫; 胡林; 杨曼; 刘博�; 孙胜春; 张子豪; 李义辉
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2022-03-22
Anticipated expiration: 2041-01-20
Also published as: CN112766165A

Abstract

The invention provides a falling prediction method based on the combination of a deep neural network and a panoramic segmentation method, which can efficiently and quickly realize the falling detection prediction function, and is combined with the deep neural network and the image panoramic segmentation method to carry out short-term real-time evaluation and notification on the impending falling risk and carry out long-term behavior learning and prediction on the future risk. The invention adopts a Deep Neural Network (DNN) in deep learning to construct a panoramic segmentation network, and then carries out pixel-level segmentation on a video image in falling detection through an image panoramic segmentation algorithm, thereby realizing scene understanding of cared persons and the environment conditions and realizing falling prejudgment on dangerous environments.

Description

Falling pre-judging method based on deep neural network and panoramic segmentation

Technical Field

The invention relates to the field of intelligent communication, in particular to a falling prediction method based on the combination of a deep neural network and an image panorama segmentation algorithm.

Background

At present, there are many explorations on fall prediction methods based on computer vision at home and abroad, and the fall prediction methods can be divided into three types according to different algorithms and implementation methods: (1) and (3) attitude estimation: the method is used for acquiring data of various postures of a human body in a mode of combining deep learning and a recurrent neural network, so that a personal posture library is constructed, and the method is used for realizing pre-judgment and alarm on the possible falling through a series of posture actions. The method can realize the fall prejudgment to a certain extent, but has huge calculation amount, higher requirement on hardware equipment, difficulty in real-time detection and lack of identification and understanding of the environment where people are located, thereby causing low accuracy. (2) And (3) behavior recognition: the method comprises the steps of training walking, squatting, sitting, lying, falling and other behaviors by adopting a CNN (convolutional neural network), generating a falling model base, classifying and identifying the training behaviors, and carrying out alarm prompts in different degrees according to falling similarity grades, thereby realizing falling prejudgment. The method generates a model base of the method, and the accuracy of tumble prediction is greatly improved. However, the CNN training model has a large calculation amount, so that the algorithm efficiency is low, and real-time prejudgment cannot be achieved. (3) Scene understanding: the method comprises the steps of classifying input video images through a deep learning framework, classifying and identifying human bodies and environments where the human bodies are located through a trained image data set, displaying the relation between the human bodies and the surrounding environments, extracting various objects in the human bodies and the environments respectively through candidate frames, and setting a falling grade according to an environment danger grade to alarm, so that falling prejudgment is achieved. The method realizes scene understanding between the human body and the surrounding environment, can effectively realize the function of tumble prejudgment, but is complex in training method, lacks a quick and accurate image segmentation algorithm, and is difficult to accurately extract the human body and the surrounding environment. In addition, the large amount of image data trained by the deep neural network causes too large calculation amount and too high energy consumption, and real-time prejudgment is difficult to achieve.

In combination with the current state of research worldwide as analyzed above, it can be found that the current fall prediction method faces the following problems: (1) the calculation amount required by the operation of the algorithm is large, so that the operation speed is low, and the real-time operation cannot be realized; (2) a fast and accurate image segmentation algorithm is lacking.

Disclosure of Invention

The invention aims to solve the technical problem of how to improve the image segmentation quality and reduce the calculated amount required by training an image data set, thereby effectively realizing the function of tumble prejudgment.

In order to solve the technical problem, the invention provides a falling prediction method based on the combination of a deep neural network and a panoramic segmentation algorithm. The method converts floating-point data into integer data by using a data conversion method on a convolution layer, thereby reducing the calculation amount of floating-point operation; the matrix compression method is adopted at the full connection layer, the original large full connection layer matrix is decomposed into two small full connection layer matrixes and an intermediate layer matrix by utilizing a matrix Singular Value Decomposition (SVD) method, the intermediate layer contains less neurons, so that the matrix compression can be realized, the connection number and the weight scale are reduced, the calculation and storage requirements are reduced, the calculated amount required by the deep neural network training image data set is greatly reduced, the power consumption is reduced, and the algorithm real-time function is realized; and then, the semantic segmentation and the example segmentation are fused into two processes of the same segmentation network by organically combining an image feature fusion method based on feature pyramid fusion and a full convolution neural network structure, and two original parallel network structures are merged into one network structure, so that a brand-new panoramic segmentation algorithm is obtained. The panoramic segmentation algorithm is used for distinguishing the human body and the environment where the human body is located, so that scene understanding is achieved, and the falling prediction function is achieved.

Specifically, the invention provides a fall prejudging method based on the combination of a deep neural network and a panoramic segmentation method, which comprises the following steps:

step 1, acquiring a stable indoor image by using a full-color camera;

step 2, carrying out image processing on the video image obtained in the step 1, eliminating noise interference factors and obtaining processed image information;

step 3, training a data set PASCAL VOC2012, activating a neural network through an activation layer, and inputting the processed image information obtained in the step 2 into a convolutional layer;

step 4, inputting the acquired image information into the convolution layer, extracting image characteristics, and converting the floating point data into integer data by adopting a data conversion method on the data acquired in the step 3 so as to reduce the amount of operation data;

step 5, performing batch normalization processing on the extracted features, and uniformly outputting the features;

step 6, sending the images subjected to batch normalization processing into a pooling layer, performing feature dimensionality reduction, and extracting key features as output results, wherein the key features comprise main body components, contours, shapes and texture features in the images;

and 7, transmitting the output result in the step 6 into a full-connection layer, and classifying the data sets, wherein the classification of the data sets specifically adopts a matrix compression method, and specifically comprises the following steps: decomposing an original large full-connection layer matrix into two small full-connection layer matrices and an intermediate layer matrix by using a matrix singular value decomposition method, wherein the two small full-connection layer matrices comprise most neurons, and the intermediate layer matrix comprises a small number of neurons;

the matrix singular value decomposition method is specifically shown in the following formula:

wherein: suppose that

The m × N matrix in the full connection layer FC8, U, V and N are intermediate variables of the SVD transformation matrix,

for the middle layer matrix, the original weight matrix becomes the multiplication form of two matrices:

the matrix dot product satisfies the exchange rate, so the mapping of output y to input N is shown as:

b represents the offset of the full connection layer, if fine adjustment of the deep neural network is not needed, the value of b is 0, and N is the data volume output in the step 6;

step 8, outputting through a full connection layer, classifying and identifying all images in the data set, marking the categories of all the images, finishing the training of the data set, matching the video image obtained in the step 2 with the trained data set, and classifying and identifying all things in the video image so as to construct a panoramic segmentation image network;

step 9, carrying out characteristic pyramid fusion on the image information output in the step 8, and extracting an image after characteristic fusion;

step 10, performing semantic segmentation on the image after the feature fusion obtained in the step 9, selecting an interested area through a candidate frame, analyzing each pixel through the interested area, applying the panoramic segmentation image network trained in the step 8, realizing semantic category prediction on each pixel by using a pixel category prediction formula, and distinguishing different types of objects;

step 11, performing example segmentation on the image output in the step 10, distinguishing different objects of the same type by setting example mask region segmentation,

an example segmentation formula is shown below:

wherein: l is_ins(x_i) Representing the result of the segmentation of the image instance, x_iIs the ith pixel point, N_mask(i,j)Representing the number of example mask segmentation areas;

step 12, after step 11, completing a panoramic video image segmentation task, then obtaining an image segmentation model labeled with categories through a deep neural network and an image panoramic segmentation algorithm, classifying the image segmentation model according to the object placement condition, determining each identified object by a risk coefficient, and only defining the risk level according to the risk coefficient, specifically, identifying various specific conditions in the environment according to the image segmentation model, wherein the specific conditions are as follows: if no water or obstacles exist in the environment, the environment is judged to be a safe environment; if accumulated water or obstacles exist in the environment, the environment is judged to be a general dangerous environment; if dangerous factors such as stairs, water accumulation, barriers and the like which are easy to fall down exist in the environment, the environment is determined to be a high-risk environment, and an alarm is triggered to remind pedestrians and medical care personnel to pay attention.

Preferably, the activation function selected in step 3 is a linear modified unit function, and the specific expression is as follows:

where X denotes the image gradient and f (X) denotes the image gradient resulting from the data set.

Preferably, in the step 4, the 32-bit floating-point type data obtained in the step 3 is converted into 8-bit integer type data.

Preferably, in step 5, the batch normalization process formula applied is:

wherein g (x) is normalized image output information, x^(k)For image dimension information, E denotes expectation, and Var denotes variance.

Preferably, the pooling layer processing method in step 6 adopts a maximum pooling method to reduce the amount of calculation and increase the training speed, and the obtained image information data volume is:

N＝(g(x)-F+2P)/S+1

n is the amount of data after pooling, g (x) is the output of step 5, F is the filter size, P is the number of pixels padded by Padding, and S is the step size.

Preferably, in the step 9, using ResNet-50 as a basic network for image feature extraction, the implementation principle is as shown in the following formula:

wherein L is_iDenotes the result after fusion of the i-th layer features, g (x)_i) For the ith layer feature input, UP represents an upsampling operation,

the characteristic nucleus size was 5 x 5.

Preferably, in step 10, the pixel class prediction formula is as follows:

wherein: l (p)_i,l_i) For the pixel class prediction result, i is the pixel index, p_iIs the pixel probability, p^* _iTo label the probability, λ is the segmentationCoefficient of l_iFour coordinates representing the true candidate box boundary for a vector,

to predict the candidate box boundary coordinates, N_clsIs the total number of pixels of the object class, L_clsFor the log-loss function of the object class (including the background), the formula is calculated as:

N_regis the number of pixels in the region of interest, L_regAs a regression loss function, the calculation formula is:

smooth is a smoothing processing function, and obtained data are converted into 8-bit integer data, so that the calculation amount is reduced, and the data storage space is saved.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention can realize real-time prejudgment, greatly reduces the possible falling risk, and provides a latest falling prejudgment method based on the combination of a Deep Neural Network (DNN) and an image panorama segmentation algorithm.

(2) The invention uses the PASCAL VOC2012 data set which is relatively complete in categories, and the PASCAL VOC2012 data set is trained through a deep neural network and is marked with different objects in various images, so that the image classification training is completed, and the detection accuracy can be greatly improved. And then, constructing a panoramic segmentation network by using the trained data set, and carrying out panoramic segmentation on the image according to the pixel type and the individual difference of the example to finish scene understanding, thereby realizing the function of falling prediction.

(3) The invention adopts the convolution layer data conversion method and the full-connection layer matrix compression method to greatly reduce the calculation amount required by training data and reduce the algorithm operation power consumption, so that the training speed of a deep neural network on a large number of image data sets is greatly improved under the condition of ensuring the image classification accuracy, and the real-time property of the algorithm is ensured.

(4) The invention adopts a panoramic segmentation algorithm based on the characteristic pyramid fusion and the full convolution neural network (FCN) to accurately segment the acquired video image, wherein the characteristic pyramid fusion method reduces the operation amount of the segmentation network algorithm, improves the image segmentation speed, improves the segmentation accuracy of the full convolution neural network, combines semantic segmentation and example segmentation into the same network structure, ensures the pixel classification segmentation and also ensures the differentiation of the individual difference of the examples, so that the image segmentation algorithm is more perfect, the segmentation result is clearer, the scene understanding of the video image is facilitated, and the falling prediction result is more real and reliable.

Drawings

FIG. 1 is a general block diagram of a deep neural network and panorama segmentation based algorithm according to the present invention;

FIG. 2 is a schematic diagram of deep neural network training data according to the present invention;

FIG. 3 is a comparison of the effect of the fully connected layer matrix of the present invention before and after compression;

FIG. 4a is a schematic diagram of a panoramic segmentation network model of the present invention before improvement;

FIG. 4b is a schematic diagram of the panorama segmentation network model of the present invention after improvement;

fig. 5 is a flow chart of the image panorama segmentation algorithm of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

The fall prediction method disclosed by the invention comprises a deep neural network module and a panoramic segmentation module as shown in fig. 1.

The deep neural network module comprises an activation layer, a full connection layer, a convolution layer, a batch normalization layer and a pooling layer. The specific implementation steps of carrying out data set training and constructing a panoramic segmentation network based on the deep neural network are as follows:

step 1, collecting images by using a color camera, wherein the color camera is fixed at the ceiling of an entrance to observe the whole indoor environment. Room objects such as tables, chairs, obstacles, beds, books, stationery and the like are all static objects, indoor light is good, and stable and clear scene images can be captured; the experimental object simulates the behavior posture of the old, the action speed is slow, and the experimental object can be approximately regarded as uniform motion.

And 2, processing the video image obtained in the step 1 by adopting basic image processing algorithms such as Gaussian filtering, median filtering, morphological denoising and the like, thereby eliminating interference factors such as Gaussian noise, salt and pepper noise and the like and being beneficial to further analysis and processing of the image.

And 3, training a data set PASCAL VOC2012, activating a neural network through an activation layer, and inputting image information into the convolutional layer. The activation function is a linear modified unit function (ReLU), and the specific expression is as follows:

x denotes the image gradient and f (X) denotes the image gradient resulting from the data set.

And 4, inputting the acquired image information into the convolutional layer, and extracting the image characteristics. Relevant researches have proved that under a relatively stable image acquisition scene, the precision of integer fixed-point calculation can provide the result of shoulder-to-shoulder floating-point operation, and in a convolutional neural network structure, the accuracy of fixed-point calculation with reduced precision and 32-bit floating-point calculation is almost the same. Therefore, the data obtained in the step 3 are converted into integer data by adopting a data conversion method, so that the operation data amount is reduced.

And 5, performing batch normalization processing on the extracted features, and outputting uniformly. The batch normalization process formula is as follows:

wherein g (x) is the normalized image output information，x^(k)For image dimension information, E denotes expectation, and Var denotes variance.

And 6, sending the images g (x) subjected to batch processing into a pooling layer, performing feature dimension reduction, extracting key features, and compressing image information data volume, so that the calculated amount is reduced, and the training speed is increased. The pond treatment is divided into two methods: average pooling and maximum pooling. The maximum pooling method is adopted, and the obtained image information data volume is as follows:

N＝(g(x)-F+2P)/S+1

And 7, transmitting the output result in the step 6 into a full connection layer, and classifying the data set. Operations such as convolutional layer, pooling layer, and activation function can be understood as mapping the original data distribution space to the implicit space, and the fully-connected layer is mapping the learned features to the labeled class space. The experimental object detected by the invention is the old, the movement speed is slow, and the old can be approximately regarded as uniform movement, so that the nonlinear operation can be filtered. The invention adopts a matrix compression method, decomposes the original large full-connection layer matrix into two small full-connection layer matrices and an intermediate layer matrix by using a matrix Singular Value Decomposition (SVD) method, wherein the intermediate layer comprises less neurons, so that the matrix compression can be realized, the connection number and the weight scale are reduced, and the calculation and storage requirements are reduced. The specific implementation method is shown in the following formula:

wherein: suppose that

is a middleA layer matrix. The original weight matrix becomes the form of multiplication of two matrices:

the matrix dot product satisfies the exchange rate, so the mapping of output y to input N can be expressed as:

wherein b represents the offset of the fully-connected layer, and if fine tuning of the deep neural network is not required, the value of b is 0. And N is the data volume output in the step 6.

And 8, outputting through a full connection layer, classifying and identifying all images in the data set, and marking the category to which each image belongs, wherein the data set training is finished at the moment. And matching the video image obtained in the step (2) with the trained data set, and classifying and identifying all things in the video image so as to construct a panoramic segmentation image network. During classification and identification, human bodies and articles are mainly distinguished, image labeling is mainly adopted, and a universal image labeling method is adopted in practical application.

Due to the fact that data conversion and matrix compression processing are carried out in the process of training the data set, the image data volume is greatly reduced, calculation time consumption caused by large data volume is saved, and training efficiency is improved. According to the method, the image data volume is compressed, so that the training efficiency is greatly improved on the premise of not influencing the accuracy of classification and identification, the calculation time loss is reduced, and the real-time falling prediction function is guaranteed.

The image panorama segmentation module comprises feature pyramid fusion, and semantic segmentation and instance segmentation based on a full convolution neural network (FCN). The method comprises the steps of extracting image features by adopting a feature pyramid fusion method, then realizing image pixel level analysis through semantic segmentation, distinguishing different types of images according to pixels, and finally distinguishing individual differences among the same types through example segmentation to realize scene understanding, thereby realizing falling prediction.

The specific implementation mode comprises the following steps:

and 9, carrying out characteristic pyramid fusion on the image information output in the step 8, and extracting image characteristics. The present invention uses ResNet-50 as the underlying network for image feature extraction. ResNet is divided into 5 stages according to the size of feature maps, which are respectively called res1, res2, res3, res4 and res5, and the feature map sizes are respectively 1/2,1/4,1/8,1/16 and 1/32 of the original. For the visual task, the depth of the network corresponds to the receptive field, and the larger the receptive field of the pixel points on the deep characteristic diagram is, the stronger the classification capability is. The fused feature maps with different resolutions can be used for object detection with corresponding resolution sizes respectively. The method can ensure that each layer has proper resolution and strong semantic features, and meanwhile, the method only adds extra cross-layer connection on the original basic network and hardly adds extra time and calculation amount.

The implementation principle is shown in the following formula:

the characteristic nucleus size was 5 x 5.

And step 10, performing semantic segmentation on the image after feature fusion, selecting an interested area through a candidate frame, analyzing each pixel through the interested area, and applying the panoramic segmentation network trained in the step 8 to realize semantic category prediction of each pixel and distinguish different objects. The pixel class prediction formula is as follows:

wherein: l (p)_i,l_i) For the pixel class prediction result, i is the pixel index, p_iIs the pixel probability, p^* _iFor labeling the probability, λ is the partition coefficient, l_iFour coordinates representing the true candidate box boundary for a vector,

to predict candidate box boundary coordinates. N is a radical of_clsIs the total number of pixels of the object class, L_clsFor the log-loss function of the object class (including the background), the formula is calculated as:

wherein smooth is a smoothing processing function, and the obtained data is converted into 8 integer data, so that the calculation amount is reduced, and the data storage space is saved.

And 11, performing example segmentation on the image output in the step 10, wherein the example segmentation task needs not only to predict the pixel-level class, but also to distinguish different individuals belonging to the same class, namely to predict the example identification number. According to the invention, different objects of the same type are distinguished by setting example mask region segmentation, so that panoramic segmentation and scene understanding of the image are realized, and the specific situation of a person in the surrounding environment is accurately judged. An example segmentation formula is shown below:

wherein: l is_ins(x_i) Representing the result of the segmentation of the image instance, x_iIs the ith pixelPoint, N_mask(i,j)The number of example mask segmentation areas is shown.

And step 12, completing the panoramic division task of the video image through step 11. The image segmentation model marked with the category can be obtained through the deep neural network and the image panorama segmentation algorithm, and the specific situation of people in the environment can be identified according to the model. The invention sets that if no water or barrier exists in the pedestrian passageway, the pedestrian passageway is judged to be a safe environment; if accumulated water or obstacles exist on the pavement of the pedestrian, the pedestrian is judged to be a general dangerous environment; if dangerous factors such as stairs, water accumulation, barriers and the like which are easy to fall down exist in the pedestrian passageway, the high-risk environment is judged, and an alarm is triggered to remind pedestrians and medical care personnel to pay attention.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention shall fall within the protection scope defined by the claims of the present invention.

Claims

1. A falling prediction method based on the combination of a deep neural network and a panoramic segmentation method is characterized in that: which comprises the following steps:

step 1, acquiring a stable indoor video image by using a full-color camera;

step 3, training a data set PASCALVOC2012, activating a neural network through an activation layer, and inputting the processed image information obtained in the step 2 into a convolutional layer;

step 4, extracting image characteristics after inputting the acquired image information into the convolutional layer, converting the floating point data into integer data by filtering out decimal numbers by adopting a data conversion method for the data acquired in the step 3, and thus reducing the amount of operation data;

wherein: suppose that

an example segmentation formula is shown below:

2. The fall prediction method based on the combination of the deep neural network and the panorama segmentation method as claimed in claim 1, wherein: the activation function selected in step 3 is a linear correction unit function, and the specific expression is as follows:

3. The fall prediction method based on the combination of the deep neural network and the panorama segmentation method as claimed in claim 1, wherein: in the step 4, the 32-bit floating-point data obtained in the step 3 is converted into 8-bit integer data.

4. The fall prediction method based on the combination of the deep neural network and the panorama segmentation method as claimed in claim 1, wherein: in step 5, the batch normalization processing formula applied is:

5. The fall pre-judging method based on the combination of the deep neural network and the panorama segmentation method as claimed in claim 4, wherein: the pooling layer processing method in the step 6 adopts a maximum pooling method to reduce the calculated amount and improve the training speed, and the obtained image information data amount is as follows:

N＝(g(x)-F+2P)/S+1

6. The fall prediction method based on the combination of the deep neural network and the panorama segmentation method as claimed in claim 1, wherein: in the step 9, ResNet-50 is used as a basic network for image feature extraction, and the implementation principle is shown in the following formula:

wherein L is_iAs a result of fusion of the ith layer features, g (x)_i) For the ith layer feature input, UP is the upsampling operation,

the characteristic nucleus size was 5 x 5.

7. The fall prediction method based on the combination of the deep neural network and the panorama segmentation method as claimed in claim 1, wherein: in step 10, the pixel class prediction formula is as follows:

wherein: l (p)_i,l_i) For the pixel class prediction result, i is the pixel index, p_iIs the pixel probability, p^* _iFor labeling the probability, λ is the partition coefficient, l_iAs vectors, tablesFour coordinates, l, representing the true candidate box boundary_i ^*To predict the candidate box boundary coordinates, N_clsIs the total number of pixels of the object class, L_clsLogarithmic loss function of object class, L_clsThe calculation formula of (2) is as follows:

N_regis the number of pixels in the region of interest, L_regAs a function of the regression loss, L_regThe calculation formula of (2) is as follows:

where smooth is a smoothing function that converts the resulting data into 8-bit integer data.