Background
The feature extraction method is a research hotspot in the field of image recognition. Yolo (Real-Time Object Detection) is a target Real-Time Detection model based on a convolutional neural network, and has a learning capability of mass data, a point-to-point feature extraction capability, and a good Real-Time identification effect, so that the model is of great interest. In the prior art, a pedestrian detection algorithm based on a Gaussian mixture model and a YOLO is provided by using the Gaussian mixture model to simulate background characteristics, and a good effect is obtained when pedestrians in a monitoring video of a transformer substation are detected. The method comprises the steps of extracting context information characteristics of the gray level image by using an alternating direction multiplier method, combining the information into a 2D input channel to be used as input of a YOLO neural network model, and forming a real-time target detection algorithm based on the YOLO. A text detection method for natural images is designed with a mechanism for extracting text characters in the images, and YOLO is adopted for text detection and bounding box regression. The above researches have done much work on the aspects of improving the performance of YOLO, expanding the application thereof, and the like, but when the YOLO neural network is adopted to solve the problem of feature extraction of images, the following defects exist:
1) in the identification process, YOLO divides the image to be identified into 7 × 7 grids, and the neurons in the cells for predicting the target can belong to a plurality of sliding windows belonging to the same category, which makes the model have strong spatial constraint. If multiple objects of different classes are covered in the sliding window, the system cannot detect all the target objects simultaneously.
2) During the training process, the data set features are extracted, and the cells in the network are at most responsible for predicting a real target, which results in poor effect when the YOLO detects a relatively close and small target.
3) In the image pre-processing stage, YOLO processes the high resolution images of the training dataset into low resolution data and for final classification feature extraction. After convolution for many times, the characteristics of small targets in the distribution area of the original picture are difficult to store.
Disclosure of Invention
The invention aims to overcome the defects and provide a feature extraction method based on a convolutional neural network target real-time detection model, which can improve the identification capability of a smaller target and is not easy to lose information in the feature extraction process.
The invention discloses a feature extraction method based on a convolutional neural network target real-time detection model, which comprises the following steps of:
(1) preprocessing picture data: acquiring a rectangular area coordinate of a real target for each picture, and generating a coordinate information file of the real target in each picture;
(2) constructing and loading an improved convolutional neural network target real-time detection model (YOLO): the model comprises 18 convolution layers for extracting image features, 6 pooling layers for reducing picture pixels, 1 Softmax output layer and 1 full-connection layer; after the image is input, a maximum pooling layer is added;
(3) generating an area matrix vector: generating a plurality of target candidate area matrix vectors of each picture according to the coordinate information file;
(4) taking the candidate region matrix vector as the input of a first layer, and taking the result as the input of a second layer;
(5) performing pooling operations;
(6) taking the result in the step (5) as input, adopting a sliding window to scan the grid, and performing convolution and pooling operation to calculate the feature vector of the unit cell in the sliding window;
(7) taking the feature vector obtained in the step (6) as the input of the 18 th convolution layer, and performing convolution operation by using 2 x 2 steps;
(8) taking the output of the step (7) as the input of a full connection layer, and performing convolution operation by adopting steps of 1 multiplied by 1;
(9) taking the output of the step (8) as the input of a classification function Softmax, calculating a prediction probability estimation value of the picture data, and obtaining the characteristics of a target area corresponding to the maximum overlapping area of a sliding window and a real detection object area by adopting a sliding window merging method;
(10) storing the characteristics of the corresponding target area to the position corresponding to each category in the characteristic model;
(11) and outputting the characteristic model.
The feature extraction method based on the convolutional neural network target real-time detection model comprises the following steps: in the step (7), a maximum pooling layer of 2 × 2 is applied to reduce the picture size, and a network feature map of 14 × 14 is output.
The feature extraction method based on the convolutional neural network target real-time detection model is characterized by comprising the following steps: the sliding window merging method described in step (9) is a nearest neighbor based target detection method (RPN), and includes the following steps:
1) dividing picture data into n unit grids by using a grid division method to generate a set R ═ S1,S2,...,Sn};
2) Initializing cell S
iSimilar set of
And initializing a sliding window of 14 x 14 specification;
3) for a pair of adjacent regions in a sliding window (S)i,Sj)do
a) Calculating the sum of S in a sliding window by using a nearest neighbor target detection method (RPN)iAll the cells S adjacent to each otherjFeature similarity F (S) ofi,Sj);
b) Find the maximum similarity value Fmax(Si,Sj);
c) Updating cell SiSimilar set m ofi=mi∪{Fmax(Si,Sj)};
e) while (for each cell S)
iSimilar set of
)
f) Find set miRemoving all cells corresponding to the elements in the table, and removing cells not including the detection object;
g) the obtained cells and cells SiCombine to form a new SiAnd as an element of the set L;
h) outputting a target position detection sliding window set;
i) ending while;
j) and ending the for.
Compared with the prior art, the method has obvious beneficial effects, and the scheme shows that an improved convolutional neural network target real-time detection model (YOLO) is constructed and loaded: the model comprises 18 convolution layers for extracting image features, 6 pooling layers for reducing picture pixels, 1 Softmax output layer and 1 full-connection layer; and a maximum pooling layer is added after the image is input. In the structure, a full connection layer is adopted to reduce the loss of characteristic information; after an image is input, a 2 × 2 maximum pooling layer is added to reduce the size of the image and simultaneously store information of an original image as much as possible, and a plurality of layers of meshes after rolling and pooling are output to be 14 × 14 to improve the size of a network feature map, so that the identification accuracy of a system is improved. The sliding window merging method based on the nearest neighbor target detection method (RPN) can determine the frame of the sliding window after convolution and pooling, and can reduce redundancy and time overhead by merging similar areas. In a word, the method has the advantages that the recognition capability of the smaller target can be improved, and information is not easy to lose in the characteristic extraction process.
The advantageous effects of the present invention will be further described below by way of specific embodiments.
Detailed Description
The following detailed description will be made of specific embodiments, features and effects of the feature extraction method based on the convolutional neural network target real-time detection model according to the present invention with reference to the accompanying drawings and preferred embodiments.
The invention discloses a feature extraction method based on a convolutional neural network target real-time detection model, which comprises the following steps of:
(1) and preprocessing picture data. Obtaining the rectangular area coordinates of the real target aiming at each picture of the picture data set X, and generating a coordinate information file F of the real target in each picturec;
(2) Loading a picture classification training model of YOLO (YOLO), and simultaneously initializing a feature model M of picture data XweightsIf the prediction rectangular area coordinate of each picture is null, initializing the prediction rectangular area coordinate of each picture as null; the model comprises 18 convolution layers for extracting image features, 6 pooling layers for reducing picture pixels, 1 Softmax output layer and 1 full-connection layer; and after inputting the image, a max pooling layer is added (see fig. 1).
(3) According to the coordinate information file FcGenerating a plurality of target candidate area matrix vectors of each picture based on a nearest neighbor target detection method (RPN);
(4) taking the candidate region matrix vector as the input of a first layer, and taking the result as the input of a second layer;
(5) a pooling operation is performed.
(6) And (5) taking the result in the step (5) as input, scanning the grid by adopting a sliding window, and performing convolution and pooling operation to calculate the feature vector of the unit cell in the sliding window.
(7) Taking the feature vector obtained in the step (6) as the input of the 18 th convolution layer, and performing convolution operation by using 2 x 2 steps;
(8) taking the output of the step (7) as the input of a full connection layer, and performing convolution operation by adopting steps of 1 multiplied by 1;
(9) calculating picture data X by taking the output of the step (8) as the input of a classification function SoftmaxpicAnd storing the P with the largest overlapping area obtained by applying a sliding window merging algorithm based on RPNIOUCharacteristics of the corresponding target region; wherein P isIOUThe expression is the overlapping area (unit is pixel) of the sliding window and the real detection object region);
(10) Saving the characteristics of the corresponding target area to the characteristic model MweightsThe location corresponding to each category;
(11) output feature model Mweights;
The LabelImg tool was used to obtain coordinate information for the selected region in step (1) above. And (7) applying a maximum pooling layer of 2 × 2 to reduce the size of the picture and simultaneously saving as much information of the original picture as possible, and outputting a 14 × 14 network feature map. In step (8) above, the sliding window operates on 17 convolutional layers for extracting image features and 6 pooling layers for reducing the image size. In the process, when the convolution operation is carried out on the sliding window every time, the P with the largest overlapping area is calculated by using the sliding window merging algorithm based on the RPNIOUSubstituting the loss function calculation formula of YOLO to calculate the minimum value of the loss function. In the application system, application judgment can be performed according to the feature models Mweights output in step (11).
In the step (8), based on the sliding window merging algorithm of the RPN, when the target object is detected by the YOLO, one cell relates to a plurality of sliding windows, and the finally output window for identifying the target object is less than or equal to the classification number of the picture data. When applying YOLO to context detection, it is not necessary to identify all targets, but rather it is necessary to feed back whether the object that needs to be detected is present in the current view. Therefore, a sliding window merging algorithm based on the RPN is designed:
the algorithm is as follows: RPN-based sliding window merging algorithm
Inputting: picture data Xpic
And (3) outputting: target position detection sliding window set L
1) Using gridding method to divide XpicDividing the cell into n cells to generate a set R ═ S1,S2,...,Sn};
2) Initializing cell S
iSimilar set of
And initialize the 14 x 14 specificationThe sliding window of (2);
3) for a pair of adjacent regions in a sliding window (S)i,Sj)do
a) Calculating the sum of S in the sliding window by using RPN methodiAll the cells S adjacent to each otherjFeature similarity F (S) ofi,Sj);
b) Find the maximum similarity value Fmax(Si,Sj);
c) Updating cell SiSimilar set m ofi=mi∪{Fmax(Si,Sj)};
e) while (for each cell S)
iSimilar set of
)
f) Find set miRemoving all cells corresponding to the elements in the table, and removing cells not including the detection object;
g) the obtained cells and cells SiCombine to form a new SiAnd as an element of the set L;
h) outputting a target position detection sliding window set;
i) ending while;
j) and ending the for.
The examples are as follows:
the method is applied to the service robot to carry out the privacy context detection test. Firstly, a service robot situation detection platform is built, and the overall work flow of the service robot situation detection is given. Six types of situations in a home environment are designed, and a training data set with 2580 pictures, a verification data set with 360 pictures and a test data set consisting of 4 types of 960 samples related to privacy content are established. The test analyzes the relationship between the training step and the prediction probability estimation value, the learning rate and the recognition accuracy, and finds out the experience value of the training step and the learning rate suitable for the proposed algorithm. The method is applied to the service robot to carry out the privacy context detection test.
1 privacy detection service robot hardware platform
Fig. 2 is a built service robot platform, which includes a mobile base, a data processor, a data acquisition device, and a mechanical support, and fig. 3 is a general work flow of the system. The touch display screen for inputting and displaying data is a 16-inch industrial touch screen supporting a l inux system; the vision system adopts an ORBBEC 3D somatosensory camera which can collect RGB depth images. The auditory system is formed by expanding a voice module based on science news, and can recognize voice and position the voice direction in a noisy environment. The development board is an NvidiaJetson Tx1 development board with 256-core GPUs; the mobile base is iRobot Create 2. The operating system of the system is Ubuntu16.04, and a Kinect version of ROS (Robot Operation System) system is installed. The workstation used for reducing the operation load of the service robot is ThinkPad T550 (the GPU is NVIDA GeForce 940M), and is mainly used for data analysis. Meanwhile, both the service robot and the workstation are provided with OpenCV 3.1 and TensorFlow 0.9[22]The YOLO, ROS system. The service robot is provided with a wireless communication module, and end-to-end communication between the service robot and the working station can be realized.
In fig. 3, upon collecting the training data set, the workstation with the GPU trains the data set using the RPN-based sliding window merging algorithm to obtain the feature model. And then, transmitting the obtained feature model to a service robot, starting a camera after the service robot receives the model, and reading pictures from the camera according to a given frequency (10 seconds) to perform context detection. And finally, determining the action of the robot according to the detection result. If the privacy situation is detected, the robot adjusts the angle of the camera, meanwhile, information forming an abstract according to the identified privacy content is stored in a text file, and after every 30 seconds, the robot inquires whether the camera can be used for observing the behavior of the person again or not in a voice consultation mode. If the reply is negative, the camera of the system keeps the non-working state, thereby achieving the purpose of protecting the privacy information. For example, when the system detects that the user is bathing, the camera is rotated 90 degrees and the text information "3 month, 29 month, 8:00 user is bathing in 2017" is stored. Meanwhile, the system starts timing, and after 30 seconds, the system inquires whether bathing is finished. If the content of the human response is positive, the camera returns to the observation angle at the previous moment to continue collecting data, and then the action of the service robot is determined according to the identified data.
2 data set and experimental design
2.1 training dataset and validation dataset
The training data set is composed of picture data under different situations, and the feature model is obtained by applying the proposed algorithm to be used for an application system. And the verification data set is used for testing the recognition performance of the feature model under different parameters in the feature model extraction process so as to refine the feature model.
The household scenarios considered include category 6: c1: bathing; c2: naked or semi-naked sleep; c3: going to the toilet; c4: changing clothes resulting in naked body; c5: someone but not related to the above private content; c6: no one is present in the home environment. The data sources include 2 ways: 1) in the constructed home environment, the picture automatically acquired by the ORBBEC 3D somatosensory camera on the constructed service robot platform accounts for about 81% of the whole data set. 2) Pictures in the home environment are collected, screened and appropriately processed from the network, and have different scenes, objects, brightness, angles and pixels so as to enrich a data set.
The 6 classes of contexts of the training dataset comprise 2580 samples in total, each class comprising 430 samples.
The class 6 context of the validation dataset consists of 360 samples, each class comprising 60 samples.
Fig. 4 is a sample example of a data set.
2.2 System Performance test design and test data set
To test the performance of the system, 3 experiments were designed:
experiment 1: the home environment includes privacy context detection in a training dataset. The test data a and b are obtained in the following way: pictures taken by subjects in the training set and subjects not in the training set, respectively, in the home environment. During testing, the system collects data at different angles in real time through the camera of the system. The experiment aims to test the detection robustness of the system to different detection objects.
Experiment 2: when the detection object (person) is the same, the detection environment does not include privacy detection in the training data set. After the situation in the training set is checked to change through the experiment, the accuracy of the system for the privacy detection content is checked. Test data c is: pictures of subjects in the training set in other home environments. During testing, the system collects data at different angles in real time through the camera of the system. The experiment examines the detection performance of the system to different detection environments.
Experiment 3: neither the detection object nor the home environment context includes privacy detection in the training dataset. In order to reflect the objectivity and diversity of the data, the test data d is collected and sorted from the network. During testing, data are provided for the detection system in a real-time acquisition mode through a simulation system camera. The performance of the experiment detection system is completely different from that of training data when both a detection object and an environment are different from each other.
The system performance test data set is used for testing the performance of the proposed algorithm and the constructed platform in practical application. The four types of test data a, b, c and d are tested for 40 pictures in each situation, and each type of data is tested for 240 pictures different from each other in 6 situations. The 4 experiments were completed and a total of 960 pictures were involved. The test data set and the training set have no duplicate data.
3 training model parameter optimization results and analysis
Considering that training of the model takes a lot of time, different training scales have an impact on the performance of the model. In order to make the proposed training model have better performance, the influence of the training step on the prediction probability estimation value is studied, so as to find out the scale of the better (or feasible) training step. On the other hand, different learning rates also have an influence on the identification accuracy of the model, so that the identification accuracy of the model under different learning rates is researched through experimental tests.
3.1 analysis of the relationship between the training step size and the predicted probability estimate
10 different step scales are designed (see table 1), and for 360 samples of a given verification data set, when the learning rate of a given model is set to be 0.001 by using the YOLO, the statistical results of the prediction probability estimation value, the recognition accuracy rate and the average value of the single-chart recognition time of the model are shown in table 1, the variation trend is shown in fig. 5, and the statistical box diagram of the class estimation value of the model in different training steps is shown in fig. 6.
As can be seen from fig. 5 and table 1, when the training step is 1000, the average prediction probability estimation value is 0.588, and the recognition accuracy is 0.733; with the increase of the training steps, the prediction probability estimation value and the privacy situation identification accurate value of the model are in an ascending trend, when the scale of the training steps is 9000, the average prediction probability estimation value of the model reaches the maximum value of 0.830, and the average value of the identification accuracy rate also reaches the maximum value of 0.967. When the training step is increased to 20000, the average prediction probability estimation value of the model is reduced to 0.568, and the average prediction accuracy value is 0.417. Meanwhile, as can be seen from fig. 6, when the training steps are 1000 to 7000, although the abnormal values outside the rectangle are less, the corresponding rectangular area of the box diagram is longer and the median line is lower. When the training steps are 8000 and 10000, although the median line of the data is high, there are many abnormal points outside the rectangular frame, and there are predicted estimation value singular points close to 0. When the training step is 9000, the rectangular area of the box map is narrower in area and has the highest median line in other cases, and although there are abnormal points outside the rectangular frame, the lowest abnormal points are all higher than the lowest rectangular areas corresponding to the training steps of 2000, 3000 and 4000; further checking the corresponding data to find that the abnormal point data is only 2 and is greater than 0.45.
TABLE 1 model Performance at different steps
As can be seen from the time overhead statistical results in Table 1, the average overhead time of the system is between 2.1ms and 2.6ms, and the model has shorter recognition time and meets the real-time detection application requirement with lower real-time requirement.
The analysis can be concluded in summary: the proposed model is able to achieve the best predictive estimates and recognition accuracy when the training step is set to 9000.
3.2 identification Performance test results and analysis at different learning rates
To obtain the learning rate setting that allows the model to perform the best performance, the learning rates of 1 and 10 were examined when the training step is 9000, in combination with the conclusions described in the previous section-1、10-2、10-3、10-4、10-5、10-6、10-7、10-8、10-9And 10-10The model performance of time. For 360 samples of the designed validation dataset, the statistical results of the prediction probability estimation value and the recognition accuracy average value of the model are shown in table 2, fig. 7 and fig. 8.
From the data of table 2 and fig. 7, it is shown that when the learning rate is greater than 0.1, the average probability prediction estimation value and the recognition accuracy of the model both tend to increase as the learning rate decreases. When the learning rate is 10-1Then, the prediction probability estimation value reaches the maximum value of 0.911, and the average recognition accuracy reaches 1. When the learning rate is from 10-1Reduced to 10-4In the process, the prediction probability estimation value is above 0.8, the recognition accuracy average value is about 0.94, and the change of the learning rate has small influence on the two performance indexes. When the learning rate is from 10-4Reduced to 10-10The mean values of the prediction probability estimation value and the recognition accuracy rate show a significant drop as the learning rate becomes smaller, and their lowest mean values are 0.315 and 0.417, respectively.
Observing fig. 8, it can be further found that: when the learning rate is 1, the area of the corresponding rectangular box is the largest, and although the corresponding average value in table 2 is only 0.67, the corresponding rectangular box in the box diagram extends to 0.9 scale on the vertical axis, which indicates that there are a certain number of predicted estimated values greater than 0.9. When the learning rate is 0.1, although there are some outliers, the rectangular area thereof is small, which indicates that the system can output a large prediction estimation class value in a large number of cases. At a learning rate of 10-10~10-1When the pattern is internal, the corresponding pattern has more abnormal points and the output is largeA smaller amount of predictive probability estimates.
TABLE 2 statistical results of model Performance at different learning rates
In summary, it can be concluded that: the proposed model has better performance when the learning rate is set to 0.1, which can be used when applied.
4 application system performance testing
4.1 System Performance test results and analysis
The designed algorithm is deployed on the built service robot platform, the learning rate and the training step are respectively set to be 0.01 and 9000, the four types of data in the test data set are tested, the system situation recognition accuracy, the category estimation value and the time overhead statistical result are respectively shown in tables 3 and 4, and a prediction probability estimation value statistical box diagram is shown in FIG. 9. From these data it can be seen that:
at the same time, the data in Table 4 shows that for the class a test data, the average values for the C1-C6 context category estimates are: 0.82, 0.968, 0.971, 0.972, 0.920 and 0.972, the standard deviations corresponding to which are: 0.275, 0.006, 0.168, 0.038, 0.141, 0.152, their high class estimates and small variance indicate that the system can be classified into the corresponding class with very high probability for the tested data, and for data where both the object and the background are included in the training set, the system has strong recognition capability for new contexts composed of the object and the background at different perspectives. The results corresponding to the class b test data are slightly worse than the results of the class a numbers as a whole, and the class estimation values in each case are 0.789, 0.849, 0.922, 0.977, 0.918, and 0.869, respectively, and the recognition accuracy is reduced by 0.05, 0.025, 0.05, and 0.025 for C1, C2, C4, and C6, respectively. This indicates that changes in the object have some effect on the recognition performance of the system.
2) The result of experiment 2 shows that the system has very good performance on the situations of C4 and C5, and the identification accuracy can reach 1; the recognition accuracy rates for the C1-C3, C6 scenarios are 0.850, 0.950 and 0.925. The corresponding predicted probability estimates, compared to the results for class a and b test data, are reduced by 0.069, 0.194, 0.034, 0.066, and 0.108, respectively, for the mean of the C1-C3, C5, and C6 scenarios.
TABLE 3 privacy recognition accuracy of the System for different test data sets
TABLE 4 statistical table of privacy class estimation values of system for different test data
This indicates that: by means of the characteristics obtained by the limited training set, the new situation formed by the objects in the training set and the home environment not in the training set can be predicted with high recognition accuracy, but the situation recognition performance of the system is reduced due to the change of the home environment.
3) From the data of experiment 3, although the recognition accuracy of the system can be 0.975 at the highest and 0.85 at the lowest, the mean distribution of the predicted estimated values is in a relatively low interval [0.713,0.89 ]. This indicates that, when the home environment and the object are changed, the recognition accuracy and the category estimation value of the system are reduced. However, it is worth noting that the d-class data is derived from the network, and the background theme, the object and the acquisition angle of the d-class data have larger differences from the data of the training set, and the system can still obtain an identification accuracy rate of more than 0.85, which indicates that the system has stronger robustness for identifying new samples with larger differences.
4) As can be seen from the data in the box fig. 9, although the system as a whole has a recognition accuracy of 94.48%, there are outliers outside the rectangle, especially the points where the prediction estimation value is very small, which indicates that the system recognizes certain situations that the decision is made in the case that the prediction estimation probability is very low, and the recognition robustness of the system for this kind of data needs to be improved.
4.2 System identification of erroneous data analysis
From the above analysis, the constructed system has 5.52% of error of context recognition, and the author finds 53 pictures with error recognition from 960 test pictures, and fig. 10 is a sample example of such data. Analysis of these pictures revealed that:
1) the data collected by the camera in the system has the characteristics of darker light and bright area with overexposure. At the same time, we examined the training data and found that there was no such training data.
2) Pictures from the network are characterized by low resolution or single color, which introduces strong noise.
Therefore, in order to improve the recognition performance of the system, the sample diversity of the training set should be expanded, and the samples with recognition errors should be placed into the corresponding training number set to obtain a more universal feature model.
In a word, the structure and the feature extraction process of the YOLO neural network are improved, the image grid division size is improved, meanwhile, a sliding window merging algorithm based on RPN is designed, and a feature extraction method based on the improved YOLO is formed, namely the method is disclosed by the invention. Through experimental analysis on the privacy context data set and the built service robot platform, experimental results show that: the provided feature extraction algorithm can better identify the privacy-related situation in the intelligent home environment in the service robot system, the average identification accuracy is 94.48%, the system identification time range is 1.62ms-3.32ms, the algorithm has better robustness, and the privacy situation in the home environment can be detected in real time.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent change and modification made to the above embodiment according to the technical spirit of the present invention are within the scope of the present invention without departing from the technical spirit of the present invention.