CN110334769A

CN110334769A - Target identification method and device

Info

Publication number: CN110334769A
Application number: CN201910614107.8A
Authority: CN
Inventors: 郭建亚; 李骊
Original assignee: Beijing HJIMI Technology Co Ltd
Current assignee: Beijing HJIMI Technology Co Ltd
Priority date: 2019-07-09
Filing date: 2019-07-09
Publication date: 2019-10-15

Abstract

The embodiment of the present application discloses a kind of target identification method and device, acquires the RGB image and depth image of target area；Empty filling, the depth image repaired are carried out to depth image；The depth image of reparation is encoded to obtain triple channel depth image；RGB image and triple channel depth image are inputted into trained identification model in advance, obtain the target identification result in RGB image.The application carries out target identification using preparatory trained identification model, in conjunction with RGB image and depth image, improves the accuracy rate of target identification.

Description

Target identification method and device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a target identification method and apparatus.

Background

The current target recognition is realized based on RGB images, and a target is recognized by extracting color features, texture features and contour features from the RGB images. However, due to the influence of environmental factors such as illumination and the like during imaging, the available feature information of the target cannot be completely reflected by the features extracted in the existing target identification process based on the RGB image, so that the identification accuracy of the target is low.

Disclosure of Invention

The application aims to provide a target identification method and a target identification device so as to improve the accuracy of target identification, and the method comprises the following technical scheme:

an object recognition method, comprising:

collecting an RGB image and a depth image of a target area;

filling holes in the depth image to obtain a repaired depth image;

coding the restored depth image to obtain a three-channel depth image;

inputting the RGB image and the three-channel depth image into a pre-trained recognition model to obtain a target recognition result in the RGB image; the identification model is obtained by training a plurality of labeled RGB images and depth images corresponding to the labeled RGB images serving as samples in advance.

In the above method, preferably, the filling the hole in the depth image to obtain a restored depth image includes:

carrying out binarization processing on the depth image to obtain a mask;

determining a hole point in the depth image according to the mask;

clustering pixel values in the grayed RGB images to obtain clustered images, wherein the clustered images identify pixel points with approximate pixel values in the grayed RGB images;

determining a first pixel corresponding to the void point and all second pixels of the same kind as the first pixel in the grayed RGB image, wherein the second pixels correspond to non-void points in the depth image;

calculating the distance between the first pixel and each second pixel;

and taking the depth value corresponding to the second pixel with the shortest distance between the first pixels as the filling value of the hole point.

carrying out binarization processing on the depth image to obtain a mask;

determining a hole point in the depth image according to the mask;

determining a first pixel corresponding to the void point and a second pixel in a preset neighborhood of the first pixel in the RGB image, wherein the second pixel is a pixel in the preset neighborhood corresponding to a non-void point;

calculating the distance between the first pixel and each second pixel;

The above method, preferably, the identification model includes:

a deep network unit and a convolutional neural network unit; wherein,

the depth network unit is used for processing the three-channel depth image so as to extract the characteristics of the three-channel depth image;

the convolution neural network unit is used for processing the RGB image, extracting the characteristics of the RGB image, processing the characteristics of the three-channel depth image and the characteristics of the RGB image, and obtaining a target identification result in the RGB image.

In the above method, preferably, the deep network unit includes: three layers of multilayer perceptron convolution layers;

the convolutional neural network unit includes: two layers of convolution pooling layers; a two-layer first inclusion module connected to the two-layer convolutional pooling layer; the first pooling layer is connected with the two first Inception modules; a five-layer second inclusion module connected to the first pooling layer; a second pooling layer connected to the five-layer second inclusion module; a two-layer third inclusion module connected to the second pooling layer; a third pooling layer connected to the two third inclusion modules; a signal loss layer connected to the third pooling layer; a linear layer connected to the loss of signal layer; a classification layer connected to the linear layer; a decision layer connected to the classification layer; an output layer coupled to the decision layer.

An object recognition apparatus comprising:

the acquisition module is used for acquiring the RGB image and the depth image of the target area;

the filling module is used for filling the hole in the depth image to obtain a repaired depth image;

the coding module is used for coding the repaired depth image to obtain a three-channel depth image;

the recognition module is used for inputting the RGB image and the three-channel depth image into a pre-trained recognition model to obtain a target recognition result in the RGB image; the identification model is obtained by training a plurality of labeled RGB images and depth images corresponding to the labeled RGB images serving as samples in advance.

The above apparatus, preferably, the filling module includes:

a binarization unit, configured to perform binarization processing on the depth image to obtain a mask;

a first determining unit, configured to determine a hole point in the depth image according to the mask;

the clustering unit is used for clustering the pixel values in the grayed RGB images to obtain clustered images, and the clustered images identify pixels with approximate pixel values in the grayed RGB images;

a second determining unit, configured to determine, in the grayed RGB image, a first pixel corresponding to the hole point and all second pixels of the same kind as the first pixel, where the second pixels correspond to non-hole points in the depth image;

a calculating unit for calculating a distance between the first pixel and each of the second pixels;

and a filling unit configured to use a depth value corresponding to a second pixel having a shortest distance between the first pixels as a filling value of the hole point.

The above apparatus, preferably, the filling module includes:

a third determining unit, configured to determine, in the RGB image, a first pixel corresponding to the void point and a second pixel in a preset neighborhood of the first pixel, where the second pixel is a pixel in the preset neighborhood corresponding to a non-void point;

The above apparatus, preferably, the identification model includes: a deep network unit and a convolutional neural network unit; wherein,

The above apparatus, preferably, the deep network unit includes: three layers of multilayer perceptron convolution layers;

According to the scheme, the target identification method and the target identification device collect the RGB image and the depth image of the target area; filling holes in the depth image to obtain a repaired depth image; coding the restored depth image to obtain a three-channel depth image; and inputting the RGB image and the three-channel depth image into a pre-trained recognition model to obtain a target recognition result in the RGB image. According to the method and the device, the target recognition is carried out by utilizing the pre-trained recognition model and combining the RGB image and the depth image, the accuracy of the target recognition is improved, and the problem that the recognition accuracy is low due to the influence of environmental factors such as illumination on the existing target recognition method is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of an implementation of a target identification method according to an embodiment of the present application;

fig. 2 is a flowchart of an implementation of filling a hole in a depth image to obtain a restored depth image according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a recognition model provided in an embodiment of the present application;

fig. 4 is an exemplary diagram of an inclusion module provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an object recognition apparatus according to an embodiment of the present application;

fig. 6 is a frame of image to be subjected to target recognition according to an embodiment of the present disclosure;

fig. 7 is a target recognition result obtained by processing the image shown in fig. 6 and the corresponding depth image based on the target recognition method provided in the embodiment of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be practiced otherwise than as specifically illustrated.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of an implementation of a target identification method according to an embodiment of the present application, which may include:

step S101: the RGB image and the depth image of the target area are collected.

An RGB-D depth camera may be employed to capture RGB images and depth images of the target zone. When the RGB-D based depth camera collects images, one frame of depth image can be collected at the same time when one frame of RGB image is collected. In displaying, only the RGB image may be displayed.

Step S102: and filling holes in the depth image to obtain a repaired depth image.

There are often holes in the depth image acquired with the depth camera that need to be repaired. In an alternative embodiment, the holes may be filled with depth values of pixels around the holes to repair the depth map.

Step S103: and coding the repaired depth image to obtain a three-channel depth image.

Optionally, the restored depth image may be encoded by using an HHA encoding method, and three channels in the obtained three-channel depth image may be three channels of a horizontal difference, a ground height, and an angle of a surface normal vector. The HHA coding method emphasizes the complementary information between the channel data.

Step S104: inputting the RGB image and the three-channel depth image into a pre-trained recognition model to obtain a target recognition result in the RGB image; the recognition model is obtained by training a plurality of RGB images marked with targets and depth images corresponding to the marked RGB images as samples in advance.

In the embodiment of the application, a plurality of pairs of RGB images and depth images acquired by an RGB-D depth camera are taken as training samples in advance, and labeling information corresponding to the RGB images is taken as a label to be trained to obtain the recognition model. The annotation information corresponding to the RGB image may include a text identifier corresponding to a specified region in the RGB image. The text label may not be identified in the RGB image, but the text label is stored in association with the RGB image, and the specified area information of the RGB (the specified area information is identified by a graphic in the RGB image, for example, the specified area information may be a rectangular box), wherein the specified area information is used to specify the position of the object in the RGB image.

According to the target recognition method, the pre-trained recognition model is used, the depth image and the RGB image are combined to perform target recognition, the target recognition precision is improved, and the problem that the recognition accuracy is low due to the fact that the existing target recognition method is influenced by environmental factors such as illumination is solved.

In an optional embodiment, an implementation flowchart of the above hole filling for a depth image to obtain a restored depth image is shown in fig. 2, and may include:

step S201: and carrying out binarization processing on the depth image to obtain a mask.

Optionally, in the depth image, a point with a depth value of zero may be binarized to be 0, and a point with a depth value of non-zero may be binarized to be 255, and may be expressed as:

mask denotes a mask, and A (i, j) denotes a depth value at (i, j).

Step S202: determining a hole point in the depth image according to the mask.

The hole point is a point in the mask with a value of 0.

Step S203: and clustering the pixel values in the grayed RGB image to obtain a clustered image, wherein the clustered image identifies pixel points with approximate pixel values in the grayed RGB image.

The grayed RGB image refers to a grayscale image converted from an RGB image. Optionally, a K-means algorithm may be used to cluster pixel values in the grayed RGB image. Alternatively, other clustering algorithms may be used to cluster the pixel values in the grayed RGB images, such as hierarchical clustering algorithms. The cluster image characterizes which pixels in the grayed RGB image have similar pixel values.

Step S204: and determining a first pixel corresponding to the void point and all second pixels which are the same as the first pixel in the RGB image, wherein the second pixels correspond to non-void points in the depth image.

Pixels in the RGB image correspond to pixels in the depth map one-to-one. The pixels belonging to the same cluster include both a first pixel corresponding to a hole point and a second pixel corresponding to a non-hole point.

Step S205: and calculating the distance between the first pixel and each second pixel.

In the embodiment of the present application, for a first pixel corresponding to each empty point, a distance between the first pixel and each second pixel in the same cluster is calculated through a pixel value (i.e., a gray value), where the distance may be an euclidean distance, or may be other distances, such as a cosine similarity distance.

In an alternative embodiment, the distance between the first pixel and the second pixel may be a combined distance between the first pixel and the second pixel calculated from the euclidean distance between the first pixel and the second pixel and the image pixel distance. Image pixel distance refers to the distance between two pixels measured in pixels. For example,

assuming that the coordinates of the pixel point a in the image are the 30 th row and the 30 th column in the 10 th row and the coordinates of the pixel point b in the 13 th row and the 34 th column, the distance between the two pixel points in the row direction is 3, and the distance in the column direction is 4, and the image pixel distance between the pixel point a and the pixel point b is 5. After the euclidean distance between the pixel point a and the pixel point b and the image pixel distance are obtained, the sum of the two (namely, the euclidean distance and the image pixel distance) can be used as the comprehensive distance between the pixel point a and the pixel point b, or the weighted sum of the two is carried out to obtain the comprehensive distance between the pixel point a and the pixel point b, or the weighted sum of the two is respectively squared and then summed to obtain the comprehensive distance between the pixel point a and the pixel point b.

Step S206: and taking the depth value corresponding to the second pixel with the shortest distance with the first pixel as the filling value of the hole point. That is, the hole point is filled with a depth value corresponding to a second pixel having the shortest distance between the first pixels.

In an optional embodiment, the above-mentioned performing hole filling on the depth image to obtain the repaired depth image may be implemented in a manner that:

and carrying out binarization processing on the depth image to obtain a mask.

Determining a hole point in the depth image according to the mask.

The implementation of the above two steps can refer to the foregoing embodiments, and is not described in detail here.

And determining a first pixel corresponding to the void point and a second pixel in a preset neighborhood of the first pixel in the RGB image, wherein the second pixel is a pixel in the preset neighborhood corresponding to a non-void point in the depth image.

In this embodiment, the second pixel is a pixel in the neighborhood of the first pixel.

And calculating the distance between the first pixel and each second pixel. The calculation process can be referred to the foregoing embodiments and will not be described in detail here.

And taking the depth value corresponding to the second pixel with the shortest distance with the first pixel as the filling value of the hole point.

In an alternative embodiment, a schematic structural diagram of the recognition model is shown in fig. 3, and may include: a deep network unit and a convolutional neural network unit; wherein,

the depth network unit (NIN network unit for short) is configured to process the three-channel depth image to extract features of the three-channel depth image. In the example shown in fig. 3, the input three-channel depth image is HHA _ Img, with a size of 300 × 300.

The convolutional neural network unit (CNN network unit for short) is used for processing the RGB image, extracting the characteristics of the RGB image, and processing the characteristics of the three-channel depth image and the characteristics of the RGB image to obtain a target identification result in the RGB image. In the example shown in fig. 3, the input RGB image is RGB _ Img, and the size is 300 × 300.

Optionally, the NIN network unit includes three multilayer perceptron convolutional layers (i.e., three mlpconv network layers, respectively identified as NIN1, NIN2, and NIN3 in fig. 3), where the mlpconv layers are actually performed by performing a normal convolution (convolution) and then a conventional mlp (multilayer perceptron). The multi-layer perceptron is a 2-layer (input layer +1 hidden layer) perceptron, which is to perform weighted linear recombination on an element at the same position in each feature layer output by a common convolutional layer, which is equivalent to the operation result on a local block in 1X1 convolution, and then perform such operation on each element in the feature map, which is equivalent to 1X1 convolution. Since the constraint is linear and mlp is non-linear, the latter allows higher abstraction and thus greater generalization capability. In the cross-channel case, mlpconv is equivalent to convolutional layer +1 × 1 convolutional layer.

The CNN network unit comprises two layers of convolution pooling layers; a two-layer first inclusion module connected to the two-layer convolutional pooling layer; the first pooling layer is connected with the two first Inception modules; a five-layer second inclusion module connected to the first pooling layer; a second pooling layer connected to the five-layer second inclusion module; a two-layer third inclusion module connected to the second pooling layer; a third pooling layer connected to the two third inclusion modules; a signal loss layer connected to the third pooling layer; a linear layer connected to the loss of signal layer; a classification layer connected to the linear layer; a decision layer connected to the classification layer; an output layer coupled to the decision layer.

Taking fig. 3 as an example, the 7 × 7 convolution layer Conv _7 × 7, the maximum pooling layer maxpool, 3 × 3 convolution layer Conv _3 × 3 and the maximum pooling layer maxpool connected in sequence in fig. 3 constitute two convolution pooling layers; the Incep (3a) and the Incep (3b) which are connected in sequence form a first Incep module with two layers; the largest pooling layer maxpool connected with the inclusion (3b) constitutes a first pooling layer; the sequentially connected Incepration (4a) -Incepration (4e) form a five-layer second Incepration module; the largest pooling layer maxpool connected with the inclusion (4e) forms a second pooling layer; the inclusion (5a) and the inclusion (5b) which are connected in sequence form a third inclusion module with two layers; the average value pooling layer avgpool connected with the inclusion (5b) forms a third pooling layer; dropout is the signal loss layer; linear is a linear layer; softmax is a classification layer; detection is a decision layer; Non-Maximum Suppression is the output layer.

The inclusion module is used to simultaneously convolve and re-aggregate the features output by the previous layer in multiple dimensions. Specifically, an example of the inclusion module is shown in fig. 4, in which the sizes of convolution kernels used by the inclusion module are 1 × 1, 3 × 3, and 5 × 5, the use of convolution kernels with different sizes means different sizes of receptive fields, and the final concatenation means aggregation of features with different scales, so that the sizes of convolution kernels are 1 × 1, 3 × 3, and 5, mainly for convenience of alignment. After the convolution step stride is set to 1, as long as the edge extension parameters pad are set to 0, 1 and 2 respectively, features with the same dimension can be obtained after convolution, then the features can be directly spliced together, meanwhile, 3 × 3 pooling layers are introduced into the network, the more the network goes to the back, the more abstract the features are, and the more the receptive field involved in each feature is, therefore, as the number of layers increases, the proportion of convolution of 3 × 3 and 5 × 5 also increases, but the convolution kernel using 5 × 5 still brings huge calculation amount. For this purpose, a 1 × 1 convolution kernel is used for dimensionality reduction.

Corresponding to the embodiment of the method, the present application further provides a target identification device, and a schematic structural diagram of the target identification device provided by the present application is shown in fig. 5, and the target identification device may include:

an acquisition module 51, a filling module 52, an encoding module 53 and an identification module 54; wherein,

the acquisition module 51 is configured to acquire an RGB image and a depth image of the target area;

the filling module 52 is configured to perform hole filling on the depth image to obtain a repaired depth image;

the encoding module 53 is configured to encode the repaired depth image to obtain a three-channel depth image;

the recognition module 54 is configured to input the RGB image and the three-channel depth image into a pre-trained recognition model to obtain a target recognition result in the RGB image; the identification model is obtained by training a plurality of labeled RGB images and depth images corresponding to the labeled RGB images serving as samples in advance.

The target recognition device provided by the application acquires an RGB image and a depth image of a target area; filling holes in the depth image to obtain a repaired depth image; coding the restored depth image to obtain a three-channel depth image; and inputting the RGB image and the three-channel depth image into a pre-trained recognition model to obtain a target recognition result in the RGB image. According to the method and the device, the target recognition is carried out by utilizing the recognition model trained in advance and combining the RGB image and the depth image, and the accuracy of the target recognition is improved.

In an alternative embodiment, the filling module 52 may include:

In an alternative embodiment, the encoding module 53 may specifically be configured to: and performing HHA coding on the repaired depth image to obtain a three-channel depth image.

In an alternative embodiment, the identifying the model may include: a deep network unit and a convolutional neural network unit; wherein,

In an optional embodiment, the deep network unit includes: three layers of multilayer perceptron convolution layers;

As shown in fig. 6-7, fig. 6 is a frame of image to be subjected to target recognition, and the frame of image and the depth image corresponding to the frame of image are processed based on the target recognition method provided by the present application, and the obtained target recognition result is shown in fig. 7. In this example, the target is a chair, and when the recognition model is trained, the chair recognition model is obtained by training using a training sample including the chair.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

It should be understood that the technical problems can be solved by combining and combining the features of the embodiments from the claims.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of object recognition, comprising:

collecting an RGB image and a depth image of a target area;

filling holes in the depth image to obtain a repaired depth image;

coding the restored depth image to obtain a three-channel depth image;

2. The method of claim 1, wherein the hole filling the depth image to obtain a restored depth image comprises:

carrying out binarization processing on the depth image to obtain a mask;

determining a hole point in the depth image according to the mask;

calculating the distance between the first pixel and each second pixel;

3. The method of claim 1, wherein the hole filling the depth image to obtain a restored depth image comprises:

carrying out binarization processing on the depth image to obtain a mask;

determining a hole point in the depth image according to the mask;

calculating the distance between the first pixel and each second pixel;

4. The method of claim 1, wherein identifying the model comprises:

a deep network unit and a convolutional neural network unit; wherein,

5. The method of claim 4, wherein the deep network unit comprises: three layers of multilayer perceptron convolution layers;

6. An object recognition apparatus, comprising:

7. The apparatus of claim 6, wherein the fill module comprises:

8. The apparatus of claim 6, wherein the fill module comprises:

9. The apparatus of claim 6, wherein the recognition model comprises: a deep network unit and a convolutional neural network unit; wherein,

10. The apparatus of claim 9, wherein the deep network unit comprises: three layers of multilayer perceptron convolution layers;