CN110852327A - Image processing method, image processing device, electronic equipment and storage medium - Google Patents
Image processing method, image processing device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN110852327A CN110852327A CN201911084122.2A CN201911084122A CN110852327A CN 110852327 A CN110852327 A CN 110852327A CN 201911084122 A CN201911084122 A CN 201911084122A CN 110852327 A CN110852327 A CN 110852327A
- Authority
- CN
- China
- Prior art keywords
- image
- static
- processed
- local
- images
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012545 processing Methods 0.000 title claims abstract description 19
- 238000003672 processing method Methods 0.000 title claims abstract description 14
- 230000003068 static effect Effects 0.000 claims abstract description 121
- 238000001514 detection method Methods 0.000 claims abstract description 61
- 230000000007 visual effect Effects 0.000 claims abstract description 55
- 239000013598 vector Substances 0.000 claims abstract description 47
- 238000000034 method Methods 0.000 claims abstract description 36
- 238000013507 mapping Methods 0.000 claims abstract description 19
- 238000001914 filtration Methods 0.000 claims abstract description 12
- 238000000605 extraction Methods 0.000 claims description 21
- 238000004422 calculation algorithm Methods 0.000 description 34
- 238000013527 convolutional neural network Methods 0.000 description 33
- 238000000926 separation method Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 10
- 238000013135 deep learning Methods 0.000 description 9
- 230000008901 benefit Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 230000011218 segmentation Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 238000004590 computer program Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 5
- 238000005286 illumination Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 241001465754 Metazoa Species 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000036961 partial effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 201000011243 gastrointestinal stromal tumor Diseases 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the disclosure discloses an image processing method, an image processing device, an electronic device and a storage medium, wherein the method comprises the following steps: filtering dynamic objects in the images to be processed in the image set to be processed to obtain static images; extracting local features in the static image by using a CNN network, and clustering the local features to obtain a plurality of clustering central points; and constructing a visual bag-of-words model of the static image according to the plurality of clustering center points, mapping local features of the static image to the constructed visual bag-of-words model, expressing feature vectors of the image set to be processed according to the visual bag-of-words model, and performing closed-loop detection according to the result expressed by the feature vectors.
Description
Technical Field
The present disclosure relates to the field of image technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.
Background
In recent years, with the rapid development of robotics and autopilot technologies, visual Simultaneous Localization and Mapping (vSLAM) plays an increasingly important role therein. The closed-loop detection is used as an important link in the vSLAM system and is mainly used for judging whether the current position of the robot passes through a space region before, and the problems of map creation failure and position loss and map redundant data or repeated structures caused by calculation accumulated errors in large-scale complex environments such as sensor errors, noise influence and environmental changes are solved, so that the accuracy and the robustness of the vSLAM are improved.
The existing closed loop detection methods can be divided into two categories: the method comprises a traditional closed-loop detection method based on artificial design characteristics and a closed-loop detection method based on deep learning. The conventional closed-loop detection algorithm relies on artificially designed feature descriptors to perform image similarity matching, and can be divided into a closed-loop detection method based on local feature descriptors (SIFT, SURF, and ORB) and a closed-loop detection method based on global feature descriptors (GIST). The most common method in closed-loop detection based on local feature descriptors is a Visual Bag of Words (BoVW), and the main idea is to extract local features from an image and cluster the extracted feature descriptors to establish a Visual dictionary. And representing the image into a K-dimensional numerical value vector according to the visual dictionary tree, and judging whether a closed loop exists or not by measuring the distance between the image feature vectors. Because the traditional artificially designed feature closed-loop detection algorithm is to extract regional features such as edges, corners, lines, curves and special attributes from local regions of an image, the method has good accuracy when the change of the surrounding environment is not obvious. However, since the artificial design feature algorithm mostly depends on the professional knowledge and experience of human beings, and cannot accurately express the image, when the surrounding environment changes significantly (for example, in a scene with significant illumination change), it is difficult to provide accurate and robust image feature description, and the stability is poor, resulting in an increased mismatch rate in a complex environment.
With the rapid development of deep learning in recent years, because CNN (convolutional neural network) can extract features of deep level of an image, has richer data information and higher feature expression capability, CNN is used to learn features from a large amount of data to replace the features of traditional manual design to solve the closed-loop detection problem. When the illumination changes remarkably, the image descriptors based on the CNN are superior to descriptors designed manually in performance, but because the CNN-based closed-loop detection algorithm extracts the global features of the image, the accuracy of the CNN feature extraction closed-loop detection algorithm based on the global features is lower than that of the traditional closed-loop detection algorithm based on the local features in a stable environment without illumination changes.
Disclosure of Invention
The embodiment of the disclosure provides an image processing method and device, electronic equipment and a computer-readable storage medium.
In a first aspect, an embodiment of the present disclosure provides an image processing method, including:
filtering dynamic objects in the images to be processed in the image set to be processed to obtain static images;
extracting local features in the static image by using a CNN network, and clustering the local features to obtain a plurality of clustering central points;
and constructing a visual bag-of-words model of the static image according to the plurality of clustering center points, mapping local features of the static image to the constructed visual bag-of-words model, expressing feature vectors of the image set to be processed according to the visual bag-of-words model, and performing closed-loop detection according to the result expressed by the feature vectors.
Further, filtering the dynamic object in the to-be-processed image set to obtain a static image, including:
detecting a dynamic object in the image to be processed, and extracting the region information of the dynamic object;
and filtering the dynamic object from the image to be processed according to the region information to obtain the static image.
Further, extracting local features in the static image by using a CNN network, comprising:
processing the static image by utilizing a multi-scale dense full convolution network, and returning key point information in the static image;
locally cutting the static image according to the key point information to obtain a plurality of local image blocks;
and respectively extracting the features of the local image blocks by using a local feature network to obtain the feature point positions and the feature descriptors of the local image.
Further, mapping local features of the static image onto the constructed visual bag-of-words model, comprising:
determining a first similarity between the local features of the static image and the cluster center point in the visual bag of words model;
and mapping the local features of the static images to a bag of words category to which one of the cluster center points belongs according to the first similarity.
Further, performing feature vector representation on the image set to be processed according to the visual bag-of-words model, including:
counting the distribution condition of mapping the local features of the static images to each bag type in the visual bag-of-words model;
and obtaining the characteristic vector of the static image according to the distribution condition.
Further, performing closed-loop detection according to the result represented by the feature vector, including:
determining a second similarity between the feature vectors of the static images corresponding to the two images to be processed in the image set to be processed;
and determining the closed-loop detection results of the two images to be detected according to the second similarity.
In a second aspect, an embodiment of the present invention provides an image processing apparatus, including:
the filtering module is configured to filter dynamic objects in the images to be processed in the image set to be processed to obtain static images;
the characteristic extraction module is configured to extract local characteristics in the static image by using a CNN network, and cluster the local characteristics to obtain a plurality of cluster central points;
the closed-loop detection module is configured to construct a visual bag-of-words model of the static image according to the plurality of clustering center points, map local features of the static image to the constructed visual bag-of-words model, perform feature vector representation on the image set to be processed according to the visual bag-of-words model, and perform closed-loop detection according to a result of the feature vector representation.
The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions.
In one possible design, the image processing apparatus includes a memory and a processor, the memory is used for storing one or more computer instructions for supporting the image processing apparatus to execute the method in the first aspect, and the processor is configured to execute the computer instructions stored in the memory. The image processing apparatus may further include a communication interface for the image processing apparatus to communicate with other devices or a communication network.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of the first aspect.
In a fourth aspect, the disclosed embodiments provide a computer-readable storage medium for storing computer instructions for a security authentication apparatus for an enterprise account, which contains computer instructions for performing the method according to the first aspect.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
the method comprises the steps of firstly utilizing a convolutional neural network to carry out local region segmentation and feature extraction on an image, combining an extracted image local CNN feature descriptor with a traditional closed-loop detection bag-of-words model algorithm, clustering the obtained image local CNN feature descriptor in the bag-of-words model by utilizing a clustering center algorithm, constructing a visual dictionary tree according to a plurality of clustered central points, further obtaining a feature vector representing the image, carrying out similarity comparison on the image, and judging whether a loop is generated. The method can simultaneously improve the accuracy and stability of closed-loop detection in a complex environment.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:
FIG. 1 shows a flow diagram of an image processing method according to an embodiment of the present disclosure;
FIG. 2 shows a flow chart of step S101 according to the embodiment shown in FIG. 1;
FIG. 3 shows a flowchart of step S102 according to the embodiment shown in FIG. 1;
FIG. 4 shows a flow diagram of a method for combining a deep learning advantage with a conventional closed loop detection advantage in an embodiment of the present disclosure;
fig. 5 shows a block diagram of the structure of an image processing apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an electronic device suitable for implementing an image processing method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.
In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.
It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates a flowchart of an image processing method according to an embodiment of the present disclosure. As shown in fig. 1, the image processing method includes the steps of:
in step S101, filtering the dynamic objects in the to-be-processed image set to obtain a static image;
in step S102, a CNN network is used to extract local features in the static image, and the local features are clustered to obtain a plurality of cluster center points;
in step S103, a visual bag-of-words model of the static image is constructed according to the plurality of cluster center points, local features of the static image are mapped onto the constructed visual bag-of-words model, feature vector representation is performed on the image set to be processed according to the visual bag-of-words model, and closed-loop detection is performed according to a result of the feature vector representation.
In this embodiment, the image set to be processed may be a set formed by a plurality of images to be processed for closed-loop detection. The dynamic object is an object which is in dynamic change in the image to be processed, such as a person, an animal, a vehicle, and the like.
According to the embodiment of the disclosure, before the closed-loop detection is performed on the to-be-processed image in the to-be-processed image set, the dynamic object in the to-be-processed image is filtered, and then the static image is obtained. In some embodiments, the pixels in the image area where the dynamic object in the image to be processed is located may be set to a preset color, for example, black, because the full black area has no feature point, the feature point interference on the dynamic object is eliminated.
For a plurality of static images derived from a set of images to be processed, a CNN network may be utilized to extract local features in the static images. In the process of extracting the local features of the static image by using the CNN network, firstly, the first part of the CNN network is used for extracting key point information from the static image, and the static image is cut into a plurality of local area image blocks based on the key points, wherein each local area image block comprises a key point extracted from the static image. Then, for each local area image block, the second part CNN network is used to extract the key point information from each local area image block, and the key point information extracted from the multiple local area image blocks corresponding to the static image is used as the local feature of the static image, for example, the feature vector of the key point extracted from the multiple local area image blocks corresponding to the static image can be used as the local feature of the static image.
After the local features of the static images corresponding to each image to be processed in the image set to be processed are obtained, the local features can be clustered by using a center clustering algorithm to obtain a plurality of clustering center points. The center clustering algorithm may adopt a Kmeans + + algorithm, which is a known algorithm, and the specific clustering process is not described herein again.
After clustering is completed, a visual bag-of-words model of the static images corresponding to all the images to be processed in the image set to be processed can be constructed according to the clustering center points. The constructed visual bag-of-words model comprises a plurality of bag-of-words categories, each bag-of-words category represents a category, and the central point of each bag-of-words category is one of the cluster central points. When a local feature is similar to the cluster center point, the local feature may be classified into the bag of words category in which the cluster center point is located.
After the visual bag-of-words model is constructed, the local features of the static images can be mapped to the visual bag-of-words model according to the similarity between the local features and the clustering center points. I.e. the local features of the static image are classified into the various bag-of-words categories comprised by the visual bag-of-words model. After the mapping is completed, the feature vector representation is performed on the static image according to the mapping result, and the feature vector representation result of the static image can be used as the feature vector representation result of the image to be processed corresponding to the static image. In some embodiments, the feature vector representation of the static image may be derived based on the distribution of local features of the static image across the various bag-of-words classes after these local adjustments are mapped onto the visual bag-of-words model.
After the feature vector representation of each to-be-processed image in the to-be-processed image set is obtained, a closed-loop detection result between the two to-be-processed images can be determined according to the feature vector representation.
In an optional implementation manner of this embodiment, as shown in fig. 2, the step S101, namely, the step of filtering the dynamic object in the to-be-processed image set to obtain the static image, further includes the following steps:
in step S201, detecting a dynamic object in the image to be processed, and extracting area information of the dynamic object;
in step S202, the dynamic object is filtered from the image to be processed according to the region information, so as to obtain the static image.
In the image preprocessing stage, scene recognition and dynamic and static scene separation are carried out on the image, and a Yolov3 model is pre-trained by using PASCAL VOC2007 and VOC2012 data sets before the scene separation. The data set contains a plurality of outdoor moving objects such as people, animals, vehicles and the like, and the positions and the belonging classification identifications of the objects in the pictures. Wherein, the pascalloc 2007 contains 20 classifications, 9963 images and 24640 objects marked therein, and the PASCAL VOC2012 contains 20 classifications, 11530 images and 27450 objects marked therein.
The object classification in the PASCAL data set completely covers moving objects which may be encountered by the autonomous mobile robot during outdoor work, and most of the outdoor moving objects in the data set used in the embodiment of the disclosure are people, vehicles and the like, so that the dynamic objects can be easily identified by using a convolutional neural network trained by the database, and then the dynamic objects are filtered from the image to be processed.
After the training of the convolutional neural network for identifying the dynamic objects in the images is completed, the dynamic and static scene identification and separation can be performed on the images to be processed in the image set to be processed by using a dynamic and static scene separation algorithm based on target detection.
In the embodiment of the present disclosure, a pre-trained target detection YOLOv3 model is used to perform dynamic object detection on an image to be processed, and dynamic object region information in the image to be processed is extracted, where the region information includes a rectangular region, a region size, and the like of a position where a dynamic object is located. And separating the corresponding position of the dynamic object area information in the original image, and eliminating the dynamic object information such as pedestrians, vehicles and the like in the original image.
In some embodiments, the adopted scene information separation mode is based on a matrix corresponding to image pixels, and the dynamic object area is blackened in the original image, because the full-black area does not have feature points, the feature point interference on the dynamic object is eliminated. Therefore, when the local feature extraction is carried out on the image subsequently, the feature extraction is not carried out on the separated dynamic scene area. The obtained other position area is the filtered pure static scene image excluding the dynamic objects that we need to solve.
According to the method and the device, a scene semantic segmentation and separation are carried out on the image by adopting a dynamic and static scene separation algorithm based on the image semantic segmentation in the image preprocessing stage, and a semantic segmentation DeepLabv3 Plus model is pre-trained by utilizing the data sets of PASCAL VOC2007 and PASCAL VOC2012 before the scene separation.
And then, performing semantic segmentation on the input image to be processed by using a pre-trained DeepLabv3 Plus model, and obtaining semantic information of the image dynamic and static scenes in the result. Extracting semantic information of a dynamic object (including information of image pixels of an area where the dynamic object is located and the like) in the image to be processed according to the obtained result, traversing the whole semantic information of the image based on an image pixel point matrix, and when the pixel point matrix is labeled object classification information during training, removing the semantic information of the dynamic object from an original image, namely directly and independently blackening the semantic information of the dynamic object such as pedestrians, vehicles and the like from a corresponding position in the original image to obtain a static image only removing the information of the dynamic object.
In some embodiments, in order to avoid incomplete separation of edge information of the dynamic object and ensure that semantic information of the dynamic object is removed more cleanly and thoroughly, image expansion processing may be further performed on the separated area information of the dynamic object (that is, after the image area where the dynamic object is located is blacked, expansion processing is performed on a black area where the dynamic object is located), and the iteration number is 2.
In an optional implementation manner of this embodiment, as shown in fig. 3, the step S102, namely the step of extracting the local feature in the static image by using the CNN network, further includes the following steps:
in step S301, the static image is processed by using a multi-scale dense full convolution network, and the key point information in the static image is returned;
in step S302, locally clipping the static image according to the key point information to obtain a plurality of local image blocks;
in step S303, a local feature network is used to perform feature extraction on the plurality of local image blocks respectively, so as to obtain feature point positions and feature descriptors of the local image blocks.
In the optional implementation manner, the CNN network may use a deep-architecture sparse matching method LF-Net network, and the CNN network may be divided into two parts, where the first part CNN network may be a multi-scale dense full convolution network, which returns the key point position, scale, direction, and the like of the static image, and performs local area clipping on the static image according to the key point position to obtain a plurality of local area image blocks corresponding to each static image. The second partial CNN network may be a local descriptor network used to extract the keypoint-based cropped image blocks produced by the first partial network. When the local feature extraction is carried out on the static image, a full convolution network is firstly utilized to generate a rich feature map o from the image I to be processed. And detecting key points of the generated feature graph o with unchanged scale, selecting top M pixels as feature points according to the scale-unchanged mapping, and finally obtaining the feature point positions and the feature descriptors of the local image blocks.
The multi-scale dense full convolution network is a full convolution network structure designed according to the MSNDet (multi-scale dense network) idea, the network has no full connection layer, three simple ResNet layouts are adopted, each block comprises 5 × 5 convolution filters, then batch normalization, leak-ReLU activation and another group of 5 × 5 convolution are carried out. And takes the concept of MSDNet (multi-scale dense network): a plurality of scales are adopted to obtain abstract features of the graph to generate a feature graph, and the feature graph is divided into two parts which are connected in series: 1. convolution of the upper level co-scale features and down-sampling (diagonalconnection) of the upper level up-scale feature map. 2. Each layer has connections to other convolutional layers, and when the weights are propagated reversely, the weights are updated in a direction of better extracting results of each feature.
And after the key point is selected, image block clipping based on the surrounding position of the key point is carried out on the normalized image (the image obtained by normalizing the static image) by using a bilinear sampling mode (so as to keep difference). A bilinear sampling mode, namely a bilinear interpolation algorithm, is a better image scaling algorithm, and fully utilizes four real pixel values around a virtual point in an image to jointly determine one pixel value in a target image, so that the aim of cutting the image is fulfilled.
After obtaining a plurality of local area image blocks of the static image by using the first part of network, extracting features (including key point positions, sizes, feature vectors and the like of the local area image blocks) by using the local area image blocks cut by the second part of network, and finally obtaining a local feature description subset of the static image for subsequent clustering and establishing a visual bag-of-words model. Compared with the traditional artificial feature extraction algorithm, the LF-Net network can extract more abundant features, is hardly influenced by environments such as illumination and the like, and has stronger stability and higher accuracy.
When the local feature extraction is performed on the image by using the above method in the embodiment of the present disclosure, feature extraction is not performed on the dynamic scene area separated from the original image. Namely, when the area is detected to be a dynamic scene area during feature extraction, the area is bypassed, and local feature extraction is not performed. So that the dynamic scene area no longer participates in the representation of the features of the image. And finally determining the positions and the feature descriptors of the filtered image feature points.
In an optional implementation manner of this embodiment, the step of mapping the local features of the static image onto the constructed visual bag-of-words model in step S103 further includes the following steps:
determining a first similarity between the local features of the static image and the cluster center point in the visual bag of words model;
and mapping the local features of the static images to a bag of words category to which one of the cluster center points belongs according to the first similarity.
In the optional implementation manner, the obtained local features of the static image are combined with a closed-loop detection bag-of-words model algorithm, that is, the extracted local feature descriptor subset of the static image is clustered based on a Kmeans + + central clustering algorithm, and a visual bag-of-words model is constructed according to a plurality of clustered cluster central points.
In some embodiments, after the plurality of cluster center points are obtained by performing a center clustering algorithm on the local features corresponding to the static images corresponding to the to-be-processed images in the to-be-processed image set, the local features corresponding to the static images may be mapped to the plurality of cluster center points, and the specific mapping process may be performed by calculating a first similarity between the local features and the plurality of cluster center points and mapping according to the first similarity. For example, for a certain local feature B of the static image a, a first similarity between the feature descriptor of the local feature B and the feature descriptor of each cluster center point Ci may be calculated, and the first similarity may be determined by the euclidean distance, and the local feature B is mapped to the bag-of-words category corresponding to the cluster center point Ci with the largest corresponding first similarity. By the method, all local features extracted from all static images can be mapped to the visual bag-of-words model, and finally the visual dictionary tree is obtained.
In an optional implementation manner of this embodiment, the step of performing, in step S103, feature vector representation on the image set to be processed according to the visual bag-of-words model further includes the following steps:
counting the distribution condition of mapping the local features of the static images to each bag type in the visual bag-of-words model;
and obtaining the characteristic vector of the static image according to the distribution condition.
In this optional implementation, after the local features on the static images are mapped onto the visual dictionary model, the distribution of the local features on each static image on each bag type of the visual dictionary model, for example, the distribution frequency of the local features on one static image on different bag types, may be counted according to the mapping result, and then the feature vector of the static image is obtained based on the distribution. For example, the constructed visual bag-of-words model includes 3 bag-of-words categories, the static image a has 10 local features in total, after the 10 local features are mapped onto the 3 bag-of-words categories, 1 local feature is mapped onto the first bag-of-words category, 6 local features are mapped onto the second bag-of-words category, and 3 bag-of-words categories are mapped onto the third bag-of-words category, then the feature vector of the static image a may be represented as (1, 6, 3), and of course, the feature vector may be normalized to be used as the feature vector of the static image a. It can be understood that the feature vector dimension of the static image is the same as the number of bag categories in the visual bag-of-words model, and is also the same as the number of cluster center points.
In an optional implementation manner of this embodiment, the step of performing closed-loop detection according to the result represented by the feature vector in step S103 further includes the following steps:
determining a second similarity between the feature vectors of the static images corresponding to the two images to be processed in the image set to be processed;
and determining the closed-loop detection results of the two images to be detected according to the second similarity.
In this optional implementation manner, the feature vector of each to-be-processed image in the to-be-processed image set may be represented by the feature vector of the corresponding static image, so when a closed-loop detection result between two to-be-processed images in the to-be-processed image set is determined, by calculating a second similarity between the feature vectors of the static images corresponding to the two to-be-processed images, when the second similarity is greater than or equal to a preset threshold, it may be considered that the closed-loop detection result of the two to-be-processed images is yes, that is, a closed loop is generated between the two to-be-processed images, otherwise, a closed loop is not generated between the two to-be-processed images. The second similarity can be determined by using the cosine distance between the two images to be processed.
The traditional closed-loop detection algorithm based on local feature extraction in a stable environment has high accuracy, and the deep learning closed-loop detection algorithm based on global feature extraction in a complex environment has better stability. In order to improve the stability and accuracy of closed-loop detection in a complex environment, the embodiment of the disclosure provides a closed-loop detection method based on dynamic and static scene separation and local CNN feature representation. Firstly, in an image preprocessing stage, scene separation is carried out on an input image by utilizing a dynamic and static scene separation algorithm based on target detection or semantic segmentation, interference of dynamic objects such as pedestrians and vehicles in the image is eliminated, and a filtered pure static scene image is obtained. And then, local feature extraction is carried out on the filtered image by using a CNN feature extraction algorithm. When the local feature extraction is performed on the image, feature extraction is not performed on the dynamic scene area separated from the original image. And finally, determining the positions of the feature points and the feature descriptors of the filtered image, combining the positions with a traditional closed loop detection bag-of-words model algorithm, clustering the extracted CNN local feature descriptors by using a bag-of-words model, and constructing a visual dictionary tree. And obtaining a feature vector of the image according to the constructed dictionary tree, comparing the similarity of the feature vector, and judging whether a closed loop is generated.
The beneficial effects that this disclosed embodiment produced mainly lie in:
1) the closed-loop detection algorithm based on dynamic and static scene separation solves the problems that the cognition degree of a visual word bag model to image semantic information is reduced under the interference of movable objects such as pedestrians and vehicles in the environment, the global scene information in an image cannot be effectively extracted, and the like. Thereby improving the accuracy of closed loop detection.
2) The closed loop detection method based on the local CNN feature representation is provided. Because the traditional closed-loop detection algorithm extracts the local features of the image and the closed-loop detection algorithm based on deep learning extracts the global features of the image, the traditional closed-loop detection algorithm based on the local features has higher accuracy and higher precision than the closed-loop detection algorithm based on the deep learning under the same stable environment. However, in a complex environment (such as severe change of illumination), the closed-loop detection algorithm based on deep learning is obviously superior to the traditional closed-loop detection algorithm.
The embodiment of the disclosure provides a method for combining a deep learning advantage and a traditional closed loop detection advantage. The method comprises the steps of firstly utilizing a convolutional neural network to carry out local region segmentation and feature extraction on an image, combining an extracted image local CNN feature descriptor with a traditional closed-loop detection bag-of-words model algorithm, clustering the obtained image local CNN feature descriptor in the bag-of-words model by utilizing a Kmeans + + algorithm, constructing a visual dictionary tree according to K clustered central points, further obtaining feature vectors representing the image, carrying out similarity comparison on the image and judging whether a loop is generated. The method can simultaneously improve the accuracy and stability of closed-loop detection in a complex environment.
Fig. 4 shows a flow chart of a method for combining a deep learning advantage with a conventional closed loop detection advantage in an embodiment of the present disclosure. As shown in fig. 4, in the image preprocessing stage, the original image is subjected to dynamic and static scene separation to obtain a static image after scene separation. The method comprises the steps of carrying out local CNN feature detection on a static image, obtaining image local feature sets, merging feature point clustering, counting frequency distribution of feature points on word bags in each image by utilizing a histogram statistical method in a word bag model, comparing similarity of any two images, and judging whether a closed loop is generated.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods.
Fig. 5 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 5, the image processing apparatus includes:
a filtering module 501, configured to filter dynamic objects in the to-be-processed images in the to-be-processed image set to obtain static images;
a feature extraction module 502, configured to extract local features in the static image by using a CNN network, and perform clustering on the local features to obtain a plurality of clustering center points;
a closed-loop detection module 503, configured to construct a visual bag-of-words model of the static image according to the plurality of cluster center points, map local features of the static image to the constructed visual bag-of-words model, perform feature vector representation on the image set to be processed according to the visual bag-of-words model, and perform closed-loop detection according to a result of the feature vector representation.
The image processing apparatus provided in the embodiment of the present disclosure corresponds to the image processing method described above, and specific details may be referred to the description of the image processing method, which is not described herein again.
Fig. 6 is a schematic structural diagram of an electronic device suitable for implementing an image processing method according to an embodiment of the present disclosure.
As shown in fig. 6, the electronic device 600 includes a Central Processing Unit (CPU)601, which can execute various processes in the embodiments of the above-described method of the present disclosure according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to embodiments of the present disclosure, the methods in the embodiments above with reference to the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the methods of embodiments of the present disclosure. In such embodiments, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, the present disclosure also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the above-described embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Claims (9)
1. An image processing method, comprising:
filtering dynamic objects in the images to be processed in the image set to be processed to obtain static images;
extracting local features in the static image by using a CNN network, and clustering the local features to obtain a plurality of clustering central points;
and constructing a visual bag-of-words model of the static image according to the plurality of clustering center points, mapping local features of the static image to the constructed visual bag-of-words model, expressing feature vectors of the image set to be processed according to the visual bag-of-words model, and performing closed-loop detection according to the result expressed by the feature vectors.
2. The method of claim 1, wherein filtering dynamic objects in the to-be-processed images in the to-be-processed image set to obtain a static image comprises:
detecting a dynamic object in the image to be processed, and extracting the region information of the dynamic object;
and filtering the dynamic object from the image to be processed according to the region information to obtain the static image.
3. The method according to claim 1 or 2, wherein extracting local features in the static image using a CNN network comprises:
processing the static image by utilizing a multi-scale dense full convolution network, and returning key point information in the static image;
locally cutting the static image according to the key point information to obtain a plurality of local image blocks;
and respectively extracting the features of the local image blocks by using a local feature network to obtain the feature point positions and the feature descriptors of the local image.
4. The method of claim 1 or 2, wherein mapping local features of the static image onto the constructed visual bag-of-words model comprises:
determining a first similarity between the local features of the static image and the cluster center point in the visual bag of words model;
and mapping the local features of the static images to a bag of words category to which one of the cluster center points belongs according to the first similarity.
5. The method of claim 4, wherein performing feature vector representation on the set of images to be processed according to the visual bag-of-words model comprises:
counting the distribution condition of mapping the local features of the static images to each bag type in the visual bag-of-words model;
and obtaining the characteristic vector of the static image according to the distribution condition.
6. The method of claim 5, wherein performing closed-loop detection based on the result of the eigenvector representation comprises:
determining a second similarity between the feature vectors of the static images corresponding to the two images to be processed in the image set to be processed;
and determining the closed-loop detection results of the two images to be detected according to the second similarity.
7. An image processing apparatus characterized by comprising:
the filtering module is configured to filter dynamic objects in the images to be processed in the image set to be processed to obtain static images;
the characteristic extraction module is configured to extract local characteristics in the static image by using a CNN network, and cluster the local characteristics to obtain a plurality of cluster central points;
the closed-loop detection module is configured to construct a visual bag-of-words model of the static image according to the plurality of clustering center points, map local features of the static image to the constructed visual bag-of-words model, perform feature vector representation on the image set to be processed according to the visual bag-of-words model, and perform closed-loop detection according to a result of the feature vector representation.
8. An electronic device comprising a memory and a processor; wherein,
the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to implement the method of any one of claims 1-6.
9. A computer-readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement the method of any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911084122.2A CN110852327A (en) | 2019-11-07 | 2019-11-07 | Image processing method, image processing device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911084122.2A CN110852327A (en) | 2019-11-07 | 2019-11-07 | Image processing method, image processing device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110852327A true CN110852327A (en) | 2020-02-28 |
Family
ID=69598753
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911084122.2A Pending CN110852327A (en) | 2019-11-07 | 2019-11-07 | Image processing method, image processing device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110852327A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582447A (en) * | 2020-04-30 | 2020-08-25 | 电子科技大学 | Closed loop detection method based on multiple network characteristics |
CN111738299A (en) * | 2020-05-27 | 2020-10-02 | 完美世界(北京)软件科技发展有限公司 | Scene static object merging method and device, storage medium and computing equipment |
CN111882002A (en) * | 2020-08-06 | 2020-11-03 | 桂林电子科技大学 | MSF-AM-based low-illumination target detection method |
CN112182275A (en) * | 2020-09-29 | 2021-01-05 | 神州数码信息系统有限公司 | Trademark approximate retrieval system and method based on multi-dimensional feature fusion |
CN112396596A (en) * | 2020-11-27 | 2021-02-23 | 广东电网有限责任公司肇庆供电局 | Closed loop detection method based on semantic segmentation and image feature description |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831446A (en) * | 2012-08-20 | 2012-12-19 | 南京邮电大学 | Image appearance based loop closure detecting method in monocular vision SLAM (simultaneous localization and mapping) |
CN106919920A (en) * | 2017-03-06 | 2017-07-04 | 重庆邮电大学 | Scene recognition method based on convolution feature and spatial vision bag of words |
CN109902619A (en) * | 2019-02-26 | 2019-06-18 | 上海大学 | Image closed loop detection method and system |
-
2019
- 2019-11-07 CN CN201911084122.2A patent/CN110852327A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831446A (en) * | 2012-08-20 | 2012-12-19 | 南京邮电大学 | Image appearance based loop closure detecting method in monocular vision SLAM (simultaneous localization and mapping) |
CN106919920A (en) * | 2017-03-06 | 2017-07-04 | 重庆邮电大学 | Scene recognition method based on convolution feature and spatial vision bag of words |
CN109902619A (en) * | 2019-02-26 | 2019-06-18 | 上海大学 | Image closed loop detection method and system |
Non-Patent Citations (3)
Title |
---|
YUKI ONO等: "LF-Net: Learning Local Features from Image", 《32ND CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NEURIPS 2018)》 * |
刘强等: "复杂环境下视觉SLAM闭环检测方法综述", 《机器人》 * |
林辉: "基于CNN与VLAD融合的闭环检测", 《研究与开发》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582447A (en) * | 2020-04-30 | 2020-08-25 | 电子科技大学 | Closed loop detection method based on multiple network characteristics |
CN111738299A (en) * | 2020-05-27 | 2020-10-02 | 完美世界(北京)软件科技发展有限公司 | Scene static object merging method and device, storage medium and computing equipment |
CN111738299B (en) * | 2020-05-27 | 2023-10-27 | 完美世界(北京)软件科技发展有限公司 | Scene static object merging method and device, storage medium and computing equipment |
CN111882002A (en) * | 2020-08-06 | 2020-11-03 | 桂林电子科技大学 | MSF-AM-based low-illumination target detection method |
CN111882002B (en) * | 2020-08-06 | 2022-05-24 | 桂林电子科技大学 | MSF-AM-based low-illumination target detection method |
CN112182275A (en) * | 2020-09-29 | 2021-01-05 | 神州数码信息系统有限公司 | Trademark approximate retrieval system and method based on multi-dimensional feature fusion |
CN112396596A (en) * | 2020-11-27 | 2021-02-23 | 广东电网有限责任公司肇庆供电局 | Closed loop detection method based on semantic segmentation and image feature description |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xie et al. | Multilevel cloud detection in remote sensing images based on deep learning | |
Chen et al. | Vehicle detection in high-resolution aerial images via sparse representation and superpixels | |
CN112966691B (en) | Multi-scale text detection method and device based on semantic segmentation and electronic equipment | |
CN110334762B (en) | Feature matching method based on quad tree combined with ORB and SIFT | |
Chen et al. | Vehicle detection in high-resolution aerial images based on fast sparse representation classification and multiorder feature | |
CN111145209B (en) | Medical image segmentation method, device, equipment and storage medium | |
CN102968637B (en) | Complicated background image and character division method | |
CN110852327A (en) | Image processing method, image processing device, electronic equipment and storage medium | |
CN111898432B (en) | Pedestrian detection system and method based on improved YOLOv3 algorithm | |
CN112116599B (en) | Sputum smear tubercle bacillus semantic segmentation method and system based on weak supervised learning | |
CN110175615B (en) | Model training method, domain-adaptive visual position identification method and device | |
CN114359851A (en) | Unmanned target detection method, device, equipment and medium | |
CN105528595A (en) | Method for identifying and positioning power transmission line insulators in unmanned aerial vehicle aerial images | |
Zhang et al. | Road recognition from remote sensing imagery using incremental learning | |
CN113361495A (en) | Face image similarity calculation method, device, equipment and storage medium | |
CN116704490B (en) | License plate recognition method, license plate recognition device and computer equipment | |
CN114332473A (en) | Object detection method, object detection device, computer equipment, storage medium and program product | |
CN113850136A (en) | Yolov5 and BCNN-based vehicle orientation identification method and system | |
CN113223037A (en) | Unsupervised semantic segmentation method and unsupervised semantic segmentation system for large-scale data | |
CN113657196B (en) | SAR image target detection method, SAR image target detection device, electronic equipment and storage medium | |
CN112241736A (en) | Text detection method and device | |
CN113822134A (en) | Instance tracking method, device, equipment and storage medium based on video | |
CN113378837A (en) | License plate shielding identification method and device, electronic equipment and storage medium | |
CN116071625B (en) | Training method of deep learning model, target detection method and device | |
CN112668662A (en) | Outdoor mountain forest environment target detection method based on improved YOLOv3 network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200228 |