CN113052159B - Image recognition method, device, equipment and computer storage medium - Google Patents
Image recognition method, device, equipment and computer storage medium Download PDFInfo
- Publication number
- CN113052159B CN113052159B CN202110400954.1A CN202110400954A CN113052159B CN 113052159 B CN113052159 B CN 113052159B CN 202110400954 A CN202110400954 A CN 202110400954A CN 113052159 B CN113052159 B CN 113052159B
- Authority
- CN
- China
- Prior art keywords
- image
- sample
- identified
- determining
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 238000011176 pooling Methods 0.000 claims abstract description 36
- 230000004927 fusion Effects 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims description 52
- 238000000605 extraction Methods 0.000 claims description 20
- 238000001914 filtration Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 13
- 239000013598 vector Substances 0.000 claims description 13
- 238000013507 mapping Methods 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000001629 suppression Effects 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 238000001514 detection method Methods 0.000 abstract description 18
- 238000004422 calculation algorithm Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 16
- 238000004891 communication Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 230000007547 defect Effects 0.000 description 5
- 238000004088 simulation Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000009469 supplementation Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Multimedia (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the application provides an image recognition method, an image recognition device, image recognition equipment and a computer storage medium, relates to the field of image detection, and aims to improve the accuracy of image recognition. The method comprises the following steps: acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified; inputting an image to be identified into a first network in a pre-trained image identification model, and determining text characteristics of the image to be identified; inputting an image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of at least one object to be identified; carrying out feature fusion on text features of the image to be identified, a pooling feature map of at least one object to be identified and spatial relationship features, and determining a shared feature image corresponding to the image to be identified; and inputting the shared characteristic image into a third network in the image recognition model, and determining recognition information of the image to be recognized, wherein the recognition information comprises category information and position information of each object to be recognized.
Description
Technical Field
The present application relates to the field of image detection, and in particular, to an image recognition method, apparatus, device, and computer storage medium.
Background
The identification of the target object in the image is one of important research directions in the field of computer vision, and has important roles in the fields of public safety, road traffic, video monitoring and the like. In the prior art, the target object can be identified by utilizing the spatial relationship characteristics of the target object in the image; the recognition accuracy of the neural network to the target object can be improved by reasonably matching the image characteristic weights in the neural network.
However, in the prior art, due to the complexity and diversity of the scenes included in the image and the uncertainty of the target position to be detected in the image, the method cannot adapt to more scenes, and therefore the accuracy of image recognition cannot be improved.
Disclosure of Invention
The embodiment of the application provides an image recognition method, an image recognition device, image recognition equipment and a computer storage medium, which are used for improving the accuracy of image recognition.
In a first aspect, an embodiment of the present application provides an image recognition method, including:
acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified;
Inputting an image to be identified into a first network in a pre-trained image identification model, and determining text characteristics of the image to be identified;
inputting an image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of at least one object to be identified;
carrying out feature fusion on text features of the image to be identified, a pooling feature map of at least one object to be identified and spatial relationship features, and determining a shared feature image corresponding to the image to be identified;
And inputting the shared characteristic image into a third network in the image recognition model, and determining recognition information of the image to be recognized, wherein the recognition information comprises category information and position information of each object to be recognized.
In a second aspect, an embodiment of the present application provides an image recognition apparatus, including:
the first acquisition module is used for acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified;
The first determining module is used for inputting the image to be identified into a first network in the pre-trained image identification model and determining the text characteristics of the image to be identified;
the second determining module is used for inputting the image to be identified into a second network in the image identification model and determining a pooling feature map and spatial relationship features of at least one object to be identified;
The fusion module is used for carrying out feature fusion on the text features of the image to be identified, the pooled feature map of at least one object to be identified and the spatial relationship features, and determining a shared feature image corresponding to the image to be identified;
And the identification module is used for inputting the shared characteristic image into the third network in the image identification model, and determining identification information of the image to be identified, wherein the identification information comprises category information and position information of each object to be identified.
In a third aspect, an embodiment of the present application provides an image recognition apparatus, including:
a processor and a memory storing computer program instructions; the processor reads and executes the computer program instructions to implement the image recognition method as provided in the first aspect of the embodiment of the present application.
In a fourth aspect, an embodiment of the present application provides a computer storage medium having stored thereon computer program instructions which, when executed by a processor, implement an image recognition method as provided in the first aspect of the embodiment of the present application.
According to the image recognition method provided by the embodiment of the application, the text characteristics of the image to be detected and the pooling characteristic diagram and the spatial relation characteristic of at least one first target object in the image to be detected are extracted, the three characteristics are subjected to characteristic fusion, the fused shared characteristic diagram is input into a third network in an image recognition model, the recognition information of the image to be recognized is determined, and the recognition information comprises the category information and the position information of each object to be recognized. Compared with the prior art, the method has the advantages that the complementation of the image information is realized through the feature fusion, the defects of the image feature information on details and scenes are overcome while redundant noise is avoided, meanwhile, the text features are extracted, the difference and commonality of the images in different scenes can be reflected, the method can be further applied to more complex scenes, and the accuracy of image recognition is improved.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present application, the drawings that are needed to be used in the embodiments of the present application will be briefly described, and it is possible for a person skilled in the art to obtain other drawings according to these drawings without inventive effort.
FIG. 1 is a schematic flow chart of a training method of an image recognition model according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a multi-modal feature fusion module according to an embodiment of the present application;
fig. 3 is a schematic flow chart of an image recognition method according to an embodiment of the present application;
Fig. 4 is a schematic flow chart of an image recognition device according to an embodiment of the present application;
Fig. 5 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application.
Detailed Description
Features and exemplary embodiments of various aspects of the present application will be described in detail below, and in order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings and the detailed embodiments. It should be understood that the particular embodiments described herein are meant to be illustrative of the application only and not limiting. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the application by showing examples of the application.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The image recognition algorithm is one of important research directions in the field of computer vision, and has important effects in the fields of public safety, road traffic, video monitoring and the like. In recent years, image recognition is continuously improved in terms of accuracy due to the development of image recognition algorithms based on deep learning.
In the prior art, image recognition is performed in two ways:
1. multi-view image target detection method based on visual saliency
Aiming at a scene with a foreground target not blocked, calculating saliency maps of a plurality of view angle images, projecting the saliency maps of the view angles at two sides to a middle target view angle by utilizing a spatial relation between the view angles, and fusing the projected saliency maps and the saliency maps of the middle view angle to obtain a fused saliency map. The region blocked by the foreground object cannot be truly mapped to the target viewing angle during projection, projection holes are generated around the foreground object in the projection saliency map, and the projection hole region is regarded as a background region in the fusion saliency map. The image area is divided by utilizing the multi-view projection holes, and the area between the projection holes and the image edge and the area between the projection holes of different foreground objects are all regarded as background areas. In the fusion saliency map, the saliency value of the background area obtained by the method is set to be zero, and the object with clear edges and no background interference can be obtained after binarization.
2. Small target detection algorithm under complex background
By means of the thought of the feature pyramid algorithm, the features of the Conv4-3 layer are fused with the features of the Conv7 layer and the Conv3-3 layer, and the number of default frames corresponding to each position of the fused feature map is increased. A clipping-weight distribution network (SENet) is added in the network structure to distribute the weight of the characteristic channels of each layer, promote the useful characteristic weight and inhibit the invalid characteristic weight. And simultaneously, in order to enhance the generalization capability of the network, a series of enhancement processing is performed on the training data set.
Both algorithms are common techniques for detecting and identifying a target object in an image, however, due to the complexity and diversity of scenes contained in the image and the uncertainty of the position of the target to be detected in the image, the conventional target detection method has poor robustness in different application scenes. According to the multi-view image target detection method based on visual saliency, only the spatial relation characteristics of the target to be detected in the image are considered, but various characteristic information in the image is not fully utilized for information supplementation so as to improve the accuracy of final image recognition. The small target detection algorithm under the complex background does not consider the spatial relationship between the context information in the complex background and the target to be detected, has a narrow application range, mainly improves the detection and identification accuracy of the small target in the image, and omits the application of the algorithm in more complex scenes.
Based on the above, the embodiment of the application provides an image recognition method, which realizes complementation of image information through feature fusion, overcomes the defects of image feature information on details and scenes while avoiding redundant noise, and simultaneously extracts text features, which can reflect differences and commonalities of images in different scenes, is suitable for more complex scenes, and improves the accuracy of image recognition.
In the image recognition method provided by the embodiment of the application, the image is required to be recognized by using the pre-trained image recognition model, so that the image recognition model needs to be trained before the image recognition is performed by using the image recognition model. Accordingly, a specific implementation of the training method for an image recognition model according to the embodiment of the present application will be described below with reference to the accompanying drawings.
As shown in fig. 1, an embodiment of the present application provides a training method for an image recognition model, which includes first obtaining a sample image, fusing information such as a pooled feature map, text features, and spatial relationship features extracted from the sample image to form a shared feature map with more abundant information, and performing iterative training on a preset image recognition model through classification and regression detection algorithms until a training stop condition is satisfied. The method can be realized by the following steps:
1. and obtaining a plurality of images to be marked.
In some embodiments, a plurality of images to be marked can be obtained through a vehicle-mounted camera or the obtained video is subjected to frame extraction processing to obtain the plurality of images to be marked.
2. And manually labeling the plurality of images to be labeled, wherein the content to be labeled is label identification information of the target object, and the label identification information comprises classification information and position information of the target identification object, wherein the position information is coordinate values of a bounding box surrounding the target object.
In some embodiments, the image shot by the vehicle-mounted camera mainly uses road traffic as a main scene, so that the labeling object of the image to be labeled can comprise target objects such as pedestrians, riders, bicycles, motorcycles, automobiles, trucks, buses, trains, traffic signs, traffic lights and the like, and the labeling result is the category of the target object and the coordinate value of a boundary frame surrounding the target object; simultaneously, each image to be annotated is subjected to text annotation from three angles of time, place and weather.
Specifically, for each image to be annotated, from a temporal perspective, the selectable values include daytime, dusk/dawn, night; from a location perspective, optional values include highways, city streets, residences, parking lots, gas stations, tunnels; from a weather perspective, alternative values include snow, cloudiness, sunny, cloudy, rainy, foggy.
3. And integrating the artificially marked images and marking information corresponding to each image into a training sample set, wherein the training sample set comprises a plurality of sample image groups.
It should be noted that, since the image recognition model needs to be subjected to iterative training for multiple times, so as to adjust the loss function value until the loss function value meets the training stop condition, a trained image recognition model is obtained, and in each iterative training, if only one sample image is input, the sample size is too small to be beneficial to the training adjustment of the image recognition model, so that the training sample set is divided into multiple sample image sets, each sample image set contains multiple sample images, and the iterative training is performed on the image recognition model by utilizing the multiple sample image sets in the training sample set.
4. And training an image recognition model by using a sample image group in the training sample set until the training stopping condition is met, so as to obtain a trained image recognition model. The method specifically comprises the following steps:
And 4.1, extracting a sample pooling feature map and sample space relation features of the identifiable objects in the sample image by using a second network in the preset image identification model.
In some embodiments, the second network in the preset image recognition model may be a fast area convolutional neural network FASTERRCNN network, which is not limited in this disclosure.
Specifically, the acquisition of the sample pooling feature map and the sample spatial relationship feature of the identifiable object in the sample image can be realized by the following steps:
And 4.1.1, uniformly adjusting the sample images in the training set to a fixed size of 1000 multiplied by 600 pixels to obtain the sample images with the adjusted sizes.
And 4.1.2, inputting the sample image group subjected to the size adjustment into a depth residual error network ResNet, an area generation network RPN and a fast area convolution neural network to extract image features, and obtaining a pooling feature map.
1) Firstly, inputting a sample image with the adjusted size into a convolution layer conv1 with the size of 7 multiplied by 64, and then sequentially extracting an original feature map of the sample image through the convolution layers conv2_x, conv3_x, conv4_x, conv5_x and a full connection layer fc;
2) Inputting an original feature map output by conv4_x in ResNet network structures into an area extraction network RPN, and selecting the first 300 anchor frames (anchors) with highest scores in the prediction results and candidate frames corresponding to the anchor frames;
3) And (3) inputting the position mapping diagram of the 300 candidate frames into a region-of-interest pooling layer ROIPooling in the fast region convolutional neural network according to the original feature diagram output by conv4_x to obtain a pooled feature diagram with a fixed size of the identifiable object.
4.1.3 Calculating the intersection ratio (Intersection over union, IOU) between the candidate frames using the coordinates of 300 anchors and the candidate frames corresponding thereto, and calculating the spatial relationship characteristics between the identifiable objects by the following equation 1,
F r=f(w,h,area,dx,dy, IOU) equation 1
Where w and h represent the width and height of the candidate box, area represents the candidate box area, d x and d y are the lateral, longitudinal distances of the geometric centers of the two candidate boxes, IOU refers to the intersection ratio between the candidate boxes, F (·) represents the activation function, and F r represents the predicted spatial relationship features between the identifiable objects.
And 4.2, inputting the sample image into a first network in a preset image recognition model, determining at least one text vector according to the context information of the sample image, splicing the at least one text vector, and determining a sample text feature F t corresponding to the sample image.
It should be noted that the first network in the image recognition model may be a pre-training model such as Word2vec, glove or BERT; the text vector determined according to the context information of the sample image may be a word vector that converts text annotation information describing the time, place and weather of the sample image, which is not limited in this regard.
And 4.3, constructing a multi-mode feature fusion module, and complementarily fusing the sample text features extracted according to the context information of the sample image, the sample space relationship features determined by the second network based on the image recognition model and the sample pooling feature images to obtain a sample sharing feature image. The fusion calculation method can be realized by a formula 2 and a formula 3:
f v=ReLu(Froi,Fr) equation 2
F out=Fv*Ft equation 3
Wherein F roi represents a fixed-size feature map output after passing through the pooling layer ROIPooling, F v represents an original feature map, and F out represents a sample sharing feature image obtained after fusion of a sample text feature, a sample spatial relationship feature, and a sample pooling feature map.
And 4.4, inputting the sample sharing characteristic image into a third network in a preset image recognition model, and determining reference recognition information of each recognizable object, wherein the reference recognition information comprises classification information and reference position information of the recognizable object.
And 4.5, performing non-maximum suppression processing on the reference position information of each identifiable object, filtering the reference position information which does not meet the preset requirement, and determining the prediction identification information of each sample image, wherein the prediction identification information comprises the classification information and the prediction position information of all the identifiable objects.
In some embodiments, non-maximum suppression processing (Non Maximum Suppression, NMS) is performed on the reference location information of each type of identifiable object, the NMS obtains a prediction list arranged according to the score and iterates the ordered prediction list, discards prediction results with IOU values greater than a predefined threshold, sets the threshold to 0.7, filters out candidate boxes with greater overlapping degree, and determines the suppressed location information as predicted location information.
And 4.6, calculating a loss value between the predicted identification information and the marked identification information, optimizing the image identification model according to the target loss function shown in the formula 4, reversely updating network parameters by using a gradient descent algorithm to obtain an updated image identification model, stopping optimizing training until the loss function value is smaller than a preset value, and determining the trained image identification model.
Where i denotes an index of an anchor, p i denotes a probability that the i-th anchor is predicted as a target,Probability of true sample label representing whether the i-th anchor is a sample, λ is a parameter representing weight,/>Log loss representing two categories (target and non-target)/>Representing the classification loss, t= { t x,ty,tw,th } represents the predicted offset of the anchor in the RPN training phase (rois in FastRCNN phase)/>Representing the actual offset of the anchor with respect to the real tag during the RPN training phase (rois during Fast RCNN), v >Representing regression loss.
In order to improve the accuracy of the image recognition model, the image recognition model can be continuously trained by using new training samples in practical application, so that the image recognition model is continuously updated, the accuracy of the image recognition model is improved, and the accuracy of image recognition is further improved.
The above is a specific implementation manner of the image recognition model training method provided in the embodiment of the present application, and the image recognition model obtained through the training may be applied to the image recognition method provided in the following embodiment.
The following describes in detail a specific implementation manner of the image recognition method provided by the present application with reference to fig. 3.
As shown in fig. 3, an embodiment of the present application provides an image recognition method, which includes:
s301, acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified.
In some embodiments, the object to be identified may be acquired through an on-board camera, or a pre-acquired video may be subjected to frame extraction processing, so as to determine an image to be identified.
Taking the road traffic scene as an example, the object to be identified in the image to be identified can be a pedestrian, a rider, a bicycle, a motorcycle, an automobile, a truck, a bus, a train, a traffic sign, a traffic light and the like.
S302, inputting the image to be identified into a first network in a pre-trained image identification model, and determining text characteristics of the image to be identified.
In some embodiments, the image to be identified is input to a first network in a pre-trained image identification model, and at least one text vector is determined according to the context information of the image to be identified; and splicing the at least one text vector to determine the text characteristics of the image to be recognized.
It should be noted that, the text vector is based on the first network, and text annotation information describing time, place and weather of the sample image is converted into a determined word vector according to the context information of the image to be identified, so that the environment information of the image to be identified can be represented by splicing text features determined by a plurality of text vectors, and further the difference and commonality of the image to be identified under different scenes can be reflected, so as to enhance the identification degree of the object to be identified.
S303, inputting the image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of at least one object to be identified.
When the object to be identified is identified, since a large amount of redundant information exists in the image to be identified, convolution processing is needed to be performed on the image, after the image features are determined through the convolution processing, the image identification model can be trained by using the extracted image features, but the calculation cost is relatively high, so that pooling processing is needed to be performed on the image to reduce the dimension of the image features, reduce the calculation amount and the number of parameters, prevent overfitting and improve the fault tolerance of the model.
On the other hand, the spatial relationship refers to a relative spatial position and a relative direction relationship between a plurality of target objects segmented from an image, and these relationships may be also classified into a connection relationship, an overlapping/overlapping relationship, and an inclusion/containment relationship. Thus, the extraction of spatial relationship features may enhance the ability to distinguish image content.
In some embodiments, determining the pooled feature map and the spatial relationship feature of at least one of the objects to be identified may be performed by:
1. And adjusting the resolution of each sample image in the sample image group to be a preset resolution, and determining the adjusted sample image group.
In this step, the sample images in the training set can be uniformly adjusted to a fixed size of 1000×600 pixels.
2. And inputting the adjusted sample image group into a depth residual error network, and determining an original image set, wherein images in the original image set correspond to images in the adjusted sample image group one by one.
Specifically, the resized sample image may be input to the convolution layer conv1 of 7×7×64, and then the original feature map of the sample image may be extracted sequentially through the convolution layers conv2_x, conv3_x, conv4_x, conv5_x, and one full connection layer fc.
3. Inputting an original image set into a region extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the region extraction network and surround identifiable objects, and N is an integer greater than 1; and extracting M anchor frames with the confidence coefficient larger than a preset confidence coefficient threshold value from the N anchor frames based on the confidence coefficient of the N anchor frames, wherein M is a positive integer smaller than N.
As an example, the original feature map output by conv4_x in the ResNet network structure may be input into the regional extraction network RPN, a plurality of anchor frames and candidate frames corresponding to the anchor frames are determined, and 300 anchor frames with higher confidence and candidate frames corresponding to the anchor frames are selected from the 300 anchor frames based on the confidence of each anchor frame.
4. And inputting the mapping region images of the M anchor frames to a region-of-interest pooling layer of the region convolution neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame.
In this step, the position map of the 300 candidate frame may be input to the region of interest pooling layer in the fast region convolution neural network according to the original feature map output by conv4_x, to obtain a pooled feature map of a fixed size of the identifiable object.
S304, carrying out feature fusion on the text features of the image to be identified, the pooled feature map of at least one object to be identified and the spatial relationship features, and determining a shared feature image corresponding to the image to be identified.
In the above steps S202 and S203, the text feature of the image to be identified, the pooled feature map of at least one object to be identified, and the spatial relationship feature are extracted, and although the spatial relationship feature is more sensitive to the rotation, inversion, and size change of the image or the identification of the target object in the image, and the pooled feature map can reduce the calculation amount in the image identification, in practical application, only the spatial relationship feature and/or pooled feature is insufficient, and the scene information cannot be expressed effectively and accurately, so that feature fusion needs to be performed on the text feature of the image to be identified, the pooled feature map of at least one object to be identified, and the spatial relationship feature, and various feature information in the image is fully utilized to supplement information to reflect the difference and commonality of the image in different scenes, so that the defects of the image feature information in detail and scene are overcome while redundant noise is avoided.
S305, inputting the shared characteristic image into a third network in the image recognition model, and determining recognition information of the image to be recognized, wherein the recognition information comprises category information and position information of each object to be recognized.
According to the image recognition method provided by the embodiment of the application, the text characteristics of the image to be recognized, the pooling characteristic diagram of at least one object to be recognized and the spatial relationship characteristics are determined through the image recognition model. The method and the device have the advantages that multiple characteristic information is complementarily fused, the identification degree of the object to be identified in the image is enhanced, the final image identification performance is optimized, the method and the device are suitable for more complex scenes, and the accuracy of image identification is improved.
In order to verify that the image recognition method provided in the above embodiment can improve the accuracy of image recognition compared with the image recognition method in the prior art, the embodiment of the application also provides a test method of image recognition, which tests the image recognition model applied in the image recognition method of the application. Specifically, the method may include the steps of:
1. And inputting the sample image into a trained image recognition model for testing.
Specifically, the average detection precision of all kinds of target objects is calculated according to the formula 5 and the formula 6, and the classification and prediction precision of each prediction frame are output:
n represents the number of the target categories to be detected, AP represents the average precision, and mAP represents the average precision mean value of all the categories.
2. According to the AP and mAP calculation formulas, a detection result is obtained, and the advantages and disadvantages of an image recognition algorithm using FASTER RCNN network algorithm and an image recognition model provided by the embodiment of the application in the prior art are compared to obtain a conclusion:
The image recognition method provided by the embodiment of the application is used in a classical image recognition network, has a remarkable improvement on the image recognition effect, maintains the recognition accuracy of the target object in the image at a stable level even under the condition of large background difference of the image, and has a better recognition effect compared with the original algorithm.
Specifically, an embodiment is used to further describe the method for testing the image recognition model provided by the embodiment of the application through the following simulation experiment.
The prior art adopted in the simulation experiment provided by the application is a faster regional convolution neural network FASTER RCNN; the image recognition model selects ResNet structures to extract image features, the initial learning rate is set to be 0.005, the learning rate attenuation coefficient is set to be 0.1, epoch is set to be 15, and the default optimizer selects SGD.
1. Simulation conditions: the simulated hardware environment of the application is: intel Core i7-7700@
3.60GHz,8G memory; software environment: ubuntu.04, python3.7, pycharm2019.
2. Simulation content and result analysis:
firstly, a sample image set is used as input, text feature extraction, spatial relation feature extraction and pooling feature map acquisition are conducted on the basis of a traditional FASTER RCNN algorithm, then the basic thought of the three feature fusion detection methods is used, an image recognition model is trained by means of the method, a test sample set is input into a trained improved model, and the average precision of each category and the average precision of all categories are evaluated by an AP index.
The application discloses a driving dataset based on BDD100k for experiments, simulation experiment results are shown in table 1, and comparison results of a classical FASTERRCNN algorithm and a multi-mode feature fusion detection method based on context information, which are tested on the same dataset, are shown in the table.
Table 1 comparison of the performance of image recognition methods
As can be seen from the experimental results in Table 1, compared with the detection precision of the classical FASTER RCNN algorithm on the test dataset, the image recognition method provided by the embodiment of the application improves the average detection precision of five kinds of targets by approximately 4.3% in tasks of different scenes. Multiple experiments prove that: the multi-mode feature fusion technology utilizes complementarity among information to enhance the representation of input features, can effectively improve the performance of a target detection algorithm, and obviously improves the average precision in most categories in different image recognition scenes. Because in a real life scene, the image/video data acquisition difficulty is high and the defects often occur, the traditional image and video-based target detection method is not suitable at the moment, and the image recognition method provided by the embodiment of the application can enhance the complementarity between information and has important significance for detection tasks in different scenes.
Based on the same inventive concept of the image recognition method, the embodiment of the application also provides an image recognition device.
As shown in fig. 4, an embodiment of the present application provides an image recognition apparatus, which may include:
A first obtaining module 401, configured to obtain an image to be identified, where at least one object to be identified is in the image to be identified;
A first determining module 402, configured to input an image to be identified into a first network in a pre-trained image identification model, and determine text features of the image to be identified;
a second determining module 403, configured to input the image to be identified into a second network in the image identification model, and determine a pooled feature map and a spatial relationship feature of at least one object to be identified;
The fusion module 404 is configured to perform feature fusion on the text feature of the image to be identified, the pooled feature map of at least one object to be identified, and the spatial relationship feature, and determine a shared feature image corresponding to the image to be identified;
the identifying module 405 is configured to input the shared feature image to the third network in the image identifying model, and determine identifying information of the image to be identified, where the identifying information includes category information and location information of each object to be identified.
In some embodiments, the apparatus may further comprise:
The second acquisition module is used for acquiring a training sample set, wherein the training sample set comprises a plurality of sample image groups, each sample image group comprises a sample image and a corresponding label image, label identification information of a target identification object and scene information of the sample image are marked in the label image, and the label identification information comprises category information and position information of the target identification object;
The training module is used for training a preset image recognition model by using the sample image group in the training sample set until the training stopping condition is met, so as to obtain a trained image recognition model.
In some embodiments, the training module may be specifically configured to:
for each sample image group, the following steps are respectively executed:
Inputting a sample image group into a first network in a preset image recognition model, and determining sample text characteristics corresponding to each sample image;
Inputting a sample image group into a second network in a preset image recognition model, and determining a sample pooling feature map and sample space relation features of each recognizable object;
According to the sample text features corresponding to each sample image, the sample pooling feature images of each identifiable object and the sample space relation features, carrying out feature fusion on each sample image, and determining a sample sharing feature image corresponding to each sample image;
Inputting the sample sharing characteristic image into a third network in a preset image recognition model, and determining reference recognition information of each recognizable object, wherein the reference recognition information comprises classification information and reference position information of the recognizable object;
Performing non-maximum suppression processing on the reference position information of each identifiable object, filtering the reference position information which does not meet the preset requirement, and determining the prediction identification information of each sample image, wherein the prediction identification information comprises the classification information and the prediction position information of all the identifiable objects;
determining a loss function value of a preset image recognition model according to the predicted recognition information of the target sample image and the tag recognition information of all target recognition objects on the target sample image, wherein the target sample image is any one of a sample image group;
And under the condition that the loss function value does not meet the training stop condition, adjusting the model parameters of the image recognition model, and training the image recognition model after parameter adjustment by using the sample image group until the loss function value meets the training stop condition, so as to obtain the trained image recognition model.
In some embodiments, the training module may be specifically configured to:
For each sample image, the following steps are performed:
Inputting a sample image into a first network in a preset image recognition model, and determining at least one text vector according to the context information of the sample image;
at least one text vector is stitched to determine sample text features corresponding to the sample images.
In some embodiments, the second network in the pre-set image recognition model comprises at least a depth residual network, a region extraction network and a region convolution neural network,
The training module may be specifically configured to:
adjusting the resolution of each sample image in the sample image group to be a preset resolution, and determining the adjusted sample image group;
inputting the adjusted sample image group into a depth residual error network, and determining an original image set, wherein images in the original image set correspond to images in the adjusted sample image group one by one;
Inputting an original image set into a region extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the region extraction network and surround identifiable objects, and N is an integer greater than 1;
extracting M anchor frames with confidence degrees larger than a preset confidence degree threshold value from the N anchor frames based on the confidence degrees of the N anchor frames, wherein M is a positive integer smaller than N;
Inputting the mapping region images of the M anchor frames into a region-of-interest pooling layer of the region convolution neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame;
and determining the sample space relation characteristic of each identifiable object according to the intersection ratio and the relative position between at least one anchor frame corresponding to each identifiable object.
In some embodiments, the training module may be specifically configured to:
Dividing all the identifiable objects into a plurality of groups based on the classification information of each identifiable object, and determining the reference position information of the identifiable objects of different groups;
filtering the reference position information of each type of identifiable object;
The predicted identification information of each sample image is determined based on the reference position information of the identified object after filtering and the classification information of the identified object after filtering.
In some embodiments, the training module may be specifically configured to:
Calculating the intersection ratio between a target frame and other reference frames in sequence, wherein the target frame is any one of a plurality of reference frames, and the reference frames are boundary frames which are determined in the reference position information and surround the identifiable object;
filtering the reference frames with the cross-over ratio larger than the preset cross-over ratio threshold until the cross-over ratio between any two reference frames is smaller than the preset cross-over ratio threshold;
the reference frame after filtering is determined as predicted position information of the identifiable object.
Other details of the image recognition apparatus according to the embodiment of the present application are similar to those of the image recognition method according to the embodiment of the present application described above in connection with fig. 1, and are not described herein.
Fig. 5 shows a schematic hardware structure of image recognition according to an embodiment of the present application.
The image recognition method and apparatus provided according to the embodiments of the present application described in connection with fig. 1 and 4 may be implemented by an image recognition device. Fig. 5 is a schematic diagram showing a hardware configuration 500 of an image recognition apparatus according to an embodiment of the application.
A processor 501 and a memory 502 storing computer program instructions may be included in the image recognition device.
In particular, the processor 501 may include a central processing unit (Central Processing Unit, CPU), or Application SPECIFIC INTEGRATED Circuit (ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present application.
Memory 502 may include mass storage for data or instructions. By way of example, and not limitation, memory 502 may comprise a hard disk drive (HARD DISK DRIVE, HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) drive, or a combination of two or more of the foregoing. In one example, memory 502 may include removable or non-removable (or fixed) media, or memory 502 may be a non-volatile solid state memory. Memory 502 may be internal or external to the integrated gateway disaster recovery device.
In one example, memory 502 may be Read Only Memory (ROM). In one example, the ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.
The processor 501 reads and executes the computer program instructions stored in the memory 502 to implement the methods/steps S301 to S305 in the embodiment shown in fig. 3, and achieve the corresponding technical effects achieved by executing the methods/steps in the embodiment shown in fig. 3, which are not described herein for brevity.
In one example, the image recognition device may also include a communication interface 503 and a bus 510. As shown in fig. 5, the processor 501, the memory 502, and the communication interface 503 are connected to each other by a bus 510 and perform communication with each other.
The communication interface 503 is mainly used to implement communication between each module, apparatus, unit and/or device in the embodiments of the present application.
Bus 510 includes hardware, software, or both that couple the components of the online data flow billing device to each other. By way of example, and not limitation, the buses may include an accelerated graphics Port (ACCELERATED GRAPHICS Port, AGP) or other graphics Bus, an enhanced industry Standard architecture (Extended Industry Standard Architecture, EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry Standard architecture (Industry Standard Architecture, ISA) Bus, an Infiniband interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a Micro Channel Architecture (MCA) Bus, a Peripheral Component Interconnect (PCI) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a video electronics standards Association local (VLB) Bus, or other suitable Bus, or a combination of two or more of these. Bus 510 may include one or more buses, where appropriate. Although embodiments of the application have been described and illustrated with respect to a particular bus, the application contemplates any suitable bus or interconnect.
The image recognition equipment provided by the embodiment of the application realizes complementation of image information through feature fusion, overcomes the defects of image feature information on details and scenes while avoiding redundant noise, fully utilizes various feature information in images to carry out information complementation, and simultaneously extracts text features, can reflect the differences and commonalities of the images in different scenes, and further can be suitable for more complex scenes, and improves the accuracy of image recognition.
In addition, in combination with the image recognition method in the above embodiment, the embodiment of the present application may be implemented by providing a computer storage medium. The computer storage medium has stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the image recognition methods of the above embodiments.
It should be understood that the application is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. The method processes of the present application are not limited to the specific steps described and shown, but various changes, modifications and additions, or the order between steps may be made by those skilled in the art after appreciating the spirit of the present application.
The functional blocks shown in the above-described structural block diagrams may be implemented in hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic Circuit, application SPECIFIC INTEGRATED Circuit (ASIC), appropriate firmware, plug-in, function card, or the like. When implemented in software, the elements of the application are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuitry, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and the like. The code segments may be downloaded via computer networks such as the internet, intranets, etc.
It should also be noted that the exemplary embodiments mentioned in this disclosure describe some methods or systems based on a series of steps or devices. The present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, or may be performed in a different order from the order in the embodiments, or several steps may be performed simultaneously.
Aspects of the present application are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a processor may be, but is not limited to being, a general purpose processor, a special purpose processor, an application specific processor, or a field programmable logic circuit. It will also be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware which performs the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the foregoing, only the specific embodiments of the present application are described, and it will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein. It should be understood that the scope of the present application is not limited thereto, and any equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present application, and they should be included in the scope of the present application.
Claims (10)
1. An image recognition method, the method comprising:
acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified;
Inputting the image to be identified into a first network in a pre-trained image identification model, and determining text characteristics of the image to be identified;
Inputting the image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of the at least one object to be identified;
Feature fusion is carried out on the text features of the image to be identified, the pooled feature map of the at least one object to be identified and the spatial relationship features, and a shared feature image corresponding to the image to be identified is determined;
inputting the shared characteristic image into a third network in the image recognition model, and determining recognition information of the image to be recognized, wherein the recognition information comprises category information and position information of each object to be recognized;
Inputting the image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of the at least one object to be identified, including:
adjusting the resolution of each sample image in the sample image group to be a preset resolution, and determining the adjusted sample image group;
Inputting the adjusted sample image group into the depth residual error network, and determining an original image set, wherein images in the original image set correspond to images in the adjusted sample image group one by one;
Inputting the original image set to the area extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the area extraction network and surround identifiable objects, and N is an integer greater than 1;
Extracting M anchor frames with the confidence coefficient larger than a preset confidence coefficient threshold value from the N anchor frames based on the confidence coefficient of the N anchor frames, wherein M is a positive integer smaller than N;
And inputting the mapping region images of the M anchor frames to a region-of-interest pooling layer of the region convolutional neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame.
2. The method of claim 1, wherein prior to the acquiring the image to be identified, the method further comprises:
Acquiring a training sample set, wherein the training sample set comprises a plurality of sample image groups, each sample image group comprises a sample image and a corresponding label image thereof, label identification information of a target identification object and scene information of the sample image are marked in the label image, and the label identification information comprises category information and position information of the target identification object;
and training a preset image recognition model by using the sample image group in the training sample set until the training stopping condition is met, so as to obtain a trained image recognition model.
3. The method according to claim 2, wherein training the image recognition model using the sample image group in the training sample set until a training stop condition is satisfied, and obtaining a trained image recognition model specifically includes:
For each sample image group, the following steps are respectively executed:
inputting the sample image group into a first network in a preset image recognition model, and determining sample text characteristics corresponding to each sample image;
Inputting the sample image group into a second network in a preset image recognition model, and determining a sample pooling feature map and sample space relation features of each recognizable object;
According to the sample text characteristics corresponding to each sample image, the sample pooling characteristic images of each identifiable object and the sample space relation characteristics, carrying out characteristic fusion on each sample image, and determining a sample sharing characteristic image corresponding to each sample image;
inputting the sample sharing characteristic image into a third network in a preset image recognition model, and determining reference recognition information of each recognizable object, wherein the reference recognition information comprises classification information and reference position information of the recognizable object;
Performing non-maximum suppression processing on the reference position information of each identifiable object, filtering the reference position information which does not meet the preset requirement, and determining the prediction identification information of each sample image, wherein the prediction identification information comprises the classification information and the prediction position information of all identifiable objects;
Determining a loss function value of the preset image recognition model according to the predicted recognition information of the target sample image and the tag recognition information of all target recognition objects on the target sample image, wherein the target sample image is any one of the sample image groups;
And under the condition that the loss function value does not meet the training stop condition, adjusting the model parameters of the image recognition model, and utilizing the image recognition model after the sample image group training parameter adjustment until the loss function value meets the training stop condition, so as to obtain the trained image recognition model.
4. A method according to claim 3, wherein said inputting said set of sample images into a first network in a pre-set image recognition model, determining sample text features corresponding to each of said sample images, comprises:
For each sample image, the following steps are respectively executed:
inputting the sample image into a first network in the preset image recognition model, and determining at least one text vector according to the context information of the sample image;
And splicing the at least one text vector, and determining sample text characteristics corresponding to the sample images.
5. The method of claim 3, wherein the second network in the pre-set image recognition model comprises at least a depth residual network, a region extraction network, and a region convolution neural network,
Inputting the sample image group into a second network in a preset image recognition model, and determining a sample pooling feature map and sample spatial relationship features of each recognizable object, wherein the method comprises the following steps of:
Adjusting the resolution of each sample image in the sample image group to be a preset resolution, and determining an adjusted sample image group;
Inputting the adjusted sample image group into the depth residual error network, and determining an original image set, wherein images in the original image set correspond to images in the adjusted sample image group one by one;
Inputting the original image set to the area extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the area extraction network and surround the identifiable object, and N is an integer greater than 1;
extracting M anchor frames with the confidence coefficient larger than a preset confidence coefficient threshold value from the N anchor frames based on the confidence coefficient of the N anchor frames, wherein M is a positive integer smaller than N;
inputting the mapping region images of the M anchor frames into a region-of-interest pooling layer of the region convolutional neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame;
And determining the sample space relation characteristic of each identifiable object according to the intersection ratio and the relative position between at least one anchor frame corresponding to each identifiable object.
6. A method according to claim 3, wherein said performing non-maximum suppression processing on the reference position information of each identifiable object, filtering the reference position information that does not meet a preset requirement, and determining the predicted identification information of each sample image includes:
Dividing all the identifiable objects into a plurality of groups based on the classification information of each identifiable object, and determining the reference position information of the identifiable objects of different groups;
filtering the reference position information of each type of identifiable object;
and determining the prediction identification information of each sample image according to the reference position information of the identified objects after filtering and the classification information of the identified objects after filtering.
7. The method of claim 6, wherein filtering the reference location information for each type of identifiable object comprises:
sequentially calculating the cross-over ratio between a target frame and other reference frames, wherein the target frame is any one of a plurality of reference frames, and the reference frames are boundary frames which are determined in the reference position information and surround the identifiable object;
filtering the reference frames with the cross-over ratio larger than a preset cross-over ratio threshold until the cross-over ratio between any two reference frames is smaller than the preset cross-over ratio threshold;
the reference frame after filtering is determined as predicted position information of the identifiable object.
8. An image recognition apparatus, the apparatus comprising:
The first acquisition module is used for acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified;
The first determining module is used for inputting the image to be identified into a first network in a pre-trained image identification model and determining text characteristics of the image to be identified;
The second determining module is used for inputting the image to be identified into a second network in the image identification model and determining a pooling feature map and spatial relationship features of the at least one object to be identified;
the fusion module is used for carrying out feature fusion on the text features of the image to be identified, the pooled feature map of the at least one object to be identified and the spatial relationship features, and determining a shared feature image corresponding to the image to be identified;
The identification module is used for inputting the shared characteristic image into a third network in the image identification model, and determining identification information of the image to be identified, wherein the identification information comprises category information and position information of each object to be identified;
The second determination module includes:
the first adjusting unit is used for adjusting the resolution of each sample image in the sample image group to be a preset resolution and determining the adjusted sample image group;
The first determining unit is used for inputting the adjusted sample image group into the depth residual error network to determine an original image set, wherein the images in the original image set are in one-to-one correspondence with the images in the adjusted sample image group;
the second determining unit is used for inputting the original image set into the area extraction network and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the area extraction network and surround the identifiable object, and N is an integer larger than 1;
The extraction unit is used for extracting M anchor frames with the confidence coefficient larger than a preset confidence coefficient threshold value from the N anchor frames based on the confidence coefficient of the N anchor frames, wherein M is a positive integer smaller than N;
And the second adjusting unit is used for inputting the mapping region images of the M anchor frames to a region-of-interest pooling layer of the region convolution neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame.
9. An image recognition apparatus, characterized in that the apparatus comprises: a processor and a memory storing computer program instructions; the processor reads and executes the computer program instructions to implement the image recognition method according to any one of claims 1-7.
10. A computer storage medium having stored thereon computer program instructions which, when executed by a processor, implement the image recognition method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110400954.1A CN113052159B (en) | 2021-04-14 | 2021-04-14 | Image recognition method, device, equipment and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110400954.1A CN113052159B (en) | 2021-04-14 | 2021-04-14 | Image recognition method, device, equipment and computer storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113052159A CN113052159A (en) | 2021-06-29 |
CN113052159B true CN113052159B (en) | 2024-06-07 |
Family
ID=76519713
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110400954.1A Active CN113052159B (en) | 2021-04-14 | 2021-04-14 | Image recognition method, device, equipment and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113052159B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113591967B (en) * | 2021-07-27 | 2024-06-11 | 南京旭锐软件科技有限公司 | Image processing method, device, equipment and computer storage medium |
CN114663871A (en) * | 2022-03-23 | 2022-06-24 | 北京京东乾石科技有限公司 | Image recognition method, training method, device, system and storage medium |
CN114648478A (en) * | 2022-03-29 | 2022-06-21 | 北京小米移动软件有限公司 | Image processing method, device, chip, electronic equipment and storage medium |
CN115861720B (en) * | 2023-02-28 | 2023-06-30 | 人工智能与数字经济广东省实验室(广州) | Small sample subclass image classification and identification method |
CN116993963B (en) * | 2023-09-21 | 2024-01-05 | 腾讯科技(深圳)有限公司 | Image processing method, device, equipment and storage medium |
CN117710234B (en) * | 2024-02-06 | 2024-05-24 | 青岛海尔科技有限公司 | Picture generation method, device, equipment and medium based on large model |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271967A (en) * | 2018-10-16 | 2019-01-25 | 腾讯科技(深圳)有限公司 | The recognition methods of text and device, electronic equipment, storage medium in image |
CN109299274A (en) * | 2018-11-07 | 2019-02-01 | 南京大学 | A kind of natural scene Method for text detection based on full convolutional neural networks |
US10198671B1 (en) * | 2016-11-10 | 2019-02-05 | Snap Inc. | Dense captioning with joint interference and visual context |
CN110458165A (en) * | 2019-08-14 | 2019-11-15 | 贵州大学 | A kind of natural scene Method for text detection introducing attention mechanism |
CN111028235A (en) * | 2019-11-11 | 2020-04-17 | 东北大学 | Image segmentation method for enhancing edge and detail information by utilizing feature fusion |
CN111368893A (en) * | 2020-02-27 | 2020-07-03 | Oppo广东移动通信有限公司 | Image recognition method and device, electronic equipment and storage medium |
CN111598214A (en) * | 2020-04-02 | 2020-08-28 | 浙江工业大学 | Cross-modal retrieval method based on graph convolution neural network |
CN111985369A (en) * | 2020-08-07 | 2020-11-24 | 西北工业大学 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
WO2020232867A1 (en) * | 2019-05-21 | 2020-11-26 | 平安科技(深圳)有限公司 | Lip-reading recognition method and apparatus, computer device, and storage medium |
CN112070069A (en) * | 2020-11-10 | 2020-12-11 | 支付宝(杭州)信息技术有限公司 | Method and device for identifying remote sensing image |
CN112101165A (en) * | 2020-09-07 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Interest point identification method and device, computer equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108986891A (en) * | 2018-07-24 | 2018-12-11 | 北京市商汤科技开发有限公司 | Medical imaging processing method and processing device, electronic equipment and storage medium |
-
2021
- 2021-04-14 CN CN202110400954.1A patent/CN113052159B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10198671B1 (en) * | 2016-11-10 | 2019-02-05 | Snap Inc. | Dense captioning with joint interference and visual context |
CN109271967A (en) * | 2018-10-16 | 2019-01-25 | 腾讯科技(深圳)有限公司 | The recognition methods of text and device, electronic equipment, storage medium in image |
CN109299274A (en) * | 2018-11-07 | 2019-02-01 | 南京大学 | A kind of natural scene Method for text detection based on full convolutional neural networks |
WO2020232867A1 (en) * | 2019-05-21 | 2020-11-26 | 平安科技(深圳)有限公司 | Lip-reading recognition method and apparatus, computer device, and storage medium |
CN110458165A (en) * | 2019-08-14 | 2019-11-15 | 贵州大学 | A kind of natural scene Method for text detection introducing attention mechanism |
CN111028235A (en) * | 2019-11-11 | 2020-04-17 | 东北大学 | Image segmentation method for enhancing edge and detail information by utilizing feature fusion |
CN111368893A (en) * | 2020-02-27 | 2020-07-03 | Oppo广东移动通信有限公司 | Image recognition method and device, electronic equipment and storage medium |
CN111598214A (en) * | 2020-04-02 | 2020-08-28 | 浙江工业大学 | Cross-modal retrieval method based on graph convolution neural network |
CN111985369A (en) * | 2020-08-07 | 2020-11-24 | 西北工业大学 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
CN112101165A (en) * | 2020-09-07 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Interest point identification method and device, computer equipment and storage medium |
CN112070069A (en) * | 2020-11-10 | 2020-12-11 | 支付宝(杭州)信息技术有限公司 | Method and device for identifying remote sensing image |
Non-Patent Citations (3)
Title |
---|
A Novel Water-Shore-Line Detection Method for USV Autonomous Navigation;Zou, X等;SENSORS;第20卷(第6期);文献号1682 * |
基于图像语义的服务机器人视觉隐私行为识别与保护系统;李中益;杨观赐;李杨;何玲;;计算机辅助设计与图形学学报(第10期);第146-154额 * |
跨模态多标签生物医学图像分类建模识别;于玉海;林鸿飞;孟佳娜;郭海;赵哲焕;;中国图象图形学报(第06期);第143-153页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113052159A (en) | 2021-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113052159B (en) | Image recognition method, device, equipment and computer storage medium | |
CN111368687B (en) | Sidewalk vehicle illegal parking detection method based on target detection and semantic segmentation | |
WO2022141910A1 (en) | Vehicle-road laser radar point cloud dynamic segmentation and fusion method based on driving safety risk field | |
Negru et al. | Image based fog detection and visibility estimation for driving assistance systems | |
CN111666805B (en) | Class marking system for autopilot | |
CN110619279B (en) | Road traffic sign instance segmentation method based on tracking | |
CN111274942A (en) | Traffic cone identification method and device based on cascade network | |
CN112329623A (en) | Early warning method for visibility detection and visibility safety grade division in foggy days | |
Belaroussi et al. | Impact of reduced visibility from fog on traffic sign detection | |
Zhang et al. | A graded offline evaluation framework for intelligent vehicle’s cognitive ability | |
CN114677507A (en) | Street view image segmentation method and system based on bidirectional attention network | |
CN110599497A (en) | Drivable region segmentation method based on deep neural network | |
Yebes et al. | Learning to automatically catch potholes in worldwide road scene images | |
Arora et al. | Automatic vehicle detection system in Day and Night Mode: challenges, applications and panoramic review | |
CN114973199A (en) | Rail transit train obstacle detection method based on convolutional neural network | |
Wang | Vehicle image detection method using deep learning in UAV video | |
Kühnl et al. | Visual ego-vehicle lane assignment using spatial ray features | |
Huu et al. | Proposing Lane and Obstacle Detection Algorithm Using YOLO to Control Self‐Driving Cars on Advanced Networks | |
Coronado et al. | Detection and classification of road signs for automatic inventory systems using computer vision | |
Matsuda et al. | A Method for Detecting Street Parking Using Dashboard Camera Videos. | |
CN118570749A (en) | Multi-mode road sensing method, system, terminal equipment and storage medium | |
Imad et al. | Navigation system for autonomous vehicle: A survey | |
CN111353481A (en) | Road obstacle identification method based on laser point cloud and video image | |
CN110555425A (en) | Video stream real-time pedestrian detection method | |
CN111611942B (en) | Method for extracting and building database by perspective self-adaptive lane skeleton |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |