CN118097521B

CN118097521B - Object recognition method, device, apparatus, medium and program product

Info

Publication number: CN118097521B
Application number: CN202410518192.9A
Authority: CN
Inventors: 付灿苗; 孙冲; 李琛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-04-28
Filing date: 2024-04-28
Publication date: 2024-07-23
Anticipated expiration: 2044-04-28
Also published as: CN118097521A

Abstract

The application discloses an object identification method, an object identification device, an object identification medium and a program product, and relates to the field of computer vision. The method comprises the following steps: extracting a first video frame from the target video based on the first frame interval while following the identification of the target object in the target video; performing object recognition on the first video frame to obtain a second object frame selection area; under the condition that the first object frame selection area and the second object frame selection area meet the coincidence ratio condition, adjusting the first object frame selection area based on the second object frame selection area to obtain a target frame selection area; the target object in the target video is identified based on the target frame selection area. And extracting video frames in the following recognition process based on frame intervals to perform object recognition, correcting the following recognition result through the object recognition result when the object recognition result and the following recognition result meet the coincidence ratio condition, and avoiding the following loss of a target, thereby improving the accuracy rate of carrying out following recognition on the target object.

Description

Object recognition method, device, apparatus, medium and program product

Technical Field

Embodiments of the present application relate to the field of computer vision, and in particular, to an object recognition method, apparatus, device, medium, and program product.

Background

Object tracking is an important application in the field of computer vision, whose main task is to track specific objects in a video or image sequence on a computer device.

In the related art, when tracking a target, after detecting the target object to be tracked in a certain frame of a video, a motion prediction model may be used to predict a motion track of the target object to determine a position of the target object in a subsequent frame, so as to track the target object.

However, in the tracking process, when the target object suddenly disappears or is blocked, the motion prediction model cannot acquire the direct motion information of the target object, so that the new position of the target object cannot be accurately predicted, and the target object is lost.

Disclosure of Invention

The embodiment of the application provides an object identification method, device, equipment, medium and program product, which can improve the accuracy of follow-up identification of a target object.

In one aspect, there is provided an object recognition method, the method comprising:

Extracting a first video frame from a target video based on a first frame interval when a target object in the target video is recognized, wherein the first video frame comprises a first object frame selection area, and the first object frame selection area is a frame selection area obtained based on track prediction information of the target object in the process of recognizing the target object in a following manner;

performing object recognition on the first video frame to obtain a second object frame selection area, wherein the second object frame selection area is a frame selection area obtained after the target object is recognized from the first video frame;

Under the condition that the first object frame selection area and the second object frame selection area meet the coincidence ratio condition, adjusting the first object frame selection area based on the second object frame selection area to obtain a target frame selection area;

And continuing to follow and identify the target object in the target video based on the target frame selection area.

In another aspect, there is provided an object recognition apparatus, the apparatus comprising:

The data acquisition module is used for extracting a first video frame from the target video based on a first frame interval when a target object in the target video is identified, wherein the first video frame comprises a first object frame selection area, and the first object frame selection area is a frame selection area obtained based on track prediction information of the target object in the process of identifying the target object in a following manner;

The first identification module is used for carrying out object identification on the first video frame to obtain a second object frame selection area, wherein the second object frame selection area is a frame selection area obtained after the target object is identified from the first video frame;

The data adjustment module is used for adjusting the first object frame selection area based on the second object frame selection area to obtain a target frame selection area under the condition that the first object frame selection area and the second object frame selection area meet the coincidence ratio condition;

And the second identification module is used for continuing to follow and identify the target object in the target video based on the target frame selection area.

In another aspect, a computer device is provided, the computer device including a processor and a memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement any of the object recognition methods described above.

In another aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement any of the above-described object recognition methods is provided.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the computer device to perform any of the object recognition methods described above.

The technical scheme provided by the embodiment of the application has the following beneficial effects.

Extracting a first video frame from the target video according to a first frame interval when the target object is identified, wherein the first video frame comprises a first object frame selection area determined based on track prediction information in the process of identifying the target object in a following manner; the method comprises the steps of independently identifying objects in a first video frame, and determining the position of a target object in the first video frame again to obtain a second object frame selection area; and when the first object frame selection area and the second object frame selection area meet the coincidence ratio condition, adjusting the first object frame selection area based on the second object frame selection area to obtain a target frame selection area. On one hand, video frames in the following recognition process are extracted based on frame intervals to carry out object recognition, when the object recognition result and the following recognition result meet the coincidence ratio condition, the following recognition result is corrected through the object recognition result, and the following loss of a target is avoided, so that the accuracy of carrying out following recognition on the target object is improved; on the other hand, as the computational power resources consumed by object recognition on the video frames are more, the object recognition is carried out by selecting the video frames at intervals instead of carrying out the object recognition on each video frame, the consumption of the computational power resources can be reduced while the accuracy of follow-up recognition is ensured, and the object recognition method is transplanted to the mobile terminal.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a computer system provided in accordance with an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method of object recognition provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of an object recognition method provided by another exemplary embodiment of the present application;

FIG. 4 is a flow chart of an object recognition method provided by yet another exemplary embodiment of the present application;

FIG. 5 is an overall framework diagram of an object recognition method provided by an exemplary embodiment of the present application;

FIG. 6 is a graphical representation of the visual results of an object recognition method provided by an exemplary embodiment of the present application;

FIG. 7 is a schematic view of a visual result of an object recognition method according to another exemplary embodiment of the present application;

FIG. 8 is a graphical representation of the visual results of an object recognition method according to yet another exemplary embodiment of the present application;

FIG. 9 is a block diagram of an object recognition apparatus provided in an exemplary embodiment of the present application;

Fig. 10 is a block diagram of an object recognition apparatus according to another exemplary embodiment of the present application;

Fig. 11 is a block diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of promoting an understanding of the principles and advantages of the application, reference will now be made in detail to the embodiments of the application, some but not all of which are illustrated in the accompanying drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and no limitation on the amount or order of execution.

First, a brief description will be given of terms involved in the embodiments of the present application.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of how to make a machine "look at", and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing, so that the Computer processes the target into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important revolution for the development of computer vision technology, swin-transducer (swin converter), viT (Vision Transformer, vision converter), V-MOE (Vision Mixture of Experts, vision mixing expert model), MAE (Masked Autoencoder, mask self-encoder) and other vision fields of pre-training models can be quickly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous localization and map construction, and other techniques, as well as common biometric recognition techniques such as face recognition.

In the related art, when tracking a target, after detecting the target object to be tracked in a certain frame of a video, a motion prediction model may be used to predict a motion track of the target object to determine a position of the target object in a subsequent frame, so as to track the target object. However, in the tracking process, when the target object suddenly disappears or is blocked, the motion prediction model cannot acquire the direct motion information of the target object, so that the new position of the target object cannot be accurately predicted, and the target object is lost.

The embodiment of the application provides an object recognition method, which is characterized in that on one hand, video frames in a following recognition process are extracted based on frame intervals to carry out object recognition, and when an object recognition result and a following recognition result meet a coincidence ratio condition, the following recognition result is corrected through the object recognition result, so that a following target is avoided, and the accuracy of following recognition on a target object is improved; on the other hand, as the computational power resources consumed by object recognition on the video frames are more, the object recognition is carried out by selecting the video frames at intervals instead of carrying out the object recognition on each video frame, the consumption of the computational power resources can be reduced while the accuracy of follow-up recognition is ensured, and the object recognition method is transplanted to the mobile terminal.

The object recognition method provided by the embodiment of the application can be applied to an automatic security scene, an automatic driving scene, a medical scene (such as tracking the position and change of lesions, organs or injected objects), a virtual reality scene and an augmented reality scene (such as actually tracking the head, hands or other body parts of a user to realize interactive control or environmental perception), and the like, and the embodiment of the application is not limited to the above.

Next, a computer system provided by an embodiment of the present application will be described.

FIG. 1 illustrates a block diagram of a computer system provided in accordance with an exemplary embodiment of the present application. The computer system may implement a system architecture that becomes a method of object recognition. The computer system includes a terminal 110 and a server 120, and the terminal 110 and the server 120 are connected through a communication network 130.

The object recognition method provided in the embodiment of the present application may be implemented by the terminal 110 alone, or may be implemented by the server 120 alone, or may be implemented by the terminal 110 and the server 120 through data interaction, which is not limited in the embodiment of the present application.

Terminal 110 may be an electronic device such as a cell phone, tablet computer, vehicle mounted terminal (car), wearable device, PC (Personal Computer ), etc. The terminal 110 may be provided with a client for running a target application, which may be an application for object recognition or an application provided with an object recognition function, which is not limited in the present application. In addition, the present application is not limited to the form of the target Application program, and may be a web page form, including but not limited to an App (Application), an applet, etc. installed in the terminal 110.

The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. The server 120 may be a background server of the target application program, and is configured to provide background services for clients of the target application program.

The cloud technology is a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business model, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing. Optionally, the server 120 may also be implemented as a node in a blockchain system.

Alternatively, an object recognition method provided by the present application is described as an example of the interactive execution of the terminal 110 and the server 120.

Illustratively, the terminal 110 is provided with a camera, and the terminal 110 collects the target video through the camera (e.g., continuously captures a real-time video stream through the camera), and the collected target video is transmitted to the server 120 for further analysis and processing.

The server 120 is provided with a first recognition model and a second recognition model, wherein the first recognition model is used for predicting the position of a target object in a current video frame by using the information of a previous video frame; the second recognition model is used for carrying out independent object recognition on a certain video frame to obtain the position of the target object in the video frame. Referring to fig. 1, a target object exists in a t-th video frame in a target video, the t-th video frame includes a t-th detection frame, and the t-th detection frame is used for identifying a position of the target object in the t-th video frame; the t+n (N is a first frame interval, and t and N are positive integers) video frames are sequentially input into a first recognition model, and the first recognition model can use the position information of the target object indicated by the t-th detection frame to follow and recognize the target object in the subsequent N video frames, wherein the following recognition result of the t+n-th video frame is the first detection frame 121.

After the first detection frame 121 corresponding to the t+n video frames is obtained, the t+n video frames need to be input into the second recognition model to perform object recognition, so as to obtain a second detection frame 122. After the first detection frame 121 and the second detection frame 122 are obtained, the server 120 determines the degree of overlap of the first detection frame 121 and the second detection frame 122, and if the first detection frame 121 and the second detection frame 122 meet the degree of overlap condition, the first detection frame 121 is adjusted based on the second detection frame 122. For example: the overlap ratio between the first detection frame 121 and the second detection frame 122 is within the interval of 0.5 to 0.8, and the first detection frame 121 and the second detection frame 122 are fused according to a preset proportion to obtain a t+n detection frame 123, where the t+n detection frame 123 is used for identifying the position of the target object in the t+n video frames. The server 120 will continue to follow the target object in the identified target video based on the t+nth detection box 123.

Optionally, after the server 120 obtains the t-th detection frame, … … and the t+n-th detection frame, these detection frame information (including detection frame coordinates, confidence level, class labels indicating object classes, etc.) are fed back to the terminal 110, and after the terminal 110 receives these detection frame information, the following application is performed on the target object in the target video according to these detection frame information, for example: using the detection frame information, the terminal 110 can accurately locate the position of the target object in the video; or using the detection frame information, the terminal 110 may superimpose virtual elements or information on the target object to achieve the AR effect. For example, adding accessories such as virtual glasses and caps to the face detection frame; or in the security scene, the terminal 110 may determine whether an abnormal object enters the target area according to the detection frame information; or through the detection frame information, the terminal 110 may implement gesture recognition or gesture recognition based on the target object, which is not limited in the embodiment of the present application.

In some embodiments, the first recognition model and the second recognition model may also be provided in the terminal 110, i.e., the object recognition process is performed by the terminal 110 alone.

Those skilled in the art will appreciate that the number of terminals 110 may be greater or lesser. For example, the number of the terminals 110 may be only one, or tens or hundreds, or more. The number and device type of the terminals 110 are not limited in the embodiment of the present application.

It should be noted that, before and during the process of collecting the relevant data of the user, the present application may display a prompt interface, a popup window or output voice prompt information, where the prompt interface, popup window or voice prompt information is used to prompt the user to collect the relevant data currently, so that the present application only starts to execute the relevant step of obtaining the relevant data of the user after obtaining the confirmation operation of the user to the prompt interface or popup window, otherwise (i.e. when the confirmation operation of the user to the prompt interface or popup window is not obtained), the relevant step of obtaining the relevant data of the user is finished, i.e. the relevant data of the user is not obtained. In other words, all user data collected by the present application is collected with the user agreeing and authorized, and the collection, use and processing of relevant user data requires compliance with relevant laws and regulations and standards.

Next, a flow of the object recognition method provided by the present application will be described.

Fig. 2 is a flowchart of an object recognition method according to an embodiment of the present application, where the method is performed by a computer device, which may be the terminal 110 and/or the server 120 shown in fig. 1. The method is as follows steps 210 to 240.

Step 210, extracting a first video frame from the target video based on the first frame interval while following the identification of the target object in the target video.

Alternatively, the target video may be implemented as a real-time video stream. Illustratively, a real-time video stream is obtained through real-time shooting by a terminal camera and is uploaded to computer equipment in real time, and the computer equipment can carry out follow-up identification on a target object in the real-time video stream.

Or the target video is implemented as a pre-acquired video. Illustratively, a pre-recorded video file is stored in a computer device, and the computer device can perform follow-up identification on a target object in the video file.

Alternatively, the number of the target objects may be one or a plurality. The embodiment of the present application is not limited thereto.

In the embodiment of the application, the computer equipment analyzes and processes the target video in the form of video frames. Illustratively, the target video is decoded, converted into successive video frames, and the computer device follows the identification of the target object in the successive video frames.

In other embodiments, the computer device may also follow the identification of the target object in a continuous sequence of images. Illustratively, a plurality of image frames are continuously acquired by the terminal camera and sent to the computer device, and the computer device can perform follow-up identification on the target object in the plurality of continuous image frames.

Optionally, after the computer device acquires a plurality of continuous video frames (or a plurality of continuous image frames) corresponding to the target video, the video frames (or the image frames) may be preprocessed to improve accuracy of subsequent identification. The preprocessing mode includes at least one of image graying (converting a color image into a gray image), image binarizing (converting the gray image into a binary image, that is, each pixel in the image has only two possible values, usually 0 and 255), image filtering (denoising and smoothing the image by using a filter), and the application is not limited to this.

Following identifying a target object in a target video refers to tracking the target object based on a motion pattern of the target object in a plurality of consecutive video frames (or a plurality of consecutive image frames).

Optionally, the target object in the plurality of consecutive video frames is identified by the first identification model following. The first recognition model is used for predicting the motion trail of the target object in the subsequent frame based on the position of the target object of the current frame.

The first recognition model may be implemented as a tracker (tracker), where the tracker includes at least one of a tracker based on an optical flow method, a tracker based on a kalman filter, a tracker based on a particle filter, a tracker based on a correlation filter, and the like, which is not limited in the present application.

Illustratively, the target object to be tracked is manually selected or automatically detected by a machine learning model (e.g., a second recognition model, as described below) in an initial frame of a plurality of consecutive video frames, wherein the initial frame refers to a video frame of the plurality of consecutive video frames that contains the target object. Then, extracting the characteristics corresponding to the target object in the initial frame through the first recognition model; in a video frame (hereinafter referred to as a subsequent frame) after the initial frame, the first recognition model can analyze the motion pattern of the target object based on the features corresponding to the target object, predict the motion track of the target object in the subsequent frame, and identify the position of the target object in the subsequent frame through a frame selection area.

In the following recognition process, in order to avoid following the target and ensure the accuracy of following recognition, a first video frame can be acquired from a plurality of continuous video frames according to a first frame interval, and the position of the target object in the first video frame is redetermined.

The first video frame comprises a first object framing area, and the first object framing area is obtained based on track prediction information of the target object in the process of following and identifying the target object. That is, the first object selection area is a selection area predicted by the first recognition model, and is used to identify the position of the target object predicted by the first recognition model in the first video frame.

Alternatively, the first object-frame selection area may be implemented in the form of a bounding box (typically a rectangular box for closely surrounding the target object), a polygonal box, a 3D detection box (in a three-dimensional target detection task, the detection box needs to have three-dimensional information including length, width, height, and position and direction of the target object in three-dimensional space), and the like, which is not limited by the embodiment of the present application.

The first frame interval refers to the frequency at which frames are extracted from a plurality of consecutive video frames.

In some embodiments, the first frame interval is a preset frame interval, for example: if the preset frame interval is 5, one frame is extracted every 5 frames for processing.

In other embodiments, the first frame interval is a dynamically determined frame interval.

Optionally, when following the target object in the identified target video, acquiring a picture scaling corresponding to a second video frame, where the second video frame is a video frame extracted from the target video; the first frame interval is determined according to the picture scaling.

Wherein, the picture scaling and the frame interval are in a negative correlation, i.e. the larger the picture scaling, the smaller the frame interval, and the smaller the picture scaling, the larger the frame interval.

Illustratively, in the process of following the identification target video, firstly, determining a1 st frame interval according to the picture scaling of an initial frame; extracting a1 st video frame needing object identification from a target video according to a1 st frame interval, simultaneously obtaining a picture scaling of the extracted video frame, and determining a2 nd frame interval according to the picture scaling; and extracting the 2 nd video frame needing object identification from the target video according to the 2 nd frame interval. And so on, the subsequent frame interval is determined.

Or setting a first preset proportion x and a second preset proportion y, wherein y is larger than or equal to x, if the picture scaling k of the initial frame is larger than or equal to y, reducing the preset frame interval according to the proportion difference k-y to obtain the 1 st frame interval, wherein the reduction amplitude of the proportion difference k-y and the preset frame interval is in positive correlation; if k is smaller than x, increasing the preset frame interval according to the proportion difference x-k to obtain the 1 st frame interval, wherein the proportion difference x-k and the increasing amplitude of the preset frame interval are in positive correlation. If x is less than or equal to k and less than y, taking the preset frame interval as the 1 st frame interval. And so on, the subsequent frame interval is determined.

In the embodiment of the application, the frame interval is determined according to the picture scaling of the last extracted video frame, and then the video frame extraction is continuously carried out according to the frame interval, so that the frame interval is dynamically adjusted. In the case of a large picture scale (large picture scale means that the target object has a significant variation between successive frames), the frames can be extracted more frequently to maintain accuracy of the following recognition; and in the case of smaller picture scaling, the frequency of frame extraction is reduced to improve efficiency.

And 220, performing object recognition on the first video frame to obtain a second object frame selection area.

Object identification of a first video frame refers to finding a target object in the first video frame. Optionally, object recognition is performed on the first video frame by the second recognition model.

It should be noted that the second recognition model usually searches for the target object independently in a certain video frame or image frame, and does not depend on the information of the previous frame, in other words, the second recognition model usually recognizes the position of the target object on a static image. In addition, the amount of computation of the first recognition model is generally smaller than that of the second recognition model, because the first recognition model can utilize the information of the previous frame or frames to assist the tracking of the current frame, which helps to reduce the computation complexity in the current frame; in contrast, the second recognition model needs to search for and recognize the target from scratch in each frame, which increases the calculation amount thereof. And, the first recognition model typically employs a lighter weight model to enable application in real-time systems; while the second recognition model requires the use of a more complex model in order to obtain higher accuracy, thereby increasing the amount of calculation.

Wherein the second recognition model may implement a detector (detector). Optionally, the detector includes at least one of a convolutional neural network (Convolutional Neural Network, CNN) based detector, a recurrent neural network (Recurrent Neural Network, RNN) based detector, a Long Short-Term Memory (LSTM) based detector, a generating countermeasure network (GENERATIVE ADVERSARIAL Networks, GAN) based detector, and the like, and the present application is not limited.

The second object frame selection area is a frame selection area obtained after the target object is identified from the first video frame. That is, the first object selection region is a selection region predicted by the second recognition model, and the second object selection region is used to identify the position of the target object predicted by the second recognition model in the first video frame. Alternatively, the second object box selection area may be implemented in the form of a bounding box, a polygonal box, a 3D detection box, or the like.

The process of identifying the second object selection area is described below. Illustratively, prior to identifying the first video frame, the first video frame may be pre-processed, including at least one of resizing, normalizing, denoising, and the like. Extracting, by a detector, features in the preprocessed first video frame, for example: color, texture, shape, etc.; based on the extracted features, the detector may generate a plurality of candidate regions that may contain the target object, i.e., a plurality of possible locations of the target object; for each candidate region, the detector judges the possibility of the target object contained in the candidate region, and then the candidate region with the highest possibility is taken as a second object framing region corresponding to the target object.

And step 230, adjusting the first object frame selection area based on the second object frame selection area to obtain a target frame selection area under the condition that the first object frame selection area and the second object frame selection area meet the coincidence degree condition.

In the embodiment of the application, when the coincidence ratio between the first object frame selection area and the second object frame selection area does not reach higher coincidence ratio, the position reliability of the target object indicated by the current first object frame selection area is lower, and the first object frame selection area can be adjusted through the more accurate second object frame selection area for correcting the position of the target object.

Optionally, when the contact ratio between the first object frame selection area and the second object frame selection area is smaller than or equal to the first contact ratio, fusing the first object frame selection area and the second object frame selection area based on a first weight corresponding to the first object frame selection area and a second weight corresponding to the second object frame selection area to obtain the target frame selection area.

Illustratively, an intersection ratio (Intersection over Union) between the first object-selection region and the second object-selection region may be calculated as a degree of coincidence between the first object-selection region and the second object-selection region. Intersection ratio = intersection area ++intersection area of first object frame selection area + second object frame selection area, wherein intersection area refers to the intersection area of first object frame selection area and second object frame selection area, and the value range of intersection ratio is between 0 and 1, and the bigger value indicates the higher degree of coincidence of first object frame selection area and second object frame selection area, and the value is 1 indicates complete coincidence, and the value is 0 indicates complete non-coincidence.

After the contact ratio between the two frame selection areas is obtained through calculation, if the contact ratio is smaller than or equal to the first contact ratio, the first object frame selection area and the second object frame selection area are fused, and the frame selection area is schematically assumed to be realized as a rectangular frame, and the coordinates of each vertex of the first object frame selection area are A₁（m₁,n₁）、B₁（m₂,n₂）、C₁（m₃,n₃）、D1（m₄,n₄）, and the first weight is x; the coordinates of each vertex in the first object box selection area are A₂（p₁,q₁）、B₂（p₂,q₂）、C₂（p₃,q₃）、D₂（p₄,q₄）, and the second weight is y. The coordinate of each vertex of the selected area of the target object frame obtained by fusion is A #，）、B（，）、C（，）、D（，）。

Optionally, the first weight has a positive correlation with the contact ratio between the first object frame selection area and the second object frame selection area, that is, the higher the contact ratio between the first object frame selection area and the second object frame selection area is, the greater the first weight is; the second weight is in a negative correlation with the overlap ratio between the first object selection area and the second object selection area, that is, the higher the overlap ratio between the first object selection area and the second object selection area is, the smaller the second weight is.

In the above embodiment, when the contact ratio between the first object frame selection area and the second object frame selection area is smaller than or equal to the first contact ratio, the first object frame selection area and the second object frame selection area are fused according to the preset weight, on one hand, since the second object frame selection area is obtained from a more accurate object recognition result, the accuracy of the following recognition result can be improved by correcting the following recognition result through the object recognition result; on the other hand, the first object frame selection area and the second object frame selection area are selected and fused instead of directly using the second object frame selection area, so that an object track formed by the object frame selection areas corresponding to continuous video frames is smoother, and the following recognition stability is improved.

In some embodiments, the second object selection region is taken as the target selection region when the degree of overlap between the first object selection region and the second object selection region is less than or equal to the second degree of overlap. Wherein the first overlap ratio is greater than the second overlap ratio.

Schematically, in the embodiment of the present application, if the overlap ratio between the first object frame selection area and the second object frame selection area is too small, the position of the target object indicated by the first object frame selection area is considered to be completely unreliable, and then the more accurate second object frame selection area obtained by the second recognition model is directly taken as the target frame selection area.

In the above embodiment, when the position of the target object indicated by the first object frame selection area is completely unreliable, the result of the second recognition model is directly adopted as the target frame selection area, so that the tracking of the target can be quickly responded and restored, the subsequent follow-up recognition by using error data is avoided, and the risk of tracking loss is reduced.

In some embodiments, the first object selection region is taken as the target selection region when the contact ratio between the first object selection region and the second object selection region is greater than the first contact ratio.

Schematically, in the embodiment of the present application, if the overlap ratio between the first object frame selection area and the second object frame selection area is higher, the position reliability of the target object indicated by the first object frame selection area is considered to be higher, and in this case, in order to make the track prediction result of the target object smooth, jump is avoided, and the first object frame selection area identified by the first identification model is used as the target frame selection area.

In some embodiments, the first overlap ratio and the second overlap ratio are predetermined overlap ratios. For example: the first overlap ratio was 0.8, and the second overlap ratio was 0.5.

In other embodiments, the first contact ratio is a dynamically determined contact ratio. Optionally, the method of dynamically determining the first overlap ratio includes at least one of the following methods.

The method comprises the following steps: and determining the first contact ratio according to the picture rotation information.

Acquiring picture rotation information corresponding to a first video frame, wherein the picture rotation information is used for indicating picture rotation conversion of the first video frame relative to the previous video frame; and when the picture rotation information accords with the preset conversion condition, adjusting the preset contact ratio based on the picture rotation information to obtain a first contact ratio.

Wherein the preset overlap ratio is the overlap ratio preset.

Illustratively, feature point matching is performed between the first video frame and the previous video frame, so as to determine matched feature points (the matched feature points may be extracted by using a first recognition model, which will be described in detail in the embodiment of fig. 3, and not described herein), and then a transformation matrix between the two frames is determined based on the matched feature points, where the transformation matrix describes a rotation transformation from the previous video frame to the first video frame, and a rotation angle is determined based on the transformation matrix, where the rotation angle is used to indicate a rotation transformation of the first video frame relative to a picture occurring in the previous video frame, that is, picture rotation information.

The picture rotation information accords with the preset transformation condition, that is, the rotation angle from the previous video frame to the first video frame is in a preset rotation interval, such as [ -270 degrees, -90 degrees ] and [90 degrees, 270 degrees ], when the rotation angle is in the range, the picture of the first video frame can be considered to be overturned, and the characteristics of the target object in the overturned picture are possibly inconsistent with the characteristics under normal conditions, so that the accuracy of the characteristics identified by the first identification model is reduced, the reliability of the obtained selected area of the first object frame is reduced, the preset overlap ratio can be improved, that is, the first overlap ratio can be set to be a higher value, for example: the first overlap ratio is 0.9, in other words, the first object frame selection area is considered to be trusted only when the first object frame selection area and the second object frame selection area are substantially completely overlapped, and the first object frame selection area is selected as the target frame selection area, thereby reducing the risk of misrecognition or tracking loss due to screen inversion.

If the rotation angle from the previous video frame to the first video frame is outside the preset rotation interval, the picture of the first video frame can be considered not to be overturned, and the preset coincidence degree can be directly used as the first coincidence degree.

The second method is as follows: the first overlap ratio is determined based on the illumination variation.

Acquiring illumination change information corresponding to a first video frame, wherein the illumination change information is used for indicating illumination intensity change of the first video frame relative to the previous video frame; and when the illumination change information accords with the preset change condition, adjusting the preset overlap ratio based on the illumination change information to obtain a first overlap ratio.

Illustratively, the illumination change information corresponding to the second video frame refers to an intensity difference between illumination intensities corresponding to the first video frame and the previous video frame, and the illumination intensity of the video frame can be determined through brightness, contrast, color distribution and the like of the video frame.

If the intensity difference is greater than or equal to the preset intensity difference, that is, the illumination intensity is changed greatly, which means that the appearance or scene background of the target object is changed significantly, the reliability of the selected area of the first object frame obtained at this time is lower, and the preset overlap ratio can be improved, that is, the first overlap ratio can be set to a higher value, so that the accurate tracking of the target object is maintained.

If the intensity difference is smaller than the preset intensity difference, that is, the illumination intensity is changed less, meaning that the appearance or scene background of the target object is not changed significantly, the preset overlap ratio can be directly used as the first overlap ratio.

It should be noted that the foregoing examples for dynamically determining the first overlap ratio are merely illustrative, and the embodiments of the present application are not limited thereto.

Alternatively, the second overlap may be a dynamically determined overlap, and the method for dynamically determining the second overlap may refer to the method for dynamically determining the first overlap, which is not described herein.

The target frame selection area is the finally determined frame selection area used for representing the position of the target object in the first video frame.

Step 240, continuing to follow the identified target object in the target video based on the target frame selection area.

Illustratively, after determining the target frame selection area, a first video frame containing the target frame selection area is taken as an initial frame. Based on the first video frame, analyzing the video frames after the first video frame through the first identification model, and predicting the motion trail of the target object in the video frames after the first video frame. This process may refer to the description in step 210, and will not be repeated here.

In summary, according to the object recognition method provided by the embodiment of the application, on one hand, the object recognition is performed based on the video frames in the following recognition process by extracting the frame intervals, and when the object recognition result and the following recognition result meet the coincidence ratio condition, the following recognition result is corrected by the object recognition result, so that the following target is avoided, and the accuracy of following recognition on the target object is improved; on the other hand, as the computational power resources consumed by object recognition on the video frames are more, the object recognition is carried out by selecting the video frames at intervals instead of carrying out the object recognition on each video frame, the consumption of the computational power resources can be reduced while the accuracy of follow-up recognition is ensured, and the object recognition method is transplanted to the mobile terminal.

In some embodiments, the following recognition may be implemented using feature-point-matching-based optical flow estimation that extracts feature points from the video frame that are useful for optical flow estimation to predict the motion trajectory of the target object. In the embodiment of the application, in order to reduce the calculation amount required by feature extraction, feature extraction can be performed only on the object frame selection area corresponding to the video frame. Illustratively, the embodiment illustrated in fig. 2 described above may also be implemented as steps 311 through 340 as illustrated in fig. 3.

Step 311, obtaining the t-th video frame in the multiple continuous video frames corresponding to the target video.

The t-th video frame comprises a t-th object frame selection area, the t-th object frame selection area is used for marking the position of a target object in the t-th video frame, and t is a positive integer.

Illustratively, the t-th video frame is a start frame of a detection period, where the detection period includes N video frames, where N represents a first frame interval. The description will be given taking the t-th video frame as an example of a start frame from which the following identification starts among a plurality of consecutive video frames.

If the target object generally refers to any object in the target video, the t-th video frame may be implemented as the 1 st video frame in one continuous video frame, and then the 1 st video frame is input into the second recognition model to perform object recognition, so as to obtain the t-th object frame selection area corresponding to the target object, and the process of performing object recognition by the second recognition model may refer to the description in step 220, which is not repeated herein.

If the target object refers to a certain object in the target video, the t-th video frame can be implemented as a first video frame containing the target object in a plurality of continuous video frames, the plurality of continuous video frames are sequentially input into a second recognition model for object recognition, when the second recognition model recognizes the target object from the certain video frame, the video frame is taken as the t-th video frame, and a t-th object frame selection area corresponding to the target object is determined.

And step 312, performing feature recognition on the t-th object frame selection area in the t-th video frame to obtain a plurality of first feature points corresponding to the target object in the t-th object frame selection area.

Optionally, feature recognition is performed on the t-th object frame selection area in the t-th video frame through the first recognition model, so as to obtain a plurality of first feature points.

Optionally, the feature points include at least one of edges, corner points, spots, areas, and the like, which is not limited by the embodiment of the present application. Where edges often represent a significant change in brightness or color in the image, typically representing boundaries or demarcations between objects. Corner points are typically where two or more edges meet. Spots generally represent places in an image where the local area is significantly different in brightness or color from surrounding areas, which places may represent points of interest, objects, flaws, or other important local structures in the image. An area generally represents a collection of contiguous pixels in an image that have similar properties (e.g., color, texture).

Optionally, in the embodiment of the present application, the implementation of feature points is illustrated as corner points.

Illustratively, the first recognition model includes a feature recognition network, and the feature representation network performs feature recognition on the t-th object frame selection area to obtain a plurality of first feature points. Wherein, the method goodFeaturesToTrack can be called to extract the corner feature in the t-th object frame selection area in the feature recognition network, and the method goodFeaturesToTrack comprises the following steps.

(1) And carrying out graying treatment on the t-th video frame.

Optionally, in the embodiment of the present application, feature recognition is performed only for the t-th object frame selection area in the t-th video frame, and then graying processing may also be performed only for the t-th object frame selection area.

(2) Image gradients are calculated from the gray values.

Illustratively, the image gradient is the gray value change rate of each pixel in the image, which is used to represent the edge intensity and direction information of each pixel in the image, and the gradient of the image is calculated by applying Sobel, prewitt or other edge detection operators.

(3) Based on the image gradient, corner response values are calculated by corner response functions.

The corner response function is used for giving a corner response value of each pixel point, wherein the larger the value is, the more likely the point is a corner, and the corner response function comprises any one of a Harris corner response function, a Shi-Tomasi corner response function and the like.

(4) And determining the corner based on the corner response value.

Illustratively, a response value threshold is set, and pixels with corner response values greater than the response value threshold are regarded as corner points. Optionally, redundant corner points in the local area can be eliminated by non-maximum suppression (non-maximum suppression), so that the selected corner points are distributed uniformly and representatively.

In some embodiments, a first number threshold may be further set, and if the number of obtained corner points is greater than the first number threshold, the corner points may be screened according to the size of the corner response value. For example: and if the first number threshold is 50 and the obtained corner points are 100, selecting 50 corner points with the highest corner point response values from the 100 corner points as the corner points finally used for following identification.

In some embodiments, when feature recognition is performed on the t-th object box selection area, the t-th object box selection area is divided into a plurality of sub-areas, and the sub-areas needing feature recognition are screened from the plurality of sub-areas.

Optionally, dividing the t object frame selection area in the t video frame into a plurality of subareas; screening the multiple subareas according to the spatial relationship among the multiple subareas to obtain a target subarea; and carrying out feature recognition on the target subarea to obtain a plurality of first feature points corresponding to the target object.

Illustratively, the filtering may be performed according to the relative positions between the sub-regions according to the spatial relationship between the sub-regions, such as: determining a sub-region located in the center of the object box selection region as a sub-region requiring feature recognition, because the central sub-region may contain the subject portion of the object and thus more likely to contain important features; or determining the sub-region close to the boundary of the selected region of the object frame as the sub-region needing feature recognition, because the sub-region of the boundary may contain edge information of the object, which is very important for recognition of feature points such as corner points.

Alternatively, the multiple sub-regions may also be screened by the recognition of the sub-regions in consecutive frames. Illustratively, if a sub-region does not detect a corner in consecutive frames, it may be considered that the sub-region does not currently require feature recognition.

In the above embodiment, the object frame selection area is further divided on the basis of feature extraction of the object frame selection area, and the sub-area most likely containing the key feature points is selected from the plurality of sub-areas, so as to reduce the calculation amount of subsequent feature recognition.

Step 313, based on the plurality of first feature points, the target object in the N video frames after the t video frame is identified, so as to obtain a first object frame selection area corresponding to the t+n video frames.

N is a preset frame interval, the (t+N) th video frame is a first video frame, and the first object frame selection area is a frame selection area obtained based on track prediction information of the target object in the process of following and identifying the target object.

Optionally, in the embodiment of the present application, if an optical flow estimation method is used for performing the following identification, the method for following identification of the target object in N video frames after the t video frame based on the plurality of first feature points further includes the following steps.

Step 1: and carrying out feature recognition on the (t+i) th video frame to obtain a plurality of second feature points corresponding to the target object, wherein i is less than or equal to N and is a positive integer.

Schematically, feature recognition is performed on the (t+i) th video frame through the first recognition model to obtain a plurality of second feature points corresponding to the target object, wherein the types of the first feature points and the second feature points are the same, such as: the first feature point is a corner point, and the second feature point is also a corner point.

Step 2: and matching the plurality of second characteristic points with a plurality of characteristic points corresponding to the t+i-1 video frames, and determining a target characteristic point from the plurality of second characteristic points.

The number of the target feature points may be plural or one.

Illustratively, taking a designated feature point among the plurality of first feature points as an example, calculating the similarity between the designated feature point and a plurality of second feature points (in the embodiment of the present application, the feature points may be described by using a vector representation, where the vector representation includes pixel information or statistical characteristics of an area around the feature points), for example, euclidean distance, mahalanobis distance, cosine similarity, or the like, and if the similarity between a certain second feature point and the designated feature point is greater than or equal to a preset similarity, the second feature point is considered to be matched with the designated feature point, and then the second feature point is determined to be a target feature point, where the target feature point may be considered to be a feature point existing in both the t+i video frame and the t+i-1 video frame.

Step 3: and determining a motion vector corresponding to the t+i video frame based on the position of the target feature point in the t+i video frame and the position of the feature point matched with the target feature point in the t+i-1 video frame.

Wherein the motion vector is used to indicate a change in displacement of the target object between the t+i-1 th video frame and the t+i-th video frame. Illustratively, the motion vector is calculated based on the change in position of the target feature point for which matching is successful.

Step 4: and determining an object frame selection area corresponding to the (t+i) th video frame based on the motion vector.

Illustratively, based on the calculated motion vector, the frame selection area of the target object in the t+i-th video frame may be updated, for example: and translating the object frame selection area in the t+i-1 video frame according to the motion vector.

In the embodiment, the accurate tracking of the target object in the continuous video frames is realized through the extraction and matching of the characteristic points of the target object in the continuous video frames, and the accuracy of follow-up identification is improved.

In some embodiments, a median motion vector from the initial motion vector to the currently calculated motion vector may also be calculated, and an object box region corresponding to the video frame may be determined based on the median motion vector.

When i=1, i.e., for the t-th video frame and the t+1th video frame, since it belongs to the first pair of adjacent video frames, the calculated motion vector 1 is taken as the initial motion vector.

When i is more than 1 and less than or equal to N, namely, for video frames after the (t+1) th video frame, determining a median motion vector among a plurality of motion vectors calculated according to the (t+i) th video frame; and determining an object frame selection area corresponding to the (t+i) th video frame based on the median motion vector.

Illustratively, for the (t+1) th video frame and the (t+2) th video frame, after the motion vector 2 is calculated, the motion vector 1 and the motion vector 2 are sequenced, the median value of the sequenced motion vectors (i.e. the average number of the motion vectors 1 and 2) is determined as the median motion vector of the (t+1) th video frame and the (t+2) th video frame, and the object frame selection area corresponding to the (t+2) th video frame is determined based on the median motion vector; and for the (t+2) th video frame and the (t+3) th video frame, after the motion vector 3 is calculated, sequencing the motion vector 1, the motion vector 2 (or the median motion vector of the (t+1) th video frame and the (t+2) th video frame) and the motion vector 3, determining the median of the sequenced motion vectors as the median motion vector of the (t+2) th video frame and the (t+3) th video frame, and determining the object frame selection area corresponding to the (t+3) th video frame based on the median motion vector. The calculation of other median motion vectors is analogized to this and is not repeated here.

In the embodiment, the median value of the accumulated motion vectors is adopted to determine the object frame selection area, abnormal characteristic point motions can be eliminated, erroneous motion estimation caused by noise and shielding is effectively eliminated, and the accuracy of follow-up identification is improved.

After the calculation is completed on the N video frames after the t video frame, a first object frame selection area corresponding to the t+N video frames can be finally obtained.

In some embodiments, a second number threshold is set, where the second number threshold is smaller than or equal to the first number threshold, and when feature points (such as corner points) in the object frame selection area in the video frame are smaller than the second number threshold, feature recognition is performed on the object frame selection area again, and feature points are supplemented. Illustratively, if the feature points in the object frame selection area corresponding to the obtained (t+3) th video frame are smaller than the second number threshold, a goodFeaturesToTrack method may be called to perform feature recognition on the first object frame selection area, and the corner points therein are supplemented until the number of feature points in the first object frame selection area is greater than or equal to the second number threshold.

And 320, performing object recognition on the (t+N) th video frame to obtain a second object frame selection area.

The second object frame selection area is a frame selection area obtained after the target object is identified from the (t+N) th video frame. Optionally, object recognition is performed on the (t+n) th video frame through the second recognition model, so as to obtain a second object frame selection area.

And step 330, adjusting the first object frame selection area based on the second object frame selection area to obtain a target frame selection area under the condition that the first object frame selection area and the second object frame selection area meet the coincidence degree condition.

Step 340, based on the target framing region, continuing to follow the identification of the target object in the video frame following the t+Nth video frame.

Illustratively, the target frame selection area is used for marking the position of the target object in the t+n video frames, the t+n video frames are taken as the t video frames in the step 311, the target frame selection area is taken as the t object frame selection area, and the steps 311 to 340 are repeated, so that continuous following identification of the target object is realized.

In summary, in the object recognition method provided in the embodiment of the present application, in the following recognition process, feature recognition is performed only on the object frame selection area, so that the waste of computational power resources caused by feature recognition on the complete image frame is avoided, and thus the object recognition speed in the scene with limited resources is improved, for example, the real-time performance of the target tracking task on the mobile device is improved.

In some embodiments, during the following recognition process, the object in the target video may be deformed, and the following recognition result may be corrected by using the displacement and scale transformation information of the target object in the first video frame. Illustratively, the embodiments illustrated in fig. 2 or 3 described above may also be implemented as steps 401 through 408 as illustrated in fig. 4.

Step 401, obtaining a t-th video frame in a plurality of continuous video frames corresponding to a target video.

And step 402, performing feature recognition on the t object frame selection area in the t video frame through the first recognition model to obtain a plurality of first feature points.

Optionally, the first feature point includes at least one of an edge, a corner point, a spot, a region, and the like, which is not limited by the embodiment of the present application.

Referring to fig. 5, an overall frame diagram of an object recognition method is schematically shown. As shown in fig. 5, after a plurality of video frames are acquired, the plurality of video frames are input to a tracker 520 (i.e., a first recognition model) for follow-up recognition.

Wherein the tracker 520 performs a follow-up identification of the target object starting from an initial frame (e.g., frame 1 of the plurality of video frames). After the initial frame, the detector 510 is invoked every N video frames to perform object recognition on the current video frame (for example, the n+1th frame), and when the contact ratio between the detection frame obtained by object recognition and the detection frame obtained by tracking recognition is low, the detection frame obtained by tracking recognition can be used to correct the detection frame obtained by tracking recognition, so as to correct the position of the target object in the current video frame.

Alternatively, the above-mentioned checking process may be synchronous or asynchronous, where synchronous checking refers to invoking the detector 510 every video frame, i.e. correcting the detection frame output by the tracker 520 by the detection frame output by the detector 510. Asynchronous checking refers to invoking detector 510 at intervals of several video frames. The embodiment of the application mainly takes asynchronous verification as an example for illustration.

As shown in fig. 5, the t-th video frame includes a detection frame 501, where the detection frame 501 is used to identify the position of the target object 502 in the t-th video frame, and the tracker 520 performs feature recognition on the area selected by the detection frame 501 to obtain a plurality of feature points corresponding to the target object 502.

Step 403, based on the plurality of first feature points, the first recognition model is used for following and recognizing the target object in the N video frames after the t video frame, so as to obtain the candidate object frame selection area corresponding to the t+n video frames.

Wherein N is a preset frame interval, and the (t+n) th video frame is a first video frame.

Referring to fig. 5, a t+1st video frame is taken as an example for explanation, and feature points matched with each feature point in the t video frame are determined from the t+1st video frame; and determining a detection frame 503 for identifying the position of the target object in the t+1st video frame according to the position of each feature point successfully matched in the t+1st video frame.

In this embodiment of the present application, for each detection frame output by the tracker 520, for example, the detection frame 503 may be corrected by a scale displacement third recognition model (i.e., a third recognition model described below), and illustratively, the (t+1) th video frame is input into the scale displacement third recognition model 530, the coordinate point corresponding to the target object in the (t+1) th video frame is output, the detection frame 503 is corrected by the coordinate point, and then the following recognition is performed by the corrected detection frame 503.

After N times of following identification, the t+N detection frames are finally obtained, wherein the t+N detection frames are uncorrected detection frames, namely candidate object frame selection areas.

And 404, estimating coordinates of the (t+N) th video frame through a third recognition model to obtain a target coordinate point corresponding to the target object.

The third recognition model is a lightweight model obtained through training of a sample data set, and the sample data set comprises sample images subjected to matting processing.

Optionally, the sample dataset includes a plurality of sample images, each sample image being a matted image, and the parts being snapped out are individual physical objects in the image, for example: people, vehicles, etc.; each sample image is marked with a reference coordinate point, and the reference coordinate point refers to a coordinate point of a contour corresponding to the buckled part; inputting a sample image into a sample recognition model, and predicting a predicted coordinate point corresponding to the buckled part through the sample recognition model, wherein the predicted coordinate point is used for identifying the position and the size of the buckled part; and training the sample recognition model based on the difference between the predicted coordinate point and the reference coordinate point, thereby obtaining a third recognition model.

In the training process, the difference between the predicted coordinate point and the reference coordinate point is compared, a loss function (such as a cross entropy loss function) can be calculated, parameters of the sample recognition model are updated based on the calculated loss, the process is repeated for a plurality of times until the sample recognition model reaches the preset precision, and the obtained model is the third recognition model.

And estimating a target coordinate point corresponding to the target object in the (t+N) th video frame through the third recognition model, so as to correct the displacement and scale transformation of the target object indicated by the candidate object frame selection area. The overall operation amount of the third recognition model is about 2M Flops, and the size of the overall model is about 50KB, namely the third recognition model is a lightweight model, and the operation amount is small and the model size is moderate, so that the third recognition model can be processed in real time in a resource-limited environment.

It should be noted that, in order to improve the prediction accuracy of the third recognition model, the sample data set may cover various possible physical objects, backgrounds, illumination conditions, and the like as much as possible.

And step 405, adjusting the scale of the candidate object frame selection area and the first coordinate point according to the target coordinate point to obtain a first object frame selection area.

The first coordinate point is used for determining the position of the candidate object frame selection area in the (t+N) th video frame, and the first coordinate point can be implemented as a center point of the candidate object frame selection area and the like.

Illustratively, the size (such as length, width, etc.) of the candidate object frame selection area is adjusted by the contour size indicated by the target coordinate point predicted by the third recognition model, and the first coordinate point (such as a center point) corresponding to the candidate object frame selection area is adjusted by the position indicated by the target coordinate point predicted by the third recognition model, so that the finally obtained first object frame selection area can accurately cover the target object in the t+Nth video frame.

Referring to fig. 6 schematically, a scale change process in an actual application process is shown, a detection frame 601 in a video frame 610 is used to identify the position of the target object 602, if a third identification model is not used, a detection frame 603 identifying the position of the target object 602 in the video frame 620, and if the third identification model is used, a detection frame 604 identifying the position of the target object 602 in the video frame 620, where it is obvious that the detection frame 604 is adapted to the scale and rotation transformation of the current target object 602, and the position accuracy of the target object 602 identified by the detection frame 603 is lower.

And step 406, performing object recognition on the (t+N) th video frame through the second recognition model to obtain a second object frame selection area.

The second object frame selection area is a frame selection area obtained after the target object is identified from the (t+N) th video frame.

Step 407, adjusting the first object frame selection area based on the second object frame selection area to obtain the target frame selection area when the first object frame selection area and the second object frame selection area meet the overlap ratio condition.

Step 408, based on the target frame selection area, continuing to follow the target object in the video frame after the t+Nth video frame through the first recognition model.

Illustratively, the target frame selection area is used for marking the position of the target object in the t+n video frames, the t+n video frames are taken as the t video frames in the step 401, the target frame selection area is taken as the t object frame selection area, and the steps 401 to 408 are repeated, so that continuous following identification of the target object is realized.

In summary, the object recognition method provided by the embodiment of the application extracts the video frames in the following recognition process based on the frame interval to perform object recognition, and corrects the following recognition result through the object recognition result when the object recognition result and the following recognition result meet the coincidence ratio condition, so as to avoid the following loss of the target, thereby improving the accuracy of following recognition on the target object. In addition, in the following recognition process, the object in the target video may be deformed, the following recognition result can be corrected through the displacement and scale transformation information of the target object in the first video frame, the interference generated by deformation is eliminated, and the accuracy of the obtained first object frame selection area is improved.

Referring to fig. 7 and 8, when the object recognition method provided by the embodiment of the application is applied to a mobile terminal, when a video collected by the mobile terminal (for example, a video collected in real time by using a sweeping-one-sweeping function) is in a fast moving scene 700 and a blocked scene 800, accurate and fast following recognition can be performed, and the phenomenon of following a lost target object is avoided.

Referring to fig. 9, a block diagram of an object recognition apparatus according to an exemplary embodiment of the present application is shown, and the apparatus includes the following modules.

A data acquisition module 910, configured to acquire, when following a target object in a target video, a first video frame based on a first frame interval, where the first video frame includes a first object frame selection area, and the first object frame selection area is a frame selection area obtained based on track prediction information of the target object in a process of following the target object;

The first identifying module 920 is configured to identify an object of the first video frame, so as to obtain a second object frame selection area, where the second object frame selection area is a frame selection area obtained after the target object is identified from the first video frame;

The data adjustment module 930 is configured to adjust, based on the second object selection area, the first object selection area to obtain a target selection area when the first object selection area and the second object selection area meet a contact ratio condition;

a second identifying module 940 is configured to continue to follow and identify the target object in the target video based on the target frame selection area.

Referring to fig. 10, in some embodiments, the data adjustment module 930 is configured to fuse the first object frame selection area and the second object frame selection area based on a first weight corresponding to the first object frame selection area and a second weight corresponding to the second object frame selection area when the contact ratio between the first object frame selection area and the second object frame selection area is less than or equal to the first contact ratio, so as to obtain the target frame selection area.

In some embodiments, the data adjustment module 930 is configured to use the second object selection area as the target selection area when the overlap ratio between the first object selection area and the second object selection area is less than or equal to a second overlap ratio; wherein the first overlap ratio is greater than the second overlap ratio.

In some embodiments, the data adjustment module 930 is configured to use the first object selection area as the target selection area when the overlap ratio between the first object selection area and the second object selection area is greater than the first overlap ratio.

In some embodiments, the data adjustment module 930 includes:

A first obtaining unit 931, configured to obtain picture rotation information corresponding to the first video frame, where the picture rotation information is used to indicate picture rotation transformation that occurs in the first video frame relative to a previous video frame;

A first adjustment unit 932, configured to adjust a preset overlap ratio based on the picture rotation information when the picture rotation information meets a preset transformation condition, so as to obtain the first overlap ratio.

In some embodiments, the data acquisition module 910 includes:

A second obtaining unit 911, configured to obtain a t-th video frame in a plurality of continuous video frames corresponding to the target video; the t-th video frame comprises a t-th object frame selection area, wherein the t-th object frame selection area is used for marking the position of the target object in the t-th video frame, and t is a positive integer; the data acquisition module 910 includes:

The data identifying unit 912 is configured to perform feature identification on the t-th object frame selection area in the t-th video frame, so as to obtain a plurality of first feature points corresponding to the target object in the t-th object frame selection area;

The data identifying unit 912 is further configured to, based on the plurality of first feature points, identify a target object in N video frames after the t-th video frame, and obtain the first object frame selection area corresponding to the t+n-th video frame, where N is the first frame interval, and the t+n-th video frame is the first video frame.

In some embodiments, the data identifying unit 912 is configured to divide the t-th object frame selection area in the t-th video frame into a plurality of sub-areas; screening the plurality of subareas according to the spatial relationship among the plurality of subareas to obtain a target subarea; and carrying out feature recognition on the target subarea to obtain a plurality of first feature points corresponding to the target object.

In some embodiments, the data identifying unit 912 is configured to perform feature identification on the (t+i) th video frame to obtain a plurality of second feature points corresponding to the target object, where i is equal to or less than N and i is a positive integer; matching the plurality of second feature points with a plurality of feature points corresponding to the t+i-1 video frames, and determining target feature points from the plurality of second feature points; determining a motion vector corresponding to the (t+i) -th video frame based on the position of the target feature point in the (t+i) -th video frame and the position of the feature point matched with the target feature point in the (t+i) -1-th video frame, wherein the motion vector is used for indicating the displacement change of the target object between the (t+i) -1-th video frame and the (t+i) -th video frame; and determining an object frame selection area corresponding to the (t+i) th video frame based on the motion vector.

In some embodiments, a data identification unit 912 is configured to determine a median motion vector among a plurality of motion vectors calculated from t+i video frames, where 1 < i.ltoreq.N; and determining an object frame selection area corresponding to the (t+i) th video frame based on the median motion vector.

In some embodiments, the data identifying unit 912 is configured to identify, based on the plurality of first feature points, a target object in N video frames after the t-th video frame, to obtain a candidate object frame selection area corresponding to the t+n-th video frame; carrying out coordinate estimation on the (t+N) th video frame through a third identification model to obtain a target coordinate point corresponding to the target object, wherein the third identification model is a lightweight model obtained through sample data set training, and the sample data set comprises sample images subjected to matting processing; and adjusting the scale of the candidate object frame selection area and a first coordinate point according to the target coordinate point to obtain the first object frame selection area, wherein the first coordinate point is used for determining the position of the candidate object frame selection area in the t+N video frames.

In some embodiments, the data obtaining module 910 is further configured to obtain, when following the identification of the target object in the target video, a picture scaling corresponding to a second video frame, where the second video frame is a video frame extracted from the target video; and determining the first frame interval according to the picture scaling.

In summary, when the object recognition device provided by the embodiment of the present application follows the recognition target object, a first video frame is extracted from video frames corresponding to the target video according to a first frame interval, where the first video frame includes a first object frame selection area determined based on track prediction information in the process of following the recognition target object; the method comprises the steps of independently identifying objects in a first video frame, and determining the position of a target object in the first video frame again to obtain a second object frame selection area; and when the first object frame selection area and the second object frame selection area meet the coincidence ratio condition, adjusting the first object frame selection area based on the second object frame selection area to obtain a target frame selection area. On one hand, video frames in the following recognition process are extracted based on frame intervals to carry out object recognition, when the object recognition result and the following recognition result meet the coincidence ratio condition, the following recognition result is corrected through the object recognition result, and the following loss of a target is avoided, so that the accuracy of carrying out following recognition on the target object is improved; on the other hand, as the computational power resources consumed by object recognition on the video frames are more, the object recognition is carried out by selecting the video frames at intervals instead of carrying out the object recognition on each video frame, the consumption of the computational power resources can be reduced while the accuracy of follow-up recognition is ensured, and the object recognition method is transplanted to the mobile terminal.

It should be noted that: the object recognition device provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the embodiments of the object recognition device and the object recognition method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the embodiments of the method are detailed in the method embodiments, which are not repeated herein.

Fig. 11 shows a block diagram of a computer device 1100 provided by an exemplary embodiment of the application. The computer device 1100 may be: a smart phone, a tablet computer, a dynamic video expert compression standard audio layer 3 player (Moving Picture Experts Group Audio Layer III, MP 3), a dynamic video expert compression standard audio layer 4 (Moving Picture Experts Group Audio Layer IV, MP 4) player, a notebook computer, or a desktop computer. The computer device 1100 may also be referred to by other names as user device, portable computer device, laptop computer device, desktop computer device, etc.

In general, the computer device 1100 includes: a processor 1101 and a memory 1102.

The processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1101 may be implemented in at least one hardware form of Digital Signal Processing (DSP), field-Programmable gate array (fieldprogrammable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 1101 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a central processor (Central Processing Unit, CPU), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with an image processor (Graphics Processing Unit, GPU) for rendering and rendering of content to be displayed by the display screen. In some embodiments, the processor 1101 may also include an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one instruction for execution by processor 1101 to implement the object recognition method provided by the method embodiments of the present application.

Illustratively, the computer device 1100 also includes other components, and those skilled in the art will appreciate that the structure illustrated in FIG. 11 is not limiting of the computer device 1100, and may include more or fewer components than illustrated, or may combine certain components, or employ a different arrangement of components.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing related hardware, and the program may be stored in a computer readable storage medium, which may be a computer readable storage medium included in the memory of the above embodiments; or may be a computer-readable storage medium, alone, that is not assembled into a computer device. The computer readable storage medium has stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which are loaded and executed by the processor to implement the object recognition method according to any of the above embodiments.

Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid STATE DRIVES), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, RESISTANCE RANDOM ACCESS MEMORY) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. An object recognition method, the method comprising:

2. The method according to claim 1, wherein, in the case that the first object selection area and the second object selection area meet a coincidence condition, adjusting the first object selection area based on the second object selection area to obtain a target selection area includes:

And when the contact ratio between the first object frame selection area and the second object frame selection area is smaller than or equal to the first contact ratio, fusing the first object frame selection area and the second object frame selection area based on a first weight corresponding to the first object frame selection area and a second weight corresponding to the second object frame selection area to obtain the target frame selection area.

3. The method according to claim 2, wherein the method further comprises:

When the coincidence degree between the first object frame selection area and the second object frame selection area is smaller than or equal to a second coincidence degree, the second object frame selection area is used as the target frame selection area; wherein the first overlap ratio is greater than the second overlap ratio.

4. The method according to claim 2, wherein the method further comprises:

And when the contact ratio between the first object frame selection area and the second object frame selection area is larger than the first contact ratio, taking the first object frame selection area as the target frame selection area.

5. The method according to claim 2, wherein the method further comprises:

acquiring picture rotation information corresponding to the first video frame, wherein the picture rotation information is used for indicating picture rotation conversion of the first video frame relative to the previous video frame;

and when the picture rotation information accords with a preset conversion condition, adjusting the preset overlap ratio based on the picture rotation information to obtain the first overlap ratio.

6. The method according to any one of claims 1 to 5, wherein extracting a first video frame from a target video based on a first frame interval while following a target object in the target video comprises:

acquiring a t-th video frame in a plurality of continuous video frames corresponding to the target video; the t-th video frame comprises a t-th object frame selection area, wherein the t-th object frame selection area is used for marking the position of the target object in the t-th video frame, and t is a positive integer;

performing feature recognition on the t-th object frame selection area in the t-th video frame to obtain a plurality of first feature points corresponding to the target object in the t-th object frame selection area;

and based on the plurality of first feature points, identifying target objects in N video frames after the t video frame, and obtaining the first object frame selection area corresponding to the t+N video frames, wherein N is the first frame interval, and the t+N video frames are the first video frames.

7. The method of claim 6, wherein the performing feature recognition on the t-th object frame selection area in the t-th video frame to obtain a plurality of first feature points corresponding to the target object in the t-th object frame selection area includes:

dividing the t object frame selection area in the t video frame into a plurality of subareas;

Screening the plurality of subareas according to the spatial relationship among the plurality of subareas to obtain a target subarea;

And carrying out feature recognition on the target subarea to obtain a plurality of first feature points corresponding to the target object.

8. The method according to claim 6, wherein the following identifying the target object in the N video frames after the t video frame based on the plurality of first feature points, to obtain the first object frame selection area corresponding to the t+n video frames includes:

performing feature recognition on the (t+i) th video frame to obtain a plurality of second feature points corresponding to the target object, wherein i is less than or equal to N and is a positive integer;

Matching the plurality of second feature points with a plurality of feature points corresponding to the t+i-1 video frames, and determining target feature points from the plurality of second feature points;

Determining a motion vector corresponding to the (t+i) -th video frame based on the position of the target feature point in the (t+i) -th video frame and the position of the feature point matched with the target feature point in the (t+i) -1-th video frame, wherein the motion vector is used for indicating the displacement change of the target object between the (t+i) -1-th video frame and the (t+i) -th video frame;

and determining an object frame selection area corresponding to the (t+i) th video frame based on the motion vector.

9. The method of claim 8, wherein determining an object box selection region corresponding to a t+i video frame based on the motion vector comprises:

Determining a median motion vector among a plurality of motion vectors calculated according to t+i video frames, wherein i is more than 1 and less than or equal to N;

And determining an object frame selection area corresponding to the (t+i) th video frame based on the median motion vector.

10. The method according to claim 6, wherein the following identifying the target object in the N video frames after the t video frame based on the plurality of first feature points, to obtain the first object frame selection area corresponding to the t+n video frames includes:

Based on the plurality of first feature points, target objects in N video frames after the t video frame are identified in a following mode, and candidate object frame selection areas corresponding to the t+N video frames are obtained;

Carrying out coordinate estimation on the (t+N) th video frame through a third identification model to obtain a target coordinate point corresponding to the target object, wherein the third identification model is a lightweight model obtained through sample data set training, and the sample data set comprises sample images subjected to matting processing;

And adjusting the scale of the candidate object frame selection area and a first coordinate point according to the target coordinate point to obtain the first object frame selection area, wherein the first coordinate point is used for determining the position of the candidate object frame selection area in the t+N video frames.

11. The method according to any one of claims 1 to 5, further comprising:

When the target object in the target video is identified in a following mode, acquiring a picture scaling corresponding to a second video frame, wherein the second video frame is a video frame extracted from the target video;

And determining the first frame interval according to the picture scaling.

12. An object recognition apparatus, the apparatus comprising:

13. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the object recognition method of any one of claims 1 to 11.

14. A computer-readable storage medium, characterized in that at least one program is stored in the storage medium, the at least one program being loaded and executed by a processor to implement the object recognition method according to any one of claims 1 to 11.

15. A computer program product comprising a computer program which, when executed by a processor, implements the object recognition method according to any one of claims 1 to 11.