CN112488071A

CN112488071A - Method, device, electronic equipment and storage medium for extracting pedestrian features

Info

Publication number: CN112488071A
Application number: CN202011517231.1A
Authority: CN
Inventors: 郑新想; 夏凤君; 周斌
Original assignee: Chongqing Unisinsight Technology Co Ltd
Current assignee: Chongqing Unisinsight Technology Co Ltd
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2021-03-12
Anticipated expiration: 2040-12-21
Also published as: CN112488071B

Abstract

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a method and a device for extracting pedestrian features, electronic equipment and a storage medium. In the method, the image sequence of the target object to be monitored obtains the body posture category of the target object in each frame of image, such as the front, the back or the side. Then, the images of different body posture categories are subjected to feature fusion independently. For example, the front image of the target object is subjected to feature fusion to obtain the front feature of the target object as a kind of pedestrian feature, and the side image of the target object is subjected to feature fusion to obtain the side feature of the target object as a kind of pedestrian feature. Therefore, pedestrian features of different human body posture categories can be accurately aligned, and the features of the target object under different postures can be more accurately described by carrying out weighted summation relative to the front features and the side features, so that the quality of the extracted pedestrian features is more reliable.

Description

Method, device, electronic equipment and storage medium for extracting pedestrian features

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for extracting pedestrian features, an electronic device, and a storage medium.

Background

Video can record life, record the world. Video covers rich information. In the field of pedestrian re-recognition, the features of pedestrians are often analyzed from videos and subjected to subsequent processing.

In the related art, a scheme for extracting pedestrian features from a video includes: firstly, intercepting a picture of a monitoring target from each frame of image, extracting the characteristics of the monitoring target from the picture, and then carrying out weighting and averaging on the characteristics of different frames of the inter-frame picture to obtain the final characteristics of the monitoring target.

However, the inventor studies to find that the pedestrian feature quality extracted from the video in the related art is not good. Therefore, if the quality of the pedestrian feature is improved, the problem is solved.

Disclosure of Invention

The application aims to provide a method, a device, an electronic device and a storage medium for extracting pedestrian features, which are used for solving the problem that the quality of pedestrian features extracted from videos in the related art is poor.

In a first aspect, an embodiment of the present application provides a method for extracting pedestrian features, including:

acquiring an image sequence of a target object;

classifying each frame of image in the image sequence according to the human body posture category of the target object to obtain an image group corresponding to each human body posture category, and extracting the human body characteristics of the target object in each frame of image;

performing, respectively for each of the body pose categories: and grouping the images corresponding to the human body posture categories, and performing feature fusion processing on the human body features of each frame of image in the image groups to obtain first pedestrian features corresponding to the human body posture categories.

In some embodiments, after the acquiring the sequence of images of the target object, the method further comprises:

respectively evaluating the image quality of each frame of image in the image sequence to obtain image quality evaluation parameters of each frame of image in the image sequence;

the performing feature fusion processing on the human body features of each frame of image in the image group to obtain a first pedestrian feature corresponding to the human body posture category includes:

and performing feature fusion processing on the human body features of the frame images in the image grouping by adopting the image quality evaluation parameters of the frame images in the image grouping to obtain first pedestrian features corresponding to the human body posture categories.

In some embodiments, the performing feature fusion processing on the human body feature of each frame image in the image packet by using the image quality evaluation parameter of each frame image in the image packet includes:

and performing weighted operation processing on the human body features of each frame of image in the image grouping by taking the image quality evaluation parameters as weight factors.

In some embodiments, the human body features include human body region features of each of a plurality of designated human body regions;

the weighting operation processing of the human body features of each frame image in the image grouping by taking the image quality evaluation parameter as a weighting factor includes:

and performing weighted operation processing on the same specified human body regions in the image groups by taking the image quality evaluation parameters as weight factors to obtain first pedestrian characteristics corresponding to each preset human body region.

and sequentially inputting the human body characteristics of each frame of image in the image group and the image quality evaluation parameters of each frame of image in the image group into a pre-trained characteristic extraction model for characteristic fusion processing according to the sequence of each frame of image in the image group in the image sequence.

In some embodiments, before the grouping the images corresponding to the human body posture category and performing feature fusion processing on the human body features of each frame of image in the image grouping, the method further includes:

and filtering out the images which do not meet the specified image quality requirement in the image groups.

each frame of image of the image sequence is segmented to obtain a segmentation mask of the target object;

aiming at each frame of image in the image sequence, extracting the motion characteristics of the target object by adopting the segmentation mask of the target object;

and performing fusion processing on the extracted motion characteristics of each frame of image to obtain a second pedestrian characteristic of the target object.

respectively performing feature extraction and quality evaluation processing on each frame of image in the image sequence to obtain feature information of the target object and quality evaluation parameters of each frame of image;

and sequentially inputting the characteristic information and the quality evaluation parameters of each frame of image in the image sequence into a pre-trained incidence relation extraction model according to the sequence of each frame of image in the image sequence to obtain the third person characteristic of the target object output by the incidence relation extraction model.

In a second aspect, an embodiment of the present application further provides an apparatus for extracting features of a pedestrian, the apparatus including:

the sequence acquisition module is used for acquiring an image sequence of the target object;

the posture processing module is used for classifying each frame of image in the image sequence according to the human posture category of the target object to obtain an image group corresponding to each human posture category and extracting the human body characteristics of the target object in each frame of image;

a pedestrian feature determination module for performing, for each of the human posture categories, respectively: and grouping the images corresponding to the human body posture categories, and performing feature fusion processing on the human body features of each frame of image in the image groups to obtain first pedestrian features corresponding to the human body posture categories.

In some embodiments, after the sequence acquiring module performs acquiring the sequence of images of the target object, the apparatus further includes:

the evaluation module is used for respectively evaluating the image quality of each frame of image in the image sequence to obtain the image quality evaluation parameter of each frame of image in the image sequence;

and the pedestrian characteristic determining module is used for executing characteristic fusion processing on the human body characteristics of each frame of image in the image group to obtain a first pedestrian characteristic corresponding to the human body posture category:

In some embodiments, the pedestrian feature determination module, when performing feature fusion processing on the human body feature of each frame image in the image grouping by using the image quality evaluation parameter of each frame image in the image grouping, is configured to:

the pedestrian feature determination module is configured to, when performing the weighted operation processing on the human body features of each frame of image in the image group by using the image quality evaluation parameter as a weighting factor,:

In some embodiments, the pedestrian feature determination module, when performing the feature fusion processing on the human body feature of each frame image in the image grouping by using the image quality evaluation parameter of each frame image in the image grouping, is configured to:

In some embodiments, before the pedestrian feature determination module performs the image grouping corresponding to the human body posture category and performs the feature fusion processing on the human body features of each frame of image in the image grouping, the apparatus further includes:

and the filtering module is used for filtering out the images which do not meet the specified image quality requirement in the image groups.

In some embodiments, after the sequence acquiring module executes the acquiring of the sequence of images of the target object, the apparatus further includes:

the segmentation module is used for carrying out segmentation processing on each frame of image of the image sequence to obtain a segmentation mask of the target object;

the motion characteristic extraction module is used for extracting the motion characteristic of the target object by adopting the segmentation mask of the target object aiming at each frame of image in the image sequence;

and the motion characteristic fusion module is used for carrying out fusion processing on the motion characteristics of the extracted frames of images to obtain a second pedestrian characteristic of the target object.

the information acquisition module is used for respectively carrying out feature extraction and quality evaluation processing on each frame of image in the image sequence to obtain feature information of the target object and quality evaluation parameters of each frame of image;

and the additional characteristic determining module is used for sequentially inputting the characteristic information and the quality evaluation parameters of each frame of image in the image sequence into a pre-trained incidence relation extraction model according to the sequence of each frame of image in the image sequence to obtain the third person characteristic of the target object output by the incidence relation extraction model.

In a third aspect, another embodiment of the present application further provides an electronic device, including at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute any method for extracting pedestrian features provided by the embodiment of the application.

In a fourth aspect, another embodiment of the present application further provides a computer storage medium, where the computer storage medium stores a computer program, and the computer program is used to make a computer execute any method for extracting pedestrian features in the embodiments of the present application.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an application environment according to one embodiment of the present application;

FIG. 2 is a schematic flow chart of pedestrian feature extraction according to one embodiment of the present application;

FIG. 3 is a schematic diagram illustrating classification of human body poses by a pose estimation network according to one embodiment of the present application;

FIG. 4 is a schematic illustration of labeling a side of a human body according to one embodiment of the present application;

FIG. 5 is a schematic illustration of a human body divided into a plurality of parts according to one embodiment of the present application;

FIG. 6 is a schematic illustration of a human body part in the absence of a human body part according to one embodiment of the present application;

FIG. 7 is a schematic flow chart of a method of extracting pedestrian features according to one embodiment of the present application;

FIG. 8 is a schematic diagram illustrating a feature fusion process according to one embodiment of the present application;

FIG. 9 is another flow chart diagram of a method of extracting pedestrian features according to one embodiment of the present application;

FIG. 10 is a schematic diagram of pedestrian feature extraction for a single frame image according to one embodiment of the present application;

FIG. 11 is a diagram illustrating an embodiment of the present disclosure for extracting multi-dimensional pedestrian features in an image sequence;

FIG. 12 is a schematic diagram of an apparatus for extracting pedestrian features according to one embodiment of the present application;

FIG. 13 is a schematic view of an electronic device according to one embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

It is noted that the terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

In order to facilitate understanding of technical solutions provided in the embodiments of the present application, some key terms related to the embodiments of the present application are described below.

And (4) re-identifying the pedestrian, namely identifying whether the pedestrian target is a known monitoring target from the video or the single-frame picture. The video pedestrian re-identification task is the same as that of a single-frame picture, namely whether pedestrian targets across different cameras or at different moments of the same camera are the same person is judged; compared with pedestrian re-identification of a single-frame picture, the video often contains richer appearance information and motion information of pedestrians, and is beneficial to learning more robust and high-quality video pedestrian features.

The image quality evaluation is performed, and the quality of the image is often required to be evaluated before the pedestrian re-identification, for example, a blurred image is not suitable for the subsequent pedestrian feature extraction and re-identification, so that the quality evaluation of the image can be given firstly through the image quality evaluation technology, so that the subsequent operation can be performed on the image with qualified image quality. In implementation, an image quality evaluation model can be adopted to perform quality scoring on the image, and the higher the scoring is, the better the image quality is, the better the accuracy rate of re-recognition is facilitated.

The human body posture categories in the embodiment of the application can mainly comprise a front posture, a side posture and a back posture. For example, a front face image of a pedestrian target is represented in the image and classified into a front face posture, a back face image of the pedestrian target is represented in the image and classified into a back face posture, and a side face image is represented and classified into a side face posture. In implementation, the neural network can be adopted to classify and identify the human posture category of the pedestrian target.

The embodiment of the application mainly relates to comprehensive features of pedestrians extracted from multi-frame images of videos. In the embodiment of the present application, the comprehensive characteristic of the pedestrian can be described from multiple dimensions, for example, the comprehensive characteristic can include a motion characteristic and a pedestrian appearance characteristic, wherein the pedestrian appearance characteristic can include a corresponding pedestrian characteristic obtained by classifying different body postures, and the comprehensive appearance characteristic of the pedestrian characteristic obtained from the multi-frame image can also be obtained by performing fusion processing on the pedestrian characteristics which are not classified.

In a real scene, the pedestrian video segment often has the problems of shielding, the self posture change of the pedestrian and the like. When the pedestrian features are extracted in the existing scheme, the picture of the monitoring target is intercepted from each frame of image, the features of the monitoring target are extracted from the picture, and then the features of different frames need to be weighted and averaged among frames.

However, the inventor researches and discovers that the quality of the finally extracted pedestrian feature is affected because the alignment weighting and averaging of the pedestrian features of a plurality of images cannot be effectively performed during pedestrian feature extraction due to the fact that the number of pictures in a pedestrian sequence of each video is different and the information difference of each frame of image in the image sequence is large.

In view of the above, the present application provides a method, an apparatus, an electronic device and a storage medium for extracting pedestrian features, so as to solve the above problems. The invention of the application is to extract an image sequence of a target object to be monitored, and obtain the body posture category of the target object in each frame of image, such as the front, the back or the side. Then, the images of different body posture categories are subjected to feature fusion independently. For example, the front image of the target object is subjected to feature fusion to obtain the front feature of the target object as a kind of pedestrian feature, and the side image of the target object is subjected to feature fusion to obtain the side feature of the target object as a kind of pedestrian feature. Therefore, pedestrian features of different human body posture categories can be accurately fused through posture alignment, and features of the target object under different postures can be more accurately described through weighting and summing relative to the front features and the side features, so that the extracted pedestrian feature quality is more reliable.

Furthermore, in order to comprehensively describe the characteristics of the target object from more dimensions, the embodiment of the application not only extracts pedestrian features of different human posture categories, but also can extract the overall feature of the target object from an image sequence of the target object by combining image quality as another pedestrian feature, so that the feature description of the target object is more comprehensive.

In addition, in the embodiment of the application, not only can the pedestrian feature extraction be carried out independently for each human posture category, but also the motion feature of the target object can be extracted as a pedestrian feature, the motion feature can express the motion feature of the target object, the motion features of different target objects have certain difference, and the pedestrian re-identification can be also assisted. Therefore, in the embodiment of the application, in order to extract the motion features more accurately, the motion features may be extracted in a manner of avoiding the influence of the appearance features on the motion features.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

FIG. 1 is a schematic diagram of an application environment according to one embodiment of the present application.

As shown in fig. 1, the application environment may include, for example, a network 10, a server 20, at least one monitoring device 30 (e.g., monitoring device 30_1, monitoring device 30_2 … …, monitoring device 30_ N), a terminal device 40, and a database 50. Wherein:

the monitoring device 30 is configured to collect images within a monitoring range, send the collected images to the server 20 through the network 10, and store the images in the database 50 by the server 20 to obtain a monitoring video.

The terminal device 40 may send a surveillance video acquisition request to the server 20, and the server 20 responds to the surveillance video acquisition request to acquire a corresponding surveillance video from the database 50 and return the corresponding surveillance video to the terminal device 40 for display.

In order to perform pedestrian re-identification, the pedestrian characteristics of the target object can be extracted from the monitoring video. In practice, the pedestrian feature of the target object may be extracted from the video by the server 20. Of course, the pedestrian feature may also be extracted from the video by the terminal device 40 when the processing power of the terminal is sufficient. The extracted pedestrian features may include features of multiple dimensions, such as pedestrian features, motion features, and pedestrian global features for each human pose category.

After the features of each pedestrian are extracted, the pedestrian re-identification can be carried out based on the features, and whether the target object extracted from the video is a target needing to be searched or not is judged.

The description in this application will be detailed in terms of only a single server or terminal device, but it will be understood by those skilled in the art that the monitoring device 30, the terminal device 40, the server 20 and the database 50 shown are intended to represent the operation of the monitoring device, the terminal device, the server and the storage system to which the solution of the present application relates. The discussion of a single server and storage system is at least for convenience of description and is not meant to imply limitations on the number, type, or location of end devices and servers. It should be noted that the underlying concepts of the example embodiments of the present application may not be altered if additional modules are added or removed from the illustrated environments. In addition, although fig. 1 shows a bidirectional arrow from the database 50 to the server 20 for convenience of explanation, those skilled in the art will understand that the above-mentioned data transmission and reception also need to be implemented through the network 10.

It should be noted that the storage system in the embodiment of the present application may be, for example, a cache system, or may also be a hard disk storage, a memory storage, and the like.

It should be further noted that the embodiment of the present application is not only applicable to monitoring scenes, but also applicable to extracting pedestrian features from videos collected by any image collecting device by the method provided by the embodiment of the present application.

The method for extracting the pedestrian features is not only suitable for the monitoring system shown in fig. 1, but also suitable for any scene capable of carrying out image acquisition and carrying out pedestrian re-identification.

As shown in fig. 2, a schematic flowchart of a method for extracting pedestrian features provided in an embodiment of the present application includes the following steps:

after the video is acquired, the video can be preprocessed to obtain an image sequence of the target object. The pre-processing may include image enhancement and/or pedestrian screening. For example, when a plurality of pedestrians are included in a video, an image sequence of a target object to be processed can be extracted. Only one complete pedestrian is included in the image sequence as much as possible.

In step 201, an image sequence of a target object is obtained, then in step 202, each frame image in the image sequence is classified according to a human body posture category of the target object, an image group corresponding to each human body posture category is obtained, and a human body feature of the target object in each frame image is extracted.

For example, the body pose categories may include a front side, a side, and a back side. Then, each frame of image in the image sequence can be classified through the training neural network model, and the human body posture category of the target object in each frame of image is obtained. Therefore, the images belonging to the same human body posture category are divided into the same group, the images belonging to different human body posture categories are divided into different groups, and image groups corresponding to different human body posture categories are obtained. Thus, each human pose category can be evaluated separately for pedestrian features.

In practice, as shown in FIG. 3, the classification recognition of the image may be implemented based on a pose estimation network. The posture evaluation network can be obtained by training based on the following method:

for example, in one embodiment, the sample image may be labeled, the classes of the objects to be classified in the sample image are respectively labeled, and then the pose evaluation network is trained based on the sample image and the labeled labels.

During training, the training of the posture evaluation network can be supervised by using cross entropy loss, and the cross entropy loss is shown as a formula (1).

In the formula (1), y_pAnd

respectively representing the true values of the front, back and side surfaces of the human body and the corresponding network estimated values, i represents the posture classification of the human body, namely the front, back and side surfaces, L_poseRepresenting the calculated loss value.

After the images are classified according to different body posture categories, in step 203, the following steps may be performed for each of the body posture categories: and grouping the images corresponding to the human body posture categories, and performing feature fusion processing on the human body features of each frame of image in the image groups to obtain first pedestrian features corresponding to the human body posture categories.

For example, a front image group consisting of front images of the target object can obtain a front feature of the target object as one first pedestrian feature, a back image group consisting of back images of the target object can obtain a back feature of the target object as one pedestrian feature, and so on, a side feature of the target object can also be obtained as another first pedestrian feature.

In another embodiment, the final result is affected during the feature fusion process due to the difference in resolution, blur, etc. of the target objects in the grouping of each human pose category. Therefore, in the embodiment of the application, in order to better perform the feature fusion processing and obtain the high-quality pedestrian features, quality evaluation can be performed on each frame of image of the target object to obtain quality evaluation parameters of each frame of image. The quality evaluation parameters can be used for screening images, for example, images with poor quality can be screened out according to the quality evaluation parameters for each image group, and then images with better quality are adopted for feature fusion processing. On the other hand, when the feature fusion processing is performed, the feature fusion processing may be performed in combination with the quality evaluation parameter of each frame of image. In practice, the quality evaluation network can be used for quality evaluation.

And during quality evaluation network training, the adopted label is used for marking image quality scores. The label may be manually marked. Of course, in order to reduce the time consumption and labor cost for labeling, an automatic labeling mode can be adopted for labeling.

In the present embodiment, the image quality evaluation is also evaluated according to the body posture type. And the labels of the training samples in the training quality evaluation network are labeled according to the classification.

Taking a sample with a human body posture category as a side surface as an example, as shown in fig. 4, a pre-collected training sample includes a plurality of images of the side surface of a human body, one side image can be screened out from the plurality of images as a standard image, the quality score of the standard image is the highest value, and the quality scores of the other images are determined according to the similarity between each side image and the standard image thereof. The higher the similarity with the standard image, the higher the quality score of the annotation, and the lower the similarity with the standard image, the lower the quality score of the annotation. In practice, the value interval of the quality score can be set between 0 and 1. Of course, other value intervals can be set, and the method and the device are all applicable to the embodiment of the application.

The automatic labeling of the front and back sides is similar to the side labeling shown in fig. 4 and will not be described herein.

For each type of image group of human body posture category, when the image quality evaluation parameter is adopted to perform the feature fusion processing on the human body features of each frame of image in the image group, the feature fusion processing can be performed in a weighted averaging or network learning mode.

The weighted averaging can be implemented as: and taking the image quality evaluation parameter as a weight factor, and carrying out weighted operation processing on the human body characteristics of each frame of image in the image grouping. For example, the human features of the image group may be subjected to a weighted operation process according to equation (2):

in formula (2), M is obtained for image groupingWeighting operation result, n is total number of images of image group, i is 1 to n, omega_iEvaluation parameter for image quality of ith image, p_iIs the human body characteristic of the ith image. When the value of the image quality evaluation parameter is more than 1, the human body characteristics can be determined by adopting a weighting and averaging mode.

Furthermore, in another embodiment, the result of the feature fusion process may also be affected due to occlusion of the target object, background, etc. In the embodiment of the present application, the whole human body may be further divided into a plurality of designated human body regions, for example, as shown in fig. 5, the human body may be divided into three designated human body regions. Of course, more or less areas may be divided according to actual requirements when implemented. For example, two designated body regions may be divided, one of which is the upper body and the other of which is the lower body. For another example, 4 designated body regions may be divided. The human body area can be divided equally or not, but divided according to the human body parts covered by the human body area, and the method is suitable for the embodiment of the application.

In practice, in order to identify each designated body region, the body region type included in the target object may be predicted using a body part detection model. The prediction result may include position information and confidence of each designated body region. The confidence level is used to express the likelihood of the presence of the predicted specified body region. A higher confidence level indicates a higher likelihood of the presence of the corresponding designated body region.

In one embodiment, a designated body region in the present application may include a global body and at least one body member (e.g., upper body, lower body, head-shoulder position member, etc.). In implementation, Mask R-CNN can be used as a human body component detection model to position the global position and the local position of a human body. The position detection is carried out on each appointed human body area, then the alignment is carried out based on different appointed human body areas, and the interference of shielding and background can be eliminated to a certain extent.

For labeling of local positions, in order to reduce the labeling workload, different designated human body regions can be obtained by equally dividing human bodies. If the complete human body mask can be adopted, the human body is evenly divided into four parts from top to bottom, and corresponding local part categories and corresponding mask marks are made. In order to enrich the sample better, during training, a self-supervision mode can be adopted to cut the complete pedestrian to generate the half-length picture and the corresponding marking information. And then training the human body part detection model by adopting the training samples and the labeled labels thereof.

The human body part detection model is used for carrying out position region detection and classification recognition on the image, the output result of the human body part detection model can comprise the position information of each specified human body region and the confidence coefficient of the position information, and the possibility that the specified human body region exists can be determined based on the confidence coefficient. The higher the confidence, the higher the likelihood of indicating the existence of the specified human body region, and the lower the confidence, the lower the likelihood of indicating the existence of the specified human body region. Therefore, during implementation, a super-parameter confidence threshold value can be set to identify whether each designated human body region exists, then the same human body region is found out from the image group, and the human body features of the same human body region are subjected to feature fusion processing.

Fig. 6 is a schematic diagram showing the absence of a portion of a human body region. The human body part detection model detects each designated human body region from the image shown in fig. 6, and can obtain the position and confidence of each designated human body region. Since the lower leg portion of the leg is missing in the image shown in fig. 6, the confidence in the lower leg portion of the leg is very low, and thus it can be determined that there is no lower leg portion region in the image.

Therefore, the human body regions can be aligned by detecting different designated human body regions, and the feature fusion processing of the same designated human body region is realized. When the feature fusion processing is carried out, the image quality evaluation parameters are used as weight factors, the human body features of the same specified human body region are subjected to weighting operation processing, and first pedestrian features corresponding to all preset human body regions are obtained.

For example, the human body part detection model outputs an evaluation of whether each specified human body region exists, which is denoted as V; for example, the result of the evaluation of the presence of four designated body regions V ═ V1, V2, V3, V4 ═ 1,1, 1, 0], where 1 denotes the presence of the corresponding designated body region and 0 denotes the absence of the corresponding designated body region. The body region characteristics of the four designated body regions are noted as: { featP1, featP2, featP3, featP4 }. Here, feature p1 indicates a body region feature of a first designated body region, feature p2 indicates a body region feature of a second designated body region, feature p3 indicates a body region feature of a third designated body region, and feature p4 indicates a body region feature of a fourth designated body region. Based on this assumption, the feature fusion process can be performed based on the steps shown in fig. 7:

in step 701, a mask process is performed on the human body region features of each designated human body region according to the evaluation result of whether each designated human body region exists. Among them, the mask processing aims to invalidate a non-existing feature to a network in a later stage, thereby suppressing an adverse effect of a non-existing specified human body region at the time of weighting.

The mask processing method may be implemented as: if the existence marks [ v1, v2, v3 and v4] of the 4 components are all 1, the human body is indicated as a complete human body characteristic; if at least one of the 4 component existence marks [ v1, v2, v3, v4] is 0, such as [1,1,0,0], a part of human body is absent; by taking the vector product vi × featPi of the evaluation result V of the presence or absence of each designated human body region and the corresponding human body region feature vector, that is, [ V1 × featP1, V2 × featP2, V3 × featP3, V4 × featP4], the values of the human body region feature vectors of the non-existing designated human body regions are all 0, and the influence of the human body region feature vectors on the feature fusion result obtained by the subsequent weighted average or the network learning is suppressed.

Due to the existence of factors such as occlusion, resolution, inaccurate pedestrian detection frames and the like, the image quality of each frame of image in each image group is different. As shown in fig. 8, the black boxes in fig. 8 represent detected global positions and the gray boxes represent detected local part positions. Due to the influence of shielding and inaccurate human body detection in the grouping of the back of the human body, the detection frame for the global position of the human body does not exist in the middle image in fig. 8, and only the dark gray local part detection frame exists. According to the detection result of whether each designated human body region exists or not, human body alignment among different images can be achieved, so that weighting or network learning is carried out on the aligned global and local human body characteristics in combination with image quality evaluation, and the human body appearance characteristics of the front, back and side surfaces are obtained. Therefore, in step 702, the weighting operation process may be performed on the same designated human body region to achieve the alignment of the human body regions.

As shown in fig. 8, a1, a2, and a3 respectively represent image quality evaluation parameters of the three images in fig. 8, and when the three images are subjected to weighting calculation processing, the human body local features of the local parts in the dark gray frames of the three images may be weighted and averaged, and for example, if P1, P2, and P3 are the human body local features in the dark gray frames in fig. 8, the first pedestrian feature in the dark gray frames may be calculated as M' ═ a1 ═ P1+ a2 ═ P2+ a3 × P3. For the global positions within the black box, let R1, R3 be the body region features of the global positions of the first and third images in fig. 8. When the feature fusion processing is performed on the global position, M ═ a1 × R1+ a3 × R3 may be calculated as the first pedestrian feature after the feature fusion processing is performed on the global position.

In addition, in another embodiment, in addition to the feature fusion processing by the weighted operation processing, the feature fusion processing may be performed by the network learning. The human body characteristics of each frame of image in the image group and the image quality evaluation parameters of each frame of image are sequentially input to a pre-trained characteristic extraction model according to the sequence of each frame of image in the image sequence to perform the characteristic fusion processing. The training process of the feature extraction model can be briefly described as follows:

1) splicing the characteristics (including global characteristics) of each component and the quality evaluation scores of the corresponding components, namely performing concatene (connection) operation to form a characteristic vector with the length of +1 of the original characteristic length; and then, generating required embedding (coding) characteristics after connecting two full-connection layers.

2) And sending the embedding characteristics into the cross entropy and the triple loss for supervision training.

During network training, the back propagation gradient of the quality assessment score is zero, because the quality assessment score is only a reference value of a following network and does not need to be adjusted, and the characteristic needs to be adjusted.

In some embodiments, in addition to the aforementioned first pedestrian feature of each human posture category, in order to be able to describe the feature of the target object from more dimensions, the motion feature of the target object may be extracted from the image sequence of the target object. For example, as shown in fig. 9, in order to avoid the influence of the appearance features on the motion features, each frame of image of the image sequence may be segmented in step 901 to obtain a segmentation mask of the target object; then, in step 902, for each frame of image in the image sequence, extracting the motion characteristics of the target object by using the segmentation mask of the target object; then, in step 903, the extracted motion features of each frame image are subjected to fusion processing, and the total motion feature of the target object is obtained as a second pedestrian feature.

For example, after the detection of the human body whole part detection model, a global human body position may be obtained, and then a mask image of the target object is extracted from an image region of the global human body position, and then a motion feature is extracted based on the mask image.

In another embodiment, in order to be able to further more fully describe the pedestrian characteristics of the target object. In the embodiment of the application, feature extraction and quality evaluation processing can be further performed on each frame of image in the image sequence respectively to obtain feature information of the target object and quality evaluation parameters of each frame of image; and sequentially inputting the characteristic information and the quality evaluation parameters of each frame of image in the image sequence into a pre-trained incidence relation extraction model according to the sequence of each frame of image in the image sequence to obtain the third person characteristic of the target object output by the incidence relation extraction model.

The incidence relation extraction model can be a Long Short-Term Memory artificial neural network (LSTM), and the input sequence is processed to obtain the integral video feature as the third person feature.

In summary, in the embodiment of the present application, a schematic structural diagram of a single-frame image processing module can be shown in fig. 10. After a single frame image is input to a basic network 1(BackBone1), extracting basic characteristics A through BackBone1, and respectively inputting the basic characteristics A to a quality evaluation network to obtain image quality parameters of the town image; the basic feature A is input into a human body segmentation model to perform human body segmentation processing, a human body segmentation mask is obtained to be used for extracting motion features, the basic feature A and the human body segmentation mask are input into a basic network 2(BackBone2) to further extract a basic feature 2, then the human body component detection model performs position detection on the basic feature 2, and position information of each designated human body region, human body region features of each designated human body region and a category prediction result (namely a detection result of whether each component exists) are obtained. And inputting the basic characteristics A into a posture evaluation network to classify and recognize the images to obtain the human posture category to which the images belong. Then, based on the respective information of the single frame image obtained in fig. 10, the model shown in fig. 11 is used to perform subsequent processing to obtain pedestrian features of various dimensions. As shown in fig. 11, each image in the image sequence is input to the single-frame image processing module and then processed to obtain image groups of each human body posture category, and preferably, the grouping module screens the images based on the image quality parameters of each frame of image to filter out low-quality images. And then, performing feature fusion processing on the features in each image group to obtain appearance features, and performing fusion processing on the motion features of the images remaining from the filtering in the image sequence, for example, performing weighting operation processing according to image quality parameters to obtain the motion features. The characteristics of each frame image and the image quality parameters can obtain the video characteristics through an LSTM model. That is, the pedestrian features extracted from the video in the embodiment of the present application may include motion features, appearance features, and video features.

In the implementation of the present application, the related major supervision loss may include:

1) the labeling value of the image quality can be between 0 and 1, so that the output of the image quality evaluation network can be processed by sigmoid processing, namely, a formula (3) and a formula (4) are adopted for processing, so that the image quality can be directly used as a weight, and the subsequent feature fusion processing is directly weighted and summed to realize weighted average of the features:

in formula (3), x represents the output of the image quality evaluation network, p and

and respectively representing the label labeling value and the image quality parameter of the quality evaluation network. L is_MSLIndicating a loss.

2) Cross entropy loss and triplet loss can be used for local and global features for supervised training, such as formula (5) -formula (7);

L_tri＝[D^ap-D^np+α]₊ (6)

D^ap＝||f^a-f^p||,D^an＝||f^a-fⁿ|| (7)

in the formula (5), i represents different body posture categories, y and

respectively representing the marked category and the estimated category.

The ternary loss function is represented in formula (6), wherein the ternary means Anchor, Negative and Positive. After learning, the distance between the Positive element and the Anchor element is minimum, and the distance between the Positive element and the Negative element is maximum. Wherein the Anchor is a sample randomly selected from the training data set, the Positive is a sample belonging to the same class as the Anchor, and the Negative is a sample different from the Anchor. D^apDenotes the distance between a Positive element and an Anchor element, D^npRepresenting the distance between a Positive element and a Negative element.

In the formula (7) f^a，f^p，fⁿRespectively, are Anchor samplesThe human body gesture feature analysis method comprises the following steps of features, sample features of human body gesture categories which are the same as that of the Anchor, and Negative sample features which are not the same as that of the Anchor.

The purpose of the training is such that D^apIs less than D^anA denotes margin, i.e. the minimum separation between ap and an distances. F in equation 7 is the desired feature, namely embedding.

3) And the supervision loss for detecting and segmenting the Mask R-CNN target is shown as the formula (8):

L＝L_cls+L_box+L_mask (8)

in the formula L_cls，L_box，L_maskThe human posture category loss is represented by local and global categories of a human body, pedestrian local and global position loss and mask binary cross entropy loss in the embodiment of the application.

As shown in fig. 12, based on the same inventive concept, there is provided an apparatus 1200 for extracting pedestrian features, comprising:

a sequence acquiring module 1201, configured to acquire an image sequence of a target object;

the pose processing module 1202 is configured to classify each frame image in the image sequence according to the human pose category of the target object, obtain an image group corresponding to each human pose category, and extract a human feature of the target object in each frame image;

a pedestrian feature determination module 1203, configured to perform, for each of the human posture categories: and grouping the images corresponding to the human body posture categories, and performing feature fusion processing on the human body features of each frame of image in the image groups to obtain first pedestrian features corresponding to the human body posture categories.

the pedestrian feature determination module is configured to, when performing the weighted operation processing on the human body feature of each frame of image in the image group by using the image quality evaluation parameter as a weighting factor:

For implementation and beneficial effects of the operations in the device for extracting pedestrian features, reference is made to the description of the foregoing method, and details are not repeated here.

Having described the method and apparatus for extracting pedestrian features according to an exemplary embodiment of the present application, an electronic device according to another exemplary embodiment of the present application is described next.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible implementations, an electronic device according to the present application may include at least one processor, and at least one memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the method for extracting pedestrian features according to various exemplary embodiments of the present application described above in the present specification.

The electronic apparatus 130 according to this embodiment of the present application is described below with reference to fig. 13. The electronic device 130 shown in fig. 13 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 13, the electronic device 130 is represented in the form of a general electronic device. The components of the electronic device 130 may include, but are not limited to: the at least one processor 131, the at least one memory 132, and a bus 133 that connects the various system components (including the memory 132 and the processor 131).

Bus 133 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The memory 132 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.

Memory 132 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The electronic device 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with the electronic device 130, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 130 to communicate with one or more other electronic devices. Such communication may occur via input/output (I/O) interfaces 135. Also, the electronic device 130 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 136. As shown, network adapter 136 communicates with other modules for electronic device 130 over bus 133. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 130, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In some possible embodiments, various aspects of a method for extracting pedestrian features provided by the present application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of a method for extracting pedestrian features according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product for extracting pedestrian features of the embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic devices may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable image scaling apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable image scaling apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable image scaling apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable image scaling device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer implemented process such that the instructions which execute on the computer or other programmable device provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of extracting features of a pedestrian, the method comprising:

acquiring an image sequence of a target object;

2. The method of claim 1, wherein after the acquiring the sequence of images of the target object, the method further comprises:

3. The method according to claim 2, wherein the performing feature fusion processing on the human body feature of each frame image in the image packet by using the image quality evaluation parameter of each frame image in the image packet comprises:

4. The method according to claim 3, wherein the human body features comprise human body region features of respective ones of a plurality of designated human body regions;

5. The method according to claim 2, wherein the performing feature fusion processing on the human body feature of each frame image in the image packet by using the image quality evaluation parameter of each frame image in the image packet comprises:

6. The method according to any one of claims 1 to 5, wherein before performing feature fusion processing on the human body features of each frame of image in the image group corresponding to the human body posture category, the method further comprises:

7. The method of any of claims 1-5, wherein after the obtaining the sequence of images of the target object, the method further comprises:

8. The method of any of claims 1-5, wherein after the obtaining the sequence of images of the target object, the method further comprises:

9. An apparatus for extracting a feature of a pedestrian, comprising:

10. An electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

11. A computer storage medium, characterized in that the computer storage medium stores a computer program for causing a computer to perform the method of any one of claims 1-8.