WO2020062493A1

WO2020062493A1 - Image processing method and apparatus

Info

Publication number: WO2020062493A1
Application number: PCT/CN2018/115968
Authority: WO
Inventors: 胡耀全
Original assignee: 北京字节跳动网络技术有限公司
Priority date: 2018-09-29
Filing date: 2018-11-16
Publication date: 2020-04-02
Also published as: CN109389640A

Abstract

Disclosed in the embodiments of the present application are an image processing method and apparatus. A specific embodiment of the method comprises: acquiring an image wherein the poses of the subjects are already labelled; on the basis of the images and the labelling of the poses, training a convolutional neural network to obtain a trained convolutional neural network, the training process comprising: inputting the image into the convolutional neural network and, on the basis of preset anchor poses of the convolutional neural network, determining candidate poses of each subject; setting candidate frames having a degree of coincidence greater than a preset degree of coincidence as target candidate frames; for each key point in a target candidate frame corresponding to each labelled frame, taking average position values of the key point in each target candidate frame; and setting the set of average position values of the key points as a pose detected for the image. In the present embodiment, the candidate poses are filtered by means of the degree of coincidence and the average values of the key points are taken in order to accurately distinguish the poses in an image.

Description

Image processing method and device

This patent application claims the priority of a Chinese patent application filed on September 29, 2018, with application number 201811149818.4, the applicant being Beijing BYTE Network Technology Co., Ltd., and the invention name "Image Processing Method and Device". Is incorporated by reference in its entirety.

Technical field

Embodiments of the present application relate to the field of computer technology, and specifically to the field of Internet technology, and in particular, to an image processing method and device.

Background technique

When confirming the key points of the human body, sometimes it is necessary to confirm the key points of a single person, and sometimes it is necessary to confirm the key points of each of multiple people. In related technologies, when detecting a key point of each of a plurality of people, it is often difficult to accurately detect the result.

Summary of the Invention

The embodiments of the present application provide an image processing method and device.

In a first aspect, an embodiment of the present application provides an image processing method, including: obtaining an image of a posture of a labeled object, where the image includes at least two objects, different objects have different postures, and the posture is indicated by multiple key points; Based on the image and the annotation of the pose, the convolutional neural network is trained to obtain the trained convolutional neural network. The training process includes: inputting the image into the convolutional neural network, and the previously set anchor pose based on the convolutional neural network. To determine the candidate pose of each object; determine the coincidence degree of the candidate frame where each candidate pose is located and the labeled frame of the labeled pose, and use the candidate frame whose coincidence is greater than the preset coincidence degree threshold as the target candidate frame; for each labeled frame For each key point in the corresponding target candidate frame, the position average value of the key point in each target candidate frame is taken; the set of position average values of each key point is taken as a pose detected on the image.

In some embodiments, before the image is input to the convolutional neural network, and based on the previously set anchor pose of the convolutional neural network to determine the candidate pose of each object, the method further includes: presetting multiple presets in the target image. The pose is clustered to obtain the key point set; each key point set is determined as the anchor point pose, where the key points included in different key point sets have different positions in the target image.

In some embodiments, clustering a plurality of preset poses in a target image to obtain a key point set includes: clustering a multi-dimensional vector corresponding to each preset pose, wherein the multi-dimensional vectors corresponding to the preset pose are The number of dimensions of the vector is the same as the number of key points of the preset pose; each key point of the pose corresponding to the multi-dimensional vector of the cluster center is composed of a key point set.

In some embodiments, for each key point of the target candidate frame corresponding to each labeled frame, taking an average position of the key point in the candidate pose within each target candidate frame includes: corresponding to each labeled frame For each key point in each target candidate frame, in response to determining that the position of the key point is outside the callout frame, use a preset first preset weight as the weight of the key point in the target candidate frame; response After determining that the position of the key point is within the label box, a preset second preset weight is used as the weight of the key point in the target candidate frame, and the first preset weight is smaller than the second preset weight; based on the label The weight of the key point in each target candidate frame corresponding to the frame determines the average position of the key point in each target candidate frame.

In some embodiments, for each key point of the target candidate frame corresponding to each labeled frame, taking an average position of the key point in the candidate pose within each target candidate frame includes: corresponding to each labeled frame For each key point within each target candidate frame, determine whether the distance between the key point and the key point in the labeled pose is less than or equal to a preset distance threshold; and in response to determining that the distance is less than or equal to, based on the corresponding value of the labeled frame The weight of the key point in each target candidate frame is determined by determining the average position of the key point in each target candidate frame.

In a second aspect, an embodiment of the present application provides an image processing apparatus, including: an obtaining unit configured to obtain an image of a posture of a labeled object, where the image includes at least two objects, different objects having different postures, Multiple key point instructions; training unit configured to train the convolutional neural network based on the image and the annotation of the pose to obtain a trained convolutional neural network. The training process includes: inputting the image into the convolutional neural network, based on The previously set anchor pose of the convolutional neural network determines the candidate pose of each object; determines the degree of coincidence between the candidate frame where each candidate pose is located and the labeled frame of the already marked pose, and the degree of coincidence is greater than a preset coincidence degree threshold The candidate frame is used as the target candidate frame; for each key point in the target candidate frame corresponding to each labeled frame, the average position of the key point in each target candidate frame is taken; the set of the average position of each key point is set As a gesture detected on the image.

In some embodiments, the device further includes: a clustering unit configured to cluster a plurality of preset poses in the target image to obtain a key point set; a determining unit configured to determine each key point set as an anchor Point pose, where key points included in different key point sets have different positions in the target image.

In some embodiments, the clustering unit is further configured to: cluster the multi-dimensional vectors corresponding to each preset pose, wherein the number of dimensions of the multi-dimensional vector corresponding to the preset pose and the number of key points of the preset pose Same; the key points of the preset pose corresponding to the multi-dimensional vector of the cluster center are formed into a key point set.

In some embodiments, the training unit is further configured to: for each key point within each target candidate frame corresponding to each labeled box, in response to determining that the position of the key point is outside the labeled box , Taking the preset first preset weight as the weight of the key point within the target candidate frame; and in response to determining that the position of the key point is within the callout frame, using the preset second preset weight as the target candidate The weight of the key point in the frame, the first preset weight is less than the second preset weight; based on the weight of the key point in each target candidate frame corresponding to the labeled box, the key point in each target candidate frame is determined Average position.

In some embodiments, the training unit is further configured to: for each key point in each target candidate frame corresponding to each labeled frame, determine whether the distance between the key point and the key point in the labeled pose is Less than or equal to the preset distance threshold; in response to determining that it is less than or equal to, based on the weight of the key point in each target candidate frame corresponding to the labeled frame, determining an average position of the key point in each target candidate frame.

According to a third aspect, an embodiment of the present application provides an electronic device including: one or more processors; a storage device configured to store one or more programs, and when one or more programs are executed by one or more processors , So that one or more processors implement the method as in any embodiment of the image processing method.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method as in any embodiment of the image processing method is implemented.

In the image processing solution provided in the embodiment of the present application, first, an image of the posture of the labeled object is obtained, where the image includes at least two objects, and the posture of different objects is different, and the posture is indicated by multiple key points. After that, the convolutional neural network is trained based on the image and the annotation of the pose to obtain a trained convolutional neural network. The training process includes: inputting the image into the convolutional neural network, and the previously set anchor based on the convolutional neural network. Point pose to determine candidate poses for each object. Then, the coincidence degree of the candidate frame where each candidate pose is located with the labeled frame of the already-posted pose is determined, and the candidate frame whose coincidence degree is greater than a preset coincidence degree threshold is used as the target candidate frame. Then, for each key point in the target candidate frame corresponding to each labeled frame, an average position of the key point in each target candidate frame is taken. Finally, the set of position averages of each key point is used as a pose detected on the image. In this embodiment, it is possible to filter each candidate pose from the image including at least two objects by the coincidence degree to select a target candidate frame that indicates the object is more accurate. And, the average value of the key points is taken to accurately distinguish each pose in the image.

BRIEF DESCRIPTION OF THE DRAWINGS

Other features, objects, and advantages of the present application will become more apparent by reading the detailed description of the non-limiting embodiments with reference to the following drawings:

FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

2 is a flowchart of an embodiment of an image processing method according to the present application;

3 is a schematic diagram of an application scenario of an image processing method according to the present application;

4 is a flowchart of another embodiment of an image processing method according to the present application;

5 is a schematic structural diagram of an embodiment of an image processing apparatus according to the present application;

FIG. 6 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.

detailed description

The following describes the present application in detail with reference to the accompanying drawings and embodiments. It can be understood that the specific embodiments described herein are only used to explain the related invention, rather than limiting the invention. It should also be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The application will be described in detail below with reference to the drawings and embodiments.

FIG. 1 illustrates an exemplary system architecture 100 to which an embodiment of an image processing method or an image processing apparatus of the present application can be applied.

As shown in FIG. 1, the system architecture 100 may include

terminal devices

101, 102, and 103, a network 104, and a server 105. The network 104 is a medium for providing a communication link between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.

The user can use the

terminal devices

101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications can be installed on the

terminal devices

101, 102, 103, such as image processing applications, video applications, live broadcast applications, instant communication tools, mailbox clients, social platform software, and so on.

The

terminal devices

101, 102, and 103 here may be hardware or software. When the

terminal devices

101, 102, and 103 are hardware, they can be various electronic devices with a display screen, including but not limited to smart phones, tablet computers, e-book readers, laptop computers and desktop computers. When the

terminal devices

101, 102, and 103 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (such as multiple software or software modules used to provide distributed services), or it can be implemented as a single software or software module. It is not specifically limited here.

The server 105 may be a server that provides various services, such as a background server that supports the

terminal devices

101, 102, and 103. The background server may analyze and process the acquired data such as the image of the labeled object posture, and feed back the processing result (such as a posture detected on the image) to the terminal device.

It should be noted that the image processing method provided in the embodiment of the present application may be executed by the server 105 or the

terminal devices

101, 102, and 103. Accordingly, the image processing apparatus may be provided in the server 105 or the

terminal devices

101, 102, and 103.

It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely exemplary. According to implementation needs, there can be any number of terminal devices, networks, and servers.

With continued reference to FIG. 2, a flowchart 200 of an embodiment of an image processing method according to the present application is shown. The image processing method includes the following steps:

Step 201: Obtain an image of the posture of the labeled object, where the image includes at least two objects, and different objects have different postures, and the posture is indicated by multiple key points.

In this embodiment, an execution subject (for example, a server or a terminal device shown in FIG. 1) of the image processing method may acquire an image of the posture of the labeled object. In the image, the object's pose is labeled. The objects here can be people, faces, cats, objects, and so on. Specifically, the posture can be represented by the coordinates of the key points. For example, when a person is in a standing posture and a squatting posture, the distance between the coordinates of the nose key point and the coordinates of the toe key point is different.

Step 202: Train the convolutional neural network based on the image and the annotation of the pose to obtain a trained convolutional neural network. The training process includes

steps

2021, 2022, and 2023, as follows:

Step 2021: input the image into the convolutional neural network, and determine the candidate pose of each object based on the previously set anchor pose of the convolutional neural network.

In this embodiment, the above-mentioned execution body may input the acquired image into the convolutional neural network, so that based on the previously set anchor pose in the convolutional neural network, the convolutional neural network obtains the candidate pose of each object. . Specifically, the convolutional neural network includes a region candidate network (RPN). The size and position of the anchor in the convolutional neural network in the image is fixed. The execution subject may input the image into a regional candidate network, and the regional candidate network may determine a difference in size and position between the candidate pose and the anchor pose, and use the difference between the size and the position to Represents the size and position of each candidate pose. The size here can be expressed by area or width, height or length, width, etc., and the position can be expressed by coordinates. For each object in the image, the execution subject described above may determine multiple candidate poses.

During the training process, the execution subject can obtain the pose output by the convolutional neural network as the pose detected on the image, and determine the loss value of the pose and the labeled pose based on a preset loss function. Then use this loss value for training to get the trained convolutional neural network.

Step 2022: Determine the degree of coincidence between the candidate frame where each candidate pose is located and the annotated frame of the already labeled pose, and use the candidate frame whose coincidence is greater than a preset coincidence degree threshold as the target candidate frame.

In this embodiment, the execution body may determine an overlap degree (Intersection Over Union, IOU) of a candidate frame where each candidate pose is located and a labeled frame of the already marked pose. After that, the execution body may select a candidate frame whose coincidence degree is greater than a preset coincidence degree threshold, and use the selected candidate frame as a target candidate frame. Specifically, the width and height of the frame of the gesture may be the width (or length) generated by the leftmost and rightmost coordinates of the key points included in the gesture, and the height (or width) generated by the uppermost and lowest coordinates. . The coincidence degree may be a ratio of an intersection between the candidate frame and the labeled frame and a union between the candidate frame and the labeled frame. If the overlap between the candidate frame and the labeled frame is large, it indicates that the candidate frame has a high accuracy in framing the object. In this way, the candidate frame can more accurately divide the object and the non-object.

Step 2023: For each key point in the target candidate frame corresponding to each labeled frame, take the position average value of the key point in each target candidate frame; use the set of position average values of each key point as the image detection To a gesture.

In this embodiment, for each key point in the target candidate frame corresponding to each labeled frame, the execution body may take the average position of the key point in the candidate poses in each target candidate frame corresponding to the labeled frame. value. Therefore, the above-mentioned execution body may use the set of the position average values of the key points of the target candidate frame corresponding to the labeled frame as a pose detected on the image. The corresponding callout box and the target candidate box indicate the same object.

Specifically, the same weight can be set for the positions of the respective key points to calculate the average position. In addition, the weights set for the positions of the key points may also be different.

It should be noted that although a position average is taken for each key point of the posture in the target candidate frame, in this embodiment, there is a possibility that the key point in some target candidate frames does not participate in obtaining the position average.

In some optional implementations of this embodiment, in step 2023, for each key point in the target candidate frame corresponding to each labeled frame, the positions of the key points in at least two target candidate frames are averaged. Values, which can include:

For each key point in each target candidate frame corresponding to each label box, in response to determining that the position of the key point is outside the label box, a preset first preset weight is taken as the target candidate frame. The weight of the key point; for each key point in each target candidate frame corresponding to each labeled box, in response to determining that the position of the key point is within the labeled box, using a preset second preset weight as The weight of the key point in the target candidate frame, the first preset weight is smaller than the second preset weight; based on the weight of the key point in each target candidate frame corresponding to the labeled frame, determining the weight of each key candidate frame The average position of this keypoint.

In these alternative implementations, when calculating the position average, the above-mentioned execution body may use a smaller weight for the coordinates of the position outside the callout box and a larger value for the coordinates of the position inside the callout box. Weights. For example, the weight of key point A, key point B, and key point C are in the callout box, callout box, and callout box, respectively. You can use weights 1, 1 for key point A, key point B, and key point C, respectively. And 0.5 to calculate the position average. Then the average position obtained is (1 × keypoint A position + 1 × keypoint B position + 0.5 × keypoint C position) / (1 + 1 + 0.5).

These implementations can obtain the weights of different target candidate frames differently. Because the key points outside the label box are often less accurate, such a weight setting method can reduce the weight of these key points to obtain a more accurate average position of the key points, and then accurately determine the attitude.

For each key point in each target candidate frame corresponding to each labeled box, determine whether the distance between the key point and the key point in the labeled pose is less than or equal to a preset distance threshold; in response to determining that the distance is less than or equal to , Based on the weights of the key points in each target candidate frame corresponding to the labeled box, determine an average position of the key points in each target candidate frame.

In these optional implementation manners, the execution body may determine whether the distance between each key point in each target candidate frame corresponding to each labeled frame and the key point in the labeled pose in the labeled frame is less than or It is equal to a preset distance threshold, and thus the key point in each target candidate frame corresponding to the labeled frame is selected. That is, in these implementation manners, the key point in some target candidate frames does not participate in obtaining the position average. Specifically, if the distance between a key point in a target candidate frame corresponding to the labeled frame and the labeled key point is small, it is determined that the key point can participate in calculating a position average. If there is a large distance between a key point and a labeled key point in a target candidate frame corresponding to the labeled box, it means that the key point in the candidate pose obtained by the convolutional neural network is more accurate. Poor, it can be determined that the key point is not involved in calculating the position average.

For example, the three target candidate boxes a, b, and c corresponding to a callout box M have nasal tip key points, and the nasal tip key points in a, b, and c are different from the nasal tip key points labeled in the callout box M. The distances are 1, 2 and 3, respectively. If the preset distance threshold is 2.5, then the distances 1 and 2 corresponding to the target candidate frames a and b are smaller than the preset distance threshold, so the key points of the nasal tip in the target candidate frames a and b can participate in calculating the position average.

These implementations can select a key point closer to the label box from a key point within each target candidate frame corresponding to the label box to determine the position average, and can avoid key points with large deviations from participating in the calculation of the position average, thereby Improve the accuracy of determining attitude.

With continued reference to FIG. 3, FIG. 3 is a schematic diagram of an application scenario of the image processing method according to this embodiment. In the application scenario of FIG. 3, the execution body 301 may obtain an image 302 of the posture of the labeled object, where the image includes at least two objects, and the posture of different objects is different, and the posture is indicated by multiple key points. Based on the image and the annotation of the pose, the convolutional neural network is trained to obtain the trained convolutional neural network. The training process includes: inputting the image into the convolutional neural network, and the previously set anchor pose based on the convolutional neural network. 303. Determine candidate poses 304 for each object. Determine the coincidence degree of the candidate frame where each candidate pose is located and the labeled frame of the already-posted pose, and use the candidate frame whose coincidence degree is greater than a preset coincidence degree threshold as the target candidate frame 305. For each key point in the target candidate frame corresponding to each labeled frame, an average position 306 of the position of the key point in each target candidate frame is taken. A set of position averages of each key point is taken as a pose 307 detected on the image.

In this embodiment, it is possible to filter each candidate pose from the image including at least two objects by the coincidence degree to select a target candidate frame that indicates the object is more accurate. And, the average value of the key points is taken to accurately distinguish each pose in the image.

Further reference is made to FIG. 4, which illustrates a flowchart 400 of still another embodiment of an image processing method. The process 400 of the image processing method includes the following steps:

Step 401: Cluster multiple preset poses in the target image to obtain a key point set.

In this embodiment, an execution subject (for example, a server or a terminal device shown in FIG. 1) on which the image processing method runs can obtain a target image and cluster multiple preset poses in the target image to obtain Key points collection. Specifically, the foregoing execution subject may cluster multiple preset poses in multiple ways. For example, the coordinates of the position of each key point can be clustered to obtain the clustering result of each key point.

In some optional implementation manners of this embodiment, the foregoing step 401 may include the following steps:

Cluster the multi-dimensional vectors corresponding to each preset pose, where the number of dimensions of the multi-dimensional vector corresponding to the preset pose is the same as the number of key points of the preset pose; Key points make up a key point set.

In these implementations, the preset pose may be represented by a multi-dimensional vector. The vector of each dimension in the multi-dimensional vector corresponds to the coordinates of the position of a key point in the preset pose. One or more cluster centers can be obtained by clustering. The cluster center here is also a multi-dimensional vector. The above-mentioned execution subject may form each key point of the gesture indicated by this multi-dimensional vector into a key point set.

Step 402: Determine each key point set as an anchor point pose, where key points included in different key point sets have different positions in the target image.

In this embodiment, the above-mentioned execution body may determine each of the obtained key point sets as an anchor point posture. In this way, the position of each anchor point posture obtained is more differentiated. At the same time, this embodiment can also cluster multiple preset poses to obtain an accurate anchor point pose. In this way, in the process of detecting the posture, the deviation between the detected candidate posture and the anchor point posture can be reduced.

Step 403: Obtain an image of the posture of the labeled object, where the image includes at least two objects, and different objects have different postures, and the posture is indicated by multiple key points.

In this embodiment, the above-mentioned execution subject may obtain an image of the posture of the labeled object. In the image, the object's pose is labeled. The objects here can be people, faces, cats, objects, and so on. Specifically, the posture can be represented by the coordinates of the key points.

In step 404, the convolutional neural network is trained based on the image and the annotation of the pose to obtain a trained convolutional neural network. The training process includes

steps

4041, 4042, and 4043, as follows:

Step 4041: Input the image into the convolutional neural network, and determine the candidate pose of each object based on the previously set anchor pose of the convolutional neural network.

In this embodiment, the above-mentioned execution body may input the acquired image into the convolutional neural network, so that based on the previously set anchor pose of the convolutional neural network, the candidate pose of each object is obtained by the convolutional neural network. Specifically, the convolutional neural network includes a regional candidate network. The size and position of the anchor pose in the image is fixed.

Step 4042: Determine the degree of coincidence between the candidate frame where each candidate pose is located and the annotated frame of the already identified pose, and use the candidate frame whose coincidence is greater than a preset coincidence degree threshold as the target candidate frame.

In this embodiment, the execution body may determine the degree of coincidence between the candidate frame where each candidate pose is located and the labeled frame of the already labeled pose. After that, the execution body may select a candidate frame whose coincidence degree is greater than a preset coincidence degree threshold, and use the selected candidate frame as a target candidate frame.

Step 4043: For each key point in the target candidate frame corresponding to each labeled frame, take the position average value of the key point in each target candidate frame; use the set of position average values of each key point as the image detection To a gesture.

In this embodiment, for each key point in the target candidate frame corresponding to each labeled frame, the execution body may take the average position of the key point in the candidate poses in each target candidate frame corresponding to the labeled frame. value. Therefore, the above-mentioned execution body may use the set of the position average values of the key points of the target candidate frame corresponding to the labeled frame as a pose detected on the image.

The postures of the respective anchor points obtained in this embodiment are more differentiated, which is beneficial to controlling the number of the postures of the anchor points while obtaining a wealth of anchor point postures. In this way, both the computation speed of the regional candidate network can be increased, and the deviation between the detected candidate pose and the anchor pose can be ensured to be small. In addition, this embodiment can also cluster multiple preset poses to obtain an accurate anchor point pose, thereby further reducing the deviation between the detected candidate pose and the anchor point pose.

With further reference to FIG. 5, as an implementation of the methods shown in the foregoing figures, this application provides an embodiment of an image processing device. The device embodiment corresponds to the method embodiment shown in FIG. 2, and the device may specifically Used in various electronic equipment.

As shown in FIG. 5, the image processing apparatus 500 in this embodiment includes an obtaining unit 501 and a training unit 502. Wherein, the obtaining unit 501 is configured to obtain an image of the pose of the labeled object, where the image includes at least two objects, the poses of different objects are different, and the pose is indicated by multiple key points; the training unit 502 is configured to be based on the image And labeling the pose, training the convolutional neural network to obtain the trained convolutional neural network, the training process includes: inputting the image into the convolutional neural network, and based on the previously set anchor pose of the convolutional neural network, determining Candidate poses for each object; determine the overlap between the candidate frame where each candidate pose is located and the labeled frame with the labeled pose, and use the candidate frame with a coincidence greater than a preset coincidence threshold as the target candidate frame; for each labeled frame For each key point in the target candidate frame, an average position of the key point in each target candidate frame is taken; a set of position averages of the key points is used as a pose detected on the image.

In some embodiments, the obtaining unit 501 of the image processing apparatus 500 may obtain an image of the pose of the labeled object. In the image, the object's pose is labeled. The objects here can be people, faces, cats, objects, and so on. Specifically, the posture can be represented by the coordinates of the key points. For example, when a person is in a standing posture and a squatting posture, the distance between the coordinates of the nose key point and the coordinates of the toe key point is different.

In some embodiments, the training unit 502 may input the acquired image into a convolutional neural network, so as to obtain candidate poses of each object from the convolutional neural network based on a previously set anchor pose in the convolutional neural network. Then, a candidate frame with a coincidence degree greater than a preset coincidence degree threshold is selected, and the selected candidate frame is used as a target candidate frame. The above-mentioned execution body may also take, for each key point in the target candidate frame corresponding to each label frame, an average position of the key point in the candidate poses in each target candidate frame corresponding to the label frame.

In some optional implementations of this embodiment, the device further includes: a clustering unit configured to cluster a plurality of preset poses in the target image to obtain a key point set; a determining unit configured to convert Each key point set is determined as an anchor point pose, where the key points included in different key point sets have different positions in the target image.

In some optional implementations of this embodiment, the training unit is further configured to: for each key point in each target candidate frame corresponding to each labeled box, in response to determining that the position of the key point is Outside the callout box, the preset first preset weight is taken as the weight of the key point in the target candidate box; in response to determining that the position of the keypoint is within the callout box, the preset second preset weight is set As the weight of the key point in the target candidate frame, the first preset weight is smaller than the second preset weight; based on the weight of the key point in each target candidate frame corresponding to the labeled frame, determining within each target candidate frame The average position of this keypoint.

In some optional implementations of this embodiment, the training unit is further configured to: for each key point within each target candidate frame corresponding to each labeled frame, determine the key point and the labeled pose Whether the distance of the key point in the target point is less than or equal to a preset distance threshold; in response to determining that the key point is less than or equal to, determine the key point in each target candidate frame based on the weight of the key point in each target candidate frame corresponding to the labeled box The average position of the point.

Reference is now made to FIG. 6, which illustrates a schematic structural diagram of a computer system 600 suitable for implementing an electronic device according to an embodiment of the present application. The electronic device shown in FIG. 6 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.

As shown in FIG. 6, the computer system 600 includes a central processing unit (CPU and / or GPU) 601, which can be loaded into a random access memory (RAM) according to a program stored in a read-only memory (ROM) 602 or from a storage portion 608 The program in 603 performs various appropriate actions and processes. In the RAM 603, various programs and data required for the operation of the system 600 are also stored. The central processing unit 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input / output (I / O) interface 605 is also connected to the bus 604.

The following components are connected to the I / O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the speaker; a storage portion including a hard disk and the like 608; and a communication section 609 including a network interface card such as a LAN card, a modem, and the like. The communication section 609 performs communication processing via a network such as the Internet. The driver 610 is also connected to the I / O interface 605 as necessary. A removable medium 611, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 610 as needed, so that a computer program read therefrom is installed into the storage section 608 as needed.

In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing a method shown in a flowchart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and / or installed from a removable medium 611. When the computer program is executed by the central processing unit 601, the above-mentioned functions defined in the method of the present application are executed. It should be noted that the computer-readable medium of the present application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the foregoing. The computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programming read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing. In this application, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium may include a data signal that is included in baseband or propagated as part of a carrier wave, and which carries computer-readable program code. Such a propagated data signal may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable medium may send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device . Program code embodied on a computer-readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagram may represent a module, a program segment, or a part of code, which contains one or more functions to implement a specified logical function Executable instructions. It should also be noted that in some alternative implementations, the functions noted in the blocks may also occur in a different order than those marked in the drawings. For example, two successively represented boxes may actually be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented by a dedicated hardware-based system that performs the specified function or operation , Or it can be implemented with a combination of dedicated hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described unit may also be provided in a processor, for example, it may be described as: a processor includes an acquisition unit and a training unit. Wherein, the names of these units do not constitute a limitation on the unit itself in some cases. For example, the obtaining unit may also be described as a “unit for obtaining an image of the posture of an already labeled object”.

As another aspect, the present application also provides a computer-readable medium, which may be included in the device described in the foregoing embodiments; or may exist alone without being assembled into the device. The computer-readable medium described above carries one or more programs, and when the one or more programs are executed by the device, the device causes the device to obtain an image of the posture of the marked object, where the image contains at least two objects, different objects The pose is different, and the pose is indicated by multiple key points. Based on the image and the annotation of the pose, the convolutional neural network is trained to obtain the trained convolutional neural network. The training process includes: inputting the image into the convolutional neural network. The previously set anchor pose of the convolutional neural network determines the candidate pose of each object; determines the degree of coincidence between the candidate frame where each candidate pose is located and the labeled frame of the already marked pose, and the degree of coincidence is greater than a preset coincidence degree threshold The candidate frame is used as the target candidate frame; for each key point in the target candidate frame corresponding to each labeled frame, the average position of the key point in each target candidate frame is taken; the set of the average position of each key point is set As a gesture detected on the image.

The above description is only a preferred embodiment of the present application and an explanation of the applied technical principles. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to the technical solution of the specific combination of the above technical features, but it should also cover the above technical features or Other technical solutions formed by arbitrarily combining their equivalent features. For example, a technical solution formed by replacing the above features with technical features disclosed in the present application (but not limited to) with similar functions.

Claims

An image processing method includes:

Acquiring an image of the posture of the labeled object, wherein the image includes at least two objects, the posture of different objects is different, and the posture is indicated by multiple key points;

Training the convolutional neural network based on the image and the annotation of the pose to obtain a trained convolutional neural network. The training process includes:

Input the image into a convolutional neural network, and determine a candidate pose of each object based on a previously set anchor pose of the convolutional neural network;

Determining the degree of coincidence between the candidate frame where each candidate pose is located and the annotated frame of the annotated pose, and using the candidate frame whose coincidence is greater than a preset coincidence degree threshold as a target candidate frame;

For each key point in the target candidate frame corresponding to each labeled box, take the position average value of the key point in each target candidate frame; use the set of position average values of each key point as the detected image A gesture.
The method according to claim 1, wherein before the inputting the image into a convolutional neural network, and based on a previously set anchor pose of the convolutional neural network, determining a candidate pose of each object, the The method also includes:

Clustering multiple preset poses in the target image to obtain the key point set;

Each key point set is determined as an anchor point pose, where key points included in different key point sets have different positions in the target image.
The method according to claim 2, wherein the clustering a plurality of preset poses in a target image to obtain a keypoint set comprises:

Cluster the multi-dimensional vectors corresponding to each preset pose, wherein the number of dimensions of the multi-dimensional vector corresponding to the preset pose is the same as the number of key points of the preset pose;

Each key point of the pose corresponding to the multi-dimensional vector of the cluster center is composed of a key point set.
The method according to claim 1, wherein, for each key point of the target candidate frame corresponding to each labeled frame, taking an average position of the key point in the candidate pose in each target candidate frame includes:

For each key point in each target candidate frame corresponding to each label box, in response to determining that the position of the key point is outside the label box, a preset first preset weight is taken as the target candidate frame. The weight of the key point; in response to determining that the position of the key point is within the callout frame, using a preset second preset weight as the weight of the key point within the target candidate frame, the first preset weight being less than A second preset weight; determining an average position of the key point in each target candidate frame based on the weight of the key point in each target candidate frame corresponding to the labeled frame.
The method according to claim 1, wherein for each key point of the target candidate frame corresponding to each labeled frame, taking an average position of the key point in the candidate pose in each target candidate frame includes:

For each key point in each target candidate frame corresponding to each labeled box, determine whether the distance between the key point and the key point in the labeled pose is less than or equal to a preset distance threshold; in response to determining that the distance is less than or equal to , Based on the weights of the key points in each target candidate frame corresponding to the labeled box, determine an average position of the key points in each target candidate frame.
An image processing device includes:

An obtaining unit configured to obtain an image of the pose of the labeled object, wherein the image includes at least two objects, the poses of different objects are different, and the pose is indicated by multiple key points;

The training unit is configured to train a convolutional neural network based on the image and the annotation of the posture to obtain a trained convolutional neural network. The training process includes:

The image is input to a convolutional neural network, and a candidate pose of each object is determined based on a previously set anchor pose of the convolutional neural network; a candidate frame where each candidate pose is located and a labeled frame of the labeled pose are determined. Coincidence degree, taking the candidate frame whose coincidence degree is greater than the preset coincidence degree threshold as the target candidate frame; for each key point in the target candidate frame corresponding to each labeled frame, the position of the key point in each target candidate frame is taken Average value; the set of average position values of each key point is taken as a pose detected on the image.
The apparatus according to claim 6, wherein the apparatus further comprises:

A clustering unit configured to cluster a plurality of preset poses in a target image to obtain a set of key points;

The determining unit is configured to determine each key point set as an anchor point pose, where key points included in different key point sets have different positions in the target image.
The apparatus according to claim 7, wherein the clustering unit is further configured to:

Cluster the multi-dimensional vectors corresponding to each preset pose, wherein the number of dimensions of the multi-dimensional vector corresponding to the preset pose is the same as the number of key points of the preset pose;

Each key point of the preset pose corresponding to the multi-dimensional vector of the cluster center is composed of a key point set.
The apparatus according to claim 6, wherein the training unit is further configured to:

For each key point in each target candidate frame corresponding to each label box, in response to determining that the position of the key point is outside the label box, a preset first preset weight is taken as the target candidate frame. The weight of the key point; in response to determining that the position of the key point is within the callout frame, using a preset second preset weight as the weight of the key point within the target candidate frame, the first preset weight being less than A second preset weight; determining an average position of the key point in each target candidate frame based on the weight of the key point in each target candidate frame corresponding to the labeled frame.
The apparatus according to claim 6, wherein said training unit is further configured to:

For each key point in each target candidate frame corresponding to each labeled box, determine whether the distance between the key point and the key point in the labeled pose is less than or equal to a preset distance threshold; in response to determining that the distance is less than or equal to , Based on the weights of the key points in each target candidate frame corresponding to the labeled box, determine an average position of the key points in each target candidate frame.
An electronic device includes:

One or more processors;

A storage device for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1-5.
A computer-readable storage medium having stored thereon a computer program, wherein when the program is executed by a processor, the method according to any one of claims 1-5 is implemented.