CN117021099A - Human-computer interaction method oriented to any object and based on deep learning and image processing - Google Patents
Human-computer interaction method oriented to any object and based on deep learning and image processing Download PDFInfo
- Publication number
- CN117021099A CN117021099A CN202311059633.5A CN202311059633A CN117021099A CN 117021099 A CN117021099 A CN 117021099A CN 202311059633 A CN202311059633 A CN 202311059633A CN 117021099 A CN117021099 A CN 117021099A
- Authority
- CN
- China
- Prior art keywords
- area
- camera
- image
- human
- coordinates
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 68
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000012545 processing Methods 0.000 title claims abstract description 41
- 238000013135 deep learning Methods 0.000 title claims abstract description 18
- 238000006243 chemical reaction Methods 0.000 claims abstract description 22
- 238000003672 processing method Methods 0.000 claims abstract description 4
- 238000001514 detection method Methods 0.000 claims description 42
- 230000002452 interceptive effect Effects 0.000 claims description 42
- 210000000707 wrist Anatomy 0.000 claims description 27
- 238000001914 filtration Methods 0.000 claims description 22
- 230000008569 process Effects 0.000 claims description 21
- 210000004247 hand Anatomy 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 8
- 230000002146 bilateral effect Effects 0.000 claims description 7
- 230000036544 posture Effects 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 5
- 230000001133 acceleration Effects 0.000 claims description 4
- 238000009499 grossing Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000008439 repair process Effects 0.000 claims description 3
- 238000013519 translation Methods 0.000 claims description 3
- 210000000078 claw Anatomy 0.000 claims description 2
- 238000003708 edge detection Methods 0.000 abstract description 2
- 238000012549 training Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 210000000245 forearm Anatomy 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/80—Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
- G06T7/85—Stereo camera calibration
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1628—Programme controls characterised by the control loop
- B25J9/163—Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/60—Analysis of geometric attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/28—Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20024—Filtering details
- G06T2207/20032—Median filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30204—Marker
- G06T2207/30208—Marker matrix
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Geometry (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Robotics (AREA)
- Mechanical Engineering (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a man-machine interaction method based on deep learning and image processing for any object, which comprises the steps of firstly detecting body node coordinates and coordinate relations of an interactor by using a MediaPipe model to judge the state of the interactor; then according to different states of the interactors, determining a potential task target area by adopting a significance and edge detection algorithm or an area growth algorithm; then, the size of a target area is obtained through an OpenCV image processing method and depth image information, and the object posture is obtained through pixel-world coordinate system conversion; judging whether an interactor holds objects or not according to the size and filling degree of the target area, and judging whether interaction can be carried out or not according to the position of the interactor hand, the object size and the size information of the mechanical arm paw; and finally, the robot completes the human-computer interaction task of the unknown object according to the obtained object position and posture. The method realizes the acquisition of the coordinates and the gestures of the unknown objects in the scene, so that the robot can face any object, and safe, stable and accurate man-machine interaction operation can be realized without training.
Description
Technical Field
The invention discloses a man-machine interaction method oriented to any object and based on deep learning and image processing, and belongs to the technical field of man-machine interaction.
Background
With the continuous development of intelligent robot technology, human-computer interaction plays a significant role in the future scientific and technological field. The robot and the collaborators interact well, so that the production efficiency can be effectively improved, and the product quality can be improved. The target detection is the most basic problem of computer vision, and exists in aspects of our life, such as face recognition, intelligent traffic and industrial detection, so that great convenience and efficient technical support are provided for our life production. Although the target detection technology is widely used and has good effect, many challenges are worth researching. Currently commonly used object detection techniques, such as YOLO-series networks, SSD, fast RCNN, etc., can relatively accurately determine the location and type of an object. However, these networks have limited kinds of objects that can be identified, and for the same object, it is difficult to identify after rotating a certain angle, and when the characteristics of the kinds of objects (such as color textures, etc.) are similar, erroneous identification is easy to occur. Under the condition, researchers apply the saliency target detection technology to human-computer interaction, can better solve the problems, and can meet the requirements of determining the position and measuring the pose of any unknown object in space in the human-computer interaction.
With the continuous development of artificial intelligence technology, human-computer interaction using a target detection network is applied to more scenes, such as intelligent canteen of schools and entrance guard systems, such as garbage detection classification and the like. When facing these scenes, robots should have a certain flexibility, and sometimes need to recognize and interact with any object with an unlimited gesture in human-computer interaction. When a certain kind is identified, the existing network detection technology needs massive data, and the identification accuracy rate is required to reach more than 95%. In addition, when adding new types in the existing model, the new types need to be modified on the basis of the original network, and the required time is long and complicated. Therefore, the invention has less required resources, the detection network with unrestricted recognition area and recognition object is used for human-computer interaction, and the invention has great significance for the function expansion of the existing object recognition technology.
Disclosure of Invention
The invention aims to provide a man-machine interaction method based on deep learning and image processing for any object, so that a robot can perform safe, stable and accurate interaction operation for any object without training. The method detects the coordinates of the body nodes of the interactors by using the MediaPipe model and judges the state of the interactors according to the coordinates. And utilizing different state information of interactors to acquire a potential task target area by adopting a saliency detection, edge detection and screening algorithm or an improved area growth algorithm pair. And then, obtaining the pixel size and the rotation angle of the target area image through OpenCV image processing. And then judging whether the interactors hold objects or not according to the area and the filling degree of the target area. And then, determining interactivity and ideal interactive positions by using the hand position information and the target area information, and obtaining the rotation angle of the target object area by using the depth image. Finally, the robot completes the task of interacting unknown objects according to the obtained object positions and the obtained object poses without limitation.
The invention aims to solve the problems, which are realized by the following technical scheme:
a man-machine interaction method based on deep learning and image processing for any object comprises the following specific steps:
step S10, obtaining a color image matched with the depth image according to the conversion relation between the color and the world coordinates of the depth camera, and storing the matching relation to accelerate the conversion speed;
step S20, filtering and repairing the depth image;
step S30, obtaining pixel coordinates of human skeleton joints of an interactor by using a MediaPipe human skeleton joint recognition model, and judging different postures of a human body relative to a camera according to the detected coordinates and distance information between the nodes: laterally and oppositely facing;
step S40, when the detection result is a side-to-side camera, detecting a man-machine interaction scene in real time by using a saliency detection neural network to obtain a man-machine interaction real-time saliency area, and filtering a network output result to optimize the result;
step S50, when the detection result is a side-to-side camera, carrying out contour screening on the output result of the saliency detection network according to the detection result of the MediaPipe model, and screening the middle point of a specific outer contour to obtain the outer contour of an object area, judging interaction intention and collecting potential objects and human hand areas;
step S60, when the detection result is that the camera is right opposite, collecting object areas of the depth image by using an area growth algorithm with guide;
step S70, according to the hand and the object area, obtaining the minimum circumscribed rectangle of the area by using an OpenCV image processing method, obtaining the object area, and obtaining the image rotation angle, the pixel width and the pixel height;
step S80, determining an area threshold and a filling degree threshold according to the hand information, and calculating corresponding data to judge whether the person holds the object in the hand of the interactor;
step S90, when the interactors hold things, judging the palm area range by utilizing the positions of the interactors, judging whether man-machine interaction can be completed or not, determining the ideal clamping positions of the manipulators during interaction, and converting the determined key points into a world coordinate system to obtain the rotation angle of the things around the X, Y axis of the world coordinate system;
and step S100, transmitting pose information of the object and the gripper gripping position to a robot control end through a vision processing end to guide the robot to complete an unknown object interaction task.
Preferably, the specific process of step S10 is as follows:
step S101, a Kinect v2 camera is arranged on one side of a working platform, and a proper placement position is selected by observing the formed image, so that the camera can shoot a human body;
step S102, calibrating the Kinect v2 camera by using a Zhang Zhengyou calibration method to obtain a camera internal and external parameter matrix K c 、
Step S103, calling an official Kinect library function to obtain depth camera internal parameters K d Obtaining the external parameters of the depth camera from the external parameters of the color camera according to the hardware position relation of the depth camera and the color camera
Step S104, obtaining a conversion relation between the color camera and a three-dimensional world coordinate system through internal and external parameters of the color camera, obtaining a conversion relation between the depth camera and the three-dimensional world coordinate system through the depth camera, and converting by taking the three-dimensional world coordinate system as a constant to obtain a coordinate system conversion matrix of the depth camera and the color camera;
step S105, the multiplied result of the pixel coordinates and the conversion matrix is recorded and stored, self-adaptive translation is carried out according to the actual matching effect to repair errors, matching acceleration is carried out by using the function of the Numba library jit, and a color image matched with the depth image is circularly obtained.
Preferably, the specific process of step S20 is as follows:
step S201, filtering invalid depth points of the depth image, and then carrying out joint bilateral filtering;
step S202, median filtering is carried out on the combined bilateral filtering result;
step S203, performing an image open operation on the median filtering result.
Preferably, the specific process of step S30 is as follows:
step S301, inputting the mapped color image into a MediaPipe model for gesture detection to obtain coordinates of wrists, shoulders and the like of both hands;
step S302, the distance between two shoulders is obtained by using the obtained joint point pixel coordinates, and the orientation of the human body posture relative to the camera is judged: laterally and directly facing the camera;
step S303, when the camera is laterally opposite, a salient region extraction and image processing algorithm is subsequently applied; while facing the camera, the improved region growing algorithm is subsequently applied.
Preferably, the specific process of step S40 is as follows:
step S401, inputting the converted small color image and the depth image after the restoration processing into a saliency detection neural network to obtain a neural network output;
step S402, discarding the region with lower confidence in the output result, selecting a proper threshold value to perform binarization processing on the network output result;
step S403, performing open operation on the image obtained after the binary processing, smoothing and denoising the image, and obtaining a reliable significance result.
Preferably, the specific process of step S50 is as follows:
step S501, performing contour detection on the processed saliency binary image to obtain all contours and contour points of all contours;
step S502, judging potential interactive hands according to the wrist coordinates of the two hands, and judging whether the interactive intention exists or not according to the distance between the potential interactive hands and the chest;
step S503, the wrist position coordinates of the interactive hand are taken as base points, the direction is judged according to the position relation between the body of the interactor and the arm of the interactor, and contour nodes on the opposite sides of the base points in all contours are screened out;
step S504, the pixel coordinates of the midpoints of the rest outlines are obtained, the distances from the midpoints to the interactive wrist are compared, and the area where the object is located is determined.
Preferably, the specific process of step S60 is as follows:
step S601, when the detection result is that the hand is right opposite to the camera, comparing the wrists of both hands with the distance of the mechanical arm to select potential interaction hands;
step S602, comparing the distance between the interactive hand and the chest, and analyzing whether the interactive hand has interactive intention;
step S603, aiming at the hand with interactive intention, selecting a base point according to the pixel coordinates of the interactive hand, growing the depth image from the base point by using a zone growing algorithm with guidance, and collecting the target object zone.
Preferably, the specific process of step S70 is as follows:
step S701, using the previously obtained object contour or object area, using an OpenCV image processing algorithm to obtain an area minimum external rectangle as an Anchor box of the object, and simultaneously obtaining the width, height, rotation angle and center coordinates of the rectangle;
step S702, according to the size information of the circumscribed rectangle and the rotation angle of the rectangular image, obtaining the included angle between the longer side of the rectangle and the vertical direction;
in step S703, a rectangular area is intercepted and stored, and pixel values of three channels of pixel points in the object area are all set to 0.
Preferably, the specific process of step S80 is as follows:
step S801, calculating the pixel area of the palm of the interactor according to the human body proportion so as to formulate an area threshold;
step S802, calculating the ratio of the number of pixels to the area when the palm of the interacter faces the camera for many times and stretches to the maximum extent, taking the average value as a filling degree threshold value and recording;
in step S803, the area of the circumscribed rectangle pixel at the time of interaction and the degree of filling are calculated. Calculating the rectangular pixel area according to the width and the height of the external rectangle;
step S804, comparing the real area and the filling degree with the corresponding threshold values, and judging the possibility of holding the object by the interactors.
Preferably, the specific process of step S90 is as follows:
step S901, judging the palm area range by using the wrist position and the palm size;
step S902, judging the upper and lower positions of the interactive hand-held object according to the position relation between the palm center position and the rectangular center;
step S903, determining the midpoint of the boundary between the palm area and the interactable area according to the coordinates of the palm center and the coordinates of the center of the object area;
step S904, determining the coordinates of the midpoint of the boundary of the area suitable for grabbing by the mechanical claw according to the included angle and the position of the object to be grabbed;
step S905, according to the rectangular area and the depth image information, obtaining a point which is close to the corresponding boundary and in the object;
step S906, converting the two points into a three-dimensional world coordinate system, calculating the distance between the two points, comparing the distance with the size of the manipulator paw, and judging whether an interaction task can be performed;
in step S907, the two points converted into the world coordinate system are projected in the YOZ and XOZ directions, respectively, and the rotation angle of the object around the X, Y direction is calculated.
Preferably, the specific process of step S100 is as follows:
step S1001, transmitting the object pose and the clamping position acquired by the vision processing end to the robot control end in a TCP/IP communication mode
Step S1002, after receiving the information, the robot control end moves to a designated position in a designated gesture, and the robot finishes the task of grabbing or following the movement according to the interactivity information.
Compared with the prior art, the invention has the following beneficial effects:
1. at present, most robot man-machine interaction systems can only finish sorting tasks on limited category target objects, and the recognition effect is influenced by the fact that the objects are shielded by the palm and the angle is rotated to a certain extent, so that the addition of recognizable new categories is complicated. The method of the invention can enable the robot to complete the identification and positioning and interactive operation of any unidentified object.
2. The man-machine interaction method provided by the invention not only can be used for positioning the object in the scene, but also can be used for analyzing the rotation angle of the object around the X, Y shaft more reliably at low cost.
Drawings
Fig. 1 is a general framework diagram of a robot unknown object interaction method of the present invention.
Fig. 2 is a schematic diagram of the coordinate system of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, based on the embodiments of the invention, which would be apparent to one of ordinary skill in the art without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, a first embodiment of the present invention provides a man-machine interaction method based on deep learning and image processing for any object based on the prior art, which includes the following steps:
s10, obtaining a color image matched with the depth image according to the conversion relation between the color and the world coordinates of the depth camera, and storing the matching relation to accelerate the matching speed. The method comprises the following specific steps:
step S101, a Kinect v2 camera is arranged on one side of a working platform, and a proper placement position is selected by observing the formed image, so that the camera can be ensured to shoot a human body.
Step S102, calibrating the Kinect v2 camera by using a Zhang Zhengyou calibration method to obtain a camera internal and external parameter matrix K c 、
Step S103, calling an official Kinect library function to obtain depth camera internal parameters K d Obtaining the external parameters of the depth camera from the external parameters of the color camera according to the hardware position relation of the depth camera and the color cameraThe conversion formula is:
step S104, go throughObtaining a conversion relation between the color camera and the three-dimensional world coordinate system by taking the internal parameters and the external parameters of the color camera, obtaining the conversion relation between the depth camera and the three-dimensional world coordinate system by the depth camera, and converting by taking the three-dimensional world coordinate system as a constant to obtain a coordinate system conversion matrix of the depth camera and the color camera. Let the point coordinate P under the pixel coordinate system p =(u,v,1) T The world coordinate system lower point is p= (X, Y, Z, 1) T The conversion relation can be obtained:
since the depth camera is on the same plane as the color camera, it can be considered that d c =d d 。
The conversion matrix from the depth image to the color image of the point under the same world coordinate system can be obtained by the formula (1) and the formula (2):
the conversion formula of a point on the depth image into a color image is:
P p_c =T dc *P p_d (4)
step S105, the multiplied result of the pixel coordinates and the conversion matrix is recorded and stored, self-adaptive translation is carried out according to the actual matching effect to repair errors, matching acceleration is carried out by using the function of the Numba library jit, and a color image matched with the depth image is circularly obtained. By the acceleration function, a higher frame rate can be obtained, and better instantaneity is realized.
Step S20, filtering and repairing the depth image, wherein the specific steps are as follows:
step S201, filtering invalid depth points of the depth image, and then carrying out joint bilateral filtering. The reliable depth range can be selected according to the practical application condition, the depth image pixel points outside the range are set to zero, and then the combined bilateral filtering is carried out.
Step S202, median filtering is carried out on the combined bilateral filtering result.
Step S203, performing an image open operation on the median filtering result. And smoothing the boundary to finish the restoration of the depth image.
Step S30, obtaining pixel coordinates of human skeleton joints of an interactor by using a MediaPipe human skeleton joint recognition model, and judging different postures of a human body relative to a camera according to the detected coordinates and distance information between the nodes: lateral and opposite.
The method comprises the following specific steps:
step S301, inputting the mapped color image into a MediaPipe model for gesture detection to obtain coordinates of wrists, shoulders and the like of both hands.
Step S302, the distance between two shoulders is obtained by using the obtained joint point pixel coordinates, and the orientation of the human body posture relative to the camera is judged: and laterally and directly against the camera. And obtaining the shoulder width of the interactors in the image according to the pixel point coordinates and the depth values of the two shoulders, obtaining the height from the shoulders to the ground according to the formula (5), and calculating the height of the interactors according to the formula (6). And judging the shoulder rotation angle of the interactors according to the theoretical shoulder width and the shoulder width obtained by actual calculation, and further judging the posture of the interactors relative to the camera.
H=Z/0.818 (6)
Step S303, when the camera is laterally opposite, a salient region extraction and image processing algorithm is subsequently applied; while facing the camera, the improved region growing algorithm is subsequently applied.
And S40, when the detection result is that the camera is in a side-to-side mode, detecting the human-computer interaction scene in real time by using the saliency detection neural network to obtain a human-computer interaction real-time saliency area, and filtering the network output result to optimize the result. The method comprises the following specific steps:
and S401, inputting the converted small color image and the depth image subjected to restoration processing into a saliency detection neural network to obtain a neural network output. And detecting the human-computer interaction scene by using the saliency detection neural network, wherein the obtained network output result comprises a human body, an object region and other interference regions.
Step S402, discarding the region with lower confidence in the output result, selecting a proper threshold value, and performing binarization processing on the network output result. The saliency detection neural network outputs a gray scale map with the same size as the input gray scale map, and the gray scale value of each pixel is the judgment of whether the pixel belongs to a saliency region or not by the network, so that a proper threshold value is selected, and regions with low confidence coefficient are discarded, and are not considered to belong to the saliency region.
Step S403, performing open operation on the image obtained after the binary processing, smoothing and denoising the image, and obtaining a reliable significance result. The formula is as follows:
and S50, when the detection result is a side-to-side camera, carrying out contour screening on the output result of the saliency detection network according to the detection result of the MediaPipe model, and screening the middle point of a specific outer contour to obtain the contour outside the object area, judging the interaction intention and collecting the potential object and the human hand area. The method comprises the following specific steps:
step S501, performing contour detection on the processed saliency binary image to obtain all contours and contour points of each contour.
Step S502, judging potential interactive hands according to the wrist coordinates of the two hands, and judging whether the interactive intention exists or not according to the distance between the potential interactive hands and the chest. And converting the two-hand coordinates into a world coordinate system by using the two-hand pixel coordinates and the corresponding depth values, and selecting the hand which is closer to the mechanical arm in the space as the potential interaction hand. The midpoint of the two shoulders is selected as the chest position, and the distance from the interactive hand to the chest is calculated. And selecting a threshold value according to the calculated height of the interactor to judge whether the distance between the wrist and the chest is greater than the threshold value, and considering that the interactor has the interaction intention if the distance is greater than the threshold value.
In step S503, the wrist position coordinates of the interactive hand are used as the base points, the direction is determined according to the position relationship between the body of the interactor and the arm of the interactor, and the contour nodes on the opposite sides of the base points in the respective contours are screened out. And judging the target interaction direction of the interactor according to the position relation between the body of the interactor and the arm of the interactor, and screening out all outline nodes in the direction from the interactor to the body direction and the later direction of the interactor, thereby eliminating the body area of the interactor and part of the environment area.
Step S504, the pixel coordinates of the midpoints of the rest outlines are obtained, the distances from the midpoints to the interactive wrist are compared, and the area where the object is located is determined. The contour with the smallest distance midpoint is the contour of the object area, so the pixel coordinate of the midpoint of each closed contour needs to be calculated, and the distance of the nearest contour is D c =min(D 1 ,D 2 ,......,D n )。
Let the wrist base point coordinates be (u) h ,v h ) The pixel distance calculation formula is as follows:
and step S60, when the detection result is that the camera is right opposite, collecting the object area by using a zone growing algorithm with guide on the depth image. The method comprises the following specific steps:
in step S601, when the detection result is that the hand is right opposite to the camera, the distance between the wrist of the both hands and the mechanical arm is compared to select the potential interactive hand. And obtaining the position under the world coordinate system of the two hands according to the pixel coordinates of the two hands and the corresponding depth values, and selecting the hand close to the mechanical arm as the potential interaction hand.
Step S602, comparing the distance between the interactive hand and the chest, and analyzing whether the interactive hand has interactive intention. And selecting a proper threshold value according to the height of the person to compare so as to judge whether the interaction intention exists.
Step S603, aiming at the hand with interactive intention, selecting a base point according to the pixel coordinates of the interactive hand, growing the depth image from the base point by using a zone growing algorithm with guidance, and collecting the target object zone.
A common region growing algorithm expands the points meeting the constraint condition among the adjacent eight points around the pixel point, and repeats the process continuously. The free expansion of the pixels meeting the depth difference in the depth image by the common growth mode can lead to the fact that the final target object area obtained by growth contains the interactive trunk part, and the problems of area misjudgment, large size error, calculation distortion of the rotation angle, inaccurate positioning and the like are caused. Therefore, according to the characteristic that the direction from the wrist to the object is approximately the same as the direction from the wrist to the palm when people interact with the people, the coordinate of the obtained wrist and palm is detected by using the MediaPipe model, the imaging angles of the palm and the wrist of the interaction hand are judged, and then the direction of the interaction hand is judged so as to determine the direction of the object relative to the wrist. The method has the advantages that the direction of the hands of the interactors is utilized to guide the growth mode of the pixels of the depth image, so that the pixels only grow in the direction and the adjacent direction, the body part is prevented from being brought into the object area, and the more accurate object area is obtained.
Step S70, according to the hand and the object area, the minimum circumscribed rectangle of the area is obtained by using an OpenCV image processing method, and the object area is obtained, and the image rotation angle, the pixel width and the pixel height are obtained. The method comprises the following specific steps:
step S701, using the previously obtained object contour or object region, using OpenCV image processing algorithm to obtain the region minimum external rectangle as the anchor box of the object, and simultaneously obtaining the width, height, rotation angle and center coordinates of the rectangle.
Step S702, according to the size information of the circumscribed rectangle and the rotation angle of the rectangular image, the included angle between the longer side of the rectangle and the vertical direction is obtained. And judging the inclination direction of the object according to the information of the width (w) and the height (h) of the circumscribed rectangle, subtracting the angle when the object is inclined leftwards from 90 degrees, and obtaining the angle which is the included angle between the object and the vertical direction of the image. The formula is as follows:
in step S703, a rectangular area is intercepted and stored, and pixel values of three channels of pixel points in the object area are all set to 0.
And S80, determining an area threshold and a filling degree threshold according to the hand information, and calculating corresponding data to judge whether the person holds the object in the hand of the interactor. The method comprises the following specific steps:
step S801, calculating the pixel area of the palm of the interactor according to the human body proportion to formulate an area threshold. The width of the human palm is approximately equal to the length of the palm when the palm is furthest extended, the length of the palm is approximately 60% -80% of the length of the forearm, and the length of the palm is approximately twice the width of the palm. Therefore, two area thresholds can be calculated, and the pixel coordinates of the wrist and elbow of the corresponding interactive hand of the interactor are set as (x) w ,y w ),(x e ,y e ) The length of the forearm pixel is L arm The area threshold calculation formula is:
step S802, calculating the ratio of the number of pixels to the area when the palm of the interactor faces the camera and stretches to the maximum extent, taking the average value as a filling degree threshold value and recording. Referring to the previous step, when the human body is extended as far as possible against the camera hand and is free from shielding, the ratio of the effective area to the circumscribed rectangular area when the human body palm is extended to the maximum extent for a plurality of times is measured by utilizing a zone growth algorithm with guidance, and the average value is taken as a filling degree threshold value and recorded.
In step S803, the area of the circumscribed rectangle pixel at the time of interaction and the degree of filling are calculated. And calculating the rectangular pixel area according to the width and the height of the circumscribed rectangle. According to the stored object area, the number of pixels with three channel pixel points of 0 in the previously stored rectangular area is obtained by using an OpenCV image processing algorithm and is used as the effective area of the object area, and the ratio of the value to the rectangular area is calculated to be used as the real filling degree.
Step S804, comparing the real area and the filling degree with the corresponding threshold values, and judging the possibility of holding the object by the interactors. When the circumscribed rectangular area is larger than the threshold value, the interactors are considered to be necessary to hold objects, and when the circumscribed rectangular area is smaller than the threshold value, the interactors are considered to be low in possibility of holding objects. And when the area of the circumscribed rectangle is smaller than the first threshold value and larger than the second threshold value, performing auxiliary judgment according to the filling degree threshold value. If the real filling degree is larger than the filling degree threshold, the interactors are considered to hold the objects, and the condition of omission and misjudgment when small articles are interacted is avoided.
Step S90, when the interactors are judged to hold things, the interaction hand position is used for judging the palm area range, judging whether man-machine interaction can be completed or not, determining the ideal clamping position of the mechanical hand during interaction, and converting the determined key points into a world coordinate system to obtain the rotation angle of the object around the X, Y axis of the world coordinate system. The method comprises the following specific steps:
in step S901, the palm area range is determined using the wrist position and the palm size. The angle of inclination of the interactive hand and the object held is considered to be approximately the same, so that the palm area can be determined by the angle between the circumscribed rectangle in the image and the vertical direction. The pixel width of the palm is approximately twice the pixel length from the wrist to the palm, and in the rectangular region, the region having the starting point of the palm and the vertical length of one half of the value is the palm region.
Step S902, judging the up-down position of the interactive hand-held object according to the position relation between the palm center position and the rectangular center. The interactive area which is more suitable for the gripper to grip can be judged through the upper and lower gripping position relationship, and the palm area and the interactive area are on opposite sides.
In step S903, a midpoint on the boundary between the palm region and the interactable region is determined according to the coordinates of the palm center and the coordinates of the center of the object region. Let the center coordinate of the circumscribed rectangle be (x box_c ,y box_c ) The palm center coordinate is (x) h ,y h ) The midpoint coordinates on adjacent boundaries are (x area_c ,y area_c ) The included angle between the external rectangle and the vertical direction is theta, and the width of the palm pixel is w h The length from the rectangular center to the palm area is H, and the length of the interactable area is h+H box /2. Taking the example that the lower part of the object is held by a hand and the object in the image is inclined to the right, the calculation formula of the length h from the rectangular center to the palm area is as follows:
h=(y h -y box_c )/cos(θ)-w h /2 (11)
from this, the midpoint coordinates on the adjacent boundaries can be calculated:
in step S904, coordinates of a midpoint of a boundary of the area suitable for gripper gripping are determined according to the included angle and the position of the object to be gripped by the hand. If the object is held by hand, the midpoint of the lower boundary of the rectangle is taken, and otherwise, the midpoint of the upper boundary of the rectangle is taken.
In step S905, a point in the object near the corresponding boundary is obtained based on the rectangular region and the depth image information. The pixel points on the frame are outside the object area, and the world coordinates of the pixel points are greatly different from the world coordinates of the object, so that iteration is required to be carried out in the image according to the included angle of the rectangular area until the difference between the depth value of the pixel points and the average depth value of the object is small, and the found pixel points can be judged to be in the object area.
Step S906, converting the two points into a three-dimensional world coordinate system, calculating the distance between the two points, comparing with the size of the manipulator paw, and judging whether the interaction task can be performed. And measuring the size of the robot interaction paw, and reserving a certain buffer distance. If the interactable area length is greater than the dimension, the description may interact normally, otherwise an interference phenomenon will occur.
In step S907, the two points converted into the world coordinate system are projected in the YOZ and XOZ directions, respectively, and the rotation angle of the object around the X, Y direction is calculated. Converting the two points from pixel coordinates to world coordinate system, and setting the point with relatively small Z coordinate as (X) 1 ,Y 1 ,Z 1 ) The point where the Z coordinate is relatively small is (X 2 ,Y 2 ,Z 2 ) The vector length projected on the YOZ and XOZ planes is calculated by the formula (13), and the rotation angle of the object around the X, Y direction is calculated by the formula (14). The formula is as follows:
and step S100, transmitting pose information of the object and the gripper gripping position to a robot control end through a vision processing end to guide the robot to complete an unknown object interaction task. The method comprises the following specific steps:
in step S1001, the object pose and the clamping position acquired by the vision processing end are transmitted to the robot control end by means of TCP/IP communication.
Step S1002, after receiving the information, the robot control end moves to a designated position in a designated gesture, and the robot finishes the task of grabbing or following the movement according to the interactivity information.
Although embodiments of the invention have been disclosed above, they are not limited to the use listed in the specification and embodiments. It can be applied to various fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. Therefore, the invention is not to be limited to the specific details and illustrations shown and described herein, without departing from the general concepts defined in the claims and their equivalents.
Claims (11)
1. A man-machine interaction method for any object based on deep learning and image processing is characterized by comprising the following specific steps:
step S10, obtaining a color image matched with the depth image according to the conversion relation between the color and the world coordinates of the depth camera, and storing the matching relation to accelerate the conversion speed;
step S20, filtering and repairing the depth image;
step S30, obtaining pixel coordinates of human skeleton joints of an interactor by using a MediaPipe human skeleton joint recognition model, and judging different postures of a human body relative to a camera according to the detected coordinates and distance information between the nodes: laterally and oppositely facing;
step S40, when the detection result is a side-to-side camera, detecting a man-machine interaction scene in real time by using a saliency detection neural network to obtain a man-machine interaction real-time saliency area, and filtering a network output result to optimize the result;
step S50, when the detection result is a side-to-side camera, carrying out contour screening on the output result of the saliency detection network according to the detection result of the MediaPipe model, and screening the middle point of a specific outer contour to obtain the outer contour of an object area, judging interaction intention and collecting potential objects and human hand areas;
step S60, when the detection result is that the camera is right opposite, collecting object areas of the depth image by using an area growth algorithm with guide;
step S70, according to the hand and the object area, obtaining the minimum circumscribed rectangle of the area by using an OpenCV image processing method, obtaining the object area, and obtaining the image rotation angle, the pixel width and the pixel height;
step S80, determining an area threshold and a filling degree threshold according to the hand information, and calculating corresponding data to judge whether the person holds the object in the hand of the interactor;
step S90, when the interactors hold things, judging the palm area range by utilizing the positions of the interactors, judging whether man-machine interaction can be completed or not, determining the ideal clamping positions of the manipulators during interaction, and converting the determined key points into a world coordinate system to obtain the rotation angle of the things around the X, Y axis of the world coordinate system;
and step S100, transmitting pose information of the object and the gripper gripping position to a robot control end through a vision processing end to guide the robot to complete an unknown object interaction task.
2. The human-computer interaction method based on deep learning and image processing for any object according to claim 1, wherein the specific process of step S10 is as follows:
step S101, a Kinect v2 camera is arranged on one side of a working platform, and a proper placement position is selected by observing the formed image, so that the camera can shoot a human body;
step S102, calibrating the Kinect v2 camera by using a Zhang Zhengyou calibration method to obtain a camera internal and external parameter matrix K c 、
Step S103, calling an official Kinect library function to obtain depth camera internal parameters K d Obtaining the external parameters of the depth camera from the external parameters of the color camera according to the hardware position relation of the depth camera and the color camera
Step S104, obtaining a conversion relation between the color camera and a three-dimensional world coordinate system through internal and external parameters of the color camera, obtaining a conversion relation between the depth camera and the three-dimensional world coordinate system through the depth camera, and converting by taking the three-dimensional world coordinate system as a constant to obtain a coordinate system conversion matrix of the depth camera and the color camera;
step S105, the multiplied result of the pixel coordinates and the conversion matrix is recorded and stored, self-adaptive translation is carried out according to the actual matching effect to repair errors, matching acceleration is carried out by using the function of the Numba library jit, and a color image matched with the depth image is circularly obtained.
3. The human-computer interaction method based on deep learning and image processing for any object according to claim 1, wherein the specific process of step S20 is as follows:
step S201, filtering invalid depth points of the depth image, and then carrying out joint bilateral filtering;
step S202, median filtering is carried out on the combined bilateral filtering result;
step S203, performing an image open operation on the median filtering result.
4. The human-computer interaction method based on deep learning and image processing for any object according to claim 1, wherein the specific process of step S30 is as follows:
step S301, inputting the mapped color image into a MediaPipe model for gesture detection to obtain coordinates of wrists, shoulders and the like of both hands;
step S302, the distance between two shoulders is obtained by using the obtained joint point pixel coordinates, and the orientation of the human body posture relative to the camera is judged: laterally and directly facing the camera;
step S303, when the camera is laterally opposite, a salient region extraction and image processing algorithm is subsequently applied; while facing the camera, the improved region growing algorithm is subsequently applied.
5. The human-computer interaction method based on deep learning and image processing for any object according to claim 1, wherein the specific process of step S40 is as follows:
step S401, inputting the converted small color image and the depth image after the restoration processing into a saliency detection neural network to obtain a neural network output;
step S402, discarding the region with lower confidence in the output result, selecting a proper threshold value to perform binarization processing on the network output result;
step S403, performing open operation on the image obtained after the binary processing, smoothing and denoising the image, and obtaining a reliable significance result.
6. The human-computer interaction method based on deep learning and image processing for any object according to claim 1, wherein the specific process of step S50 is as follows:
step S501, performing contour detection on the processed saliency binary image to obtain all contours and contour points of all contours;
step S502, judging potential interactive hands according to the wrist coordinates of the two hands, and judging whether the interactive intention exists or not according to the distance between the potential interactive hands and the chest;
step S503, the wrist position coordinates of the interactive hand are taken as base points, the direction is judged according to the position relation between the body of the interactor and the arm of the interactor, and contour nodes on the opposite sides of the base points in all contours are screened out;
step S504, the pixel coordinates of the midpoints of the rest outlines are obtained, the distances from the midpoints to the interactive wrist are compared, and the area where the object is located is determined.
7. The human-computer interaction method based on deep learning and image processing for any object according to claim 1, wherein the specific process of step S60 is as follows:
step S601, when the detection result is that the hand is right opposite to the camera, comparing the wrists of both hands with the distance of the mechanical arm to select potential interaction hands;
step S602, comparing the distance between the interactive hand and the chest, and analyzing whether the interactive hand has interactive intention;
step S603, aiming at the hand with interactive intention, selecting a base point according to the pixel coordinates of the interactive hand, growing the depth image from the base point by using a zone growing algorithm with guidance, and collecting the target object zone.
8. The human-computer interaction method based on deep learning and image processing for arbitrary objects according to claim 1, wherein the specific process of step S70 is as follows:
step S701, using the previously obtained object contour or object area, using an OpenCV image processing algorithm to obtain an area minimum external rectangle as an Anchor box of the object, and simultaneously obtaining the width, height, rotation angle and center coordinates of the rectangle;
step S702, according to the size information of the circumscribed rectangle and the rotation angle of the rectangular image, obtaining the included angle between the longer side of the rectangle and the vertical direction;
in step S703, a rectangular area is intercepted and stored, and pixel values of three channels of pixel points in the object area are all set to 0.
9. The human-computer interaction method based on deep learning and image processing for any object according to claim 1, wherein the specific process of step S80 is as follows:
step S801, calculating the pixel area of the palm of the interactor according to the human body proportion so as to formulate an area threshold;
step S802, calculating the ratio of the number of pixels to the area when the palm of the interacter faces the camera for many times and stretches to the maximum extent, taking the average value as a filling degree threshold value and recording;
step S803, calculating the area and the filling degree of the circumscribed rectangle pixels during interaction, and calculating the area of the rectangle pixels according to the width and the height of the circumscribed rectangle;
step S804, comparing the real area and the filling degree with the corresponding threshold values, and judging the possibility of holding the object by the interactors.
10. The human-computer interaction method based on deep learning and image processing for any object according to claim 1, wherein the specific process of step S90 is as follows:
step S901, judging the palm area range by using the wrist position and the palm size;
step S902, judging the upper and lower positions of the interactive hand-held object according to the position relation between the palm center position and the rectangular center;
step S903, determining the midpoint of the boundary between the palm area and the interactable area according to the coordinates of the palm center and the coordinates of the center of the object area;
step S904, determining the coordinates of the midpoint of the boundary of the area suitable for grabbing by the mechanical claw according to the included angle and the position of the object to be grabbed;
step S905, according to the rectangular area and the depth image information, obtaining a point which is close to the corresponding boundary and in the object;
step S906, converting the two points into a three-dimensional world coordinate system, calculating the distance between the two points, comparing the distance with the size of the manipulator paw, and judging whether an interaction task can be performed;
in step S907, the two points converted into the world coordinate system are projected in the YOZ and XOZ directions, respectively, and the rotation angle of the object around the X, Y direction is calculated.
11. The human-computer interaction method based on deep learning and image processing for any object according to claim 1, wherein the specific process of step S100 is as follows:
step S1001, transmitting the object pose and the clamping position acquired by the vision processing end to the robot control end in a TCP/IP communication mode;
step S1002, after receiving the information, the robot control end moves to a designated position in a designated gesture, and the robot finishes the task of grabbing or following the movement according to the interactivity information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311059633.5A CN117021099A (en) | 2023-08-22 | 2023-08-22 | Human-computer interaction method oriented to any object and based on deep learning and image processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311059633.5A CN117021099A (en) | 2023-08-22 | 2023-08-22 | Human-computer interaction method oriented to any object and based on deep learning and image processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117021099A true CN117021099A (en) | 2023-11-10 |
Family
ID=88626251
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311059633.5A Pending CN117021099A (en) | 2023-08-22 | 2023-08-22 | Human-computer interaction method oriented to any object and based on deep learning and image processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117021099A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117576787A (en) * | 2024-01-16 | 2024-02-20 | 北京大学深圳研究生院 | Method, device and equipment for handing over based on active tracking and self-adaptive gesture recognition |
-
2023
- 2023-08-22 CN CN202311059633.5A patent/CN117021099A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117576787A (en) * | 2024-01-16 | 2024-02-20 | 北京大学深圳研究生院 | Method, device and equipment for handing over based on active tracking and self-adaptive gesture recognition |
CN117576787B (en) * | 2024-01-16 | 2024-04-16 | 北京大学深圳研究生院 | Method, device and equipment for handing over based on active tracking and self-adaptive gesture recognition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109255813B (en) | Man-machine cooperation oriented hand-held object pose real-time detection method | |
CN109934864B (en) | Residual error network deep learning method for mechanical arm grabbing pose estimation | |
CN105729468B (en) | A kind of robotic workstation based on the enhancing of more depth cameras | |
JP6415026B2 (en) | Interference determination apparatus, interference determination method, and computer program | |
CN112297013B (en) | Robot intelligent grabbing method based on digital twin and deep neural network | |
CN110298886B (en) | Dexterous hand grabbing planning method based on four-stage convolutional neural network | |
CN110480637B (en) | Mechanical arm part image recognition and grabbing method based on Kinect sensor | |
CN110211180A (en) | A kind of autonomous grasping means of mechanical arm based on deep learning | |
CN110355754A (en) | Robot eye system, control method, equipment and storage medium | |
CN113284179B (en) | Robot multi-object sorting method based on deep learning | |
CN110378325B (en) | Target pose identification method in robot grabbing process | |
CN111462154A (en) | Target positioning method and device based on depth vision sensor and automatic grabbing robot | |
CN112894815B (en) | Method for detecting optimal position and posture for article grabbing by visual servo mechanical arm | |
CN110796700B (en) | Multi-object grabbing area positioning method based on convolutional neural network | |
Ni et al. | A new approach based on two-stream cnns for novel objects grasping in clutter | |
CN106256512A (en) | Robotic device including machine vision | |
CN115213896A (en) | Object grabbing method, system and equipment based on mechanical arm and storage medium | |
CN117021099A (en) | Human-computer interaction method oriented to any object and based on deep learning and image processing | |
JP2018126862A (en) | Interference determination apparatus, interference determination method, and computer program | |
CN116249607A (en) | Method and device for robotically gripping three-dimensional objects | |
CN115861780B (en) | Robot arm detection grabbing method based on YOLO-GGCNN | |
KR20230061612A (en) | Object picking automation system using machine learning and method for controlling the same | |
CN115578460A (en) | Robot grabbing method and system based on multi-modal feature extraction and dense prediction | |
CN114998573A (en) | Grabbing pose detection method based on RGB-D feature depth fusion | |
CN113034526B (en) | Grabbing method, grabbing device and robot |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |