CN118865310A

CN118865310A - Target object detection method and target object detection model training method

Info

Publication number: CN118865310A
Application number: CN202310431099.XA
Authority: CN
Inventors: 李颖彦; 范略; 黄泽昊; 王乃岩
Original assignee: Beijing Tusimple Technology Co Ltd
Current assignee: Beijing Tusimple Technology Co Ltd
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2024-10-29

Abstract

The invention discloses a target object detection method, a target object detection device and a storage medium. The method comprises the steps of selecting a first sub-point cloud corresponding to a target object from point clouds according to distribution of a plurality of point data in the point clouds; selecting a second sub-point cloud corresponding to the target object from the point clouds according to the image; generating a first prediction frame and a second prediction frame of the target object according to the first sub-point cloud and the second sub-point cloud respectively; selecting a third sub-point cloud and a fourth sub-point cloud from the point clouds according to the first prediction frame and the second prediction frame respectively; and detecting the target object according to the third sub-point cloud and the fourth sub-point cloud. By adopting the technical scheme, various sensor data can be fused, the detection accuracy of objects in the sparse point cloud is improved, and the calculation burden is kept small. In addition, a training method, a training device and a storage medium for the target object detection model are also provided.

Description

Target object detection method and target object detection model training method

Technical Field

The disclosure relates to detection of a target object, and in particular relates to a method for detecting a target object based on point cloud and a training method for a target object detection model.

Background

The lidar forms a point cloud by emitting a laser beam and collecting the reflected beam. The data of the point clouds can be processed to perceive the space range of the target object, so that the point clouds are widely applied to the intelligent driving field. In order to reduce noise and improve accuracy in detecting a target object in an environment, multi-modal sensor fusion may be performed according to different types of sensors. One type of multimodal sensor fusion relies on dense Bird's Eye Views (BEV), such as computing feature maps of Bird's Eye views generated by image modalities and point cloud modalities, respectively. However, the feature map size of the bird's eye view is generally proportional to the square of the detection distance, making it unsuitable for long-distance target object detection due to the burden of the calculation amount.

Another type of multimodal sensor fusion does not rely on dense aerial views, but rather projects a point cloud of a lidar to an image plane of a camera to obtain image information, thereby adjusting the selection of point data to detect a target object. However, when selecting point data in a point cloud by using image information, it is difficult to effectively obtain accurate point data to improve accuracy of detecting a target object. In order to enable the point data to meet the detection requirement, virtual points can be generated through the instance information of the image to enrich the point cloud, however, a related algorithm for generating the virtual points can generate a large calculation load, which is not beneficial to practical use.

Therefore, how to realize target object detection in the sparse point cloud through multi-mode sensor fusion, so as to improve the accuracy of target object detection, and simultaneously keep smaller calculation load, is a technical problem to be solved by the technicians in the field.

Disclosure of Invention

The embodiment of the disclosure provides a target object detection method, a target object detection device and a storage medium, which can fuse point cloud and image data, realize detection of a target object in sparse point cloud, improve the accuracy of target object detection, and simultaneously keep smaller calculation burden.

The embodiment of the disclosure provides a training method, a training device and a storage medium for a target object detection model, which can provide training samples based on point clouds and image data so that the target object detection model can detect a target object in sparse point clouds, the accuracy of target object detection is improved, and meanwhile, the calculation burden is kept small.

In a first aspect, an embodiment of the present disclosure provides a method for detecting a target object, where the method for detecting a target object includes:

selecting a first sub-point cloud corresponding to the target object from the point clouds according to the distribution of a plurality of point data in the point clouds;

selecting a second sub-point cloud corresponding to the target object from the point clouds according to the image;

generating a first prediction frame and a second prediction frame of the target object according to the first sub-point cloud and the second sub-point cloud respectively;

Selecting a third sub-point cloud and a fourth sub-point cloud from the point clouds according to the first prediction frame and the second prediction frame respectively; and

And detecting the target object according to the third sub-point cloud and the fourth sub-point cloud.

In a second aspect, an embodiment of the present disclosure provides a training method of a target object detection model, where the training method of the target object detection model includes:

Detecting a target object according to a target object detection primary model so as to select a sub-point cloud corresponding to the target object from point clouds;

Generating a three-dimensional annotation frame and a two-dimensional annotation frame corresponding to the target object;

Matching the sub-point cloud to the three-dimensional annotation frame according to a first matching degree of the sub-point cloud and the three-dimensional annotation frame and according to a second matching degree of a two-dimensional prediction frame corresponding to the sub-point cloud and the two-dimensional annotation frame; and

And training the target object detection primary model through matching with the sub point cloud of the three-dimensional annotation frame to obtain the target object detection model.

In a third aspect, an embodiment of the present disclosure provides an apparatus for detecting a target object, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor executes the computer program to implement a method for detecting a target object as provided in an embodiment of the present disclosure.

In a fourth aspect, an embodiment of the present disclosure provides an apparatus for training a target object detection model, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements a training method of the target object detection model as provided by the embodiment of the present disclosure when the processor executes the computer program.

In a fifth aspect, embodiments of the present disclosure provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for detecting a target object as provided by embodiments of the present disclosure.

In a sixth aspect, embodiments of the present disclosure provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a training method for a target object detection model as provided by embodiments of the present disclosure.

Based on the disclosure, in the target object detection method, the third sub-point cloud is used as the update of the first sub-point cloud through the first prediction frame, so that the data about the target object is selected from the point clouds according to the distribution of the plurality of point data in the point clouds. The fourth sub-point cloud is used as an update for the second sub-point cloud through the second prediction frame, so that the data about the target object is selected from the point clouds according to the image more accurately. Therefore, the target object detection method can fuse the point cloud and the image data, realize the detection of the target object in the sparse point cloud and improve the accuracy of target object detection. Meanwhile, the virtual points do not need to be generated to enrich the point cloud, and the target object detection method keeps small calculation burden. In addition, the training method of the target object detection model provides training samples according to the first matching degree of the sub-point cloud and the three-dimensional labeling frame and the second matching degree of the two-dimensional prediction frame and the two-dimensional labeling frame corresponding to the sub-point cloud, so that the target object detection model can be trained based on the point cloud and the image data, further the corresponding prediction frame can be generated more accurately when the target object detection model is applied to the target object detection method, detection of the target object in the sparse point cloud is achieved, accuracy of target object detection is improved, and meanwhile small calculation burden is kept.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. It is evident that the figures in the following description are only some embodiments of the invention, from which other figures can be obtained without inventive effort for a person skilled in the art. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

Fig. 1 is a schematic step diagram of a method for detecting a target object in an embodiment of the disclosure;

FIG. 2 is a schematic diagram illustrating steps of a training method of a target object prediction model according to an embodiment of the disclosure;

FIG. 3 is a flow chart of a method for detecting a target object according to an embodiment of the disclosure;

FIG. 4 illustrates a three-dimensional center of a neutron point cloud falling outside of a three-dimensional annotation frame in an embodiment of the disclosure;

FIG. 5 illustrates a center of a two-dimensional prediction box of a sub-point cloud within a two-dimensional annotation box in an embodiment of the disclosure;

FIG. 6 is a graph comparing different results of detecting a target object using a method of detecting a target object according to an embodiment of the present disclosure;

Fig. 7 is a block diagram of a computer device in an embodiment of the disclosure.

Detailed Description

In order to better understand the technical solutions of the present invention, the following description will clearly and completely describe the technical solutions of the embodiments of the present disclosure with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the present disclosure, the term "plurality" means two or more, unless otherwise indicated. In the present disclosure, unless otherwise indicated, the use of the terms "first," "second," and the like are used to distinguish similar objects and are not intended to limit their positional relationship, timing relationship, or importance relationship. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other manners than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. In order to enable those skilled in the art to better understand the present application, some technical terms appearing in the embodiments of the present application are explained below:

Point Cloud (Point Cloud): the data of the surrounding environment acquired by the lidar is typically marked with a set of sparse three-dimensional spatial point data.

Image: data of the surrounding environment collected by the camera. Typically represented by a set of RGB values.

Detection of a target object in three-dimensional space: the position of the target object (target object) is found in the raw sensor data by an algorithm. The position occupied by the target object in three-dimensional space is typically represented by a three-dimensional prediction box, such as a cuboid.

Multimodal sensor fusion (Multi-modal Sensor Fusion): and fusing and effectively processing information of a plurality of sensors. The plurality of sensors may be a plurality of different types of sensors or a plurality of the same types of sensors.

Point cloud cluster (Point Cloud Cluster): and a set of cluster point data segmented from the original point cloud. In some embodiments, one point cloud cluster corresponds to one target object.

In some embodiments of the application, the term "vehicle" is to be construed broadly to include any moving object, including for example, aircraft, watercraft, spacecraft, automobiles, trucks, vans, semi-trailers, motorcycles, golf carts, off-road vehicles, warehouse transportation vehicles or agricultural vehicles, and vehicles traveling on rails, such as electric cars or trains, and other tracked vehicles. The "vehicle" in the present application may generally include: power systems, sensor systems, control systems, peripherals, and computer systems. In some embodiments, the vehicle may include more, fewer, or different systems.

Fig. 3 is a flowchart illustrating a method for detecting a target object in an embodiment of the disclosure. In an embodiment of the disclosure, a mobile device, such as a vehicle, is mounted with a first sensor and a second sensor. The first sensor is, for example, a vehicle-mounted lidar for acquiring a point cloud (point cloud data), and the second sensor is, for example, a camera for acquiring an image (image data). In the running process of the vehicle, the vehicle-mounted laser radar and the camera mounted on the vehicle can acquire point clouds and images of the surrounding environment of the vehicle respectively according to a certain time interval. Of course, the point cloud and the image of the surrounding environment of the vehicle may be acquired at regular intervals during the running of the vehicle. The acquired point cloud and the acquired image can be sent to a server through a network. After the point cloud and the image are acquired, the server can process the acquired data in a multi-mode sensor fusion mode, and the detection of the target object is performed in a three-dimensional space constructed by the surrounding environment of the vehicle.

Specifically, the type of the target object may be, for example, an obstacle, a pedestrian, an animal, a tree, a lane, a building, or the like, to which the present invention is not limited. In addition, the vehicle may be equipped with other types or numbers of sensors to generate sensor data for improving the performance of the multi-modal sensor fusion. Other types of sensors are, for example, integrated navigation devices, inertial navigation systems (Inertial Navigation System, INS), accelerometers, gyroscopes, millimeter wave radar, ultrasonic radar, etc., as the invention is not limited in this regard. In this embodiment, the first sensor and the second sensor may be mounted on the roadside apparatus in addition to the movable object. The drive test device may, for example, communicate with the vehicle via the V2X (Vehicle to Everything) communication device, and send the data of the first sensor and the second sensor before processing, the data processing result of the first sensor and the second sensor, or a prompt, an alarm, or information related to the data processing result to the vehicle, which is not limited in this aspect of the invention.

First, generation of a point cloud cluster is explained. In this embodiment, the point cloud from the first sensor and the image from the second sensor may be used to detect the target object in the surrounding environment of the vehicle by the detection method of the target object. Referring to fig. 3, the method for detecting a target object includes selecting a first sub-point cloud PCs1 corresponding to the target object from a point cloud PC according to distribution of a plurality of point data in the point cloud PC; and selecting a second sub-point cloud PCS2 corresponding to the target object from the point cloud PCs according to the image I. The steps of obtaining the first sub-point cloud PCS1 and the second sub-point cloud PCS2 may be performed synchronously or sequentially, which is not limited in the present invention. Specifically, the point cloud PC is, for example, raw point cloud data acquired by a lidar. Since the selection of the first sub-point cloud PCS1 is related to the distribution of the plurality of point data in the original point cloud data, the first sub-point cloud PCS1 can be regarded as, for example, a radar point cloud cluster. In addition, since the second sub-point cloud PCS2 is selected to be associated with the information of the image I, the second sub-point cloud PCS2 can be regarded as, for example, an image point cloud cluster.

The step of selecting the first sub-point cloud PCs1 corresponding to the target object from the point cloud PCs according to the distribution of the plurality of point data in the point cloud PC includes: predicting a plurality of point data belonging to the foreground of the target object based on the plurality of point data in the point cloud PC to obtain a plurality of foreground point data; and clustering the plurality of foreground point data to obtain a first sub-point cloud PCS1. The first sub-point cloud PCS1 includes therein a plurality of first point data corresponding to the target object. Specifically, the above-described step of selecting the first sub-point cloud PCs1 corresponding to the target object from the point cloud PCs is one way of performing three-dimensional instance segmentation (3D Instance Segmentation) on the point cloud PCs.

In this embodiment, the step of clustering the plurality of foreground point data to obtain the first sub-point cloud PCS1 includes determining a plurality of center point data representing the target object; wherein each center point data represents a center of a target object predicted by one foreground point data of the plurality of foreground point data; respectively calculating the distances among the plurality of center point data; and adding at least one of the plurality of foreground point data satisfying the condition to the first sub-point cloud PCS 1; wherein the condition is that the distance is less than or equal to a preset distance. Specifically, the above steps may be implemented, for example, by the following procedure:

First, the point cloud PC is divided into a plurality of squares according to the spatial range. Wherein a square may be a voxel. For each voxel, a depth learning model may be used based on the point cloud PC, for example using PointNet as a point cloud feature extractor, to extract the point cloud features, resulting in a set of voxels with initial voxel features.

After obtaining the voxel set with the initial voxel features, the initial voxel features may be input to a sparse convolution based network to obtain the voxel features. After obtaining the voxel feature, the voxel feature may be mapped onto point data included in a voxel corresponding to the voxel feature, and the point data feature may be formed by combining the point data to a geometric feature of a center of the voxel to which the point data belongs.

After the point data features are acquired, the server may input the point data features into a Multi-Layer Perceptron (MLP) to classify the point data, so as to obtain the above foreground point data, and perform clustering processing on the foreground point data by using a clustering algorithm, so as to obtain a predicted point data set belonging to the target object, that is, a first sub-point cloud PCS1. Specifically, the step of clustering the plurality of foreground point data includes: determining a plurality of center point data representing a target object; the center point data represents the center of the target object to which the foreground point data predicts belongs; respectively calculating the distances among the plurality of center point data; and adding at least one of the plurality of foreground point data meeting the condition to the first set of predicted point data; wherein the condition is that the distance is less than or equal to a preset distance. Thus, multiple foreground point data may be grouped by clustering, resulting in a grouping of instances. That is, the first sub-point cloud PCS1 is a point cloud selected from the point cloud PCS by instance division to correspond to the target object.

In some cases, the clustering process of the point cloud may take the center predicted by all foreground points as a node, and the distance between the points as the weight of the edge, so as to construct a graph structure. Two points are considered to be in communication if the distance between the two points is less than a certain threshold. And searching all connected domains in the graph by using a connected domain algorithm in the graph theory. Each found connected domain is regarded as an example. Thus, each foreground point has its unique instance identification indicating which instance the point belongs to.

In this embodiment, the above steps for selecting the first sub-point cloud PCs1 corresponding to the target object from the point cloud PC can be seen in Lue Fan, feng Wang, NAIYAN WANG, and Zhaoxiang zhang. However, according to the actual requirement, other methods may be adopted to select the first sub-point cloud PCs1 corresponding to the target object from the point cloud PCs.

Please refer to fig. 4. In this embodiment, the step of selecting, according to the image I, the second sub-point cloud PCs2 corresponding to the target object from the point cloud PC includes: aligning the image I with the point cloud PC; defining a three-dimensional frame (e.g., a three-dimensional frame 3DB in fig. 4) in a space where the point cloud PC is located, based on the image I; and selecting a part falling into the three-dimensional frame 3DB from the point cloud PC to obtain a second sub-point cloud PCS2.

In this embodiment, the first sensor is a lidar and the second sensor is a camera (e.g., camera Cam of fig. 4). The step of aligning the image I with the point cloud PC includes: the first sensor and the second sensor are calibrated to align the image I with the point cloud PC. Specifically, the sensor external parameters of the first sensor (laser radar) and the second sensor (camera) are calibrated, and a calibration matrix can be obtained. Therefore, the position of the camera can be defined in the three-dimensional space where the point cloud PC is located, and the information of the image I can be projected in the three-dimensional space.

In the present embodiment, the step of defining the three-dimensional frame 3DB in the space where the point cloud PC is located according to the image I described above includes: a first position (e.g., first position P1 of fig. 4) of the camera Cam in the space where the point cloud PC is located is defined. In addition, the image I may be subjected to two-dimensional instance segmentation (2 DInstance Segmentation) to obtain a second position (e.g., the second position P2 in fig. 4) of the target object in the space where the point cloud PC is located in the image I, and in particular, the image I may be subjected to instance segmentation to obtain a mask (e.g., the mask M in fig. 3) corresponding to the target object. Because the data of the laser radar and the data of the camera can be converted through the calibration matrix, the position information obtained by carrying out example segmentation on the image I can be converted into the three-dimensional space where the point cloud PC is located through the calibration matrix.

In the present embodiment, the three-dimensional frame 3DB is, for example, a view cone (Frustum). The shape of the three-dimensional frame 3DB can be known from the description of the view cone. Specifically, the view cone (e.g., the view cone FT of fig. 4) extends from the first position P1 to the second position P2, and the contour of the surface of the view cone FT at the second position P2 is, for example, the contour of the mask M. For example, when the contour of the mask M is the contour of a vehicle, the contour of the surface of the view cone FT at the second position P2 is the contour of the vehicle, and the shape of the contour is kept converging toward the first position P1 where the camera Cam is located. In this embodiment, the portion of the point cloud PC where the plurality of point data fall into the view cone FT is a second sub-point cloud PCs2 corresponding to the target object selected from the point cloud PC according to the image I.

In other embodiments, the image I may not be subjected to two-dimensional instance segmentation. Specifically, a two-dimensional prediction frame corresponding to the target object may be generated according to the image I by other or conventional prediction methods of the target object, and the position information of the two-dimensional prediction frame may be converted into a three-dimensional space where the point cloud PC is located through a calibration matrix, so as to obtain a third position (e.g., a third position P3 in fig. 4) of the two-dimensional prediction frame in the space where the point cloud PC is located. In these embodiments, the view cone FT extends from the first position P1 to the third position P3, and the contour of the surface of the view cone FT at the third position P3 is, for example, the contour of such a two-dimensional prediction frame.

In the above embodiment, the coordinate system where the point cloud PC is located is, for example, a laser radar coordinate system, and the first sub-point cloud PCs1 and the second sub-point cloud PCs2 selected from the point cloud PC are also processed in the laser radar coordinate system. However, the point cloud PC, the first sub-point cloud PCs1 and/or the second sub-point cloud PCs2 may be converted into other kinds of coordinate systems for operation according to actual operation requirements, which is not limited in the present invention.

Next, alignment of the point cloud clusters is explained. Please refer to fig. 3. In this embodiment, the method for detecting a target object further includes: generating a first prediction frame B1 and a second prediction frame B2 of the target object according to the first sub-point cloud PCS1 and the second sub-point cloud PCS2 respectively, and selecting a third sub-point cloud PCS3 and a fourth sub-point cloud PCS4 from the point cloud PC according to the first prediction frame B1 and the second prediction frame B2 respectively. The first prediction frame B1 and the second prediction frame B2 are, for example, three-dimensional.

Specifically, the step of generating the first prediction frame B1 and the second prediction frame B2 of the target object according to the first sub-point cloud PCS1 and the second sub-point cloud PCS2, respectively, includes: extracting first point data characteristics of a plurality of first point data in a first sub-point cloud PCS1 to obtain first sub-point cloud characteristics; generating a first prediction block B1 representing a target object predicted based on the first sub-point cloud feature; extracting second point data characteristics of a plurality of second point data in a second sub-point cloud PCS2 to obtain second sub-point cloud characteristics; and generating a second prediction block B2 representing the predicted target object based on the second sub-point cloud feature. Specifically, the above steps may be regarded as respectively performing instance prediction on the first sub-point cloud PCS1 and the second sub-point cloud PCS2 by using the target object detection model, and may be implemented, for example, by the following procedures:

First, in this embodiment, the target object detection model may include a sparse instance Recognition model, so that instance features may be extracted and instance prediction may be performed by a sparse instance Recognition (SPARSE INSTANCE Recognition) model. Specifically, the plurality of point data of the first sub-point cloud PCS 1/the second sub-point cloud PCS2 may be extracted for example features using an algorithm of dynamic pooling and dynamic broadcasting. The dynamic pooling method may be to perform weighted summation on the point data features of the first sub-point cloud PCS 1/the second sub-point cloud PCS2, so as to obtain example features of the first sub-point cloud PCS 1/the second sub-point cloud PCS 2. Then, the example features of the first sub-point cloud PCS 1/the second sub-point cloud PCS2 are given to the point data features in the first sub-point cloud PCS 1/the second sub-point cloud PCS2, and target features are obtained according to the differences between the example features and the point data features. The obtained target feature can be used as the first sub-point cloud feature F1/the second sub-point cloud feature F2 through repeating the dynamic pooling and dynamic broadcasting processes for a plurality of times. Next, a first prediction frame B1 representing the predicted target object may be generated based on the first sub-point cloud feature F1, and a second prediction frame B2 representing the predicted target object may be generated based on the second sub-point cloud feature F2. Specifically, in other embodiments, the target object detection model may also include other types of instance recognition models according to the characteristics of the point cloud data, or generate the first sub-point cloud feature F1/the second sub-point cloud feature F2 through other algorithms, which is not limited in this invention.

With continued reference to fig. 3, in this embodiment, the step of selecting the third sub-point cloud PCs3 and the fourth sub-point cloud PCs4 from the point cloud PC according to the first prediction frame B1 and the second prediction frame B2 respectively includes: selecting a part falling into the first prediction frame B1 from the point cloud PC to obtain a third sub-point cloud PCS3; and selecting a part falling into the second prediction frame B2 from the point cloud PC to obtain a fourth sub-point cloud PCS4. Specifically, the third sub-point cloud PCS3 may be regarded as, for example, a radar point cloud cluster after the alignment process, and the fourth sub-point cloud PCS4 may be regarded as, for example, an image point cloud cluster after the alignment process.

Next, prediction by the point cloud cluster will be described. In this embodiment, the method for detecting a target object further includes a step of detecting the target object according to the third sub-point cloud PCS3 and the fourth sub-point cloud PCS 4. Specifically, this step includes: extracting third point data characteristics of a plurality of third point data in a third sub-point cloud PCS3 to obtain third sub-point cloud characteristics F3; generating a third prediction box (not shown) representing the predicted target object based on the third sub-point cloud feature F3; extracting fourth point data characteristics of a plurality of fourth point data in a fourth sub-point cloud PCS4 to obtain fourth sub-point cloud characteristics F4; generating a fourth prediction box (not shown) representing the predicted target object based on the fourth sub-point cloud feature F4; and detecting the target object according to the third prediction frame and the fourth prediction frame. Specifically, the above steps may be regarded as performing instance prediction on the third sub-point cloud PCS3 and the fourth sub-point cloud PCS4 by using the target object detection model, respectively.

The method for generating the third prediction frame of the target object according to the third sub-point cloud PCS3 and the method for generating the fourth prediction frame of the target object according to the fourth sub-point cloud PCS4 may be the same or similar to the method for performing instance prediction on the first sub-point cloud PCS 1/the second sub-point cloud PCS2 by using the target object detection model, but the used target object detection model may adjust the model architecture, the neural network structure, the parameters and the like according to the actual operation requirement. Specifically, reference may be made to Lue Fan, feng Wang, NAIYAN WANG, and Zhaoxiang zhang. Fully spark 3D Object Detection.NeurIPS 2022 or Lue Fan,Yuxue Yang,Feng Wang,Naiyan Wang,and Zhaoxiang Zhang.Super Sparse 3D Object Detection.arXiv:2301.02562v1., however, other example prediction methods may be employed, and the invention is not limited in this regard.

In addition, the target object detection model used to generate the first prediction block B1, the second prediction block B2, the third prediction block, and the fourth prediction block respectively may be implemented by a neural network. Specifically, the first sub-point cloud PCS1 to the first neural network may be input to generate the first prediction frame B1, the second sub-point cloud PCS2 to the second neural network to generate the second prediction frame B2, and the third sub-point cloud PCS3 and the fourth sub-point cloud PCS4 to the third neural network may be input to detect the target object (i.e., the third sub-point cloud PCS3 and the fourth sub-point cloud PCS4 share the same neural network). In this embodiment, the first neural network, the second neural network, and the third neural network may have the same structure, and different parameters are set according to actual requirements, such as requirements for optimization of algorithm performance. However, in some embodiments, the first neural network, the second neural network, and the third neural network may be the same neural network, which is not limited in this regard.

By generating the aligned radar point cloud cluster (third sub point cloud PCS 3) and the aligned image point cloud cluster (fourth sub point cloud PCS 4), two prediction frames, for example, a third prediction frame and a fourth prediction frame, may be generated for the target object, and the overlapping degree of the two prediction frames may be high. Thus, the third prediction box and the fourth prediction box may each have a confidence level representing the target object in this embodiment, for example, by Non-maximum suppression (Non-Maximum Suppression, NMS). The confidence level is, for example, a value that indicates the degree to which the target object belongs to a certain object class, and may be greater than 0 and equal to or less than 1. Specifically, the third prediction frame and the fourth prediction frame may have information such as center point coordinates, prediction frame size, and prediction frame rotation angle.

In this embodiment, the step of detecting the target object according to the third prediction frame and the fourth prediction frame includes deleting one of the third prediction frame and the fourth prediction frame according to the overlapping degree between the third prediction frame and the fourth prediction frame, and the step includes calculating an overlapping ratio (Intersection over Union, ioU) between the third prediction frame and the fourth prediction frame; and deleting one of the third prediction frame and the fourth prediction frame with a lower confidence in response to the cross-correlation ratio being above the threshold. Specifically, when there are two prediction frames (e.g., it is likely that one prediction frame is from a radar point cloud cluster and the other prediction frame is from an image point cloud cluster) that are highly coincident (i.e., it is both possible to detect a target object), one with higher confidence may be retained and one with lower confidence may be deleted. Therefore, the point cloud data of the laser radar and the image data of the camera can be effectively fused, and the accuracy of detecting the target object is improved.

In general, an image point cloud cluster generated by only two-dimensional instance segmentation may be incorporated into the view cone without including point data at the target object, such as background point data behind the target object. Furthermore, lidar is prone to the property of producing more dense point data in the lateral direction, and radar point cloud clusters generated by three-dimensional instance segmentation alone may incorporate point data that is clustered together in the lateral direction but should not be included in the target object. In the embodiment of the disclosure, the first sub-point cloud PCS1 corresponding to the target object is selected from the point cloud PC according to the distribution of the plurality of point data in the point cloud PC, and the second sub-point cloud PCS2 corresponding to the target object is selected from the point cloud according to the image I. In addition, the method for detecting the target object according to the embodiment of the present disclosure includes generating a first prediction frame B1 and a second prediction frame B2 of the target object according to the first sub-point cloud PCS1 and the second sub-point cloud PCS2, and selecting a third sub-point cloud PCS3 and a fourth sub-point cloud PCS4 from the point cloud PCS according to the first prediction frame B1 and the second prediction frame B2, respectively. The third sub-point cloud PCS3 is used for updating the first sub-point cloud PCS1 through the first prediction frame B1, so that the data about the target object is selected from the point cloud PC according to the distribution of a plurality of point data in the point cloud PC, and the missing and missing repair of the radar point cloud cluster is realized, namely the alignment processing of the radar point cloud cluster is realized. The fourth sub-point cloud PCS4 is used for updating the second sub-point cloud PCS2 through the second prediction frame B2, so that the data about the target object selected by the image I from the point cloud PC is more accurate, and the defect of leak detection and repair of the image point cloud cluster, namely the alignment processing of the image point cloud cluster, is realized. Therefore, the target object detection method disclosed by the embodiment of the invention can fuse the point cloud and the image data, realize the detection of the target object in the sparse point cloud and improve the accuracy of target object detection. Meanwhile, since virtual points do not need to be generated to enrich point clouds, the target object detection method of the embodiment of the disclosure keeps small calculation burden.

Fig. 6 is a graph showing various results of comparing whether or not the target object is detected by the target object detection method according to the embodiment of the present disclosure, for example, a point cloud collected by an on-vehicle lidar mounted on a vehicle and an image I1 and an image I2 collected by a camera mounted on the vehicle. When the multi-mode sensor fusion is performed without adopting the target object detection method in the embodiment of the present disclosure, the detection result of the target object cannot be generated corresponding to a plurality of roadblocks located farther in the image I1. This is because the number of points of the plurality of roadblocks is small and very close to each other, resulting in failure of instance recognition based on the radar point cloud clusters (e.g., no three-dimensional instance is generated in region R1). This situation easily occurs when the target object is far away and/or the point cloud is sparse. When the multi-mode sensor fusion is performed by adopting the target object detection method In the embodiment of the present disclosure, the example identification can be performed based on the image point cloud cluster, so as to obtain examples In1 to In5 (e.g., three-dimensional example results generated In the region R2), and the target object detection accuracy is improved.

In addition, when the multi-modal sensor fusion is performed without the detection method of the target object of the embodiment of the present disclosure, corresponding to the case where a plurality of persons are squeezed together In the image I2, the generated three-dimensional instance results are only the instance In6 and the instance In7 (as the three-dimensional instance results generated In the region R3), not the case where more than two instances should be apparent In the image I2. This is because the process of three-dimensional instance segmentation does not readily resolve the three-dimensional instance to which a number of points in close proximity belong. When the multi-mode sensor fusion is performed by adopting the target object detection method In the embodiment of the present disclosure, the example identification can be performed based on the image point cloud cluster, so as to obtain examples In8 to In14 (such as three-dimensional example results generated In the region R4), which are a plurality of examples identifiable In the corresponding image I2, and the improvement of the target object detection accuracy is presented.

Next, three-dimensional frame allocation in training a model is described. In this embodiment, at least one of the following models may be trained using the training method of the target object detection model of the embodiment of the present disclosure: the above-mentioned target object detection model used for instance prediction of the first sub-point cloud PCS1, the target object detection model used for instance prediction of the second sub-point cloud PCS2, the target object detection model used for instance prediction of the third sub-point cloud PCS3, and the target object detection model used for instance prediction of the fourth sub-point cloud PCS 4.

The training method of the target object detection model of the embodiment of the disclosure comprises the following steps: and detecting the target object according to the target object detection primary model so as to select a sub-point cloud corresponding to the target object from the point clouds. The target object detection primary model includes, for example, the above sparse instance identification model or other types of neural network models that can perform instance prediction on the first sub-point cloud PCS 1/the second sub-point cloud PCS 2/the third sub-point cloud PCS 3/the fourth sub-point cloud PCS4, which is not limited in this invention.

The training method of the target object detection model in the embodiment of the disclosure further includes: generating a three-dimensional annotation frame and a two-dimensional annotation frame corresponding to the target object; matching the sub-point cloud to the three-dimensional labeling frame according to the first matching degree of the sub-point cloud and the three-dimensional labeling frame and according to the second matching degree of the two-dimensional prediction frame corresponding to the sub-point cloud and the two-dimensional labeling frame; and training the target object detection primary model through the sub-point cloud matched with the three-dimensional annotation frame to obtain a target object detection model. Specifically, the three-dimensional labeling frame and the two-dimensional labeling frame are labeled manually, and belong to a reference true phase (Ground Truth). When the sub-point cloud is matched to the three-dimensional annotation frame, the sub-point cloud is used as a positive sample for model training. When the sub-point cloud is deemed not to match to the three-dimensional annotation box, the sub-point cloud is used as a negative sample for model training. In some embodiments, the three-dimensional labeling frame and the two-dimensional labeling frame may be labeled with the help of an algorithm, which is not limited by the present invention.

In this embodiment, the step of matching the sub-point cloud to the three-dimensional labeling frame according to the first matching degree of the sub-point cloud and the three-dimensional labeling frame and according to the second matching degree of the two-dimensional prediction frame and the two-dimensional labeling frame corresponding to the sub-point cloud includes: matching the sub-point cloud to the three-dimensional annotation frame in response to the three-dimensional center of the sub-point cloud falling in the three-dimensional annotation frame; and matching the sub-point cloud to the three-dimensional labeling frame in response to the three-dimensional center falling outside the three-dimensional labeling frame and the two-dimensional prediction frame and the two-dimensional labeling frame meeting the matching requirement. Specifically, this embodiment adopts a two-round dispensing scheme. In the first round of distribution, whether the three-dimensional center of the sub-point cloud falls in the three-dimensional annotation frame is judged. When the sub-point cloud is, for example, an image point cloud cluster similar to the second sub-point cloud PCS2, it is determined whether the center of the image point cloud cluster falls in the three-dimensional label frame. If yes, matching the sub-point cloud to the three-dimensional annotation frame. If the three-dimensional center of the sub-point cloud falls outside the three-dimensional labeling frame, entering a second round of distribution. In the second round of distribution, judging whether the two-dimensional prediction frame and the two-dimensional annotation frame meet the matching requirement.

In this embodiment, the matching requirement is, for example, that the cross-correlation ratio between the two-dimensional prediction frame and the two-dimensional labeling frame is higher than the threshold. The intersection ratio reflects the coincidence degree between the two-dimensional prediction frames and the two-dimensional labeling frames. Specifically, when the sub-point cloud is, for example, an image point cloud cluster similar to the second sub-point cloud PCS2, the generated two-dimensional prediction frame has a high probability of reaching a high degree of coincidence with the two-dimensional labeling frame because the generating process of the second sub-point cloud PCS2 is divided by the two-dimensional instance. In addition, when the sub-point cloud is, for example, a radar point cloud cluster similar to the first sub-point cloud PCS1, the generating process of the first sub-point cloud PCS1 is not divided by the two-dimensional instance, so that the generated two-dimensional prediction frame has a low probability of reaching a high degree of coincidence with the two-dimensional labeling frame. In this embodiment, if the two-dimensional prediction frame and the two-dimensional labeling frame meet the matching requirement, the sub-point cloud is matched to the three-dimensional labeling frame. If the two-dimensional prediction frame and the two-dimensional annotation frame do not meet the matching requirement, the sub-point cloud is not distributed to the proper three-dimensional annotation frame in two-round distribution, and the sub-point cloud is used as a negative sample for model training.

Fig. 4 illustrates a situation in which the three-dimensional center of the sub-point cloud falls outside the three-dimensional labeling frame in the embodiment of the present disclosure. Taking the second sub-point cloud PCS2 as an example of the image point cloud cluster, in this embodiment, the image point cloud cluster generated by the view cone FT generally includes a plurality of clutter. For example, in fig. 4, most of the plurality of point data of the second sub-point cloud PCS2 is collected at a position close to the camera Cam, and few point data are distributed in the whole view cone FT, so that the center C1 (three-dimensional center) of the second sub-point cloud PCS2 is deviated to a position far from the camera Cam due to the miscellaneous points, and thus the center C1 fails to fall in the three-dimensional labeling frame GTB 1.

FIG. 5 illustrates a situation in which the center of a two-dimensional prediction box of a sub-point cloud falls within a two-dimensional annotation box in an embodiment of the disclosure. Taking the second sub-point cloud PCS2 as the image point cloud cluster as an example, in this embodiment, the generating process of the second sub-point cloud PCS2 is divided by a two-dimensional example of the camera image to obtain the contour CT corresponding to the target object O. Thus, when the second sub-point cloud PCS2 is subjected to two-dimensional target object detection to generate a two-dimensional prediction frame 2DB, this two-dimensional prediction frame 2DB may cover the contour CT of the target object O. On the other hand, the two-dimensional annotation frame GTB2 generated based on the camera image covers the contour CT of the target object O. Therefore, the center C2 of the two-dimensional prediction frame 2DB of the sub-point cloud falls within the two-dimensional annotation frame GTB 2. The two-dimensional prediction frame 2DB has higher probability of reaching high coincidence degree with the two-dimensional labeling frame GTB 2. When the sub-point cloud obtained by detecting the primary model according to the target object is substantially similar to the image point cloud cluster of the second sub-point cloud PCS2, the two-dimensional prediction frame 2DB of the sub-point cloud has a high probability of reaching a high degree of coincidence with the two-dimensional labeling frame GTB2, so as to meet the matching requirement, so that the sub-point cloud is used as a positive sample for model training and is not used as a negative sample for model training erroneously.

In the above matching requirement of the present embodiment, the threshold value of the intersection ratio between the two-dimensional prediction frame and the two-dimensional labeling frame may be a value between 0.2 and 0.4. For example, the threshold may be 0.3. When the threshold value is greater than or equal to 0.2, the requirement of the coincidence degree between the two-dimensional prediction frame and the two-dimensional labeling frame can be made high enough, and the fact that many sub-point clouds which are supposed to be regarded as negative samples are mistakenly regarded as positive samples is avoided. When the threshold value is less than or equal to 0.4, the requirement of the coincidence ratio between the two-dimensional prediction frame and the two-dimensional labeling frame is not too high, and the situation that a plurality of sub-point cloud samples which are taken as image point cloud clusters and are regarded as positive samples are mistakenly regarded as negative samples in the second round of distribution is avoided.

Specifically, the sub-point cloud obtained by detecting the target object from the target object detection primary model is substantially an image point cloud cluster similar to the second sub-point cloud PCS 2. The step of detecting the target object according to the target object detection primary model to select a sub-point cloud corresponding to the target object from the point clouds includes: and selecting a sub-point cloud corresponding to the target object from the point clouds according to the image, wherein the point clouds are from the first sensor, and the image is from the second sensor. The step of selecting the sub-point cloud corresponding to the target object from the point clouds according to the image includes: aligning the image with the point cloud; defining a three-dimensional frame in a space where the point cloud is located according to the image; and selecting a part falling into the three-dimensional frame from the point cloud to obtain a sub-point cloud. The step of aligning the image with the point cloud includes: the first sensor and the second sensor are calibrated to align the image with the point cloud. In addition, the first sensor is a lidar and the second sensor is a camera. The step of defining a three-dimensional frame in the space where the point cloud is located according to the image includes: defining a first position of a camera in a space where a point cloud is located; and performing instance segmentation on the image to obtain a second position of the target object in the image in the space where the point cloud is located, wherein the three-dimensional frame is a view cone and extends from the first position to the second position. Specifically, the step of performing instance segmentation on the image to obtain the second position of the target object in the image in the space where the point cloud is located includes: the image is instance segmented to obtain a mask corresponding to the target object, wherein a contour of a surface of the view cone at the second position is a contour of the mask.

In other embodiments, the image may not be instance segmented to obtain a mask corresponding to the target object. The image prediction frame corresponding to the target object may be generated according to the image by other or conventional prediction methods of the target object, and the position information of the image prediction frame may be converted into the space where the point cloud is located, for example, through a calibration matrix, so as to obtain the third position of the image prediction frame in the space where the point cloud is located. In these embodiments, the view cone extends from the first position to the third position, and the contour of the surface of the view cone at the third position is, for example, the contour of this image prediction frame.

Specifically, the description of the sub-point cloud obtained by detecting the target object according to the target object detection primary model may refer to the description of the sub-point cloud PCS2 as the second sub-point cloud cluster, which is not described herein.

In the embodiment of the disclosure, the training method of the target object detection model provides a training sample according to the first matching degree of the sub-point cloud and the three-dimensional labeling frame and the second matching degree of the two-dimensional prediction frame and the two-dimensional labeling frame corresponding to the sub-point cloud, so that the target object detection model can be trained based on the point cloud and the image data, further the corresponding prediction frame can be generated more accurately when the target object detection model is applied to the target object detection method, detection of the target object in the sparse point cloud is realized, the accuracy of target object detection is improved, and meanwhile, smaller calculation load is kept.

Fig. 1 is a schematic step diagram of a method for detecting a target object in an embodiment of the present disclosure, which may be performed by an apparatus for detecting a target object. Wherein the apparatus may be implemented in software and/or hardware and may typically be integrated in a computer device. The computer device herein may be integrated into the movable object, or may be external to or in communication with the movable object. Reference may be made to the above-described description of embodiments of the present disclosure for a detailed description or extension of the various steps of fig. 1. Referring to fig. 1, a method for detecting a target object according to an embodiment of the disclosure includes:

step S100: selecting a first sub-point cloud corresponding to the target object from the point clouds according to the distribution of a plurality of point data in the point clouds; selecting a second sub-point cloud corresponding to the target object from the point clouds according to the image;

After step S100 is completed, the process proceeds to step S102. Step S102: generating a first prediction frame and a second prediction frame of the target object according to the first sub-point cloud and the second sub-point cloud respectively;

After step S102 is completed, the process proceeds to step S104. Step S104: selecting a third sub-point cloud and a fourth sub-point cloud from the point clouds according to the first prediction frame and the second prediction frame respectively;

after step S104 is completed, the process proceeds to step S106. Step S106: and detecting the target object according to the third sub-point cloud and the fourth sub-point cloud.

The target object detection method described in this embodiment can at least achieve the technical effects of the target object detection method described above, and can achieve detection of a target object in a sparse point cloud, improve accuracy of target object detection, and simultaneously maintain a small computational burden.

FIG. 2 is a schematic diagram of steps of a method of training a target object prediction model in an embodiment of the present disclosure, which may be performed by an apparatus for training a target object detection model. Wherein the apparatus may be implemented in software and/or hardware and may typically be integrated in a computer device. The computer device herein may be integrated into the movable object, or may be external to or in communication with the movable object. For a detailed description or extension of the various steps of fig. 2, reference may be made to the above-described description of embodiments of the present disclosure. Referring to fig. 2, a method for detecting a target object according to an embodiment of the disclosure includes:

step S200: detecting a target object according to a target object detection primary model so as to select a sub-point cloud corresponding to the target object from point clouds;

after step S200 is completed, the process proceeds to step S202. Step S202: generating a three-dimensional annotation frame and a two-dimensional annotation frame corresponding to the target object;

After step S202 is completed, the process proceeds to step S204. Step S204: matching the sub-point cloud to the three-dimensional labeling frame according to the first matching degree of the sub-point cloud and the three-dimensional labeling frame and according to the second matching degree of the two-dimensional prediction frame corresponding to the sub-point cloud and the two-dimensional labeling frame;

After step S204 is completed, the process proceeds to step S206. Step S206: and training the target object detection primary model through the sub-point cloud matched with the three-dimensional annotation frame to obtain a target object detection model.

The training method of the target object prediction model described in this embodiment can at least achieve the technical effects of the training method related to the target object prediction model described above, and can provide training samples based on the point cloud and the image data, so that the target object detection model can detect the target object in the sparse point cloud, and the accuracy of target object detection is improved, while keeping a smaller calculation burden.

The disclosed embodiments provide a computer device into which the apparatus for detecting a target object and/or the apparatus for training a target object detection model provided by the disclosed embodiments may be integrated. Fig. 7 is a block diagram of a computer device according to an embodiment of the present disclosure. The computer device 100 may include: the system comprises a memory 101, a processor 102 and a computer program stored on the memory 101 and executable by the processor, wherein the processor 102 implements the method for detecting a target object and/or the method for training a target object prediction model according to the embodiments of the present disclosure when executing the computer program. The computer device here may be integrated into a movable object, in which case the computer device may also be considered to be the movable object itself, such as a vehicle; the computer device may also be external to or in communication with the movable object.

The computer equipment provided by the embodiment of the disclosure can realize the detection method of the target object and/or the training method of the target object prediction model, at least can obtain the technical effects of the detection method of the target object and/or the training method of the target object prediction model, can realize the detection of the target object in the sparse point cloud, improves the accuracy of the target object detection, and simultaneously keeps smaller calculation load.

The disclosed embodiments also provide a storage medium containing computer-executable instructions that when executed by a computer processor implement the method of detection of a target object and/or the method of training a target object prediction model as provided by the disclosed embodiments.

Storage media-any of various types of memory devices or storage devices. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk or tape devices; computer system memory or random access memory, such as DRAM, DDRRAM, SRAM, EDORAM, rambus (Rambus) RAM, or the like; nonvolatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in a first computer system in which the program is executed, or may be located in a second, different computer system connected to the first computer system through a network such as the internet. The second computer system may provide program instructions to the first computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations (e.g., in different computer systems connected by a network). The storage medium may store program instructions (e.g., embodied as a computer program) executable by one or more processors.

The device, the equipment and the storage medium for detecting the target object provided in the above embodiments can execute the target object detection method provided in any embodiment of the disclosure, and have the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in the above embodiments may be referred to the method for detecting a target object provided in any embodiment of the present disclosure.

The device, the equipment and the storage medium for training the target object detection model provided in the above embodiments can execute the training method of the target object detection model provided in any embodiment of the disclosure, and have the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in the above embodiments may be referred to a training method of the target object detection model provided in any embodiment of the present disclosure.

Although exemplary embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above exemplary discussion is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. Therefore, the disclosed subject matter should not be limited to any single embodiment or example described herein, but rather should be construed in breadth and scope in accordance with the appended claims.

Claims

1. A method of detecting a target object, comprising:

selecting a first sub-point cloud corresponding to a target object from point clouds according to the distribution of a plurality of point data in the point clouds;

2. The method of claim 1, wherein selecting a first sub-point cloud from the point clouds that corresponds to the target object based on a distribution of the plurality of point data in the point clouds comprises:

Predicting a plurality of point data belonging to the foreground of the target object based on the plurality of point data to obtain a plurality of foreground point data; and

And clustering the plurality of foreground point data to obtain the first sub-point cloud, wherein the first sub-point cloud comprises a plurality of first point data corresponding to the target object.

3. The method of claim 2, wherein clustering the plurality of foreground point data to obtain the first sub-point cloud comprises:

Determining a plurality of center point data representing the target object; wherein each of the plurality of center point data represents a center of the target object predicted by one of the plurality of foreground point data;

Respectively calculating the distances among the plurality of center point data; and

Adding at least one of the plurality of foreground point data satisfying a condition to the first sub-point cloud; wherein the condition is that the distance is less than or equal to a preset distance.

4. The method of claim 1, wherein the point cloud is from a first sensor, the image is from a second sensor, and selecting a second sub-point cloud from the point clouds that corresponds to the target object based on the image comprises:

Aligning the image with the point cloud;

Defining a three-dimensional frame in the space where the point cloud is located according to the image; and

And selecting a part falling into the three-dimensional frame from the point cloud to obtain the second sub-point cloud.

5. The method of claim 4, wherein aligning the image with the point cloud comprises:

Calibrating the first sensor and the second sensor to align the image with the point cloud.

6. The method of claim 4, wherein the first sensor is a lidar and the second sensor is a camera, wherein defining a three-dimensional box in a space in which the point cloud resides from the image comprises:

Defining a first position of the camera in the space where the point cloud is located; and

And performing instance segmentation on the image to obtain a second position of the target object in the image in a space where the point cloud is located, wherein the three-dimensional frame is a view cone, and the three-dimensional frame extends from the first position to the second position.

7. The method of claim 6, wherein performing instance segmentation on the image to obtain a second location of the target object in the image in a space in which the point cloud is located comprises:

And performing instance segmentation on the image to obtain a mask corresponding to the target object, wherein the contour of the surface of the view cone at the second position is the contour of the mask.

8. The method of claim 4, wherein the first sensor is a lidar and the second sensor is a camera, wherein defining a three-dimensional box in a space in which the point cloud resides from the image comprises:

Defining a first position of the camera in the space where the point cloud is located;

generating a two-dimensional prediction frame corresponding to the target object according to the image; and

And obtaining a third position of the two-dimensional prediction frame in the space where the point cloud is located, wherein the three-dimensional frame is a view cone and extends from the first position to the third position.

9. The method of claim 4, wherein the first sensor and the second sensor are mounted on a movable object or a roadside apparatus.

10. The method of claim 1, wherein generating a first prediction box and a second prediction box of the target object from the first sub-point cloud and the second sub-point cloud, respectively, comprises:

extracting first point data characteristics of a plurality of first point data in the first sub point cloud to obtain first sub point cloud characteristics;

generating the first prediction box representing the target object predicted based on the first sub-point cloud feature;

extracting second point data characteristics of a plurality of second point data in the second sub-point cloud to obtain second sub-point cloud characteristics; and

The second prediction box representing the target object predicted is generated based on the second sub-point cloud feature.

11. The method of claim 1, wherein the first prediction box and the second prediction box are three-dimensional, wherein selecting a third sub-point cloud and a fourth sub-point cloud from the point clouds according to the first prediction box and the second prediction box, respectively, comprises:

selecting a part falling into the first prediction frame from the point cloud to obtain the third sub-point cloud; and

And selecting a part falling into the second prediction frame from the point cloud to obtain the fourth sub-point cloud.

12. The method of claim 1, wherein detecting the target object from the third sub-point cloud and the fourth sub-point cloud comprises:

extracting third point data characteristics of a plurality of third point data in the third sub-point cloud to obtain third sub-point cloud characteristics;

Generating the third prediction box representing the target object predicted based on the third sub-point cloud feature;

extracting fourth point data characteristics of a plurality of fourth point data in the fourth sub-point cloud to obtain fourth sub-point cloud characteristics;

generating the fourth prediction box representing the target object predicted based on the fourth sub-point cloud feature; and

And detecting the target object according to the third prediction frame and the fourth prediction frame.

13. The method of claim 12, wherein detecting the target object based on the third prediction block and the fourth prediction block comprises:

And deleting one of the third prediction frame and the fourth prediction frame according to the superposition degree between the third prediction frame and the fourth prediction frame.

14. The method of claim 13, wherein the third prediction frame and the fourth prediction frame each have a confidence level representing the target object, and wherein deleting one of the third prediction frame and the fourth prediction frame based on a degree of overlap between the third prediction frame and the fourth prediction frame comprises:

calculating the cross ratio between the third prediction frame and the fourth prediction frame; and

And deleting one of the third prediction frame and the fourth prediction frame, the confidence of which is lower, in response to the intersection ratio being higher than a threshold.

15. The method of claim 1, wherein generating a first prediction box and a second prediction box of the target object from the first sub-point cloud and the second sub-point cloud, respectively, comprises:

inputting the first sub-point cloud to a first neural network to generate the first prediction box; and

Inputting the second sub-point cloud to a second neural network to generate the second prediction box; and

According to the third sub-point cloud and the fourth sub-point cloud, detecting the target object includes:

And respectively inputting the third sub-point cloud and the fourth sub-point cloud to a third neural network to detect the target object.

16. The method of claim 15, wherein the first neural network, the second neural network, and the third neural network have the same structure.

17. The method of claim 15, wherein the first neural network, the second neural network, and the third neural network are the same neural network.

18. A method of training a target object detection model, comprising:

19. The method of claim 18, wherein matching the sub-point cloud to the three-dimensional annotation frame according to a first degree of matching of the sub-point cloud to the three-dimensional annotation frame and according to a second degree of matching of a two-dimensional prediction frame corresponding to the sub-point cloud to the two-dimensional annotation frame comprises:

responding to the fact that the three-dimensional center of the sub-point cloud falls in the three-dimensional annotation frame, and matching the sub-point cloud to the three-dimensional annotation frame; and

And matching the sub-point cloud to the three-dimensional labeling frame in response to the fact that the three-dimensional center falls outside the three-dimensional labeling frame and the two-dimensional prediction frame and the two-dimensional labeling frame meet matching requirements.

20. The method of claim 19, wherein the matching requirement is that a cross-ratio between the two-dimensional prediction box and the two-dimensional annotation box is above a threshold.

21. The method of claim 20, wherein the threshold is a value between 0.2 and 0.4.

22. The method of claim 18, wherein detecting a target object from a target object detection primary model to select a sub-point cloud from a point cloud that corresponds to the target object comprises:

Selecting the sub-point cloud corresponding to the target object from the point clouds according to an image, wherein the point clouds are from a first sensor, the image is from a second sensor, and selecting the sub-point clouds corresponding to the target object from the point clouds according to an image comprises:

Aligning the image with the point cloud;

And selecting a part falling into the three-dimensional frame from the point cloud to obtain the sub-point cloud.

23. The method of claim 22, wherein aligning the image with the point cloud comprises:

24. The method of claim 22, wherein the first sensor is a lidar and the second sensor is a camera, wherein defining a three-dimensional box in a space in which the point cloud resides from the image comprises:

25. The method of claim 24, wherein performing instance segmentation on the image to obtain a second location of the target object in the image in a space in which the point cloud is located comprises:

26. The method of claim 22, wherein the first sensor is a lidar and the second sensor is a camera, wherein defining a three-dimensional box in a space in which the point cloud resides from the image comprises:

generating an image prediction frame corresponding to the target object according to the image; and

And obtaining a third position of the image prediction frame in the space where the point cloud is located, wherein the three-dimensional frame is a view cone and extends from the first position to the third position.

27. The method of claim 18, wherein the target object detection primary model is a neural network model.

28. An apparatus for detecting a target object, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-17 when executing the computer program.

29. An apparatus for training a target object detection model, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 18-27 when the computer program is executed.

30. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-27.