CN118776545A - Semantic map construction method and device - Google Patents
Semantic map construction method and device Download PDFInfo
- Publication number
- CN118776545A CN118776545A CN202310399171.5A CN202310399171A CN118776545A CN 118776545 A CN118776545 A CN 118776545A CN 202310399171 A CN202310399171 A CN 202310399171A CN 118776545 A CN118776545 A CN 118776545A
- Authority
- CN
- China
- Prior art keywords
- map
- frame
- semantic
- image
- map semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title claims abstract description 62
- 238000000034 method Methods 0.000 claims description 58
- 238000013507 mapping Methods 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 15
- 238000005457 optimization Methods 0.000 claims description 14
- 230000007246 mechanism Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 7
- 230000001419 dependent effect Effects 0.000 claims 2
- 238000004364 calculation method Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 11
- 230000000007 visual effect Effects 0.000 description 11
- 238000004422 calculation algorithm Methods 0.000 description 10
- 230000011218 segmentation Effects 0.000 description 10
- 238000001514 detection method Methods 0.000 description 7
- 238000000605 extraction Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000001360 synchronised effect Effects 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 238000009434 installation Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000003708 edge detection Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000005286 illumination Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
Landscapes
- Image Analysis (AREA)
Abstract
The application provides a construction method of a semantic map, which comprises the steps of obtaining information of map semantic elements in multi-frame images; obtaining a data association frame sequence corresponding to the map semantic elements, wherein images in the data association frame sequence comprise the map semantic elements; obtaining pixel point matching pairs of the map semantic elements in a first image frame and a second image frame, wherein the first image frame and the second image frame are image frames in a data association frame sequence; based on the pixel point matching pair, the camera pose corresponding to the first image frame and the camera pose corresponding to the second image frame, obtaining the space position information of the map semantic element; and constructing a three-dimensional semantic map based on the spatial position information of the semantic elements of the map. And the real-time performance and the robustness of the semantic map construction are improved.
Description
Technical Field
The application relates to the technical field of map construction, in particular to a semantic map construction method and device.
Background
The semantic map construction can be applied to various fields, such as the fields of robots, mobile terminals, automatic driving of vehicles and the like, and particularly has important significance on real-time positioning, path planning, speed specification and the like of the vehicles when being applied to automatic driving of the vehicles. According to different sensors, the existing semantic map construction schemes can be divided into a semantic map construction scheme based on laser radar and a semantic map construction scheme based on vision.
However, the scheme of semantic map construction based on the laser radar is characterized by the following: the laser radar has the problems of high price, inconvenient installation, high power consumption, short service life, difficult data fusion and the like, and is not beneficial to large-scale popularization in automatic driving. In the existing visual-based semantic map construction scheme, the problems of real-time performance, robustness and the like of semantic map construction mostly exist.
Disclosure of Invention
The embodiment of the application provides a method and a device for constructing a semantic map, which are used for improving the instantaneity and the robustness of the semantic map construction.
In a first aspect, an embodiment of the present application provides a method for constructing a semantic map, including acquiring information of a map semantic element in a multi-frame image; obtaining a data association frame sequence corresponding to the map semantic elements, wherein images in the data association frame sequence comprise the map semantic elements; obtaining pixel point matching pairs of the map semantic elements in a first image frame and a second image frame, wherein the first image frame and the second image frame are image frames in a data association frame sequence; based on the pixel point matching pair, the camera pose corresponding to the first image frame and the camera pose corresponding to the second image frame, obtaining the space position information of the map semantic element; and constructing a three-dimensional semantic map based on the spatial position information of the semantic elements of the map.
According to the method for constructing the semantic map, the data association frame sequence corresponding to each map semantic element is obtained through data association, then pixel point (also called as a general point) matching is carried out in the data association frame sequence aiming at the map semantic element, so that a pixel point matching pair is obtained, three-dimensional reconstruction of the semantic element is realized by utilizing the pixel point matching pair, three-dimensional reconstruction of the semantic map without the feature point is realized, and real-time performance and robustness of semantic map construction are improved.
In one possible implementation, obtaining a sequence of data-associated frames corresponding to map semantic elements includes obtaining bounding boxes (bboxes) of map semantic elements in each of the plurality of frame images; for each map semantic element, performing inter-frame matching based on the boundary frame, and taking the successfully matched image frames as a data association frame sequence corresponding to the map semantic element; the similarity of the bounding boxes of the map semantic elements between the successfully matched image frames is larger than a preset threshold.
In this possible implementation, the similarity of the bounding box of the map semantic element between different image frames is utilized to determine the image frame having a data association relationship with the map semantic element, for example, the image frame having the similarity of the bounding box of the map semantic element between different image frames greater than or equal to a preset threshold is determined as the image frame having an association with the map semantic element. The meaning of having a data association relationship is understood to be that there is a data association relationship between image frames containing the same map semantic element.
In order to facilitate the calculation of the similarity of the bounding boxes of the map semantic elements, the bounding boxes which do not meet the preset condition are preprocessed, for example, the bounding box relaxation operation (Bbox relaxation) is carried out on the edge outline of the map semantic elements with the area smaller than the threshold T1 or the aspect ratio of the edge outline larger than the threshold T2, so that the bounding boxes of the map semantic elements are obtained.
In another possible implementation, the similarity of the bounding boxes of the map semantic elements between the image frames and the cross-correlation of the bounding boxes of the map semantic elements between the image frames are inversely related to the distance of the bounding boxes of the map semantic elements between the image frames.
For example, mapping the boundary box positions of all map semantic elements in the previous frame into a current frame, calculating the intersection ratio (Bbox intersection over union, bbox IOU) of the boundary boxes of each matched pair according to the category of the map semantic elements in the current frame, and multiplying the intersection ratio by an IOU similarity factor Ki to obtain the IOU similarity Si of each matched pair; and calculating the reciprocal 1/d of the Bbox vertex distances between all map semantic elements of the previous frame and all map semantic elements of the same category of the current frame, and multiplying the reciprocal 1/d by a vertex distance similarity factor Kd (wherein Ki+Kd=1) to obtain the vertex distance similarity Sd of each matching pair. And calculating the similarity (Si+Sd) of each matching pair, and when the maximum similarity Smax > Ts of the matching pair, obtaining data association of map semantic elements of the previous frame in the current frame. Otherwise, the map semantic elements are not data-correlated.
Alternatively, ki and Kd are preset according to empirical values, for example, ki is set to 0.7 and Kd is set to 0.3, making the Bbox IOU more critical in the calculation of similarity.
In the process of extracting the map semantic elements, missed detection may occur, that is, a certain map semantic element (for example, a street lamp) is not extracted, which may cause interruption of data association, and when the same map semantic element (for example, a street lamp) is detected again in a new frame of image, another independently observed object may be considered to be generated due to interruption of association in the previous frame, thereby causing erroneous reconstruction.
In view of this, in the method for constructing a semantic map provided by the embodiment of the present application, a KEEP ALIVE mechanism of a map semantic element is provided, that is, based on the KEEP ALIVE mechanism, a data association frame sequence corresponding to the map semantic element is determined from a multi-frame image.
For example, if the interruption is tracked at the time t+1, the Bbox at the time t is mapped to the time t+2, if the interruption is associated at the time t+2, the next frame association is continued to be carried out on the map semantic element, otherwise, the association sequence of the map semantic element is stopped to the image frame corresponding to the time t.
In another possible implementation, in order to improve the accuracy of determining the spatial position information of the map semantic elements (which may also be referred to as three-dimensional reconstruction of the map semantic elements), the quality of the image frame pairs needs to be ensured, that is, two image frames with the best frame quality are selected from the data-associated frames of each map semantic element as key frame pairs. The determination of the key frame pairs is related to the integrity factor (which can be simply called the integrity factor of the semantic element) of the map semantic element in the image frames, the parallax factor (which can be simply called the parallax factor) among different image frames of the map semantic element and the pose precision (which can be simply called the pose precision) of the camera corresponding to the image frames where the map semantic element is located, so that the key frame pairs can be determined from the data association frame sequence corresponding to the map semantic element according to the integrity factor, the parallax factor and the pose precision factor of the semantic element.
For example, for any map semantic element, when key frame pairs are extracted, first frame images are extracted in reverse order in a data association frame sequence of the map semantic element according to the map semantic element integrity factor, and second frame images are obtained according to the maximum value of the sum of the integrity factor, the parallax factor and the pose precision.
In another possible implementation, obtaining a pixel matching pair of the map semantic element in the first image frame and the second image frame includes: collecting N first pixel points from the first image frame along the edge contour of each map semantic element, wherein N is a positive integer greater than 1; determining a fundamental matrix of epipolar geometry based on the camera pose corresponding to the first image frame, the camera pose corresponding to the second image frame and the internal parameters of the camera; based on a fundamental matrix of epipolar geometry, N polar mapping of N first pixel points on a second frame image is determined; determining N second pixel points based on N intersection points of the N polar line maps and the edge outline of the map semantic element in the second image frame; and determining pixel point matching pairs of the map semantic elements in the image frame pairs based on the N first pixel points and the N second pixel points.
In the possible implementation, a plurality of pixel points are collected along the edge outline of the map semantic element in the image frame, for example, N pixel points are uniformly sampled on the edge outline of the map semantic element in the first image frame (for example, the outline center of the map semantic element is taken as a ray endpoint, a set of ray clusters and outline intersection points which are 360/N degrees apart is taken as N pixel points), N matching pairs are obtained by matching the N pixel points with the second image, and then three-dimensional reconstruction of the map semantic element is performed by utilizing the N matching pairs, so that sparse reconstruction of the map semantic element is realized, and real-time performance of three-dimensional semantic map construction is improved.
In order to further improve the accuracy of three-dimensional semantic map construction, the method for constructing a semantic map according to the embodiments of the present application further includes, after constructing a three-dimensional semantic map based on spatial position information of each map semantic element, an optimization step for the position of the map semantic element, for example, constructing a map semantic element error map indicating error information of each image frame of a data association frame corresponding to the map semantic element from the three-dimensional semantic map to which the map semantic element is re-projected; based on the map semantic element error map, the spatial position information of the map semantic elements is adjusted, and based on the spatial position information of the adjusted map semantic elements, the constructed three-dimensional semantic map is optimized.
In a second aspect, an embodiment of the present application further provides a method for planning a route of a vehicle, including constructing a three-dimensional semantic map of an environment around the vehicle based on the method described in the first aspect, and planning a route of the vehicle based on the three-dimensional semantic map.
In a third aspect, an embodiment of the present application further provides a semantic map building apparatus, including: the system comprises an acquisition module, a data association module, a matching module, a determination module and a construction module, wherein the acquisition module is used for acquiring the information of map semantic elements in multi-frame images; the data association module is used for obtaining a data association frame sequence corresponding to the map semantic elements, and images in the data association frame sequence comprise the map semantic elements; the matching module is used for obtaining pixel point matching pairs of the map semantic elements in a first image frame and a second image frame, wherein the first image frame and the second image frame are image frames in a data association frame sequence; the determining module is used for obtaining the space position information of the map semantic element based on the pixel point matching pair, the camera pose corresponding to the first image frame and the camera pose corresponding to the second image frame; the construction module is used for constructing a three-dimensional semantic map based on the spatial position information of the semantic elements of the map.
In one possible implementation, the data association module is specifically configured to: obtaining a boundary box of map semantic elements in each frame of images in the multi-frame images; for each map semantic element, performing inter-frame matching based on the boundary frame, and taking the successfully matched image frames as a data association frame sequence corresponding to the map semantic element; the similarity of the bounding boxes of the map semantic elements between the successfully matched image frames is larger than a preset threshold.
In another possible implementation, the data association module is further configured to: acquiring edge contours of map semantic elements in each frame of images in the multi-frame images; aiming at the edge contour of the map semantic element in each frame of image, if the edge contour meets the preset condition, performing relaxation operation on the edge contour, and determining a boundary frame of the map semantic element based on the edge contour after the relaxation operation; the preset condition includes that the area of the edge profile is smaller than or equal to a first threshold value or the aspect ratio of the edge profile is larger than or equal to a second threshold value; and if the edge contour does not meet the preset condition, determining a boundary box of the map semantic element based on the edge contour.
In another possible implementation, the similarity of the bounding boxes of the map semantic elements between the image frames and the cross-correlation of the bounding boxes of the map semantic elements between the image frames are inversely related to the distance of the bounding boxes of the map semantic elements between the image frames.
In another possible implementation, the data association module is further configured to: based on KEEP ALIVE mechanism, determining the data associated frame sequence corresponding to the map semantic element from the multi-frame image.
In another possible implementation, the first image frame and the second image frame are determined from a sequence of data-associated frames corresponding to the map semantic element based on one or more of a completeness factor of the map semantic element in the image frames, a parallax factor of the map semantic element between different image frames, and a camera pose accuracy corresponding to the image frame in which the map semantic element is located.
In another possible implementation, the matching module is specifically configured to: collecting N first pixel points from the first image frame along the edge outline of the map semantic element, wherein N is a positive integer greater than 1; determining a fundamental matrix of epipolar geometry based on the camera pose corresponding to the first image frame, the camera pose corresponding to the second image frame and the internal parameters of the camera; based on a fundamental matrix of epipolar geometry, N polar mapping of N first pixel points on a second frame image is determined; determining N second pixel points based on N intersection points of the N polar line maps and the edge outline of the map semantic element in the second image frame; and determining pixel point matching pairs of the map semantic elements in the image frame pairs based on the N first pixel points and the N second pixel points.
In another possible implementation, the semantic map construction device provided by the embodiment of the present application further includes an optimization module, where the optimization module is configured to adjust spatial location information of the map semantic elements based on an error map corresponding to the map semantic elements; the error map corresponding to the map semantic element indicates error information of each image frame of the data association frame corresponding to the map semantic element from the three-dimensional semantic map to be re-projected; and adjusting the three-dimensional semantic map based on the spatial position information of the adjusted map semantic elements.
In a fourth aspect, an embodiment of the present application further provides a vehicle, including a device for constructing a semantic map according to any one of possible implementation manners of the third aspect.
In a fifth aspect, the present application provides a computing device comprising a memory and a processor, the memory having instructions stored therein which, when executed by the processor, cause the method of the first and/or second aspect to be carried out.
In a sixth aspect, the present application provides a computer storage medium comprising computer instructions which, when executed by a processor, cause the method of the first and/or second aspect to be carried out.
In a seventh aspect, the present application provides a computer program or computer program product comprising instructions which, when executed, cause a computer to perform the method of the first and/or second aspects.
Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.
Drawings
Fig. 1 is a system architecture diagram of a semantic map building device according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for constructing a semantic map according to an embodiment of the present application;
Fig. 3 shows the IOU of the Bbox of the light pole in the previous frame mapped into the current frame, with the Bbox of the light pole in the current frame;
Fig. 4 shows that the lamp pole Bbox in the previous frame is mapped into the current frame, and the vertex distance between the lamp pole Bbox in the current frame;
FIG. 5 is a diagram of a KEEP ALIVE mechanism for map semantic element tracking;
FIG. 6a is a schematic diagram of the integrity factor of a light pole in an image frame;
FIG. 6b is a schematic diagram of the integrity factor of a light pole in another image frame;
FIG. 7 is a schematic view of parallax at the same observation point in different image frames;
FIG. 8 illustrates a pose accuracy schematic for candidate image frames;
FIG. 9 shows a generic point matching pair construction schematic based on epipolar mapping;
FIG. 10 shows a schematic of fitting a point cloud of a linear object and a schematic of fitting a point cloud of a planar object;
FIG. 11 shows the semantic association results of bars and cards constructed using the map semantic element association in the semantic map construction method provided by the embodiment of the present application;
FIG. 12 is a semantic map constructed by a vehicle using the semantic map construction method provided by the embodiment of the present application;
fig. 13 is a schematic structural diagram of a semantic map building device according to an embodiment of the present application;
fig. 14 is a schematic structural diagram of a computing device according to an embodiment of the present application.
Detailed Description
In order to realize automatic driving, it is necessary to be able to model static and dynamic objects around the vehicle and provide input signals for real-time positioning, path planning, speed planning, etc. of automatic driving. The static environment is constructed, that is, the map is constructed by adopting a sparse semantic map, and the semantic map elements not only comprise traffic attribute information (such as lane lines, stop lines, road edges, pavement markers, traffic lights, traffic signs, lamp posts, guardrails and the like), but also need to comprise spatial position information of various map elements, and even obtain orientation information of the map elements at the same time for some special cases such as traffic signs.
The construction of the semantic map can be divided into a semantic map construction scheme based on laser radar and a semantic map construction based on vision according to the different sensors adopted. The visual-based semantic map construction can be divided into monocular vision, binocular vision, depth cameras and other technical schemes according to the difference of visual sensors. The laser radar is an active measuring device, can directly measure and obtain the distance information between the scanned object and the laser radar sensor, and can relatively easily acquire the semantic information of the map elements if the type attribute of the scanned object can be identified. However, the scanning angle interval of the laser radar wire harness is fixed, when the distance between the object and the sensor is long, the point obtained by scanning the object is sparse, and compared with an image, the attribute information of the object is not easy to obtain. In addition, similar to information such as characters on a sign, the laser scanning points cannot obtain the outline of the characters, so that the information cannot be obtained. And the laser radar has the problems of high price, inconvenient installation, high power consumption, short service life, difficult data fusion and the like, so that the laser radar is not beneficial to large-scale popularization in automatic driving. The binocular vision sensor is provided with two cameras, and the same-name points can be directly triangulated through the identification of the same feature points and the base lines determined between the binocular cameras to obtain the space positions of the same-name points, but the sensor can only recover the space position information of the feature points with obvious features, and the elements of the semantic map cannot be constructed because the feature points are not usually arranged on the image. Since the binocular camera baseline is generally smaller, when the distance between the object and the camera is larger, the error detected by the camera easily causes larger triangularization error, so that for the binocular camera, the distance measurement precision of only the object with relatively short distance is ensured. In addition, the binocular camera is higher to the installation requirement, is unfavorable for promoting in batches. The ranging scheme based on the depth camera is easily affected by illumination and the like, and the illumination changes greatly under an outdoor complex scene, so that the use of the depth camera is not facilitated.
Monocular camera ranging is typically implemented based on the SFM algorithm: the method comprises the steps of obtaining observation information of the same object in different frame images by utilizing motion information, identifying the object, obtaining a base line between frames by the motion information obtained by calculation, and triangulating the object to obtain spatial position information of the object. The most commonly used algorithm is a triangulation method using sparse feature points, but since there are usually no significant feature points on the map semantic elements, it cannot be triangulated. In addition, the dense reconstruction method can acquire dense reconstruction results of semantic elements based on the assumption that the luminosity error is unchanged, but the dense reconstruction consumes longer time, which is not beneficial to real-time map construction.
In the related art, several technical schemes are proposed, for example, scheme one, a Bbox for target detection is extracted from each frame of picture, an image ROI in the bounding box is extracted based on the Bbox, the picture of each ROI is extracted based on deep learning, then matching is performed between frames based on the extracted features, and a tracking result is obtained according to the matching degree.
The scheme comprises the following steps: extracting features according to image content information in the map semantic element Bbox, and matching the extracted features to obtain matching results among frames; because the map semantic elements have no obvious features, the feature matching degree inferred from the picture content of the Bbox based on deep learning is poor, and effective tracking cannot be realized. And the problem of missing detection is not effectively processed, when a certain frame of picture is missing to detect a certain map semantic element, a plurality of matching sequences can be caused to appear on the same object, and reconstruction of each matching sequence can cause multiple reconstructions of the same map semantic element.
Based on feature point extraction and tracking, the 3D position of the feature point and the pose of the camera are synchronously estimated, and meanwhile, the distortion parameters of the camera are also estimated. And reconstructing the map semantic elements according to the feature point reconstruction results of the map semantic elements.
The scheme mainly reconstructs the 3D information of the feature points, and the feature points of the map semantic elements are usually fewer or have no feature points, so that the reconstruction of the map semantic elements is not facilitated.
Firstly, obtaining a current frame image through a camera; then processing the current frame image by using a semantic segmentation algorithm to obtain semantic information of the current frame image; optimizing the camera pose corresponding to the current frame in a mode of minimizing the reconstruction semantic error by utilizing the reference frame; depth information of an immature point on the reference frame without depth information and a corresponding point on the current frame is determined based on the optimized camera pose to form a mature point with depth information. In this way, the camera pose corresponding to the image frame can be optimized based on the semantic information, so that an accurate map point cloud can be obtained more stably.
The scheme comprises the following steps: the semantic loss functions are different, for example, in the scheme, the loss functions are semantic class differences of pixel points, and when the semantic class differences have a plurality of classes, the distribution of the semantic class differences does not have monotonicity, so that the efficiency of error optimization can be reduced. In the scheme, 3D recovery is carried out on all pixels in the key frame image, so that the method is a dense reconstruction method, has very large calculated amount and is not beneficial to real-time three-dimensional reconstruction.
Aiming at the defects of the scheme, the embodiment of the application provides a method and a device for constructing a semantic map, and the related data of the semantic elements of the map are obtained through the data association of the semantic elements of the map, so that the problem of difficult association caused by no significant feature points of the semantic elements of the map is solved; the matching relation of the pixel points is built based on the data association relation of the map semantic elements, so that the three-dimensional reconstruction of the map semantic elements is realized, the feature point extraction and the three-dimensional reconstruction of the map semantic elements can be performed without certain specific points (such as four corner points of a traffic sign) of the map semantic elements, and the real-time performance and the robustness of the three-dimensional semantic map construction are improved.
The technical scheme of the application is further described in detail through the drawings and the embodiments.
Fig. 1 is a system architecture diagram of a semantic map building device according to an embodiment of the present application. As shown in FIG. 1, the device for constructing the semantic map provided by the embodiment of the application mainly comprises three subsystems, a semantic element multi-frame data association subsystem, a semantic element three-dimensional reconstruction subsystem and a semantic element position optimization subsystem.
Obtaining image frame sequence information from end-side data, wherein the image frame sequence information comprises image data information of each frame in an image frame sequence, camera pose information corresponding to each frame of image data and map semantic element edge contour information in each frame of image data, then realizing data association of map semantic elements in multi-frame images through a semantic element multi-frame data association subsystem, completing sparse reconstruction of the map semantic elements through a semantic element three-dimensional reconstruction subsystem, optimizing the position of the map semantic elements through a semantic element position optimization subsystem, and improving the precision of a three-dimensional semantic map, namely processing the image frame sequence information through semantic element multi-frame data association, three-dimensional reconstruction of the semantic elements and semantic element position optimization, so as to realize three-dimensional semantic map construction, and improve the real-time performance, robustness and high precision of three-dimensional semantic map reconstruction.
Illustratively, in the semantic element multi-frame data association subsystem, a Bbox relation unit is used for preprocessing a boundary box of the map semantic element, for example, the boundary box relaxation operation is carried out on the edge outline of the map semantic element with the area smaller than a threshold T1 or the aspect ratio of the edge outline larger than a threshold T2, so that the follow-up IOU calculation of the Bbox is facilitated. And calculating the similarity of the map semantic elements in different image frames through an inter-frame similarity calculation unit, for example, calculating the similarity based on the distances between the IOU and the Bbox of the map semantic elements Bbox, performing data association with the similarity larger than a preset threshold, and removing error data association caused by missed detection through a KEEP ALIVE mechanism unit based on a KEEP ALIVE mechanism to realize data association of the map semantic elements in multi-frame images.
In the semantic element three-dimensional reconstruction subsystem, sampling by a 2D edge point sampling unit to obtain universal points of the map semantic element (the meaning of the universal points can be understood as image pixel points without significant feature descriptors), for example, for any map semantic element, uniformly sampling N pixel points on the edge contour of the map semantic element, wherein the N pixel points are universal points of the map semantic element, for example, the contour center of the map semantic element is taken as a ray end point, and a set of ray clusters and contour intersection points which are 360/N degrees apart is N pixel points; the 2D universal feature point matching unit is used for matching to obtain universal point matching pairs of the map semantic elements in different image frames, for example, the universal point matching relation based on the polar line mapping and the data association relation is constructed to obtain matching point pairs, and the triangular calculation is performed based on the universal point matching pairs to obtain the space positions of the universal points of the map semantic elements; and obtaining the spatial position information of the map semantic elements based on the three-dimensional line and surface fitting of the Ransac model through a spatial Ransac fitting unit, namely realizing the three-dimensional reconstruction of the map semantic elements.
In the semantic element position optimization subsystem, a map semantic element error map is constructed through a bidirectional error map modeling unit, for example, the map semantic element error map is constructed based on a bidirectional nearest neighbor principle; re-projecting the 3D map semantic elements to the associated images through a semantic element re-projection unit to acquire error sums; the map semantic element optimized position is obtained through the map semantic element optimizing unit, for example, a map semantic element three-dimensional reconstruction initial value is taken as a center, a map semantic element position searching space and a searching step length are set, the map semantic element position searching space and the searching step length are placed at each position of the searching space, the reprojection error is calculated, and the position with the minimum reprojection error is taken as the map semantic element optimized position.
The semantic map construction device provided by the embodiment of the application can be deployed at different terminals and applied to different scenes.
For example, in an automatic driving scene, the semantic map constructing device is deployed at a system layer of a vehicle machine system of a vehicle and is called by an upper layer application through a programmable interface; the construction device of the semantic map constructs three-dimensional position information of the semantic elements of the map based on the detection result of the semantic elements of the picture sequence and the pose information of the semantic elements, which are acquired by the visual sensor on the vehicle; and the vehicle performs automatic driving path planning according to the constructed 3D position information of the map semantic elements to guide the autonomous driving of the automatic driving vehicle.
The method comprises the steps that a crowdsourcing map updating scene is arranged on a vehicle, a semantic map building device is arranged on the vehicle, based on a detection result of picture sequence semantic elements and pose information thereof obtained by a visual sensor on the vehicle, three-dimensional position information of map semantic elements is built, the three-dimensional position information of the map semantic elements is compared with a reference map provided by the vehicle according to the built three-dimensional semantic map element 3D position information, a reference map change relation is detected, and change information of the reference map change relation is transmitted to a cloud for map updating.
The robot scene is that a semantic map constructing device is deployed on the robot, so that the synchronous positioning and map construction (simultaneous localization AND MAPPING, SLAM) with high robustness, high precision and high real-time performance of the robot are realized, and the automatic path planning of the robot is realized.
It can be understood that the above application scenario is merely an example of a typical application scenario of the semantic map building device provided by the embodiment of the present application, and is not exhaustive, and the semantic map building device provided by the embodiment of the present application may be applied to other possible scenarios, which is not limited in detail.
Fig. 2 is a flowchart of a method for constructing a semantic map according to an embodiment of the present application. The method can be applied to the semantic map construction device in FIG. 1, and the semantic map construction device can be deployed at terminals such as vehicles, robots and the like to realize semantic map construction with high real-time performance, high robustness and high precision. As shown in fig. 2, the method for constructing a semantic map according to the embodiment of the present application includes steps S201 to S205.
In step S201, information of map semantic elements in a multi-frame image is acquired.
Taking a vehicle scene as an example, a vision sensor on the vehicle acquires an image frame sequence of the surrounding environment of the vehicle in real time, and outputs processed image frame data through processing the image frame sequence, wherein the processed image frame data comprises the image frame sequence data, map semantic elements in each image frame and camera pose of each image frame.
For example, a map semantic segmentation module is arranged in the vision sensor to segment the map semantic element images in the image frames, and optionally, the map semantic segmentation module can realize map semantic segmentation through a depth neural network algorithm, for example, a trained map semantic segmentation model is deployed to realize map semantic segmentation of the image frames, and can also realize map semantic segmentation of the image frames through other segmentation algorithms, for example, a heuristic segmentation algorithm.
The camera pose of the image frame may be acquired by an inertial policy unit (inertial measurement unit, IMU), which may be, for example, a 6degree of freedom (6degree of freedom,6Dof) pose, by acquiring pose information of the visual sensor, and the positioning module obtains position information of the visual sensor, thus obtaining pose information of the visual sensor.
Visual sensors include, but are not limited to, monocular cameras, depth cameras, and the like.
The image frame sequence is not particularly limited in the embodiment of the application, for example, the image frame sequence acquired by the vision sensor can be an RGB image sequence or an RGB-D image sequence. An RGB-D image sequence means an image sequence having both RGB information and depth information.
The semantic map constructing device acquires the processed image frame data in real time, for example, receives the processed image frame data output by the visual sensor in real time, and obtains image frame sequence data, map semantic element data of each frame image in the image frame sequence and camera pose data of the image frame.
The information of the map semantic elements can be understood as some semantic elements required for constructing the map, for example, the map semantic elements can include traffic attribute information such as lane lines, stop lines, road edges, road surface marks, traffic lights, traffic signs, lamp posts, guardrails and the like, and spatial position information of various map elements, and even orientation information of the map elements for some special cases such as traffic signs.
The map semantic elements in the image frames can be understood as images with traffic attributes in the image frames and semantic information, such as lane lines, stop lines, road edges, road signs, traffic lights, traffic signs, lamp posts, guardrails, and the like in the image frames.
In one example, to facilitate execution of subsequent steps, image frame sequence data, map semantic element data, and camera pose data for an image frame are cached corresponding raw data according to map semantic element tracking results.
In step S202, a data-related frame sequence corresponding to the map semantic element is obtained.
For example, data association is performed on each map semantic element, and a data framing sequence corresponding to the map semantic element is obtained.
The meaning of the data associated frame sequence corresponding to the map semantic elements is as follows: the image frame sequence comprising the map semantic elements, for example, in the data association frame sequence corresponding to the street lamp pole map semantic elements, each frame of image frame comprises a street lamp pole image.
Because the map semantic element image usually has no corner points, such as lane line images, the map semantic element image usually has no significant feature points, and the feature points of the map semantic element can not be extracted, and the data association of the semantic elements can not be performed according to the feature point matching method.
According to the embodiment of the application, the data association is carried out through the similarity of the bounding boxes of the map semantic elements, so that the data association frame sequence of the map semantic elements is obtained.
In one example, the similarity of the map semantic elements in the different image frames may be determined based on the IOU of the map semantic elements in the different image frames and the distance of the Bbox. For example, according to the edge detection result of the map semantic element (i.e. the segmentation result of the map semantic element image) obtained from the end side, the edge detection result comprises a pixel point set of the edge outline of the map semantic element and map semantic information, a Bbox of the map semantic element is obtained according to the edge outline of the map semantic element, firstly, a correlation operation is performed on the Bbox of the map semantic element, and then similarity calculation (for example, similarity calculation between the IOU and the vertex distance of the existing Bbox) is performed on the Bbox after the correlation, so that the inter-frame association relation of the map semantic element is obtained.
Optionally, the method further comprises KEEP ALIVE predicting the unassociated map semantic elements, continuing to perform data association based on the predicted position in the next frame, and judging whether to end association relation construction of the local map semantic elements according to whether the association relation is matched or not.
For example, for any map semantic element of the current image frame data, firstly, calculating the area of the outline of the map semantic element, and the aspect ratio, and when the area is smaller than the threshold T1 or the aspect ratio is larger than the threshold T2, performing Bbox adaptation, otherwise, keeping the Bbox unchanged without meeting the two conditions. For example, in the relay method in which the area is smaller than the threshold T1, a new Bbox is calculated by expanding the width and the height with respect to the original Bbox center. For the relation mode with the aspect ratio larger than the threshold value T2, the height is fixed, the width is enlarged by X times relative to the original width, and a new Bbox is obtained. Then, bbox inter-frame matching is performed. An inter-frame matching method may be: the Bbox positions of all map semantic elements in the previous frame image are mapped into the current frame image, the Bbox IOU calculation of each matching pair is calculated according to the map semantic element category in the current frame, and the IOU similarity factor Ki is multiplied to obtain the IOU similarity Si of each matching pair, and fig. 3 shows that the Bbox of a lamp post (a map semantic element) in the previous frame is mapped into the current frame and is matched with the IOU of the Bbox of the lamp post in the current frame. Meanwhile, the reciprocal 1/d of the distances between the Bbox vertices of all map semantic elements of the previous frame and the Bboxes of all map semantic elements of the same category of the current frame is calculated, and the distances between the Bboxes of the previous frame and the Bboxes of the lamp poles of the current frame are multiplied by a vertex distance similarity factor Kd (Ki+Kd=1) to obtain the vertex distance similarity Sd of each matching pair, and FIG. 4 shows that the Bboxes of the lamp poles of the previous frame are mapped into the current frame. The similarity (si+sd) of each matched pair is calculated. When the maximum similarity Smax > Ts is matched, the semantic elements of the previous frame obtain data association in the current frame, and the semantic elements corresponding to the Bbox in the current frame are used as the data association of the semantic elements corresponding to the previous frame. Otherwise, the semantic elements are not data associated.
In the process of extracting the semantic elements of the map, a missing detection condition possibly occurs, the data association is interrupted, and when the same object is detected again in a new frame of image, another object which is observed independently is considered to be generated due to the interruption of the association of the previous frame. And thus may lead to erroneous reconstruction. In order to solve the problem, in the map semantic element data association process, a KEEP ALIVE mechanism of a map semantic element is provided, namely, if tracking is interrupted at the time t+1, a Bbox of the map semantic element at the time t is mapped to the time t+2 for a certain map semantic element, if the map semantic element is associated at the time t+2, the object is continuously associated with the next frame, otherwise, the association sequence of the object is stopped to the image frame corresponding to the time t.
FIG. 5 is a diagram of a KEEP ALIVE mechanism for map semantic element tracking. When a certain map semantic element loses association in a certain frame image, the data association is not terminated, and the next frame association is carried out on the map semantic element. As shown in fig. 5, if the map semantic element 3 is interrupted by the tracking segment (tracklet) of the map semantic element 3 at time t+1, the Bbox at time t is mapped to time t+2, and the tracking association is continued, and if the association is obtained at time t+2, the data association is continued for the map semantic element.
In step S203, a pixel point matching pair of the map semantic element in the first image frame and the second image frame is obtained.
After the data association is carried out on the map semantic elements, the associated data (namely the data association frame sequence) corresponding to each map semantic element is obtained, and then the universal point matching is carried out on each map semantic element, so that the universal point matching pair of each map semantic element is obtained.
It should be noted that, the general point refers to a pixel point in the map semantic element image, for example, a pixel point collected along an edge contour of the map semantic element, which is different from a feature point of a pixel point needing corner points or texture mutation, the general point is not required to be extracted by a special feature extraction algorithm, and is obtained by directly collecting the pixel point from an image area corresponding to the map semantic element, for example, the pixel point is collected at the edge contour of the map semantic element.
In one example, after the universal point matching pair of the map semantic element is determined, a three-dimensional reconstruction result of the map semantic element can be obtained through calculation according to a triangulation mode. The triangularization precision is related to the integrity of the extraction of the universal point of the semantic element of the map in the image and the parallax of the universal point in the image frame pair, and is also related to the pose precision of the given reconstructed image frame pair, so that the optimal key frame pair is extracted from the image frame sequence in which the semantic element is observed, and positive benefits are provided for the reconstruction precision.
By way of example, keyframe pairs may be selected based on three factors: (1) map semantic element integrity factor: when occlusion occurs, the semantic elements are partially occluded in the image and can only be partially reconstructed, and the reconstruction accuracy is low. And calculating a map semantic element closure integrity factor according to the closure condition of the map semantic element outline, wherein the map semantic element closure integrity factor indicates the map semantic element integrity factor.
Fig. 6a and 6b show that the integrity factor of the semantic element of the street lamp pole map in two frames of images is poor in the acquired image frames because the street lamp pole in fig. 6a is blocked by the vehicle, while the integrity factor of the street lamp pole in the acquired image frames is good because the street lamp pole in fig. 6b is not blocked.
(2) Parallax factor: and calculating the relative pose according to the pose of the key frame pair of the observed map semantic element, wherein the relative pose determines the magnitude of parallax, and the smaller the parallax is, the larger the reconstruction error is. As shown in fig. 7, when the camera is in pose 1, pose 2, pose 3 and pose 4, the same observation point is photographed, and the parallax of the images of the observation points photographed by pose 1 and pose 4 is also maximum because the relative pose of pose 1 and pose 4 is the largest.
(3) Pose accuracy of image frames: the pose accuracy of an image frame may be given by its covariance matrix. Fig. 8 shows a schematic diagram of pose accuracy corresponding to a candidate image frame. And calculating a pose covariance matrix of the candidate image frame, wherein the pose covariance matrix indicates the pose precision of the candidate image frame, and the pose precision corresponding to the candidate frame 2 is highest as shown in fig. 8.
In one example, the key frame pairs are extracted, first frame data is extracted in the reverse order in the map semantic data association sequence according to the map semantic element integrity factor, and second frame data is obtained according to the sum maximum value of the three factors.
After the key frame pair is extracted, a general point matching operation is performed on the key frame pair, for example, N general points are uniformly sampled on the map semantic element outline on the first frame of the key frame pair. Alternatively, the general point acquisition mode may be that the contour center is taken as a ray endpoint, and the set of ray clusters and contour intersection points at 360/N degree intervals is the general point set of the map semantic elements. And calculating a fundamental matrix of epipolar geometry according to the pose of the key frame pair and the camera internal parameters. And obtaining an polar mapping set of the universal point set on the first frame on the second frame according to the basic matrix. And the intersection point set of each polar line in the polar line mapping set and the outline of the map semantic element in the second frame is the second frame universal point set corresponding to the first frame universal point set. Thus, the universal point matching pair of the map semantic elements is obtained.
FIG. 9 shows a generic point matching pair construction schematic based on epipolar mapping. As shown in fig. 9, a 1、B1、C1、D1、E1 is a common point set ,e2A2、e2B2、e2C2、e2D2、e2E2 of map semantic elements in a first image frame, an intersection of a epipolar mapping set .e2A2、e2B2、e2C2、e2D2、e2E2 of the map semantic elements in a second image frame and the map semantic elements in the second image frame is a 2、B2、C2、D2、E2, and a 1 and a 2、B1 and B 2、C1 and C 2、D1 and D 2、E1 and E 2 are common point matching pairs of the map semantic elements.
According to the embodiment of the application, through the matching of the universal points, the pixel points on the outline of the map semantic elements are directly collected as the universal points, the feature extraction algorithm is not required to extract the feature points, the real-time performance of the map construction is improved, and as a plurality of pixel points on the map semantic elements are collected as the universal point set, the construction of the map semantic elements is not affected even if the map semantic elements are partially shielded, and compared with the case that some special points (such as corner points, end points and the like) of the map semantic elements are necessarily collected in some schemes, the robustness of the semantic map construction is improved.
In step S204, spatial location information of the map semantic element is obtained based on the pixel point matching pair, the camera pose corresponding to the first image frame, and the camera pose corresponding to the second image frame.
And according to the universal point matching pair, the camera pose corresponding to the first image frame and the camera pose corresponding to the second image frame, adopting a triangulation calculation method to obtain the triangulation result of the universal points on the map semantic elements, namely the spatial position information of each universal point.
In step S205, a three-dimensional semantic map is constructed based on the spatial position information of the map semantic elements.
After the position information of each general point of the map semantic element is obtained, the space point cloud of the map semantic element is obtained, and as shown in fig. 10, line fitting or surface fitting is performed on the map semantic element according to the linear attribute or the surface attribute of the map semantic element, so as to obtain the three-dimensional reconstruction result of the map semantic element.
For example, when the map semantic element is a street lamp post, the map semantic element is a line attribute, and line fitting is performed according to the space point cloud corresponding to the universal point obtained by triangulation calculation, so as to obtain a three-dimensional reconstruction result of the street lamp post; when the map semantic element is a traffic sign, the map semantic element is a surface attribute, and surface fitting is performed according to the space point cloud corresponding to the universal point obtained through triangulation calculation, so that a three-dimensional reconstruction result of the traffic sign is obtained.
In another example, in order to further increase the accuracy of three-dimensional semantic map construction, the construction method of the semantic map provided by the embodiment of the present application further includes an optimization step for the position of the semantic elements of the map after the initial three-dimensional semantic map is constructed in steps S301 to S305, for example, a map semantic element error map is constructed, where the map semantic element error map indicates error information of each image frame of the data association frame corresponding to the map semantic element from the three-dimensional semantic map; based on the map semantic element error map, the spatial position information of the map semantic elements is adjusted, and based on the spatial position information of the adjusted map semantic elements, the constructed three-dimensional semantic map is optimized.
The three-dimensional reconstruction of the map semantic elements based on the general point matching can obtain a three-dimensional reconstruction preliminary result of the object, and the preliminary three-dimensional reconstruction result precision is further improved under the influence of element extraction precision, epipolar mapping precision and the like; the three-dimensional reconstruction result of the map semantic elements is back projected into the data association frame sequence of the map semantic elements, and a certain deviation exists between the three-dimensional reconstruction result and the outline corresponding to the map semantic elements in the image. In order to improve the precision, the outline of the map semantic element in the image can be used as the observation of the map semantic element, the three-dimensional position of the map semantic element is optimized, and the observation error is reduced, so that the precision of the three-dimensional reconstruction of the semantic element is improved.
For example, the position optimization of the map semantic elements may be achieved by:
And step 1, efficiently constructing a map semantic element error map for each frame of image based on a chamfering distance (CHAMFER DISTANCE) transformation principle. For example, pixels in the outline of all map semantic elements in the image frame are taken as the foreground, the rest pixels are taken as the background, the foreground error is 0, and the background error is the chamfering distance of the pixels. In order to calculate the projection error of a point in space on an image, the point is projected to the image, and the error is obtained according to the chamfering distance transformation result. Because the space three-dimensional point is projected onto the image to obtain a floating point type projection result, namely the accurate projection point is a sub-pixel point, the chamfering distance error of the sub-pixel point needs to be calculated.
And 2, setting a semantic element position search space and a search step length by taking a three-dimensional reconstruction initial value of the map semantic element as a center, placing the semantic element position search space and the search step length in each position of the search space, calculating a reprojection error of the search space, and taking the position with the minimum reprojection error as an optimized position of the search space.
When the method for constructing the semantic map provided by the embodiment of the application is applied to a real-time map construction scene of a vehicle, for example, the method for constructing the semantic map provided by the embodiment of the application is applied to a CarBU real-time map construction system, and the semantic map construction effect is verified.
Fig. 11 shows the semantic association result of a pole and a board constructed by associating map semantic elements in the construction method of a semantic map provided by the embodiment of the present application. In fig. 11, different bars and cards have different IDs in the picture, and they do not change IDs in different frames.
Fig. 12 is a semantic map constructed by a vehicle using the method for constructing a semantic map provided by the embodiment of the present application. The vehicle can drive the route according to the three-dimensional semantic map constructed in real time and the specification.
In the related technology, only the road surface elements are subjected to IPM transformation to obtain the three-dimensional reconstruction result of the road surface elements, the three-dimensional reconstruction result of the road side elements is not obtained, the three-dimensional optimization of the road surface elements is not carried out, and the map construction precision is poor. The method for constructing the semantic map solves the problem of three-dimensional reconstruction under the condition that the road side elements have no characteristic points. And searching for matching points based on the universal points according to the data association and epipolar mapping to obtain a preliminary three-dimensional reconstruction result. Meanwhile, the accuracy of the preliminary reconstruction result is further optimized according to the semantic data association result, so that the accuracy is further improved.
In summary, the method for constructing the semantic map provided by the embodiment of the application realizes the construction of the three-dimensional semantic map with high real-time performance, high robustness and high precision, meets the requirements of vehicles on the high real-time performance, the high robustness and the high precision of the map construction, and further increases the running safety of the automatic driving vehicle.
Based on the same conception as the foregoing embodiment of the construction of the semantic map, the embodiment of the present application further provides a construction apparatus 1300 of the semantic map, where the construction apparatus 1300 of the semantic map includes units or modules for implementing each step in the construction method of the semantic map shown in fig. 2-12.
Fig. 13 is a schematic structural diagram of a semantic map building device according to an embodiment of the present application. The device can be deployed on any device, equipment, platform or equipment cluster with computing capability, such as a vehicle or a robot, and the like, which needs to construct a map, so as to realize high real-time, lu Bang performance and high precision of constructing a three-dimensional semantic map.
As shown in fig. 13, the device 1300 for constructing a semantic map according to the embodiment of the present application includes an acquisition module 1301, a data association module 1302, a matching module 1303, a determination module 1304, and a construction module 1305, where the acquisition module 1301 is configured to acquire information of a map semantic element in a multi-frame image; the data association module 1302 is configured to obtain a data association frame sequence corresponding to the map semantic element, where an image in the data association frame sequence includes the map semantic element; the matching module 1303 is configured to obtain a pixel point matching pair of the map semantic element in a first image frame and a second image frame, where the first image frame and the second image frame are image frames in a data association frame sequence; the determining module 1304 is configured to obtain spatial location information of the map semantic element based on the pixel matching pair, the camera pose corresponding to the first image frame, and the camera pose corresponding to the second image frame; the construction module 1305 is configured to construct a three-dimensional semantic map based on the spatial location information of the semantic elements of the map.
In one possible implementation, the data association module 1302 is specifically configured to: obtaining a boundary box of map semantic elements in each frame of images in the multi-frame images; for each map semantic element, carrying out inter-frame matching based on a boundary frame, and taking the successfully matched image frame as a data association frame sequence corresponding to the map semantic element; the similarity of the bounding boxes of the map semantic elements between the successfully matched image frames is larger than a preset threshold.
In another possible implementation, the data association module 1302 is further configured to: acquiring edge contours of map semantic elements in each frame of images in the multi-frame images; aiming at the edge contour of the map semantic element in each frame of image, if the edge contour meets the preset condition, performing relaxation operation on the edge contour, and determining a boundary frame of the map semantic element based on the edge contour after the relaxation operation; the preset condition includes that the area of the edge profile is smaller than or equal to a first threshold value or the aspect ratio of the edge profile is larger than or equal to a second threshold value; and if the edge contour does not meet the preset condition, determining a boundary box of the map semantic element based on the edge contour.
In another possible implementation, the similarity of the bounding boxes of the map semantic elements between the image frames and the intersection of the bounding boxes of each map semantic element between the image frames are inversely related to the distance of the bounding boxes of the map semantic elements between the image frames.
In another possible implementation, the data association module 1302 is further configured to: based on KEEP ALIVE mechanism, determining the data associated frame sequence corresponding to the map semantic element from the multi-frame image.
In another possible implementation, the first image frame and the second image frame are determined from a sequence of data-associated frames corresponding to the map semantic element based on one or more of a completeness factor of the map semantic element in the image frames, a parallax factor of the map semantic element between different image frames, and a camera pose accuracy corresponding to the image frame in which the map semantic element is located.
In another possible implementation, the matching module 1303 is specifically configured to: collecting N first pixel points from the first image frame along the edge outline of the map semantic element, wherein N is a positive integer greater than 1; determining a fundamental matrix of epipolar geometry based on the camera pose corresponding to the first image frame, the camera pose corresponding to the second image frame and the internal parameters of the camera; based on a fundamental matrix of epipolar geometry, N polar mapping of N first pixel points on a second frame image is determined; determining N second pixel points based on N intersection points of the N polar line maps and the edge outline of the map semantic element in the second image frame; and determining pixel point matching pairs of the map semantic elements in the image frame pairs based on the N first pixel points and the N second pixel points.
In another possible implementation, the semantic map construction apparatus 1300 provided by the embodiment of the present application further includes an optimization module 1306, where the optimization module 1306 is configured to adjust spatial location information of the map semantic elements based on the error map corresponding to the map semantic elements; the error map corresponding to the map semantic element indicates error information of each image frame of the data association frame corresponding to the map semantic element from the three-dimensional semantic map to be re-projected; and adjusting the three-dimensional semantic map based on the spatial position information of the adjusted map semantic elements.
The semantic map building apparatus 1300 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in the semantic map building apparatus 1300 are respectively for implementing the corresponding flow of each method in fig. 2 to 12, which are not repeated herein for brevity.
Based on the same conception as the foregoing embodiments of the method, the present application also provides a computing device, which at least includes a processor and a memory, wherein the memory stores a program, and when the program is executed by the processor, the unit or module of each step in the method shown in fig. 2-12 may be implemented.
Fig. 14 is a schematic structural diagram of a computing device according to an embodiment of the present application.
As shown in fig. 14, the computing device 1400 includes at least one processor 1401, a memory 1402, and a communication interface 1403. The processor 1401, the memory 1402, and the communication interface 1403 are communicatively connected, and the communication connection may be implemented by a wired (e.g., bus) method or may be implemented by a wireless method. The communication interface 1403 is used to receive data sent by other devices (e.g., a sequence of image frames transmitted by a vision sensor); the memory 1402 stores computer instructions that the processor 1401 executes to perform the methods of the method embodiments described above.
It should be appreciated that in embodiments of the present application, the processor 1401 may be a central processing unit CPU, and the processor 1401 may also be other general purpose processors, digital Signal Processors (DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), field programmable gate arrays (field programmable GATE ARRAY, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 1402 may include read only memory and random access memory, and provides instructions and data to the processor 1401. Memory 1402 may also include nonvolatile random access memory.
The memory 1402 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double DATA DATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).
It should be appreciated that the computing device 1400 according to the embodiment of the present application may perform the method shown in fig. 2-12 in implementing the embodiment of the present application, and the detailed description of the implementation of the method is referred to above, which is not repeated herein for brevity.
The embodiment of the application also provides a vehicle, wherein the computing device shown in fig. 14 is deployed on the vehicle, and the high real-time, high Lu Bang performance and high-precision semantic map construction is realized by processing the image sequence acquired by the vision sensor on the vehicle.
Or the semantic map construction device provided by the embodiment of the application is deployed in a vehicle machine system of the vehicle, and the semantic map construction with high real-time performance, high Lu Bang performance and high precision is realized by processing the image sequence acquired by the visual sensor on the vehicle.
Embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the above-mentioned method to be implemented.
Embodiments of the present application provide a chip comprising at least one processor and an interface through which program instructions or data are determined by the at least one processor; the at least one processor is configured to execute the program instructions to implement the above-mentioned method.
Embodiments of the present application provide a computer program or computer program product comprising instructions which, when executed, cause a computer to perform the above-mentioned method.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different approaches for each particular application, but such implementation is not considered to be beyond the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the scope of the application, but to limit the application to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the application are intended to be included within the scope of the application.
Claims (20)
1. The construction method of the semantic map is characterized by comprising the following steps:
acquiring information of map semantic elements in multi-frame images;
Obtaining a data association frame sequence corresponding to the map semantic elements, wherein images in the data association frame sequence comprise the map semantic elements;
obtaining pixel point matching pairs of the map semantic elements in a first image frame and a second image frame, wherein the first image frame and the second image frame are image frames in the data association frame sequence;
Based on the pixel point matching pair, the camera pose corresponding to the first image frame and the camera pose corresponding to the second image frame, obtaining the spatial position information of the map semantic element;
And constructing a three-dimensional semantic map based on the spatial position information of the map semantic elements.
2. The method of claim 1, wherein the deriving a sequence of data-related frames corresponding to the map semantic elements comprises:
obtaining a boundary box of map semantic elements in each frame of images in the multi-frame images;
For each map semantic element, carrying out inter-frame matching based on the boundary frame, and taking the successfully matched image frame as a data association frame sequence corresponding to the map semantic element; and the similarity of the boundary boxes of the map semantic elements between the successfully matched image frames is larger than a preset threshold value.
3. The method of claim 2, wherein the obtaining a bounding box of map semantic elements in each of the plurality of frames of images comprises:
acquiring edge contours of map semantic elements in each frame of images in the multi-frame images;
aiming at the edge contour of the map semantic element in each frame of image, if the edge contour does not meet the preset condition, performing relaxation operation on the edge contour, and determining a boundary frame of the map semantic element based on the edge contour after the relaxation operation; the preset condition comprises that the area of the edge profile is larger than or equal to a first threshold value or the aspect ratio of the edge profile is smaller than or equal to a second threshold value;
And if the edge contour meets the preset condition, determining a boundary box of the map semantic element based on the edge contour.
4. A method according to claim 2 or 3, characterized in that the similarity of the bounding boxes of the map semantic elements between the image frames and the bounding boxes of the map semantic elements between the image frames are in direct correlation and the distance of the bounding boxes of the map semantic elements between the image frames is in negative correlation.
5. The method of any of claims 1-4, wherein the deriving a sequence of data-related frames corresponding to the map semantic elements comprises:
And determining a data association frame sequence corresponding to the map semantic element from the multi-frame image based on KEEP ALIVE mechanisms.
6. The method of any of claims 1-5, wherein the first image frame and the second image frame are determined from a sequence of data-dependent frames corresponding to the map semantic element based on one or more of a completeness factor of the map semantic element in an image frame, a disparity factor of the map semantic element between different image frames, and a camera pose accuracy corresponding to an image frame in which the map semantic element is located.
7. The method of any of claims 1-6, wherein the deriving pixel-point matching pairs of the map semantic elements in the first image frame and the second image frame comprises:
Collecting N first pixel points from the first image frame along the edge outline of the map semantic element, wherein N is a positive integer greater than 1;
determining a base matrix of epipolar geometry based on the camera pose corresponding to the first image frame, the camera pose corresponding to the second image frame, and the internal parameters of the camera;
determining N polar line mappings of the N first pixel points on the second frame image based on the fundamental matrix of the epipolar geometry;
determining N second pixel points based on N intersection points of the N polar line maps and the edge outline of the map semantic element in the second image frame;
and determining pixel point matching pairs of the map semantic elements in an image frame pair based on the N first pixel points and the N second pixel points.
8. The method of any one of claims 1-7, further comprising:
based on an error map corresponding to a map semantic element, adjusting spatial position information of the map semantic element; the error map corresponding to the map semantic element indicates error information of each image frame of the data association frame corresponding to the map semantic element, wherein the error map corresponding to the map semantic element is projected from the three-dimensional semantic map;
and adjusting the three-dimensional semantic map based on the adjusted spatial position information of the map semantic elements.
9. A method of route planning for a vehicle, comprising:
constructing a three-dimensional semantic map of the surroundings of the vehicle based on the method according to any one of claims 1-8;
and planning the driving route of the vehicle based on the three-dimensional semantic map.
10. A semantic map constructing apparatus, comprising:
the acquisition module is used for acquiring the information of the map semantic elements in the multi-frame images;
The data association module is used for obtaining a data association frame sequence corresponding to the map semantic elements, and images in the data association frame sequence comprise the map semantic elements;
The matching module is used for obtaining pixel point matching pairs of the map semantic elements in a first image frame and a second image frame, wherein the first image frame and the second image frame are image frames in the data association frame sequence;
the determining module is used for obtaining the spatial position information of the map semantic elements based on the pixel point matching pair, the camera pose corresponding to the first image frame and the camera pose corresponding to the second image frame;
the construction module is used for constructing a three-dimensional semantic map based on the spatial position information of the map semantic elements.
11. The apparatus of claim 10, wherein the data association module is specifically configured to:
obtaining a boundary box of map semantic elements in each frame of images in the multi-frame images;
For each map semantic element, carrying out inter-frame matching based on the boundary frame, and taking the successfully matched image frame as a data association frame sequence corresponding to the map semantic element; and the similarity of the boundary boxes of the map semantic elements between the successfully matched image frames is larger than a preset threshold value.
12. The apparatus of claim 11, wherein the data association module is further configured to:
acquiring edge contours of map semantic elements in each frame of images in the multi-frame images;
Aiming at the edge contour of the map semantic element in each frame of image, if the edge contour meets the preset condition, performing relaxation operation on the edge contour, and determining a boundary frame of the map semantic element based on the edge contour after the relaxation operation; the preset condition comprises that the area of the edge profile is smaller than or equal to a first threshold value or the aspect ratio of the edge profile is larger than or equal to a second threshold value;
and if the edge contour does not meet the preset condition, determining a boundary box of the map semantic element based on the edge contour.
13. The apparatus of claim 11 or 12, wherein a similarity of bounding boxes of the map semantic elements between the image frames and a cross-over positive correlation of bounding boxes of the map semantic elements between the image frames and a distance negative correlation of bounding boxes of the map semantic elements between the image frames.
14. The apparatus of any of claims 10-13, wherein the data association module is further configured to:
And determining a data association frame sequence corresponding to the map semantic element from the multi-frame image based on KEEP ALIVE mechanisms.
15. The apparatus of any of claims 10-14, wherein the first image frame and the second image frame are determined from a sequence of data-dependent frames corresponding to the map semantic element based on one or more of an integrity factor of the map semantic element in the image frames, a disparity factor of the map semantic element between different image frames, and a camera pose accuracy corresponding to the image frame in which the map semantic element is located.
16. The apparatus according to any one of claims 10-15, wherein the matching module is specifically configured to:
Collecting N first pixel points from the first image frame along the edge outline of the map semantic element, wherein N is a positive integer greater than 1;
determining a base matrix of epipolar geometry based on the camera pose corresponding to the first image frame, the camera pose corresponding to the second image frame, and the internal parameters of the camera;
determining N polar line mappings of the N first pixel points on the second frame image based on the fundamental matrix of the epipolar geometry;
determining N second pixel points based on N intersection points of the N polar line maps and the edge outline of the map semantic element in the second image frame;
and determining pixel point matching pairs of the map semantic elements in an image frame pair based on the N first pixel points and the N second pixel points.
17. The apparatus according to any one of claims 10-16, further comprising:
The optimization module is used for adjusting the spatial position information of the map semantic elements based on the error map corresponding to the map semantic elements; the error map corresponding to the map semantic element indicates error information of each image frame of the data association frame corresponding to the map semantic element, wherein the error map corresponding to the map semantic element is projected from the three-dimensional semantic map;
and adjusting the three-dimensional semantic map based on the adjusted spatial position information of the map semantic elements.
18. A vehicle comprising a semantic map building apparatus according to any one of claims 10-17.
19. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, the processor executing the executable code to implement the method of any of claims 1-9.
20. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310399171.5A CN118776545A (en) | 2023-04-04 | 2023-04-04 | Semantic map construction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310399171.5A CN118776545A (en) | 2023-04-04 | 2023-04-04 | Semantic map construction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118776545A true CN118776545A (en) | 2024-10-15 |
Family
ID=92994424
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310399171.5A Pending CN118776545A (en) | 2023-04-04 | 2023-04-04 | Semantic map construction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118776545A (en) |
-
2023
- 2023-04-04 CN CN202310399171.5A patent/CN118776545A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7485749B2 (en) | Video-based localization and mapping method and system - Patents.com | |
CN112417967B (en) | Obstacle detection method, obstacle detection device, computer device, and storage medium | |
CA2950791C (en) | Binocular visual navigation system and method based on power robot | |
Yuan et al. | Robust lane detection for complicated road environment based on normal map | |
JP6442834B2 (en) | Road surface height shape estimation method and system | |
US11790548B2 (en) | Urban environment labelling | |
CN110197173B (en) | Road edge detection method based on binocular vision | |
CN112700486A (en) | Method and device for estimating depth of road lane line in image | |
CN115147328A (en) | Three-dimensional target detection method and device | |
Petrovai et al. | A stereovision based approach for detecting and tracking lane and forward obstacles on mobile devices | |
CN115410167A (en) | Target detection and semantic segmentation method, device, equipment and storage medium | |
EP4287137A1 (en) | Method, device, equipment, storage media and system for detecting drivable space of road | |
CN114969221A (en) | Method for updating map and related equipment | |
Ma et al. | A multifeature-assisted road and vehicle detection method based on monocular depth estimation and refined UV disparity mapping | |
CN113409242A (en) | Intelligent monitoring method for point cloud of rail intersection bow net | |
CN116189150B (en) | Monocular 3D target detection method, device, equipment and medium based on fusion output | |
CN116385994A (en) | Three-dimensional road route extraction method and related equipment | |
US20230266469A1 (en) | System and method for detecting road intersection on point cloud height map | |
CN118776545A (en) | Semantic map construction method and device | |
CN117011481A (en) | Method and device for constructing three-dimensional map, electronic equipment and storage medium | |
Xie et al. | A cascaded framework for robust traversable region estimation using stereo vision | |
CN113763468A (en) | Positioning method, device, system and storage medium | |
CN118172423B (en) | Sequential point cloud data pavement element labeling method and device and electronic equipment | |
Ding et al. | Stereovision based generic obstacle detection and motion estimation using v-stxiel algorithm | |
He et al. | Lane Detection and Tracking through Affine Rectification. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |