CN117516581A - End-to-end automatic driving track planning system, method and training method integrating BEVFomer and neighborhood attention transducer - Google Patents
End-to-end automatic driving track planning system, method and training method integrating BEVFomer and neighborhood attention transducer Download PDFInfo
- Publication number
- CN117516581A CN117516581A CN202311691441.6A CN202311691441A CN117516581A CN 117516581 A CN117516581 A CN 117516581A CN 202311691441 A CN202311691441 A CN 202311691441A CN 117516581 A CN117516581 A CN 117516581A
- Authority
- CN
- China
- Prior art keywords
- vehicle
- bev
- data
- attention
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 58
- 238000000034 method Methods 0.000 title claims abstract description 55
- 230000007246 mechanism Effects 0.000 claims abstract description 32
- 230000006399 behavior Effects 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 31
- 230000002123 temporal effect Effects 0.000 claims description 25
- 230000004927 fusion Effects 0.000 claims description 21
- 238000005457 optimization Methods 0.000 claims description 20
- 238000000605 extraction Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 14
- 238000012795 verification Methods 0.000 claims description 14
- 238000007781 pre-processing Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 10
- 230000001133 acceleration Effects 0.000 claims description 9
- 238000011478 gradient descent method Methods 0.000 claims description 9
- 239000003550 marker Substances 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000010200 validation analysis Methods 0.000 claims description 7
- 230000000007 visual effect Effects 0.000 claims description 7
- 210000000988 bone and bone Anatomy 0.000 claims description 5
- 238000004140 cleaning Methods 0.000 claims description 5
- 238000005259 measurement Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000011423 initialization method Methods 0.000 claims description 3
- 238000004321 preservation Methods 0.000 claims description 3
- 238000000638 solvent extraction Methods 0.000 claims description 3
- 238000011144 upstream manufacturing Methods 0.000 claims description 3
- 238000012800 visualization Methods 0.000 claims description 3
- 230000007613 environmental effect Effects 0.000 abstract description 2
- 241000709691 Enterovirus E Species 0.000 description 95
- 238000005516 engineering process Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000013459 approach Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 206010039203 Road traffic accident Diseases 0.000 description 1
- 108700019146 Transgenes Proteins 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 239000012633 leachable Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C21/00—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
- G01C21/26—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
- G01C21/34—Route searching; Route guidance
- G01C21/3446—Details of route searching algorithms, e.g. Dijkstra, A*, arc-flags, using precalculated routes
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Automation & Control Theory (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a system, a method and a training method for end-to-end automatic driving track planning by fusing BEVFomer and neighborhood attention Transformer, wherein the BEVFomer learns unified BEV characteristic representation through a space-time structure Transformer in a multi-view image so as to capture spatial relationship and time information in input BEV data; extracting dynamic characteristics and associated characteristics of vehicle running from vehicle history track data and vehicle state data by using an RNN (RNN feature extractor), and fusing the dynamic characteristics and associated characteristics of vehicle running into BEV (vehicle-mounted vehicle) characteristics; finally, the vehicle control signals and trajectories are output through a neighborhood attention-based transducer planning model and a full connection layer. According to the invention, space and time information is utilized across cameras and time steps, vehicle motion characteristics and aerial view characteristics are fused to better understand and analyze complex relations such as vehicle behaviors and environmental changes, and the relevance among different positions in the characteristics is captured by applying a self-attention mechanism in a local neighborhood, so that the expression capacity of the model is improved.
Description
Technical Field
The invention belongs to the technical field of end-to-end automatic driving, and particularly relates to a BEVFomer and neighborhood attention transducer end-to-end automatic driving track planning system, method and training method.
Background
The statements in this section merely provide background information related to the present application and may not constitute prior art.
Autopilot technology is becoming a key development direction in future traffic areas as an important technology. In modern society, problems such as frequent traffic accidents, traffic jams and the like have become global challenges, and the introduction of automatic driving technology has provided new possibilities for solving the problems. The automatic driving technology can enable the vehicle to autonomously sense the environment, make safe driving decisions and realize efficient traffic flow by using an advanced sensing and decision-making system. The development of autopilot technology has gone through several stages. Initially, rule-based methods were widely used in automated driving systems. These methods control vehicle behavior, such as lane keeping, traffic signal compliance, etc., through predefined rules and logic. However, this approach often requires a large number of rules to be manually designed and cannot cope with complex driving scenarios and changing traffic conditions.
To address the limitations in the traditional approach, end-to-end autopilot technology is becoming of increasing interest. The end-to-end automatic driving technology directly learns the driving model from the original sensor data to output a control signal or a future track by integrating the processes of sensing, decision, planning, control and the like into a unified system. The method eliminates the complex rule design and decision reasoning process in the traditional method, and has better adaptability and flexibility. Despite significant advances in end-to-end autopilot technology, there are still some challenges and limitations. The method comprises the aspects of large data demand, poor model interpretation, generalization capability for new scenes and the like.
Disclosure of Invention
In order to solve the technical problems, the invention provides a BEVFomer and neighborhood attention transducer end-to-end automatic driving track planning system, a method and a training method, which are used for greatly improving the understanding and decision making capability of an automatic driving system to the environment and improving the accuracy and generalization of the driving system, so that the control signal or the future track of the automatic driving vehicle can be output more accurately, and support is provided for realizing safer and intelligent automatic driving.
The technical problems to be solved by the present invention are not limited to the above-described problems, and any other technical problems not mentioned in the present application will be clearly understood from the following description by those skilled in the art to which the present application pertains.
The end-to-end automatic driving track planning system integrating BEVFomer and neighborhood attention transducer is characterized by comprising the following components:
the data acquisition module is used for acquiring multi-view images, historical track data of the vehicle and vehicle state data and preprocessing the acquired data;
the RNN feature extractor is used for extracting dynamic features and associated features of vehicle running from historical track data and vehicle state data of the vehicle; for historical track data, capturing a motion change trend in the track through time sequence processing, and capturing a time dependence in the running process of the vehicle by using a memory mechanism of the RNN feature extractor; by learning the vehicle state data, an association between the vehicle state and the behavior is obtained,
the BEVFomer feature extraction module comprises a back bone neural network and at least one encoder layer, wherein each encoder layer comprises a BEV query mechanism, a spatial cross attention module, a time self attention module and a feedforward network;
Backbone neural network is used for obtaining view characteristic F of multi-view image at t moment t ;
The saidBEV query mechanisms predefine a set of learnable parameters of a mesh shapeAs a BEVFormer query, for querying the corresponding grid cell region at p= (x, y) in the BEV plane;
the BEV querying mechanism queries the BEV spatial feature B through the temporal self-attention module t-1 Extracting, by the spatial cross-attention module, a query BEV spatial feature B t-1 Spatial information of (a);
the spatial cross-attention module is further based on view features F of the multi-view image t To obtain multi-view spatial information of the vehicle;
the feed-forward network is based on view characteristics F of multi-view images t Spatial information of (B) BEV spatial characteristics B t-1 Is used for obtaining refined BEV characteristic B through time information and space information t ;
The feature fusion module is used for fusing the dynamic feature and the associated feature of the vehicle running extracted by the RNN feature extractor into the BEV feature;
and (3) a neighborhood attention transducer planning model, planning a future track of the vehicle based on the fused BEV characteristics, and outputting a planning result through the full connection layer and the visualization module.
Further, the data acquisition module comprises cameras with different visual angles, vehicle-mounted sensors and an inertial measurement unit.
Further, a spatial cross-attention (SCA) process formula of the spatial cross-attention module:
-i represents an index of camera views;
-j represents the index of the reference point;
-N ref is the number of total reference points per BEV query;
-F t i is the firstFeatures of i camera views;
-Q p is each BEV query;
p (P, i, j) is a projection function, from the j-th 3D point (x 'y' z j ') to a 2D point on the ith view;
-defromatt () is a variable attention;
the formula of the projection function is as follows:
P(p,i,j)=(x ij ,y ij )
in addition, z ij .[x ij y ij 1] T =T i .[x′ y′ z′ j 1] T
Wherein:is a projection matrix known to the i-th camera.
Further, given the BEV query Q at the current time t and the historical BEV characteristics B maintained at time t-1 t-1 First B is carried out according to the movement of the bicycle t-1 Alignment with Q, having features on the same grid corresponding to the same real world location, historical BEV feature B to be aligned t-1 As B' t-1 The method comprises the steps of carrying out a first treatment on the surface of the The time correlation between BEV features modeled by the time self-attention module is specifically expressed as follows:
wherein,
-Q p representing a BEV query located at p= (x, y);
the offset in the temporal self-attention module is through Q and B' t-1 For the first sample of each sequence, the temporal self-attention module is degenerated to self-attention without temporal information, wherein the BEV features { Q, B' t-1 The BEV query { Q, Q } is replaced with the repeated BEV query { Q, Q }.
Further, the feature fusion module splices the features extracted by the RNN feature extractor and the BEV features together according to feature dimensions.
Further, the neighborhood attention Transformer embeds the BEV features after fusion by using 2 consecutive 3 x 3 overlapping convolutions; the method comprises the steps that a 4-level neighborhood attention transducer planning model is arranged in a stacked mode, an overlapping marker is arranged at the upstream of a first-level neighborhood attention transducer planning model, a downsampler is connected between two adjacent levels, and the step length of the downsampler is 3 multiplied by 3 convolution of 2; the neighborhood attention mechanism is as follows:
given an inputIt is a matrix, its rows aredA dimension marking vector; linear projections Q, K, V of Y, relative positional deviations G (u, V);
defining an attention weight for a u-th input having a neighborhood size kThe method comprises the following steps: the u-th input query projection Q u Projection of its k nearest neighbors +.>The specific formula is as follows:
wherein ρ is v (u) represents the v nearest neighbor of u;
then, the adjacent values are comparedDefined as a matrix whose rows are the k nearest neighbor projections of the u-th input, as follows:
the neighborhood attention of the u-th marker with a neighborhood size k is:
Wherein,is the scaling parameter, which is repeated for each pixel in the feature of the BEV after fusion.
The training method for fusing the BEVFomer and the neighborhood attention transducer end-to-end automatic driving track planning and predicting system is characterized by comprising the following steps of:
s1, collecting input data of a model: collecting images of multiple views, historical track data of a vehicle and vehicle state data, and preprocessing the collected data;
s2, taking historical track data and vehicle state data of the vehicle as input of an RNN model, and extracting dynamic characteristics and associated characteristics of vehicle running from the historical track data and the vehicle state data of the vehicle by utilizing the RNN module:
s3, taking the images with multiple visual angles as a BEVFomer model to be input, and extracting BEV features with time and space by using a BEVFomer feature extraction module;
s4, fusing the dynamic characteristics and the associated characteristics of the vehicle running extracted by the RNN into BEV characteristics through a characteristic fusion module;
s5, training, verifying and optimizing a neighborhood attention transducer model: inputting the fused BEV characteristics into a neighborhood attention transducer for training; after the neighborhood attention transducer outputs, a full connection layer is applied to map the input features to an output space, and a future track planning result of the vehicle is output; and repeating the steps S2-S5 to verify and optimize the model.
Further, the multi-view image in step S1 includes image data from cameras of different view angles; the historical track data of the vehicle is to record the motion track of the vehicle in the past period of time, and comprises position coordinates, speed, acceleration and course angle information of the vehicle; the vehicle state data is information of the current state of the vehicle, and comprises the speed, the acceleration, the steering angle, the vehicle power system parameters and the brake system parameters of the vehicle.
Further, the data preprocessing in step S1 includes:
(1) Data cleaning: cleaning the data, including processing missing values, outliers and repeated values;
(2) Feature selection: selecting characteristics related to the problems, and eliminating irrelevant characteristics;
(3) Data set partitioning: dividing the data set into a training set, a verification set and a test set;
(4) And (3) data coding: encoding the collected data and converting the encoded data into numerical data;
(5) Data normalization: the data is normalized to have zero mean and unit variance to eliminate dimensional differences between features.
Further, the specific steps of training, verifying and optimizing the S5 model are as follows:
(1) Initializing model parameters: initializing parameters of the model by using a random initialization method;
(2) Defining a loss function: determining a loss function to measure a difference between the model's predictions and the true value on the training data; the loss function is a mean square error or cross entropy function;
(3) Defining an optimization algorithm: selecting an optimization algorithm suitable for model training, wherein the optimization algorithm is a gradient descent method, a random gradient descent method or Adam, and the task selects the random gradient descent method; the goal of the optimization algorithm is to minimize the loss function by adjusting the parameters of the model.
(4) Iterative training: performing iterative training of the model, each training iteration cycle comprising the steps of:
a forward propagation: the input data is transmitted forward through the model to obtain a predicted value;
b, calculating loss: comparing the predicted value with the true value, and calculating the value of the loss function;
c back propagation: calculating the gradient of each parameter to the loss through a back propagation algorithm according to the value of the loss function;
d, updating parameters: updating parameters of the model according to the gradient by using an optimization algorithm;
repeating the above steps until reaching the set stopping condition (such as reaching the maximum iteration number or the convergence of the loss function);
(5) Verification and adjustment of the model: during training, periodically using the validation set to evaluate the performance of the model; and performing model adjustment, such as learning rate adjustment, regularization increase and the like, according to the performance of the verification set so as to optimize the performance of the model.
(6) Model preservation: the trained model parameters are saved for subsequent use and deployment.
An automatic driving track planning and predicting method based on a fused BEVFomer and neighborhood attention transducer end-to-end automatic driving track planning and predicting system is characterized in that,
the method comprises the following steps:
step 1, collecting model input data: collecting images of multiple views, historical track data of a vehicle and vehicle state data, and preprocessing the collected data;
step 2, taking the historical track data and the vehicle state data of the vehicle as the input of an RNN model, and extracting dynamic characteristics and associated characteristics of the running of the vehicle from the historical track data and the vehicle state data of the vehicle by utilizing the RNN module:
step 3, taking the images with multiple visual angles as the input of a BEVFomer model, and extracting BEV features with time and space by utilizing a BEVFomer feature extraction module;
step 4, fusing the dynamic characteristics of the vehicle running and the associated characteristics extracted by the RNN characteristic extractor into BEV characteristics through a characteristic fusion module;
step 5, predicting a neighborhood attention transducer model: inputting the fused BEV characteristics into a neighborhood attention transducer for training; after neighborhood attention transducer output, a full connection layer is applied to map input features to output space, and future track planning results of the vehicle are output.
The invention provides an end-to-end automatic driving track planning system integrating BEVFomer and neighborhood attention transducer based on sensing the complex environment of an automatic driving vehicle. According to the automatic driving track planning and predicting method based on the fusion BEVFomer and the neighborhood attention transducer end-to-end automatic driving track planning and predicting system, firstly, multi-view images are used as the input of the BEVFomer, and historical track data and vehicle state data are used as the input of an RNN feature extractor model. Bevfomer learns unified BEV feature representations through convectors of the spatio-temporal structure to effectively capture spatial relationship and temporal information in the input BEV data; and extracting dynamic characteristics and associated characteristics of the vehicle running from the historical track data and the vehicle state data of the vehicle by using the RNN characteristic extractor. Then, the dynamic feature and the associated feature of the vehicle running extracted by the RNN feature extractor are fused into BEV features by a feature fusion module. Finally, the vehicle control signals and trajectories are output through a neighborhood attention-based transducer model and full connection layer. Among them, neighborhood attention convertors are a hierarchical convertors that are set efficient, accurate and scalable. In addition, neighborhood attention is an efficient and scalable sliding window attention mechanism that expands the receptive field of each query to its nearest neighbor and approaches the self-attention mechanism as the receptive field increases. Therefore, the end-to-end automatic driving track planning system improves the understanding and decision making capability of the automatic driving system to the environment, and improves the accuracy and generalization of the automatic driving system, so that the control signal or the future track of the automatic driving vehicle can be more accurately output, and support is provided for realizing safer and intelligent automatic driving.
The beneficial effects of the invention are as follows:
1. the invention provides a feature fusion module which fuses vehicle motion features and BEV features acquired by an RNN feature extractor. First, the RNN feature extractor can capture timing information and patterns in the vehicle history trajectory data and the vehicle state data. By fusing the vehicle motion features and BEV features, these timing information can be effectively incorporated into the overall feature representation, thereby more fully describing vehicle behavior and state. Thus, fusing different types of features may provide a richer, diversified, multi-dimensional representation of the state and surrounding environment of the feature-descriptive vehicle, thereby improving the expressive power of the model. In addition, by fusing the motion characteristics and the aerial view characteristics of the vehicle, the model can better understand and analyze complex relations such as vehicle behaviors and environmental changes, and further improve the performance of the model in related tasks.
2. The invention provides a novel method for fusing BEVFomer and neighborhood attention transducer. First, spatiotemporal features and historical features from multi-view cameras are effectively aggregated by BEVFormer. The method is a space-time converter and can support various automatic driving perception tasks. BevFormer interacts with the spatio-temporal space through predefined latticed BEV queries, and can capture information across cameras and time steps using spatial and temporal information, improving the performance and efficiency of the framework. Second, neighborhood attention transformers can capture the association between different locations in a feature by applying a self-attention mechanism within the local neighborhood. Such locality modeling makes neighborhood attention convertors excellent in handling fine-grained features and structures in features. In addition, the self-attention mechanism in the neighborhood attention transducer has the advantage of parallel computation, and the characteristics of different positions can be computed at the same time. This allows the neighborhood attention transducer to be computationally efficient in processing large-scale data.
Drawings
FIG. 1 is a block diagram of an end-to-end autopilot trajectory planning system incorporating BEVFomers and neighborhood attention transgenes in accordance with the present invention.
FIG. 2 is a flow chart of an end-to-end automated driving trajectory planning system incorporating BEVFomer and neighborhood attention Transformer in accordance with the present invention.
FIG. 3 is a flow chart of training and testing the fused BEVFormer and neighborhood attention Transformer model according to the present invention.
FIG. 4 is a block diagram of the BEVFormer model architecture of the present invention.
FIG. 5 is a diagram of a neighborhood attention transducer architecture of the present invention.
FIG. 6 is a block diagram of the neighborhood of attention of the present invention.
Fig. 7 is a diagram showing the results of the end-to-end autopilot trajectory planning system of the present invention.
Fig. 8 is an exemplary diagram of an application scenario of the end-to-end automatic driving trajectory planning system of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings, but the protection of the invention is not limited thereto.
The invention discloses a BEVFomer and neighborhood attention Transformer end-to-end automatic driving track planning and predicting system, which is shown in figure 1 and comprises a data acquisition module, an RNN feature extractor, a BEVFomer feature extraction module, a feature fusion module and a neighborhood attention Transformer planning model.
The data acquisition module comprises cameras with different visual angles, vehicle-mounted sensors and an inertial measurement unit; the method is used for acquiring multi-view images, historical track data of the vehicle and vehicle state data and preprocessing the acquired data.
The RNN feature extractor is used for extracting dynamic features and associated features of vehicle running from historical track data and vehicle state data of the vehicle; for historical track data, capturing a motion change trend in the track through time sequence processing, and capturing a time dependence in the running process of the vehicle by using a memory mechanism of the RNN feature extractor; by learning the vehicle state data, an association between the vehicle state and the behavior is obtained,
the bevfomer feature extraction module, as shown in fig. 4, includes a back bone neural network, at least one encoder layer, each encoder layer including a BEV query mechanism, a spatial cross-attention module, a temporal self-attention module, and a feed forward network. The Backbone neural network is used for obtaining view characteristics of the multi-view image at the moment t; the BEV query mechanism predefines a set of learnable parameters of the grid shape as a BEVFormer query for the corresponding grid cell region at the query in the BEV plane. The BEV query mechanism passes through The time self-attention module inquires time information of the BEV spatial characteristics, and the spatial information of the inquired BEV spatial characteristics is extracted through the spatial cross-attention module. The spatial cross-attention module also obtains multi-view spatial information of the vehicle based on view features of the multi-view image. The feedforward network obtains refined BEV characteristic B based on spatial information of view characteristics of the multi-view image, temporal information of BEV spatial characteristics and spatial information t 。
The feature fusion module is used for fusing the dynamic feature and the associated feature of the vehicle running extracted by the RNN feature extractor into the BEV feature.
And the neighborhood attention transducer planning model performs planning of a future track of the vehicle based on the fused BEV characteristics, and then outputs a planning result through the full connection layer and the visualization module.
As shown in FIG. 2, the training method for fusing the BEVFomer and neighborhood attention Transformer end-to-end automatic driving track planning model comprises the following steps:
data collection, data preprocessing, feature extraction modules, construction of a neighborhood attention transducer model, model training, verification and optimization.
S1, inputting a constructed model:
data collection uses cameras, sensors, GPS devices, inertial Measurement Units (IMUs) or experimental devices to collect images of multiple perspectives, historical trajectory data of the vehicle, and vehicle state data. Wherein the multi-view image is used as input of the BEVFormer feature extraction module, and the historical track data of the vehicle and the vehicle state data are used as input of the RNN feature extractor. The multi-view image is image data obtained from cameras from different views, each providing visual information of the surroundings of the vehicle. The historical track data of the vehicle records the motion track of the vehicle in a period of time in the past, and the motion track comprises information such as position coordinates, speed, acceleration, course angle and the like of the vehicle; such data may be obtained by means of onboard sensors such as GPS or Inertial Measurement Units (IMU). The vehicle state data provides information about the current state of the vehicle, including the speed, acceleration, steering angle, vehicle powertrain parameters (e.g., engine speed, accelerator opening), brake system parameters, etc. These data may be acquired by onboard sensors and an Electronic Control Unit (ECU) of the vehicle.
The multi-view images, the historical track data of the vehicle and the vehicle state data need to be subjected to data preprocessing before being input into the bevfomer feature extraction module and the RNN feature extractor, and the raw data are cleaned, converted and sorted to improve the data quality, reduce noise, process missing values and outliers, and convert the data into a format suitable for model training and analysis. The following is the step of data preprocessing:
(1) Data cleaning: the data is cleaned, including processing missing values, outliers, duplicate values, etc. Various methods may be used, such as filling in missing values, deleting outliers, etc.
(2) Feature selection: and selecting the characteristics related to the problems, and eliminating the irrelevant characteristics so as to improve the effect and efficiency of the model. Feature selection may be performed by statistical analysis, feature correlation, feature importance, etc. Feature selection is to select features with the highest relevance and importance from the original data so as to improve the performance and efficiency of the model.
(3) Data set partitioning: the data set is divided into a training set, a validation set and a test set. As shown in fig. 3, the processed multi-view image, the history trajectory data of the vehicle, and the vehicle state data are processed as follows: 2: the scale of 2 is divided into a training set, a validation set and a test set. The training set is used for training the model and adjusting parameters, the verification set is used for selecting, optimizing and adjusting super parameters of the model, and the test set is used for finally evaluating the performance and generalization capability of the model.
(4) And (3) data coding: the collected and processed data are encoded and converted into numerical data so as to facilitate the processing of the model.
(5) Data normalization: and carrying out standardization processing on the data to ensure that the data has zero mean and unit variance so as to eliminate dimension differences among features and improve the stability and convergence rate of the model.
S2, feature extraction
(1) Establishing a BEV Former feature extraction module, and extracting BEV features with space-time characteristics by using the BEV Former feature extraction module:
a new transform-based BEV generation framework is presented that can efficiently aggregate spatio-temporal features from multi-view cameras and record BEV features through a attentive mechanism. As shown in fig. 4, the BEVFormer feature extraction module architecture has a back bone neural network and 3 encoder layers, each following the conventional structure of the transducer except for the BEV query mechanism, the spatial cross-attention module, and the temporal self-attention module. In particular, in each encoder layer, a BEV query mechanism is used to query features in BEV space from a multi-view through an attention mechanism. The spatial cross-attention module and the temporal self-attention module are attention layers working with BEV query mechanisms for finding and aggregating spatial features from multiple camera images and temporal features, spatial features from historical BEVs according to the BEV query mechanisms. Namely: at time t, the bevfomer feature extraction module first obtains the view feature F of the multi-view image via a back bone neural network, e.g., VGGNet t . Due to the retention in the system of the historical BEV characteristic B at the previous time t-1 t-1 . In each encoder layer, the prior BEV feature B is first queried by the temporal self-attention module using the BEV query mechanism t-1 Time information of (2); then, using the BEV query to extract spatial information of the spatial features of the query BEV through the spatial cross-attention module, while from the multi-camera feature F at time t through the spatial cross-attention module t Querying space information; finally, after the feed-forward network, the encoder outputs the refined BEV features as input to the next encoder layer, and after 3 stacked encoders, a unified BEV feature B at the current time t is generated t For subsequent tasks.
First, BEV query mechanism:
BEV query mechanisms predefine a set of learnable parameters of a mesh shapeAs a BEVFormer query, where H, W is the spatial shape of the BEV plane. Specifically, the query at p= (x, y) of Q +.>Is responsible for querying the corresponding grid cell area in the BEV plane. Each grid cell in the BEV plane corresponds to a size of s meters in length in the real world. The center of the BEV feature corresponds by default to the location of the own vehicle. Before the BEV query Q is input to the bevform, a leachable location embedding is added to it.
Second, spatial cross-attention module:
the computation cost of ordinary multi-head attention is high due to the large scale of multi-camera 3D input. Thus, spatial cross-attention was developed based on variable attention, a resource efficient layer of attention, each BEV query Q p Only the region of interest is interacted with in the camera view. Wherein the formula of the variable attention mechanism is:
wherein,
q, p, x represent query features, reference point features and input features, respectively;
-i is the index of the attention header and j is the index of the sampling key;
-N head represents the total number of attention heads, N key Is the number of sampling keys per attention header;
-and->Is a learnable parameter, where C is a feature dimension;
-A ij ∈[0,1]is the predicted attention weight, byNormalization was performed.
-Is the predicted offset relative to the reference point p;
-x(p+Δp ij ) Represents p+Δp ij And extracting the characteristics at the position by adopting a bilinear interpolation method.
However, the variable attention is designed based on 2D perception, so some adjustments need to be made to accommodate the 3D scene. Specifically, each query on the BEV plane is first promoted to a columnar query, and N is sampled from the column ref The 3D reference points are then projected into the 2D view. For a BEV query, the point projected onto 2D can only fall on some views, while other views are not hit. We call the hit view V here hit . These 2D points are then considered to be query Q p And from hit view V hit Surrounding sampling features. Finally, the sampled features are weighted and then output as spatial cross-attention. The following is a process formula for spatial cross-attention (SCA):
-i represents an index of camera views;
-j represents the index of the reference point;
-N ref is the number of total reference points per BEV query;
-F t i is characteristic of the ith camera view;
-Q p is each BEV query;
-P (P, i, j) is a projection function, a j-th reference point on the i-th view image being obtained;
deformatt () is a variable attention.
Next, it is described how to obtain a reference point on the view image from the projection function P.First calculate query Q p The corresponding real world position (x ', y') at position p= (x, y) is as follows:
-H, W is the height and width of the BEV query;
-s is the resolution size of the BEV mesh;
- (x ', y') coordinates with the vehicle position as the origin;
in 3D space, an object located at (x ', y ') will appear at a height z ' in the z-axis. Thus, a set of anchor heights is predefined firstTo ensure that cues present at different heights can be captured. In this way, for each query Q p Will get a columnar 3D reference point +.>Finally, the three-dimensional reference points are projected into different views through a projection matrix of the camera, and the projection function is expressed as follows:
P(p,i,j)=(x ij ,y ij )
in addition, z ij .[x ij y ij 1] T =T i .[x′ y′ z′ j 1] T
Wherein:
p (P, i, j) is the point (x 'y' z 'from the j-th 3D point' j ) A 2D point projected onto the ith view;
-is a projection matrix known to the i-th camera.
Third, time self-attention module:
in addition to spatial information, temporal information is also critical to the system's understanding of the surrounding environment. For example, it is challenging to infer the velocity of moving objects or detect highly occluded objects from static images without time cues. To address this problem, a temporal self-attention module was designed to represent the current environment by incorporating historical BEV features. The method comprises the following specific steps:
given the BEV query Q at the current time t and the historical BEV feature B saved at time t-1 t-1 First B is carried out according to the movement of the bicycle t-1 Aligned with Q such that features on the same grid correspond to the same real world location. Here, the aligned historical BEV features B t-1 As B' t-1 . However, from t-1 to t, the movable object moves in the real world with a different offset. It is a challenge to construct accurate correlations of the same targets between BEV features at different times. Thus, the temporal correlation between BEV features is modeled by a temporal self-attention module (TSA). The specific formula is as follows:
Wherein,
-Q p representing a BEV query located at p= (x, y);
furthermore, unlike normal variable attention, the offset in the temporal self-attention module is through Q and B' t-1 Predicted by a series of (a) and (b) are predicted by a series of (b). In particular, for the first sample of each sequence, the temporal self-attention module is degenerated to self-attention without temporal information, wherein BEV features { Q, B' t-1 The BEV query { Q, Q } is replaced with the repeated BEV query { Q, Q }.
Thus, the temporal self-attention module may more effectively simulate long-term dependencies than simply superimposing BEVs. Bevfomer extracts time information from previous BEV features rather than from multiple superimposed BEV features, thus reducing computational cost and interference information.
(2) The RNN feature extractor is used for extracting vehicle dynamic features and associated features from historical track data and vehicle state data of the vehicle, and the processing steps are as follows:
first, vehicle motion data including historical track data of a vehicle and vehicle state data are input to an RNN feature extractor (e.g., LSTM). For the historical trajectory data, the RNN feature extractor considers information of vehicle position, speed, acceleration, and the like at each point in time, and captures a movement pattern and a change trend in the trajectory through time-series processing. The memory mechanism of the RNN feature extractor may help capture time-dependent relationships during vehicle travel, such as whether the acceleration of the vehicle remains consistent over a period of time, or whether the position of the vehicle exhibits periodic changes. By learning these patterns and trends in the historical trajectory data, the RNN feature extractor can further extract dynamic features of the vehicle's travel, providing a reference for subsequent driving strategies. For vehicle status data, the RNN feature extractor may consider various sensor data of the vehicle, such as steering wheel angle, accelerator pedal position, brake pedal status, etc. of the vehicle, as well as other status information of the vehicle, such as vehicle speed, turn signal status, whether the vehicle is turning on cruise control, etc. By taking these vehicle state data as inputs to the RNN feature extractor, the RNN feature extractor can learn an association between the vehicle state and behavior, e.g., the vehicle is more likely to take some behavior in some state. By extracting these associated features, the RNN feature extractor may provide more accurate inputs for the vehicle driving strategy.
S3, feature fusion:
the vehicle motion features extracted by the RNN feature extractor comprise historical track features and state features of the vehicle and are fused into BEV features extracted by the BEVFomer. Here, we stitch the feature extracted by the RNN feature extractor and the BEV feature together according to the feature dimension to form a longer feature vector. In addition, the consistency and alignment of the data are maintained in the fusion process, and the time and space correspondence between the features extracted by the RNN feature extractor and the BEV features is ensured. The characteristics from different characteristic extraction methods are fused, so that the expression capacity of the characteristics can be enhanced, the modeling capacity of a model on data is improved, the robustness and generalization capacity of the model are improved, and the performance of the model is improved.
S4, training, verifying and optimizing neighborhood attention transducer model
(1) Constructing a neighborhood attention transducer planning model:
as shown in fig. 5, the neighborhood attention fransformer planning model is a set of efficient, accurate and scalable hierarchical fransformers that embed BEV features after fusion by using 2 consecutive 3 x 3 convolutions (step size 2), which results in a spatial dimension of 1/4 of the input dimension. This is similar to using a 4 x 4patch and embedded layer, but it uses overlapping convolutions rather than non-overlapping convolutions. This may introduce a useful inductive bias. On the other hand, 2 convolutions introduce more parameters. However, this problem is handled by reconfiguring the model, which will yield a better trade-off. The 4-level neighborhood attention transducer planning model is arranged in a stacked mode, an overlapped marker is arranged at the upstream of the first-level neighborhood attention transducer planning model, and a downsampler is arranged between two adjacent stages. The downsampler reduces the space size to half of the original, and the number of channels is doubled. The downsampler here uses a 3 x 3 convolution (step size 2). Since the overlay marker downsamples the spatial size by a multiple of 4, the model generates a size of Allowing neighborhood attention transformers to migrate pre-trained models into downstream tasks more easily. />
As shown in FIG. 6, the neighborhood attention mechanism is an effective and scalable sliding window attention mechanism that locates the attention range of each pixel to its nearest neighborhood, approximates self-attention as its range grows, and maintains the variability of translation, etc. There are linear advantages in terms of temporal and spatial complexity compared to self-attention, and also local induced bias is introduced, similar to convolution. The method comprises the following specific steps:
given an inputIt is a matrix whose rows are d-dimensional marker vectors. Linear projections Q, K, V of Y, and relative positional deviations G (u, V), define the attention weight of the u-th input with neighborhood size K +.>Projection Q for the u-th input query u Projection of its k nearest neighbors +.>Is a dot product of (a). The specific formula is as follows:
wherein ρ is v (u) represents the v nearest neighbor of u;
then, the adjacent values are comparedDefined as a matrix whose rows are the k nearest neighbor projections of the u-th input, as follows:
the neighborhood attention defining the u-th marker with a neighborhood size k is:
Wherein,is the scaling parameter, which is repeated for each pixel in the feature of the BEV after fusion.
As can be seen from this definition, as k increases,near self-attention weight, and +.>Near V u Itself. Each pixel in the neighborhood attention focuses on the window around it and fills in around the input to handle the edge case. It is due to this difference that as the window size grows, the neighborhood attention approaches self-attention.
(2) Model training, verification and optimization
Model training plays a vital role in an autopilot system, training models for processes that learn and extract useful patterns and laws from data. The model is trained through the steps of initializing model parameters, defining a loss function, defining an optimization algorithm, performing iterative training, verifying and adjusting the model and the like, and in the training process, the generalization capability and the overfitting condition of the model are monitored to obtain the optimal model performance.
The neighborhood attention transducer planning model achieves expected performance through model training, verification and optimization. Finally, the model application predicts the driving control and future trajectory of the autonomous vehicle in a complex scenario. In addition, when the model is deployed, factors such as model security, privacy protection and the like need to be considered. And predicting or deducing new input data by using the trained end-to-end model. In the model application phase, corresponding driving control or future trajectories are generated.
The data set is divided into a training set, a verification set and a test set according to a certain proportion. The training and validation data set is then used for model training, validation. And inputting the BEV characteristics after the characteristic fusion into a neighborhood attention transducer planning model for training, verification and optimization.
The method comprises the following specific steps:
(1) Initializing model parameters: parameters of the neighborhood attention transducer planning model are initialized, and a random initialization method is used. The purpose of the initialization is to give the model a starting point that enables it to adjust parameters step by step during the training process to adapt the data.
(2) Defining a loss function: an appropriate loss function is chosen to measure the difference between the model's predictions and the true values on the training data. Common loss functions include Mean Square Error (MSE), cross entropy, etc., where cross entropy functions are chosen as the loss functions.
(3) Defining an optimization algorithm: optimization algorithms suitable for model training are selected, and common algorithms comprise a gradient descent method, a random gradient descent method, adam and the like. The task selects a random gradient descent method, and the objective of the optimization algorithm is to minimize the loss function by adjusting the parameters of the model.
(4) Iterative training: iterative training of the model begins. Each training iteration cycle comprises the steps of:
a forward propagation: and forward transmitting the input data through the model to obtain a predicted value.
b, calculating loss: the predicted value is compared with the actual value and the value of the loss function is calculated.
c back propagation: the gradient of each parameter to loss is calculated by a back propagation algorithm based on the value of the loss function.
d, updating parameters: parameters of the model are updated according to the gradient using an optimization algorithm.
Repeating the above steps until reaching the set stopping condition, such as reaching the maximum iteration number or the convergence of the loss function.
(5) Verification and adjustment of the model: during training, the performance of the model is periodically evaluated using the validation set. And performing model adjustment, such as learning rate adjustment, regularization increase and the like, according to the performance of the verification set so as to optimize the performance of the model.
(6) Model preservation: the trained model parameters are saved for subsequent use and deployment.
Model generation and optimization: for each test set sample, the test set sample is input into a trained model for prediction, the model propagates forward according to the input characteristics, and the output result is calculated layer by layer. In the trained model, the model generates control signals or future trajectories of the vehicle based on current inputs. The generated vehicle control signal or future trajectory is then transferred to a specific execution unit of the vehicle to control the vehicle. At the same time, data in the actual execution process continues to be collected for further optimization and iteration of the model.
As shown in fig. 7, a display of end-to-end autopilot trajectory planning system results. Referring to fig. 7, the interface is composed of four parts, and a frame (1) shows the current running state of the vehicle, including running, stopping, braking, accelerating, starting, etc.; the picture frame (2) displays the current time and the signal state of the vehicle; the frame (3) displays four display interface keys of the end-to-end prediction system: the vehicle history track information, the vehicle state information and the end-to-end prediction information can be used for respectively checking specific display information; the frame (4) shows vehicle-related information of the corresponding key of the frame (3).
In detail, the "vehicle history track information" interface displays the track of the vehicle at the current moment, and the vehicle history track data mainly comprises the position information, the time stamp, the speed, the direction and the like of the vehicle; the "vehicle state information" interface displays the state of the vehicle at the current time, and the vehicle state data mainly includes a vehicle ID, a vehicle speed, a current traveling direction of the vehicle, an acceleration of the vehicle, a steering angle, a vehicle type, a vehicle state, and the like. The "end-to-end prediction information" shows a driving control signal or a future track of the predicted autonomous vehicle at the present time based on the information input described above.
As shown in fig. 8, an exemplary view of an application scenario for an end-to-end autopilot trajectory planning system.
It should be understood at first that fig. 8 is presented by way of example only and is not intended to limit the scope of the present application.
Referring to fig. 8, taking the traffic scenario of the figure as an example, the driving control or future track of the own vehicle during driving is studied, the own vehicle collects various information including the historical track information of the vehicle, the vehicle state information, the multi-view image information of the vehicle, and the like, and then the own vehicle processes the information and transmits the information to the trained prediction model to output the driving control or future track of the vehicle. And finally, displaying all predicted display results on a central control screen of the automobile, displaying the real-time driving strategy of the automobile for a driver and providing related suggestions.
The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent manners or modifications that do not depart from the technical scope of the present invention should be included in the scope of the present invention.
Claims (11)
1. The end-to-end automatic driving track planning system integrating BEVFomer and neighborhood attention transducer is characterized by comprising the following components:
The data acquisition module is used for acquiring multi-view images, historical track data of the vehicle and vehicle state data and preprocessing the acquired data;
the RNN feature extractor is used for extracting dynamic features and associated features of vehicle running from historical track data and vehicle state data of the vehicle; for historical track data, capturing a motion change trend in the track through time sequence processing, and capturing a time dependence in the running process of the vehicle by using a memory mechanism of the RNN feature extractor; by learning the vehicle state data, an association between the vehicle state and the behavior is obtained,
the BEVFomer feature extraction module comprises a back bone neural network and at least one encoder layer, wherein each encoder layer comprises a BEV query mechanism, a spatial cross attention module, a time self attention module and a feedforward network;
backbone neural network is used for obtaining view characteristic F of multi-view image at t moment t ;
The BEV query mechanism predefines a set of learnable parameters of the mesh shapeAs a BEVFormer query, for querying the corresponding grid cell region at p= (x, y) in the BEV plane;
the BEV querying mechanism queries the BEV spatial feature B through the temporal self-attention module t-1 Time signal of (2)Extracting the BEV spatial feature B through the spatial cross attention module t-1 Spatial information of (a);
the spatial cross-attention module is further based on view features F of the multi-view image t To obtain multi-view spatial information of the vehicle;
the feed-forward network is based on view characteristics F of multi-view images t Spatial information of (B) BEV spatial characteristics B t-1 Is used for obtaining refined BEV characteristic B through time information and space information t ;
The feature fusion module is used for fusing the dynamic feature and the associated feature of the vehicle running extracted by the RNN feature extractor into the BEV feature;
and (3) a neighborhood attention transducer planning model, planning a future track of the vehicle based on the fused BEV characteristics, and outputting a planning result through the full connection layer and the visualization module.
2. The fused bevform and neighborhood attention transducer end-to-end automatic driving trajectory planning system of claim 1, wherein the data acquisition module comprises cameras of different perspectives, on-board sensors, inertial measurement units.
3. The fused bevform and neighborhood attention fransformer end-to-end automatic driving trajectory planning system of claim 1, wherein the spatial cross-attention (SCA) process formula of the spatial cross-attention module:
-i represents an index of camera views;
-j represents the index of the reference point;
-N ref is the number of total reference points per BEV query;
-F t i is characteristic of the ith camera view;
-Q p is each BEV query;
P (P, i, j) is the projection function, from the j-th 3D point (x ' y ' z ' j ) A 2D point projected onto the ith view;
-defromatt () is a variable attention;
the formula of the projection function is as follows:
P(p,i,j)=(x ij ,y ij )
in addition, z ij .[x ij y ij 1] T =T i .[x′ y′ z′ j 1] T
Wherein:is a projection matrix known to the i-th camera.
4. The fused bevform and neighborhood attention fransformer end-to-end automatic driving trajectory planning system of claim 1, wherein BEV query Q at current time t and historical BEV feature B saved at time t-1 are given t-1 First B is carried out according to the movement of the bicycle t-1 Alignment with Q, having features on the same grid corresponding to the same real world location, historical BEV feature B to be aligned t-1 As B' t-1 The method comprises the steps of carrying out a first treatment on the surface of the The time correlation between BEV features modeled by the time self-attention module is specifically expressed as follows:
wherein,
-Q p representing a BEV query located at p= (x, y);
the offset in the temporal self-attention module is through Q and B' t-1 For the first sample of each sequence, the temporal self-attention module is degenerated to self-attention without temporal information, wherein the BEV features { Q, B' t-1 The BEV query { Q, Q } is replaced with the repeated BEV query { Q, Q }.
5. The fused bevfomer and neighborhood attention fransformer end-to-end automatic driving trajectory planning system of claim 1 wherein the feature fusion module concatenates the RNN feature extractor extracted features and BEV features together in feature dimensions.
6. The fused bevform and neighborhood attention fransformer end-to-end automatic driving trajectory planning system of claim 1, wherein the neighborhood attention fransformer embeds BEV features after fusion by using 2 consecutive 3 x 3 overlapping convolutions; the method comprises the steps that a multistage neighborhood attention transducer planning model is arranged in a stacked mode, an overlapping marker is arranged at the upstream of a first stage neighborhood attention transducer planning model, a downsampler is connected between two adjacent stages, and the step length of the downsampler is 3 multiplied by 3 convolution of 2; the neighborhood attention mechanism is as follows:
given an inputIt is a matrix, its rows are d-dimensional marker vectors; linear projections Q, K, V of Y, relative positional deviations G (u, V);
defining an attention weight for a u-th input having a neighborhood size kThe method comprises the following steps: the u-th input query projection Q u Projection of its k nearest neighbors +.>The specific formula is as follows:
wherein ρ is v (u) represents the v nearest neighbor of u;
then, the adjacent values are comparedDefined as a matrix whose rows are the k nearest neighbor projections of the u-th input, as follows:
the neighborhood attention of the u-th marker with a neighborhood size k is:
wherein,is the scaling parameter, which is repeated for each pixel in the feature of the BEV after fusion.
7. The training method of the fused bevform and neighborhood attention fransformer end-to-end automatic driving trajectory planning system of any one of claims 1-6, comprising the steps of:
s1, collecting input data of a model: collecting images of multiple views, historical track data of a vehicle and vehicle state data, and preprocessing the collected data;
s2, feature extraction: the method comprises the steps that historical track data and vehicle state data of a vehicle are used as inputs of an RNN feature extractor, and dynamic features and associated features of vehicle running are extracted from the historical track data and the vehicle state data of the vehicle by the RNN feature extractor: taking the images with multiple visual angles as the input of a BEVFomer model, and extracting BEV features with time and space by using a BEVFomer feature extraction module;
S3, fusing the dynamic characteristics and the associated characteristics of the vehicle running extracted by the RNN characteristic extractor into BEV characteristics through a characteristic fusion module;
s4, training, verifying and optimizing a neighborhood attention transducer model: inputting the fused BEV characteristics into a neighborhood attention transducer for training; after the neighborhood attention transducer outputs, a full connection layer is applied to map the input features to an output space, and a future track planning result of the vehicle is output; and repeating the steps S2-S5 to verify and optimize the model.
8. The training method of an automatic driving trajectory planning system according to claim 7, characterized in that the multi-view images in step S1 include image data from different view cameras; the historical track data of the vehicle is to record the motion track of the vehicle in the past period of time, and comprises position coordinates, speed, acceleration and course angle information of the vehicle; the vehicle state data is information of the current state of the vehicle, and comprises the speed, the acceleration, the steering angle, the vehicle power system parameters and the brake system parameters of the vehicle.
9. The method of training an automatic driving trajectory planning prediction model according to claim 7, characterized in that the data preprocessing in step S1 comprises:
(1) Data cleaning: cleaning the data, including processing missing values, outliers and repeated values;
(2) Feature selection: selecting characteristics related to the problems, and eliminating irrelevant characteristics;
(3) Data set partitioning: dividing the data set into a training set, a verification set and a test set;
(4) And (3) data coding: encoding the collected data and converting the encoded data into numerical data;
(5) Data normalization: the data is normalized to have zero mean and unit variance to eliminate dimensional differences between features.
10. The training method of the automatic driving trajectory planning system according to claim 7, wherein the specific steps of training, verifying and optimizing the S4 model are:
(1) Initializing model parameters: initializing parameters of the model by using a random initialization method;
(2) Defining a loss function: determining a loss function to measure a difference between the model's predictions and the true value on the training data; the loss function is a mean square error or cross entropy function;
(3) Defining an optimization algorithm: selecting an optimization algorithm suitable for model training, wherein the optimization algorithm is a gradient descent method, a random gradient descent method or Adam, and the task selects the random gradient descent method;
(4) Iterative training: performing iterative training of the model, each training iteration cycle comprising the steps of:
a forward propagation: the input data is transmitted forward through the model to obtain a predicted value;
b, calculating loss: comparing the predicted value with the true value, and calculating the value of the loss function;
c back propagation: calculating the gradient of each parameter to the loss through a back propagation algorithm according to the value of the loss function;
d, updating parameters: updating parameters of the model according to the gradient by using an optimization algorithm;
repeating the steps until reaching the set stop condition;
(5) Verification and adjustment of the model: during training, periodically using the validation set to evaluate the performance of the model; and performing model adjustment, such as learning rate adjustment, regularization increase and the like, according to the performance of the verification set so as to optimize the performance of the model.
(6) Model preservation: the trained model parameters are saved for subsequent use and deployment.
11. An automatic driving track planning prediction method based on a fused BEVFomer and neighborhood attention transducer end-to-end automatic driving track planning system is characterized in that,
the method comprises the following steps:
step 1, collecting model input data: collecting images of multiple views, historical track data of a vehicle and vehicle state data, and preprocessing the collected data;
Step 2, taking the historical track data and the vehicle state data of the vehicle as the input of an RNN feature extractor, and extracting dynamic features and associated features of the running of the vehicle from the historical track data and the vehicle state data of the vehicle by using the RNN feature extractor;
step 3, taking the images with multiple visual angles as the input of a BEVFomer model, and extracting BEV features with time and space by utilizing a BEVFomer feature extraction module;
step 4, fusing the dynamic characteristics of the vehicle running and the associated characteristics extracted by the RNN characteristic extractor into BEV characteristics through a characteristic fusion module;
step 5, predicting a neighborhood attention transducer model: inputting the fused BEV characteristics into a neighborhood attention transducer for training; after neighborhood attention transducer output, a full connection layer is applied to map input features to output space, and future track planning results of the vehicle are output.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311691441.6A CN117516581A (en) | 2023-12-11 | 2023-12-11 | End-to-end automatic driving track planning system, method and training method integrating BEVFomer and neighborhood attention transducer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311691441.6A CN117516581A (en) | 2023-12-11 | 2023-12-11 | End-to-end automatic driving track planning system, method and training method integrating BEVFomer and neighborhood attention transducer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117516581A true CN117516581A (en) | 2024-02-06 |
Family
ID=89762705
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311691441.6A Pending CN117516581A (en) | 2023-12-11 | 2023-12-11 | End-to-end automatic driving track planning system, method and training method integrating BEVFomer and neighborhood attention transducer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117516581A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117935173A (en) * | 2024-03-21 | 2024-04-26 | 安徽蔚来智驾科技有限公司 | Target vehicle identification method, field end server and readable storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180120843A1 (en) * | 2016-11-03 | 2018-05-03 | Mitsubishi Electric Research Laboratories, Inc. | System and Method for Controlling Vehicle Using Neural Network |
CN113954864A (en) * | 2021-09-22 | 2022-01-21 | 江苏大学 | Intelligent automobile track prediction system and method fusing peripheral vehicle interaction information |
CN114372116A (en) * | 2021-12-30 | 2022-04-19 | 华南理工大学 | Vehicle track prediction method based on LSTM and space-time attention mechanism |
KR102388806B1 (en) * | 2021-04-30 | 2022-05-02 | (주)에이아이매틱스 | System for deciding driving situation of vehicle |
CN115937821A (en) * | 2022-12-12 | 2023-04-07 | 上海人工智能创新中心 | Full-stack automatic driving planning method and unified architecture system thereof |
CN116258242A (en) * | 2022-12-14 | 2023-06-13 | 北京理工大学 | Reactive track prediction method and system for automatic driving vehicle |
CN116853272A (en) * | 2023-07-12 | 2023-10-10 | 江苏大学 | Automatic driving vehicle behavior prediction method and system integrating complex network and graph converter |
-
2023
- 2023-12-11 CN CN202311691441.6A patent/CN117516581A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180120843A1 (en) * | 2016-11-03 | 2018-05-03 | Mitsubishi Electric Research Laboratories, Inc. | System and Method for Controlling Vehicle Using Neural Network |
KR102388806B1 (en) * | 2021-04-30 | 2022-05-02 | (주)에이아이매틱스 | System for deciding driving situation of vehicle |
CN113954864A (en) * | 2021-09-22 | 2022-01-21 | 江苏大学 | Intelligent automobile track prediction system and method fusing peripheral vehicle interaction information |
CN114372116A (en) * | 2021-12-30 | 2022-04-19 | 华南理工大学 | Vehicle track prediction method based on LSTM and space-time attention mechanism |
CN115937821A (en) * | 2022-12-12 | 2023-04-07 | 上海人工智能创新中心 | Full-stack automatic driving planning method and unified architecture system thereof |
CN116258242A (en) * | 2022-12-14 | 2023-06-13 | 北京理工大学 | Reactive track prediction method and system for automatic driving vehicle |
CN116853272A (en) * | 2023-07-12 | 2023-10-10 | 江苏大学 | Automatic driving vehicle behavior prediction method and system integrating complex network and graph converter |
Non-Patent Citations (6)
Title |
---|
ALI HASSANI等: "Neighborhood Attention Transformer", 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 16 May 2023 (2023-05-16), pages 3 * |
HE YOUGUO: "Predicting pedestrian tracks around moving vehicles based on conditional variational transformer", PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS, 26 May 2023 (2023-05-26) * |
LI ZHIQI: "BEVFoemer:Learning Bird\'s-Eye-View Representation from Multi-camera Images via Spatiotemporal", COMPUTER VISION-ECCV 2022, 31 December 2022 (2022-12-31) * |
LI ZHIQI等: "BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers", 17TH EUROPEAN CONFERENCE ON COMPUTER VISION, 13 July 2022 (2022-07-13), pages 1 * |
蔡英凤等: "基于注意力机制的车辆行为预测", 江苏大学学报, vol. 41, no. 2, 31 March 2020 (2020-03-31) * |
赵宏等: "智能计算技术与应用基础", 31 August 2022, 北京邮电大学出版社, pages: 191 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117935173A (en) * | 2024-03-21 | 2024-04-26 | 安徽蔚来智驾科技有限公司 | Target vehicle identification method, field end server and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110164128B (en) | City-level intelligent traffic simulation system | |
CN114842028B (en) | Cross-video target tracking method, system, electronic equipment and storage medium | |
CN113705636B (en) | Method and device for predicting track of automatic driving vehicle and electronic equipment | |
Srikanth et al. | Infer: Intermediate representations for future prediction | |
CN114970321A (en) | Scene flow digital twinning method and system based on dynamic trajectory flow | |
Bai et al. | Deep learning based motion planning for autonomous vehicle using spatiotemporal LSTM network | |
CN110281949B (en) | Unified hierarchical decision-making method for automatic driving | |
CN113538520B (en) | Pedestrian trajectory prediction method and device, electronic equipment and storage medium | |
CN113592905B (en) | Vehicle driving track prediction method based on monocular camera | |
CN112329645B (en) | Image detection method, device, electronic equipment and storage medium | |
Spannaus et al. | AUTOMATUM DATA: Drone-based highway dataset for the development and validation of automated driving software for research and commercial applications | |
CN112381132A (en) | Target object tracking method and system based on fusion of multiple cameras | |
CN114820708A (en) | Peripheral multi-target trajectory prediction method based on monocular visual motion estimation, model training method and device | |
CN114372503A (en) | Cluster vehicle motion trail prediction method | |
CN117516581A (en) | End-to-end automatic driving track planning system, method and training method integrating BEVFomer and neighborhood attention transducer | |
Sadid et al. | Dynamic Spatio-temporal Graph Neural Network for Surrounding-aware Trajectory Prediction of Autonomous Vehicles | |
US20220284623A1 (en) | Framework For 3D Object Detection And Depth Prediction From 2D Images | |
CN111695627A (en) | Road condition detection method and device, electronic equipment and readable storage medium | |
CN115457081A (en) | Hierarchical fusion prediction method based on graph neural network | |
KR102563346B1 (en) | System for monitoring of structural and method ithereof | |
CN117390590B (en) | CIM model-based data management method and system | |
CN114620059A (en) | Automatic driving method and system thereof, and computer readable storage medium | |
CN118397046A (en) | Highway tunnel pollutant emission estimation method based on video vehicle track tracking | |
Lu et al. | Monocular semantic occupancy grid mapping with convolutional variational auto-encoders | |
CN118314180A (en) | Point cloud matching method and system based on derivative-free optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |