CN110530371B

CN110530371B - Indoor map matching method based on deep reinforcement learning

Info

Publication number: CN110530371B
Application number: CN201910840334.2A
Authority: CN
Inventors: 周亮; 洪焕华; 李莹
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2021-05-18
Anticipated expiration: 2039-09-06
Also published as: CN110530371A

Abstract

The invention discloses an indoor map matching method based on deep reinforcement learning, which comprises the following steps: s1, acquiring data of the pedestrian inertial navigation module and preprocessing the data to obtain pixel coordinates related to a map; s2, constructing a local map generation module according to the pixel coordinates obtained in the step S1; s3, defining the corresponding correction coordinate when the correction code in the current state is obtained; s4, jointly representing the pixel coordinate information to be corrected and the local map as the state of the current position; s5, designing a reward mechanism according to the consistency of the corrected coordinates of the single point and the label coordinates and the similarity of the corrected track and the standard path; s6, constructing a double-network model of a target value network and a current value network, and taking MSE of the target value network output and the current value network output as loss functions; and S7, outputting the positioning coordinates corrected by the reinforcement learning model.

Description

Indoor map matching method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of indoor positioning, and particularly relates to an indoor map matching method based on deep reinforcement learning.

Background

In the age of the rapid development of internet of things technology, most applications are more or less associated with location services. For moving objects, the positioning requirement is more obvious, and therefore, the positioning technology is receiving wide attention. However, the precision and cost of the positioning technique has been a pair of spears. If the cost is too high, most of the applications of the Internet of things can only be expected to be impressive; if a low-cost scheme is adopted, the positioning precision is not satisfactory. From the market demand, the higher the accuracy of positioning is, the better, so all positioning technologies are also making continuous breakthrough in accuracy, and the cost is also gradually reduced after industrial scale-up, and the "high-accuracy, low-cost" positioning solution is undoubtedly the trend of the future market. Currently, GNSS positioning has been widely spread, but a big drawback of GNSS is that it cannot cover indoor environment, and in fact, 80% of people's daily activities occur indoors, so the importance of indoor positioning technology is self-evident.

Conventional indoor positioning methods can be divided into deployment-dependent positioning techniques and deployment-independent positioning techniques. Deployment-dependent positioning techniques are: Wi-Fi positioning technology, Bluetooth positioning technology, UWB positioning technology, RFID positioning technology, and the like; deployment-independent positioning techniques are: geomagnetic positioning technology, inertial sensor positioning technology, and the like. The inertial sensor positioning technology is different from technologies such as Wi-Fi positioning and the like, and can be suitable for more complex application scenes such as anti-terrorist rescue and the like due to the characteristic that the inertial sensor positioning technology does not need to be deployed in advance; meanwhile, the inertial sensor is low in price and beneficial to large-scale popularization and application. Then, how to rely on the inertial sensor to realize indoor positioning becomes an urgent problem to be solved.

The existing literature retrieval shows that the literature 'fusion indoor positioning based on particle filtering and map matching' (Zhouyi, Luhang, Lushuai, et al. fusion indoor positioning based on particle filtering and map matching [ J ]. academic newspaper of electronic science and technology university, 2018, v.47(03): 97-102.) provides a fusion indoor positioning method based on particle filtering and map matching. The technology tries to combine WiFi fingerprint positioning and pedestrian dead reckoning through particle filtering, and matches and corrects a positioning result by applying an indoor map. The WiFi positioning is realized by using a two-stage positioning scheme combining SVC and SVR; the PDR obtains the walking steps, step length and direction of the user through the accelerometer and the magnetometer, and the walking steps, step length and direction are used for modeling the user behavior in particle filtering; and finally, fusing the information of the two previous parts and the map information to realize final positioning. However, the technology utilizes WiFi information, a positioning area needs to be deployed in advance, a corresponding WiFi fingerprint map is constructed, and the problem of track through wall exists. The reason is that the existing map matching methods such as particle filtering focus on optimizing local tracks, and the tracks are not optimized from the global consideration.

Disclosure of Invention

The present invention is directed to provide an indoor map matching method based on deep reinforcement learning, so as to solve or improve the above-mentioned problems.

In order to achieve the purpose, the invention adopts the technical scheme that:

an indoor map matching method based on deep reinforcement learning, comprising:

s1, acquiring data of the pedestrian inertial navigation module and preprocessing the data to obtain pixel coordinates related to a map;

s2, constructing a local map generation module according to the pixel coordinates obtained in the step S1;

s3, defining the corresponding correction coordinate when the correction code in the current state is obtained;

s4, jointly representing the pixel coordinate information to be corrected and the local map as the state of the current position;

s5, designing a reward mechanism according to the consistency of the corrected coordinates of the single point and the label coordinates and the similarity of the corrected track and the standard path;

s6, constructing a double-network model of a target value network and a current value network, and taking MSE of the target value network output and the current value network output as loss functions;

and S7, outputting the positioning coordinates corrected by the reinforcement learning model.

Preferably, in step S1, the relative geodetic location coordinates during pedestrian travel are collected, and the geodetic location coordinates are subjected to coordinate conversion to generate pixel coordinates related to the map.

Preferably, the map is cut according to the pixel coordinates generated in step S1, and a local map related to the pixel coordinates is generated.

Preferably, the reward mechanism of step S5 is: and comprehensively considering the consistency of the corrected coordinates of the single point and the coordinates of the label and the similarity of the corrected track and the standard path, and returning a quantitative numerical value.

Preferably, the reward is classified according to the design of the action space and the Euclidean distance from the true value data, and if the serial number output by the model is correct, the reward is 1; decays to 0.75 of the initial value in order according to the hierarchy.

Preferably, the current value network quantizes the states, actions and reward values in the above steps to corresponding Q values through a value iteration network based on the bellman equation in step S6; the target value network and the current value network have the same network structure, except that the network parameters need to be copied at certain time steps.

The indoor map matching method based on deep reinforcement learning provided by the invention has the following beneficial effects:

according to the method, a deep reinforcement learning model is designed and built according to inertial navigation data and map data, data fusion of map information and inertial navigation track information is completed, and map matching is achieved.

Besides, the method abandons the traditional picture processing technology, and extracts the local map features by using a neural network method, so that the calculation speed is greatly improved; secondly, in view of the problem that the traditional technology cannot solve that the map matching track penetrates through the wall, the method for matching the map by fusing the map and the track information through reinforcement learning is firstly proposed to solve the problem of penetrating through the wall and finish map matching; and finally, after the deep reinforcement learning model is completely trained, the saved model can be directly used for completing the track correction task.

In conclusion, the method has the advantages of strong global optimization capability, low technical complexity, strong generalization and the like, and is particularly suitable for severe environments in which the positioning device cannot be deployed in advance.

Drawings

Fig. 1 is a schematic diagram of an overall network according to an embodiment.

Fig. 2 is a partial data presentation diagram of an embodiment.

Fig. 3 is a partial map generation result diagram of the embodiment.

Fig. 4 is a diagram illustrating an operation space control mode according to the embodiment.

Fig. 5 is a diagram of a state data structure of an embodiment.

FIG. 6 is a diagram illustrating a status feature correction process according to an embodiment.

Fig. 7 is an illustration of a reward mechanism according to an embodiment.

Fig. 8 is a flowchart of the dual network model structure according to the embodiment.

FIG. 9 is a diagram showing the comparative effect of the coarse positioning trace and the true trace according to the embodiment.

FIG. 10 is a graph showing the comparison between the rough positioning trajectory and the reinforcement learning correction trajectory according to the embodiment.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

According to one embodiment of the application, the indoor map matching method based on the deep reinforcement learning of the scheme comprises the following steps:

Referring to FIG. 1, an initial state generated by an environment according to one embodiment of the present applications _tFeedback of output actions over a current value network

The two continuously interact with the environment, and one of the moments is taken as an example for illustration. I.e. to generate a corresponding current states _tCurrent motiona _tInstant rewardr _tAnd shifting to the state at the next moments _t+1A quadruple. Is recorded as (s _t , a _t , s _t+1). And saving the quadruple to a memory playback unit. After a certain time step, the current value network and the target value network are respectively extracted from the playback memory unit at random (s _t , a _t) Ands _t+1in combination with correspondingr _tConstituting the final loss function. Network parameters are optimized for this loss.

The coordinate to be corrected is recorded as ins _ ori, the sitting mark used as a label is recorded as ins _ label, the side length of a coding map used for coding is recorded as map _ len _ big, the side length of a state map used for representing the state is recorded as map _ len _ small, and the size of a pixel group for controlling an error range is recorded as pixel _ group _ len.

According to an embodiment of the present application, the steps S1 to S7 are described in detail below.

Step S1, acquiring data and preprocessing the inertial navigation module of the pedestrian;

researchers wear inertial navigation equipment to walk in a corridor and a room, and the acquired data are transmitted to a terminal to be stored. The data used during the study are the relative geodetic coordinates of the walker during travel, as shown in FIG. 2, where ins represents inertial navigation data and label represents truth data. Since the scene currently considered is map matching on a flat map, only two-dimensional data is considered, i.e. only the x, y directions are considered. In addition, because the data is subjected to scale conversion, the data takes pixel points as units.

Step S2, constructing a local map generation module;

and the local map generation module is determined according to the inertial navigation position at the previous moment and the inertial navigation position at the current moment. Specifically, the original map is subjected to directional cutting by taking the pixel coordinate of the current moment as a center and taking the position included angle between the current moment and the previous moment as a direction, so that a final local map is obtained. Obviously, such a local map is not only related to the current position, but also to the position at the previous moment. The final local map reflects the location information together with the constraint information of the map.

As shown in fig. 3, the local map is the local cut map containing history information as shown in the right side of fig. 3, wherein the left side of fig. 3 is the original map used in the experiment, and the right side of fig. 3 is the local cut map with the pixel coordinate [552.00, 420.00] as the center, the angle with the pixel coordinate [420.00,300.00] as the direction, and the side length is 500 pixels.

Step S3, designing an action space control mode;

the regression problem of the track correction is converted into a classification problem through a self-defined coding mode, and the track can be corrected by solving a proper correction code in the current state.

Taking ins _ ori as the center and map _ len _ big as the side length, a partial map about the position can be obtained, which is called a coding map because its main function is to code the motion.

As shown in fig. 4, ins _ ori is always located at the center of the code map, i.e., No. 12 code, and the position number to be corrected is calculated from the position of ins _ label, for example, when ins _ label needs to be moved to the right by two frames relative to ins _ ori and then moved upward by one frame, the corresponding action code is 19, i.e., action _ number is 19.

The algorithm is as follows:

1) and calculating the difference value of the ins _ label and the ins _ ori in the x direction and the y direction, namely:

2) calculating the number of the center point of the coding map:

3) calculating a final action code:

when out-of-range conditions occur, e.g.

Is greater than

Forcing order

The maximum boundary value.

The same is true.

By this, the process of action numbering ends.

Step S4, designing a state space conversion module;

and retrieving the track coordinates to be corrected through an index rule, and correcting the track coordinates iteratively.

A data structure is designed to represent the current coordinate information, and meanwhile, the data structure can store the current action information. As shown in FIG. 5, the first two columns represent two sets of indices of the given trajectory, from which a set of original coordinates to be corrected can be retrieved, if noted as

Last column shows action _ number for action _ number

And (4) acting. In other words, only one coordinate with correction is given

The coordinates after correction can be calculated reversely by correcting a correct motion code. This seat is marked as ins _ reverse.

The specific correction process is shown in fig. 6: from the initial state (0, 0), the coordinate to be corrected can be found through data indexing

Extracting the coordinate component at that time

The operation code 19 corresponding to the current state is obtained from the environment, and decoding is performed according to the coding rule designed in advance in fig. 4, so that the operation code can be obtained

。

The corrected coordinate components are then as follows:

and then, after state conversion, the subsequent coordinate sequence to be corrected can be corrected continuously.

Step S5, designing a reward mechanism;

the reward is classified according to the design of the action space and the Euclidean distance from the true value data, and if the serial number output by the model is correct, the reward is 1; decays to 0.75 of the initial value in order according to the hierarchy.

As shown in fig. 7, if the action number corresponding to the current label is 19, the reward value corresponding to the number farther from 19 is smaller (the color gradually deepens) with 19 as the center. Specifically, assume that the model output is numbered 19, the reward is 1; the network output is 13,14,18,23,24, the prize value is 0.75 (decay quarter), and so on. Of course, if greater discrimination is desired, the attenuation coefficient may be increased.

Step S6, building a double-network model;

as shown in fig. 8, the present model has two CNN networks in the design process. The network structures of the two CNN networks are completely identical, and the parameters of the CNN network 2 are duplicated to the CNN network 1 at a fixed time step.

Specifically, action A and State for time t

In other words, feedback through the environment can result in an immediate reward R and a next states _t+1. These two sets of values need to pass through the network on both sides, respectively. Wherein due to the states _tOnly one coordinate, no map information has been incorporated. Both sides of the state need to go through Map clipping, i.e. Map cutting centered on the coordinates. There is a certain difference between the output parts of network 1 and network 2. Respectively through process1 and process 2.

Wherein:

process 1: and converting the action number into an one-Hot code, recording the one-Hot code as A _ Hot, and multiplying the A _ Hot by the last layer of the network to obtain the Q _ value corresponding to the number at the moment. In other words, the Process1 functions to select the Q value corresponding to action a.

Process 2: the purpose of this operation is to get Q _ target.

Process2 then needs to first take the maximum value of CNN network 2 and then add it to the instant prize R to get the final Q _ target.

The final loss function expression is obtained:

for each coordinate to be corrected, the forward path described above is performed, i.e., a different loss is obtained. Like conventional supervised learning, the network parameters can be continuously optimized based on this loss with back propagation.

Step S7, outputting the positioning coordinate corrected by the reinforcement learning model;

FIG. 9 is a comparison effect diagram of a rough positioning track and a true value track, wherein a black point track is an inertial navigation output track and has the problems of wall penetration, turning time deviation and the like; the dashed trace is the true trace.

Fig. 10 is a comparison effect diagram of the rough positioning trajectory and the reinforcement learning correction trajectory, the black dot trajectory is an inertial navigation output trajectory, and the dotted line trajectory is a correction trajectory after the reinforcement learning to complete map matching. Therefore, the inertial navigation positioning track graph which is not matched with the map has the problems of wall penetration, turning moment deviation and the like; in contrast, after map matching is realized through reinforcement learning, the track is basically overlapped with the true value, and obvious problems such as wall penetration and the like do not occur.

The invention has the following beneficial effects:

While the embodiments of the invention have been described in detail in connection with the accompanying drawings, it is not intended to limit the scope of the invention. Various modifications and changes may be made by those skilled in the art without inventive step within the scope of the appended claims.

Claims

1. An indoor map matching method based on deep reinforcement learning is characterized by comprising the following steps:

s3, defining that when the correction code in the current state is obtained, the corresponding correction coordinate can be generated, including: converting the regression problem of the track correction into a classification problem in a self-defined coding mode, and correcting the track by solving a proper correction code in the current state;

s5, designing a reward mechanism according to the consistency of the corrected coordinates of the single point and the label coordinates and the similarity of the corrected track and the standard path, wherein the reward mechanism is as follows: comprehensively considering the consistency of the corrected coordinates of the single point and the label coordinates and the similarity of the corrected track and the standard path, and returning a quantitative numerical value;

2. The deep reinforcement learning-based indoor map matching method according to claim 1, wherein: in step S1, the relative geodetic location coordinates of the pedestrian during traveling are collected, and the geodetic location coordinates are subjected to coordinate conversion, so as to generate pixel coordinates related to the map.

3. The deep reinforcement learning-based indoor map matching method according to claim 1, wherein: the map is cut according to the pixel coordinates generated in step S1, and a local map associated with the pixel coordinates is generated.

4. The deep reinforcement learning-based indoor map matching method according to claim 1, wherein the reward is classified according to the design of an action space and the euclidean distance from the true value data, and if the number of the model output is correct, the reward is 1; decays to 0.75 of the initial value in order according to the hierarchy.

5. The deep reinforcement learning-based indoor map matching method according to claim 4, wherein the current value network quantizes the state, the action and the reward value to corresponding Q values through a value iteration network based on a Bellman equation in step S6; the target value network and the current value network have the same network structure, except that the network parameters need to be copied at certain time steps.