Autonomous Robot Goal Seeking and Collision Avoidance in the Physical World: An Automated Learning and Evaluation Framework Based on the PPO Method
<p>(<b>a</b>) Agent learning its surroundings using LiDAR data while navigating to the goal. The agent has four discrete actions available in this environment, where Action 1 is not moving, Action 2 is turning clockwise, Action 3 is turning counter-clockwise, and Action 4 is moving forward. (<b>b</b>) Coordinate frame of the agent, where 0 rad is the forward direction of the agent.</p> "> Figure 2
<p>(<b>a</b>) Structure of the actor network. The network takes 6 inputs. The LiDAR data input is composed of 360 infrared readings of distances between the agent and the obstacles around it. The past action input is composed of left and right wheel speeds from the previous time step. The other 4 inputs are of single values. The inputs are sent through a ReLU activation function to a hidden layer of 256 nodes. Afterwards, they are sent through another ReLU activation function to another hidden layer of 256 nodes. The output is then sent through first a ReLU and then a SoftMax activation function, which would be either of the 4 action values. (<b>b</b>) Structure of the critic network. The network takes the same inputs as the actor network. The output is the value of the agent’s current state.</p> "> Figure 3
<p>PPO agent update workflow. The PPO agent is updated every other 500 time steps. Each of the trajectory memory tuple contains the current state <math display="inline"><semantics> <msub> <mi>s</mi> <mi>t</mi> </msub> </semantics></math>, the action taken <math display="inline"><semantics> <msub> <mi>a</mi> <mi>t</mi> </msub> </semantics></math>, the reward received <math display="inline"><semantics> <msub> <mi>R</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </semantics></math>, the logarithmic probability of the action taken <math display="inline"><semantics> <mrow> <mi>l</mi> <mi>o</mi> <msub> <mi>g</mi> <mrow> <mi>p</mi> <mi>r</mi> <mi>o</mi> <mi>b</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </semantics></math>, and the binary episode termination flag <math display="inline"><semantics> <mrow> <mi>F</mi> <mi>l</mi> <mi>a</mi> <mi>g</mi> </mrow> </semantics></math>. The goal of the actor network here is to maximize the value function output by the critic network. The operations within the blue box are specific to the actor network updates, whereas the operations within the red box are specific to the critic network updates. Note that the solid arrow lines represent direct maneuver of the variables, whereas the dotted arrow lines represent the updates of the networks.</p> "> Figure 4
<p>Overall workflow. The user can specify the parameters needed at the beginning of the current training round. In each episode, the agent receives the new state <math display="inline"><semantics> <msub> <mi>s</mi> <mi>t</mi> </msub> </semantics></math>, when taking action <math display="inline"><semantics> <msub> <mi>a</mi> <mi>t</mi> </msub> </semantics></math>. When the episode termination condition is met (i.e., the binary episode termination flag <math display="inline"><semantics> <mrow> <mi>F</mi> <mi>l</mi> <mi>a</mi> <mi>g</mi> </mrow> </semantics></math> is set to 1), the current goal rate will be calculated and saved to an array, which will be saved to a CSV file for further analysis.</p> "> Figure 5
<p>(<b>a</b>) Current and past goal distance and (<b>b</b>) goal heading representation. The green arrow line represents the path of the agent.</p> "> Figure 6
<p>(<b>a</b>) Current and past obstacle distance and (<b>b</b>) obstacle heading representation.</p> "> Figure 7
<p>(<b>a</b>) The reward plots related to goal heading and distance. The agent receives maximum reward if it is facing the goal directly and is close to it. (<b>b</b>) The reward plots related to obstacle heading and distance. The agent receives the largest penalty when facing the obstacle directly and close to it.</p> "> Figure 8
<p>The custom simulated environment. Dimension = 12.50 m × 8.00 m. Environment size ≈ 82.75 <math display="inline"><semantics> <msup> <mi mathvariant="normal">m</mi> <mn>2</mn> </msup> </semantics></math>.</p> "> Figure 9
<p>(<b>a</b>) Physical Maze 1, (<b>b</b>) Maze 2, and (<b>c</b>) Maze 3. (<b>a</b>) Physical Maze 1. Dimension = 12.50 m × 8.00 m. Environment size ≈ 82.75 <math display="inline"><semantics> <msup> <mi mathvariant="normal">m</mi> <mn>2</mn> </msup> </semantics></math>. (<b>b</b>) Physical Maze 2. Dimension = 12.50 m × 8.00 m. Environment size ≈ 82.75 <math display="inline"><semantics> <msup> <mi mathvariant="normal">m</mi> <mn>2</mn> </msup> </semantics></math>. (<b>c</b>) Physical Maze 3. Dimension = 12.50 m × 8.00 m. Environment size <math display="inline"><semantics> <mrow> <mo>≈</mo> <mn>82.75</mn> </mrow> </semantics></math> <math display="inline"><semantics> <msup> <mi mathvariant="normal">m</mi> <mn>2</mn> </msup> </semantics></math>.</p> "> Figure 10
<p>The goal rate for all training episodes in the simulated environment with obstacles. The simulated environment, shown in <a href="#applsci-14-11020-f008" class="html-fig">Figure 8</a>, was used. One of the advantages of simulated training is abundant training episodes without a time-consuming setup. After about 5000 episodes, the agent’s goal rate converged to about 79%.</p> "> Figure 11
<p>The total number of goals reached for all training episodes in the simulated environment with obstacles. The simulated environment, shown in <a href="#applsci-14-11020-f008" class="html-fig">Figure 8</a>, was used. Again, as the agent improves at reaching the target, the curve should appear more linear, which is shown here close to the end of 5000 episodes.</p> "> Figure 12
<p>Average rewards for all training episodes in the simulated environment with obstacles. As the agent improves at the task at hand, the average rewards should become more positive and converge to a value, which is shown here.</p> "> Figure 13
<p>Confidence interval plot of 100 independent PPO training runs across 5000 episodes. The shaded region around the mean represents the 95% confidence interval, quantifying run variability and training process robustness.</p> "> Figure 14
<p>Average cumulative goals for all 5 runs of 100 trials in physical environments shown in <a href="#applsci-14-11020-f009" class="html-fig">Figure 9</a>. It can be seen clearly that the agent reaches goals more often in Maze 1 since it is a free maze, whereas the agent reaches the goals least often in Maze 2 since there are more walls.</p> ">
Abstract
:1. Introduction
- We redesigned and implemented a personalized automated learning and evaluation framework based on an existing repository [34], allowing users to specify custom training and testing parameters for both simulated and physical robots (). This makes the platform adaptable and accessible to a wider range of users with limited resources.
- To establish a robust benchmark for robotic research study, we design a simulated environment in Gazebo to train the agent. The training environment closely mimics typical indoor scenarios encountered by robots with common obstacles such as walls and barriers, replicating real-world challenges.
- Our implementation provides statistical metrics, such as goal rate, for detailed comparison and analysis, with the flexibility to extend output to additional data such as trajectory paths and robot LiDAR readings. Moreover, our implementation can be extended to output additional metrics such as trajectory data, robot LiDAR data, etc. The insights gleaned from this study are invaluable to practitioners and researchers for their specific robotic navigation tasks.
- We conducted extensive physical experiments to evaluate the real-world navigation performance under varied environment configurations. Our results show that the agent trained in simulation can achieve a success rate of over in our physical environments. Simulated training is not restricted by physical constraints such as the robot’s battery power, and is more efficient in training data collection. This finding demonstrates the value of simulated training with RL for real world mobile navigation.
2. PPO Algorithm Background
3. Integration of the PPO Algorithm for Robot Navigation
3.1. Agent Model
3.2. PPO Workflow
3.3. Overall Training Workflow
Algorithm 1 Training Pseudocode |
User Input:
|
3.4. Reward Function
4. Experimental Set-Up
4.1. Source Simulated Maze
4.2. Testing Physical Environments
5. Results Analysis
5.1. Simulation Training
5.2. Physical Testing
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Sensor Specifications
- Hardware: 360 Laser Distance Sensor LDS-01, Hitachi-LG Data Storage, Inc., Tokyo, Japan
- Quantity: 1
- Dimensions: 69.5 (W) × 95.5 (D) × 39.5 (H) mm
- Distance Range: 120–3500 mm
- Sensor Position: 192 mm from the ground
Appendix B. Simulation Training Parameters
- Episode limit: 5000
- Update time step (memory limit): 500
- Policy update epochs: 50
- PPO clip parameter:
- Discount factor ():
Appendix C. Physical Training and Testing Parameters
- Number of runs: 5
- Number of trials of each run: 100
- Update time step (memory limit): 500
- Policy update epochs: 50
- PPO clip parameter:
- Discount factor ():
References
- Gonzalez-Aguirre, J.A.; Osorio-Oliveros, R.; Rodriguez-Hernandez, K.L.; Lizárraga-Iturralde, J.; Morales Menendez, R.; Ramirez-Mendoza, R.A.; Ramirez-Moreno, M.A.; Lozoya-Santos, J.d.J. Service robots: Trends and technology. Appl. Sci. 2021, 11, 10702. [Google Scholar] [CrossRef]
- O’Brien, M.; Williams, J.; Chen, S.; Pitt, A.; Arkin, R.; Kottege, N. Dynamic task allocation approaches for coordinated exploration of Subterranean environments. Auton. Robot. 2023, 47, 1559–1577. [Google Scholar] [CrossRef]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
- Heess, N.; Tb, D.; Sriram, S.; Lemmon, J.; Merel, J.; Wayne, G.; Tassa, Y.; Erez, T.; Wang, Z.; Eslami, S.; et al. Emergence of locomotion behaviours in rich environments. arXiv 2017, arXiv:1707.02286. [Google Scholar]
- Wang, Y.; Wang, L.; Zhao, Y. Research on door opening operation of mobile robotic arm based on reinforcement learning. Appl. Sci. 2022, 12, 5204. [Google Scholar] [CrossRef]
- Plasencia-Salgueiro, A.d.J. Deep Reinforcement Learning for Autonomous Mobile Robot Navigation. In Artificial Intelligence for Robotics and Autonomous Systems Applications; Springer: Berlin/Heidelberg, Germany, 2023; pp. 195–237. [Google Scholar]
- Holubar, M.S.; Wiering, M.A. Continuous-action reinforcement learning for playing racing games: Comparing SPG to PPO. arXiv 2020, arXiv:2001.05270. [Google Scholar]
- Del Rio, A.; Jimenez, D.; Serrano, J. Comparative Analysis of A3C and PPO Algorithms in Reinforcement Learning: A Survey on General Environments. IEEE Access 2024, 12, 146795–146806. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Kim, K. Multi-agent deep Q network to enhance the reinforcement learning for delayed reward system. Appl. Sci. 2022, 12, 3520. [Google Scholar] [CrossRef]
- Pérez-Gil, Ó.; Barea, R.; López-Guillén, E.; Bergasa, L.M.; Gomez-Huelamo, C.; Gutiérrez, R.; Diaz-Diaz, A. Deep reinforcement learning based control for Autonomous Vehicles in CARLA. Multimed. Tools Appl. 2022, 81, 3553–3576. [Google Scholar] [CrossRef]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Barth-Maron, G.; Hoffman, M.W.; Budden, D.; Dabney, W.; Horgan, D.; Tb, D.; Muldal, A.; Heess, N.; Lillicrap, T. Distributed distributional deterministic policy gradients. arXiv 2018, arXiv:1804.08617. [Google Scholar]
- Egbomwan, O.E.; Liu, S.; Chaoui, H. Twin Delayed Deep Deterministic Policy Gradient (TD3) Based Virtual Inertia Control for Inverter-Interfacing DGs in Microgrids. IEEE Syst. J. 2022, 17, 2122–2132. [Google Scholar] [CrossRef]
- Kargin, T.C.; Kołota, J. A Reinforcement Learning Approach for Continuum Robot Control. J. Intell. Robot. Syst. 2023, 109, 1–14. [Google Scholar] [CrossRef]
- Cheng, W.C.A.; Ni, Z.; Zhong, X. A new deep Q-learning method with dynamic epsilon adjustment and path planner assisted techniques for Turtlebot mobile robot. In Proceedings of the Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications, Orlando, FL, USA, 13 June 2023; Volume 12529, pp. 227–237. [Google Scholar]
- Chen, Y.; Liang, L. SLP-Improved DDPG Path-Planning Algorithm for Mobile Robot in Large-Scale Dynamic Environment. Sensors 2023, 23, 3521. [Google Scholar] [CrossRef]
- He, N.; Yang, S.; Li, F.; Trajanovski, S.; Kuipers, F.A.; Fu, X. A-DDPG: Attention mechanism-based deep reinforcement learning for NFV. In Proceedings of the 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), Tokyo, Japan, 25–28 June 2021; pp. 1–10. [Google Scholar]
- Gu, Y.; Zhu, Z.; Lv, J.; Shi, L.; Hou, Z.; Xu, S. DM-DQN: Dueling Munchausen deep Q network for robot path planning. Complex Intell. Syst. 2023, 9, 4287–4300. [Google Scholar] [CrossRef]
- Jia, L.; Li, J.; Ni, H.; Zhang, D. Autonomous mobile robot global path planning: A prior information-based particle swarm optimization approach. Control Theory Technol. 2023, 21, 173–189. [Google Scholar] [CrossRef]
- Hamami, M.G.M.; Ismail, Z.H. A Systematic Review on Particle Swarm Optimization Towards Target Search in The Swarm Robotics Domain. Arch. Comput. Methods Eng. 2022, 1–20. [Google Scholar] [CrossRef]
- Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
- Wang, H.; Ding, Y.; Xu, H. Particle swarm optimization service composition algorithm based on prior knowledge. J. Intell. Manuf. 2022, 1–19. [Google Scholar] [CrossRef]
- Escobar-Naranjo, J.; Caiza, G.; Ayala, P.; Jordan, E.; Garcia, C.A.; Garcia, M.V. Autonomous navigation of robots: Optimization with DQN. Appl. Sci. 2023, 13, 7202. [Google Scholar] [CrossRef]
- Sumiea, E.H.; Abdulkadir, S.J.; Alhussian, H.S.; Al-Selwi, S.M.; Alqushaibi, A.; Ragab, M.G.; Fati, S.M. Deep deterministic policy gradient algorithm: A systematic review. Heliyon 2024, 10, e30697. [Google Scholar] [CrossRef] [PubMed]
- Kahn, G.; Villaflor, A.; Ding, B.; Abbeel, P.; Levine, S. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 5129–5136. [Google Scholar]
- Liang, Z.; Cao, J.; Jiang, S.; Saxena, D.; Chen, J.; Xu, H. From multi-agent to multi-robot: A scalable training and evaluation platform for multi-robot reinforcement learning. arXiv 2022, arXiv:2206.09590. [Google Scholar]
- Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 2013, 47, 253–279. [Google Scholar] [CrossRef]
- Ju, H.; Juan, R.; Gomez, R.; Nakamura, K.; Li, G. Transferring policy of deep reinforcement learning from simulation to reality for robotics. Nat. Mach. Intell. 2022, 4, 1077–1087. [Google Scholar] [CrossRef]
- Gromniak, M.; Stenzel, J. Deep reinforcement learning for mobile robot navigation. In Proceedings of the 2019 4th Asia-Pacific Conference on Intelligent Robot Systems (ACIRS), Nagoya, Japan, 13–15 July 2019; pp. 68–73. [Google Scholar]
- Andy, W.C.C.; Marty, W.Y.C.; Ni, Z.; Zhong, X. An automated statistical evaluation framework of rapidly-exploring random tree frontier detector for indoor space exploration. In Proceedings of the 2022 4th International Conference on Control and Robotics (ICCR), Guangzhou, China, 2–4 December 2022; pp. 1–7. [Google Scholar]
- Frost, M.; Bulog, E.; Williams, H. Autonav RL Gym. 2019. Available online: https://github.com/SfTI-Robotics/Autonav-RL-Gym (accessed on 24 April 2022).
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
- ROBOTIS-GIT. turtlebot3_machine_learning. 2018. Available online: https://github.com/ROBOTIS-GIT/turtlebot3_machine_learning (accessed on 24 April 2022).
- Gazebo. Open Source Robotics Foundation. 2014. Available online: http://gazebosim.org/ (accessed on 24 April 2022).
- ROBOTIS-GIT. LDS Specifications. Available online: https://emanual.robotis.com/docs/en/platform/turtlebot3/features/#components (accessed on 24 April 2022).
Maze 1 | Maze 2 | Maze 3 | |
---|---|---|---|
Average Goal Rate | 0.952 | 0.724 | 0.881 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cheng, W.-C.; Ni, Z.; Zhong, X.; Wei, M. Autonomous Robot Goal Seeking and Collision Avoidance in the Physical World: An Automated Learning and Evaluation Framework Based on the PPO Method. Appl. Sci. 2024, 14, 11020. https://doi.org/10.3390/app142311020
Cheng W-C, Ni Z, Zhong X, Wei M. Autonomous Robot Goal Seeking and Collision Avoidance in the Physical World: An Automated Learning and Evaluation Framework Based on the PPO Method. Applied Sciences. 2024; 14(23):11020. https://doi.org/10.3390/app142311020
Chicago/Turabian StyleCheng, Wen-Chung, Zhen Ni, Xiangnan Zhong, and Minghan Wei. 2024. "Autonomous Robot Goal Seeking and Collision Avoidance in the Physical World: An Automated Learning and Evaluation Framework Based on the PPO Method" Applied Sciences 14, no. 23: 11020. https://doi.org/10.3390/app142311020
APA StyleCheng, W. -C., Ni, Z., Zhong, X., & Wei, M. (2024). Autonomous Robot Goal Seeking and Collision Avoidance in the Physical World: An Automated Learning and Evaluation Framework Based on the PPO Method. Applied Sciences, 14(23), 11020. https://doi.org/10.3390/app142311020