JP2021034050A

JP2021034050A - Auv action plan and operation control method based on reinforcement learning

Info

Publication number: JP2021034050A
Application number: JP2020139299A
Authority: JP
Inventors: 玉山孫; Yushan Sun; 祥瑞冉; Xiangrui Ran; 国成張; Guocheng Zhang; 岳明李; Yueming Li; 建曹; Jian Cao; 力鋒王; Lifeng Wang; 相斌王; Xiangbin Wang; 昊徐; Ko Jo; 新雨呉; Xinyu Wu; 陳飛馬; Chenfei Ma
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2019-08-21
Filing date: 2020-08-20
Publication date: 2021-03-01
Anticipated expiration: 2040-08-20
Also published as: CN110333739B; JP6854549B2; CN110333739A

Abstract

To provide an AUV action plan and operation control method, which permits action control of an AUV by executing the AUV action plan if a state of emergency occurs when a structure of three layers of a task of an autonomous type unmanned diving machine, that is, a task layer, an action layer and an action layer is defined.SOLUTION: A control command generated to complete an action planned by a robot in the water is defined as an operation. An AUV, when executing a tunnel detection task, executes an action plan in real time using a deep reinforcement learning DQN algorithm, prepares an action network for corresponding deep learning, thereby executing a tunnel detection task plan. Training of an action network of the AUV is performed according to a DDPG method and mapping from a force to a state is obtained as the AUV as an environmental model, thereby achieving operation control of the AUV.SELECTED DRAWING: Figure 1

Description

本発明は、水中ロボットの技術分野に属し、具体的には、ＡＵＶ行動計画及び動作制御方法に関する。 The present invention belongs to the technical field of underwater robots, and specifically relates to AUV action planning and motion control methods.

２１世紀は海洋の利用が進まれている世紀であり、海洋産業への取り組みは世界各国により認められている。中国も重要な海洋戦略を発表・実施しており、現在、中国が高速発展段階であって、土地資源が限られている人口の多い国であるため、海洋資源は持続可能な開発を支える重要な資源貯蔵空間となっている。海洋資源の開発及び探査は、海洋戦略を実施するための重要な前提及び基盤といえ、主要な水中技術及び機器として、自律型無人潜水機（ＡｕｔｏｎｏｍｏｕｓＵｎｄｅｒｗａｔｅｒＶｅｈｉｃｌｅ、ＡＵＶ）は、海洋の民間用、軍事用や科学などの研究分野において、実用的で効果的な開発ツールとなっており、海洋の開発及び探査の重要な手段である。自律型無人潜水機の応用及び技術の研究開発・アップグレードは将来注目を集めると予測でき、世界の国々が海洋分野で主導的な地位を取得するための重要な手段となる。ＡＵＶの研究には、さまざまなハイエンド技術の適用が必要であり、ナビゲーション・測位、水中ターゲット検出、通信技術、インテリジェント制御、シミュレーション技術、エネルギーシステム技術や計画技術など、多くの技術が含まれる。 The 21st century is a century in which the use of the ocean is advancing, and efforts in the marine industry are recognized by countries around the world. China has also announced and implemented an important maritime strategy, and since China is currently in a fast-developing stage and has a large population with limited land resources, marine resources are important to support sustainable development. It is a resource storage space. The development and exploration of marine resources is an important premise and foundation for implementing marine strategies, and as a major underwater technology and equipment, the Autonomous Underwater Vehicle (AUV) is for civilian use in the ocean. It has become a practical and effective development tool in research fields such as military and science, and is an important means of marine development and exploration. The application of autonomous underwater vehicles and the research and development / upgrade of technology can be expected to attract attention in the future, and will be an important means for countries around the world to gain a leading position in the marine field. AUV research requires the application of various high-end technologies, including many technologies such as navigation / positioning, underwater target detection, communication technology, intelligent control, simulation technology, energy system technology and planning technology.

計画及び制御技術は、ＡＵＶインテリジェント化を実現するための重要な要素の１つであり、ＡＵＶが自律的な決定を行い、作業タスクを完了するための前提及び基盤である。水中環境は複雑で動的であり、構造化されておらず、不確実であるので、水中環境情報の入手が困難であり、したがって、ＡＵＶは水中で作業を行うときに予期しない緊急事態に直面することは避けられない。水中通信手段の制限により、ＡＵＶは、緊急事態に対処するために独自の決定に依存するしかなく、このため、ＡＵＶは、元の計画指示を変更し、環境緊急事態に応じて再計画する必要がある。本発明は、複雑な環境でのＡＵＶ計画技術を焦点として研究しており、加圧送水トンネルを複雑な水中環境とし、トンネル検出タスクを適用背景の代表例として、強化学習に基づくＡＵＶの行動計画及び制御方法を提案している。 Planning and control technology is one of the key elements for achieving AUV intelligence, and is the premise and foundation for AUVs to make autonomous decisions and complete work tasks. The underwater environment is complex, dynamic, unstructured and uncertain, making it difficult to obtain underwater environment information, and therefore AUVs face unexpected emergencies when working underwater. It is inevitable to do. Due to restrictions on underwater communication means, the AUV has no choice but to rely on its own decisions to deal with emergencies, which requires the AUV to modify its original planning instructions and replan in response to environmental emergencies. There is. The present invention focuses on AUV planning technology in a complex environment, makes a pressurized water supply tunnel a complex underwater environment, and uses a tunnel detection task as a representative example of the application background, and an AUV action plan based on reinforcement learning. And the control method is proposed.

加圧送水トンネルの検出は、水利工事管理の重要な項目の１つであり、加圧送水トンネルは、水力発電所などの水利プロジェクトを接続するための地下送水チャネルである。現在、中小型貯水池のほとんどの危険事態は、送水トンネルのリスクをタイムリーに発見できなかったことに起因するものであり、トンネルの長期運用では堆積、腐食、漏水や障害物などの欠陥や問題が発生し、特に増水期では、送水トンネルはパイプラインの老朽化の問題を起こしやすく、水中プロジェクトの安全運用に直接影響を及ぼし、このため、トンネルを定期的に点検して、水中プロジェクトの状況を把握することは重視化されてきた。ただし、送水トンネルでは、一部のトンネルの直径が小さいこと、増水期の流速が大きいこと、及び水中工事環境が複雑であることなどの問題から、作業者がトンネルに入って点検することができ、点検タスクを完了するには、点検員の代わりに他の検出装置を使用しなければならない。 Detection of pressurized water tunnels is one of the important items of irrigation work management, and pressurized water tunnels are underground water channels for connecting irrigation projects such as hydroelectric power plants. Currently, most hazards in small and medium-sized reservoirs are due to the failure to detect the risks of water tunnels in a timely manner, and long-term operation of the tunnels results in defects and problems such as deposition, corrosion, water leaks and obstacles. In particular during the flood season, water tunnels are prone to pipeline aging problems and have a direct impact on the safe operation of underwater projects, so the tunnels should be inspected regularly for the status of underwater projects. It has been emphasized to grasp. However, due to problems such as the small diameter of some tunnels, the high flow velocity during the flood season, and the complicated underwater construction environment, workers can enter the tunnel for inspection. , Other detectors must be used on behalf of the inspector to complete the inspection task.

自律型無人潜水機（ＡＵＶ）は、水中検出装置の搭載プラットフォームとして好適であり、水中で長期間にわたって自律的かつ安全にターゲットタスクを実行することができ、トンネルの複雑な水環境では、強力な柔軟性及び状態維持性を有し、水中検出装置とセンサーを搭載して検出のニーズに対応し、トンネル検出タスクを自律的に完了することができ、トンネル検出のための主な手段として機能することが期待される。本発明は、強化学習アルゴリズムに基づいてＡＵＶトンネル検出タスクの行動計画及び制御方法を設計するものであり、ＡＵＶの環境適応性を高め、緊急事態におけるＡＵＶの決定能力を向上させる。 An autonomous underwater vehicle (AUV) is a good platform for mounting underwater detectors, capable of autonomously and safely performing target tasks underwater for long periods of time, and is powerful in the complex water environment of tunnels. It is flexible and state-sustainable, equipped with underwater detectors and sensors to meet the detection needs, can autonomously complete tunnel detection tasks, and functions as the main means for tunnel detection. It is expected. The present invention designs an action plan and a control method for an AUV tunnel detection task based on a reinforcement learning algorithm, enhances the environmental adaptability of the AUV, and improves the ability to determine the AUV in an emergency.

本発明は、複雑なタスクを実施するときに、水中ロボットのインテリジェントレベルが不十分であり、人工経験に頼りすぎる問題、及び従来の水中ロボットではインテリジェントアルゴリズムに基づいて設計される制御方法には正確な環境モデルが必要であることにより、トレーニング経験が制限され、実環境への適用が困難である問題を解決する。 The present invention is accurate for problems where the intelligent level of underwater robots is insufficient and relies too much on artificial experience when performing complex tasks, and for control methods designed based on intelligent algorithms in conventional underwater robots. It solves the problem that the training experience is limited and it is difficult to apply it to the real environment due to the need for an environmental model.

強化学習に基づくＡＵＶ行動計画及び動作制御方法であって、
水中ロボットによるトンネル検出を総タスク、即ちタスクとして定義し、タスクを完了するための行動には、ターゲットへの移動、壁追跡及び障害物回避が含まれ、ロボットが計画する行動を水中で完了するために生じる具体的な制御命令を動作として定義するステップと、
ＡＵＶはトンネル検出タスクを実行するときに、検出対象の水中環境に基づいて、ディープ強化学習ＤＱＮアルゴリズムを用いて行動計画をリアルタイムで行い、つまり、マルチ行動ネットワーク呼び出しに基づく行動計画アーキテクチャを構築し、タスクのニーズに応じて３つの行動の環境状態特徴の入力及び出力の動作を定義して、対応するディープ学習の行動ネットワークを構築し、報酬関数を設計するステップと、
計画システムがトレーニング済み行動ネットワークを呼び出すことでトンネル検出タスクを完了するステップと、
制御システムがトレーニング済み動作ネットワークを呼び出すことで計画された行動を完了するステップと、を含む。
前記強化学習に基づくＡＵＶ行動計画及び動作制御方法において、対応するディープ学習の行動ネットワークを構築し、報酬関数を設計する前記過程においては、以下のステップを含み、
トンネル検出タスクを行動シーケンスに分解し、グローバル経路計画において事前環境情報に基づいて複数の実現可能な経路ポイントを計画し、ＡＵＶが配置位置から出発し、各経路ポイントに順次到着し、
経路ポイントが既知環境下のグローバル計画であるため、航渡過程において、ＡＵＶはリアルタイム環境状態に基づいて、障害物回避行動を呼び出して安全に経路ポイントに到着し、トンネル検出タスクのうちＡＵＶは主に壁追跡行動を呼び出し、所定の検出目標に従ってタスクを完了し、
決定モジュールは、グローバルデータ、決定システム、行動ライブラリ及び評価システムを含み、グローバルデータにタスク情報、状況情報、計画知識が記憶されており、決定システムは、ＤＱＮアルゴリズムと組み合わせた自学習計画システムであり、トレーニングされておき、計画タスクを実行するに先立って行動ライブラリからトレーニング済みネットワークパラメータを抽出し、次に現在の環境状態情報を入力として現在の行動動作を計画し、評価システムは、強化学習アルゴリズムの報酬関数システムであり、ＡＵＶが１つの行動動作計画を計画して実行するたびに、状態環境とタスク情報に基づいて報酬を提供し、すべてのデータはグローバルデータベースに記憶されており
前記行動のうちターゲットへの移動過程においては、以下のステップを含み、
ターゲットポイントへの移動行動は、ＡＵＶが障害物を検知しないときに向首角を調整しながらターゲットポイントへ航行することであり、特徴入力量として主にＡＵＶとターゲットポイントの位置と角度の関係を考慮し、具体的には、現在のＡＵＶ位置座標（ｘ_AUV,ｙ_AUV）、ターゲットポイント座標（ｘ_goal,ｙ_goal）、現在の向首角θ及びターゲット向首角βの計６次元の入力を設定し、その中でも、ターゲット向首角βはＡＵＶがターゲットへ航行しているときの向首角であり、
報酬関数については、ターゲットへの移動行動によりＡＵＶが障害物無し環境でターゲットポイントへ航行するときに、報酬関数は２項に設定され、
第１項ｒ₁₁はＡＵＶとターゲットポイントの距離の変化を考慮し、

第２項ｒ₁₂はＡＵＶの向首角の変化を考慮し、向首角がターゲットに近いほど、ターゲット値の報酬値が大きく、現在のＡＵＶ向首とターゲット向首との夾角αが
α＝θ−βであり、
αの絶対値が小さいほど、取得する報酬値が大きく、具体的には、
ｒ₁₂＝ｋ_Aｃｏｓ（α）
式中、ｋ_Aはターゲットへの移動過程に対応する報酬係数であり、
総報酬値は２項を加重したものであり、
ｒ₁＝ｋ₁₁ｒ₁₁＋ｋ₁₂ｒ₁₂
式中、ｋ₁₁、ｋ₁₂はそれぞれ加重値であり、
前記行動のうち壁追跡過程においては、以下のステップを含み、
ＡＵＶ壁追跡行動は、ＡＵＶと壁の距離及び相対角度の情報を考慮し、ＡＵＶは、一方の側に配置された前後にある２つのレンジングソナーを通じて壁からのＡＵＶの距離ｘ₄とｘ₅を取得し、
方位磁針で現在のＡＵＶ向首角θを取得して、現在の壁角度θ_wallを推定し、

式中、ｌ_AUVは前後にある２つのソナーの間の距離であり、壁追跡行動の環境状態の特徴入力はｘ₁、ｘ₂、ｘ₃、ｘ₄、ｘ₅、θ、θ_wall及びターゲット追跡距離ｌ_goalに設定され、ここで、ｘ₁〜ｘ₅はそれぞれ３つの前方ソナーと一側に設けられた前後ソナーにより測定されたデータであり、特徴入力量は８個であり、ＡＵＶと壁の間の状態関係を完全に記述することができ、距離閾値を設定してソナーデータについて判断を行い、トレーニング中に閾値を超えると、現在のトレーニング周期を終了し、
報酬関数は、ＡＵＶができるだけ壁に平行し、壁とは一定の距離を保持するようにし、仮想ターゲットポイントに基づく壁追跡行動の報酬信号が主に４項に設定され、一般的な壁追跡行動の報酬信号が主に２項に設定され、
第１項は、式（６）のようにＡＵＶと現在の壁がなす角度を考慮し、ＡＵＶと壁の角度が増大し閾値を超えると、負の報酬値を取得し、ＡＵＶと壁の角度が減少すると、正の報酬値を取得し、

第２項は、式（７）のようにＡＵＶの前後両端と壁の距離を考慮し、ＡＵＶと壁の距離と予め設定された値との差が減少すると、正の報酬を取得し、この差が増大すると負の報酬を取得し、追跡距離が予め設定された値の±０．２ｍの範囲にあることができ、追跡範囲内のこの項の報酬値が０である場合、この場所と壁の距離値は、同一側面にある２つのレンジングソナーによるデータの平均値であり、

一般的な壁追跡行動の総報酬ｒは２項の報酬を加重したものであり、
ｒ₂＝ｋ₂₁ｒ₂₁＋ｋ₂₂ｒ₂₂
式中、ｋ₂₁、ｋ₂₂はそれぞれ加重値であり、
仮想ターゲットポイントに基づく追跡では、この仮想ターゲットポイントは外直角と内直角の壁により作成された仮想ターゲットポイントであり、環境が外直角である場合、前側のソナーが障害物を検知していないときに入力が最大検出距離であるので、仮想壁が構築され、仮想ターゲットポイントが追加され、環境が内直角である場合、前方ソナーが壁を検知したとき、ＡＵＶが対向する現在のターゲット壁の他方の側で仮想ターゲットポイントが構築され、
仮想ターゲットポイントに基づく報酬関数の構築は、以下のとおりであり、

ｒ₂₄＝ｋ_Bｃｏｓ（α）
式中、ｋ_Bは壁追跡過程に対応する報酬係数であり、
仮想ターゲットポイントに基づく追跡行動の総報酬ｒは４項の報酬を加重したものであり、
ｒ₂＝ｋ₂₁ｒ₂₁＋ｋ₂₂ｒ₂₂＋ｋ₂₃ｒ₂₃＋ｋ₂₄ｒ₂₄
式中、ｋ₂₃、ｋ₂₄はそれぞれ加重値であり、
ＡＵＶが次の部分の壁を追跡するまで徐々に調整したとき、たとえば、外直角環境におけるレンジングソナーが再度ターゲット壁を検知したか、内直角環境における前方ソナーがさらに前方の壁を検知しない場合、仮想ターゲットポイントを削除し、一般的な壁追跡行動ネットワークを呼び出し、
前記行動のうち障害物回避過程においては、以下のステップを含み、
ＡＵＶ障害物回避行動のニーズについて、環境状態の特徴入力は３つの前方ソナーと両側のそれぞれに設けられたフロントソナーのデータを含み、ＡＵＶは、障害物を回避しながらターゲットポイントの方向へ近づき、特徴入力はＡＵＶの現在の位置座標（ｘ_AUV,ｙ_AUV）、ターゲットポイント位置座標（ｘ_goal,ｙ_goal）、現在の向首角θ及びターゲット向首角βの計１１次元の入力をさらに含み、
報酬関数については、報酬信号が３項に分けられ、第１項は障害物に対するＡＵＶ距離に基づいて得られた報酬値ｒ₃₁であり、ＡＵＶが障害物に近づくと、負の報酬の警告を取得し、ＡＵＶが障害物から離間すると、正の報酬を取得し、ＡＵＶが障害物から離間して航行するように促し、障害物と衝突すると報酬値−１を取得し、現在のトレーニング周期を終了し、

第２項は、現在のＡＵＶとターゲットポイントの距離に基づいて生じる報酬値ｒ₃₂であり、ＡＵＶが障害物を回避しながらターゲットポイントへ航行するように促し、このため、ＡＵＶがターゲットポイントから離間すると、負の報酬を取得し、ターゲットポイントに近づくと正の報酬を取得し、ＡＵＶターゲットポイントに到着すると、正の報酬値１．０を取得し、トレーニング周期を終了し、

第３項は、ＡＵＶと現在のターゲットがなす角度αに基づいて生じる報酬ｒ₃₃であり、同様にＡＵＶがターゲットポイントの方向へ航行するように促すが、この項の報酬は、主に、現在のターゲット向首に近くなるように向首角を調整することをＡＵＶに学習させ、経路の長さを減らすようにするためであり、
ｒ₃₃＝ｋ_cｃｏｓ（α）
式中、ｋ_Cは壁障害物の回避過程に対応する報酬係数であり、
最後の総報酬信号はこの３項の報酬値を加重したものに等しく、
ｒ₃＝ｋ₃₁ｒ₃₁＋ｋ₃₂ｒ₃₂＋ｋ₃₃ｒ₃₃
式中、ｋ₃₁〜ｋ₃₃はそれぞれ加重値であり、
強化学習は、動作から環境へのマッピングをトレーニングするものであり、ロボットを環境として、ＤＤＰＧトレーニングを通じて力とトルクを得て水中ロボットに作用させ、ＡＵＶモデルを用いて計算することによりロボットの速度と角速度を得て、速度、角速度とターゲット速度、ターゲット角速度との誤差を利用して報酬値ｒ₄＝−｜△ｖ＋△Ψ｜を設計し、ここで△ｖは速度誤差であり、△Ψは向首誤差であり、
また、トレーニング中のＡＵＶモデルにランダム干渉力を追加することで、ＤＤＰＧに基づく制御システムをトレーニングにより得て、制御システムのトレーニングが完了した後、ロボットの現在の位置及びターゲット経路から、経路追跡戦略に従ってターゲット命令を得て、ＤＤＰＧ制御システムを用いてロボットを計画命令に従うように制御する。 AUV action plan and motion control method based on reinforcement learning.
Underwater robot tunnel detection is defined as a total task, or task, and actions to complete the task include moving to a target, wall tracking, and obstacle avoidance, completing the robot-planned action underwater. And the steps to define the concrete control instructions that occur for
When the AUV performs the tunnel detection task, it uses the deep reinforcement learning DQN algorithm to perform action planning in real time based on the underwater environment to be detected, that is, it builds an action planning architecture based on multi-action network calls. Steps to define the input and output behaviors of the environmental state characteristics of the three behaviors according to the needs of the task, build the corresponding deep learning behavioral network, and design the reward function.
The steps that the planning system completes the tunnel detection task by calling the trained behavior network,
Includes steps in which the control system completes the planned action by calling the trained operating network.
In the AUV action planning and motion control method based on the reinforcement learning, the process of constructing the corresponding deep learning action network and designing the reward function includes the following steps.
The tunnel detection task is broken down into action sequences, multiple feasible route points are planned based on prior environmental information in global route planning, the AUV departs from the placement position and arrives at each route point in sequence.
Since the route point is a global plan under a known environment, during the travel process, the AUV calls the obstacle avoidance action to safely arrive at the route point based on the real-time environmental state, and the AUV is the main tunnel detection task. Call wall tracking action to complete the task according to a given detection target,
The decision module includes global data, decision system, action library and evaluation system, and task information, situation information and planning knowledge are stored in the global data. The decision system is a self-learning planning system combined with the DQN algorithm. , Trained, extract trained network parameters from the behavior library prior to performing the planning task, then plan the current behavior behavior using the current environmental state information as input, and the evaluation system is an enhanced learning algorithm. It is a reward function system of the above, and each time the AUV plans and executes one action action plan, it provides a reward based on the state environment and task information, and all the data is stored in the global database. In the process of moving to the target, the following steps are included.
The movement action to the target point is to navigate to the target point while adjusting the head angle when the AUV does not detect an obstacle, and the relationship between the position and angle of the AUV and the target point is mainly used as the feature input amount. In consideration, specifically, the current AUV position coordinates (x _AUV , y _AUV ), the target point coordinates (x _goal , y _goal ), the current heading angle θ, and the target heading angle β are input in a total of 6 dimensions. Among them, the target heading angle β is the heading angle when the AUV is navigating to the target.
Regarding the reward function, when the AUV navigates to the target point in an obstacle-free environment due to the movement action to the target, the reward function is set to the second term.
Item 1 r ₁₁ considers the change in the distance between the AUV and the target point.

The second term r ₁₂ considers the change in the AUV heading angle, and the closer the heading angle is to the target, the larger the reward value of the target value, and the current AUV heading angle α between the target heading angle is α = θ-β,
The smaller the absolute value of α, the larger the reward value to be acquired.
r ₁₂ = k _A cos (α)
In the formula, k _A is the reward coefficient corresponding to the process of moving to the target.
The total reward value is the weight of two terms,
r ₁ = k ₁₁ r ₁₁ + k ₁₂ r ₁₂
In the formula, k ₁₁ and k ₁₂ are weighted values, respectively.
The wall tracking process of the above actions includes the following steps:
The AUV wall tracking behavior takes into account information on the distance and relative angle between the AUV and the wall, and the AUV measures the distance x ₄ and x ₅ of the AUV from the wall through two front and rear ranging sonars located on one side. Acquired,
Obtain the current AUV head angle θ with the compass, estimate the _{current wall angle θ wall, and}

In the equation, l _AUV is the distance between the two sonars in front and behind, and the environmental state feature inputs of the wall tracking behavior are x ₁ , x ₂ , x ₃ , x ₄ , x ₅ , θ, θ _wall and target. The tracking distance is set to _{l goal} _{, where x 1 to} _{x 5} are data measured by three front sonars and front and rear sonars provided on one side, respectively, and the feature input amount is eight, which is AUV. You can fully describe the state relationship between the walls, set a distance threshold to make decisions about sonar data, and if the threshold is exceeded during training, the current training cycle ends.
The reward function is to keep the AUV as parallel to the wall as possible and keep a certain distance from the wall, and the reward signal of the wall tracking action based on the virtual target point is mainly set in the 4th term, and the general wall tracking action. Reward signal is mainly set to 2 items,
The first term considers the angle formed by the AUV and the current wall as in equation (6), and when the angle between the AUV and the wall increases and exceeds the threshold value, a negative reward value is obtained and the angle between the AUV and the wall is obtained. When is reduced, you get a positive reward value,

The second term considers the distance between the front and rear ends of the AUV and the wall as in equation (7), and when the difference between the distance between the AUV and the wall and the preset value decreases, a positive reward is obtained. If the difference increases, you will get a negative reward, the tracking distance can be in the range of ± 0.2m of the preset value, and if the reward value of this item in the tracking range is 0, then with this location The wall distance value is the average of the data from two ranging sonars on the same side.

The total reward r for general wall tracking behavior is the weight of the reward in the second term.
r ₂ = k ₂₁ r ₂₁ + k ₂₂ r ₂₂
In the formula, k ₂₁ and k ₂₂ are weighted values, respectively.
In tracking based on virtual target points, this virtual target point is a virtual target point created by a wall with an outer right angle and an inner right angle, and if the environment is an outer right angle, when the front sonar is not detecting an obstacle. Since the input to is the maximum detection distance, a virtual wall is built, a virtual target point is added, and if the environment is at right angles, when the forward sonar detects the wall, the AUV faces the other of the current target walls. A virtual target point is built on the side of
The construction of the reward function based on the virtual target points is as follows,

r ₂₄ = k _B cos (α)
In the formula, k _B is the reward factor corresponding to the wall tracking process.
The total reward r for tracking behavior based on virtual target points is the weight of the reward in item 4.
r ₂ = k ₂₁ r ₂₁ + k ₂₂ r ₂₂ + k ₂₃ r ₂₃ + k ₂₄ r ₂₄
In the formula, k ₂₃ and k ₂₄ are weighted values, respectively.
When the AUV is gradually adjusted until it tracks the next part of the wall, for example, if the ranging sonar in the outer right angle environment detects the target wall again, or if the anterior sonar in the inner right angle environment does not detect the further anterior wall. Delete the virtual target point, call the general wall tracking behavior network,
Among the above actions, the obstacle avoidance process includes the following steps.
For the needs of AUV obstacle avoidance behavior, the environmental state feature inputs include data for the three front sonars and front sonars on each side, and the AUV approaches the target point while avoiding obstacles. The feature input further includes a total of 11-dimensional inputs of the current position coordinates (x _AUV , y _AUV ) of the AUV, the target point position coordinates (x _goal , y _goal ), the current heading angle θ, and the target heading angle β. ,
Regarding the reward function, the reward signal is divided into three terms, the first term is the reward value r ₃₁ obtained based on the AUV distance to the obstacle, and when the AUV approaches the obstacle, a negative reward warning is given. Obtain and get a positive reward when the AUV moves away from the obstacle, encourage the AUV to navigate away from the obstacle, get a reward value of -1 when colliding with the obstacle, and set the current training cycle. Finished

_{The second term is the reward value r 32} generated based on the current distance between the AUV and the target point, which encourages the AUV to navigate to the target point while avoiding obstacles, thus causing the AUV to move away from the target point. Then, a negative reward is obtained, a positive reward is obtained when approaching the target point, a positive reward value of 1.0 is obtained when the AUV target point is reached, and the training cycle is completed.

_{The third term is the reward r 33} generated based on the angle α between the AUV and the current target, which also prompts the AUV to navigate in the direction of the target point, but the reward in this section is mainly the present. This is to train the AUV to adjust the heading angle so that it is closer to the target heading, and to reduce the length of the path.
r ₃₃ = k _c cos (α)
In the formula, k _C is the reward coefficient corresponding to the wall obstacle avoidance process.
The final total reward signal is equal to the weight of the reward values in these three terms,
r ₃ = k ₃₁ r ₃₁ + k ₃₂ r ₃₂ + k ₃₃ r ₃₃
In the formula, k _{31 to} _{k 33} are weighted values, respectively.
Reinforcement learning trains the mapping from motion to environment. Using the robot as the environment, the force and torque are obtained through DDPG training to act on the underwater robot, and the speed of the robot is calculated by using the AUV model. _{Obtain the angular velocity and design the reward value r 4} =-| △ v + △ Ψ | using the error between the velocity, the angular velocity and the target velocity, and the target angular velocity, where Δv is the velocity error and ΔΨ is It is a heading error,
In addition, by adding random interference force to the AUV model during training, a control system based on DDPG is obtained by training, and after the training of the control system is completed, the route tracking strategy is performed from the current position and target route of the robot. The target command is obtained according to the above, and the robot is controlled to follow the planned command by using the DDPG control system.

本発明の有益な効果は以下のとおりである。
１、本発明で設計された３層計画システムは、総タスクをターゲットへの移動と障害物回避のサブ動作に分解し、環境状態モデルと報酬関数を設計し、動作中の戦略最適化により空間次元を削減し、それによって、複雑な環境モデルでも安全で衝突のない経路を計画することができ、「次元の呪い」の問題を解決する。
また、本発明は、インテリジェントレベルが高く、計画が手動プログラミングに依存する必要がなく、人工経験によらずにロボット制御を実現することができる。
２、本発明は、ディープ強化学習アルゴリズムを行動計画システムに適用し、ニューラルネットワークを介して高次元データ特徴を抽出することで、連続的な環境状態での検知の問題を解決し、また強化学習を使用して行動決定計画を行うものである。トンネル検出タスクのニーズに応じて、ターゲットポイントへの移動、壁追跡、障害物回避という３つの典型的な動作が定義され、動作ごとに動作ネットワークが構築され、対応する環境状態変数と報酬関数が設計されて、壁の隅の問題については、仮想ターゲットポイントに基づく追跡方法が提案される。各動作は対応する目標に達しており、各動作ネットワークを呼び出すことでトンネル検出タスクを完了し、それにより、アルゴリズムは、安定性が高く、汎化能力が強い。
３、本発明は、ＡＵＶの運動学モデルを環境として、力から速度へのマッピング関係をトレーニングしているため、本発明の制御方法は、正確な環境モデルを必要とせず、トレーニング経験が制限され、実環境への適用が困難であるという問題を解決し、他のインテリジェント制御アルゴリズムの研究と比較して、普遍的な適応性を有し、一度だけトレーニングに成功すると、さまざまなタスクに適用できる。 The beneficial effects of the present invention are as follows.
1. The three-layer planning system designed by the present invention decomposes the total task into sub-actions of moving to the target and avoiding obstacles, designs an environmental state model and a reward function, and optimizes the space during the operation. It reduces dimensions, allowing you to plan safe, collision-free routes even in complex environmental models, solving the problem of "curse of dimensionality".
In addition, the present invention has a high level of intelligence, the planning does not need to rely on manual programming, and robot control can be realized without relying on artificial experience.
2. The present invention solves the problem of detection in continuous environmental conditions by applying a deep reinforcement learning algorithm to an action planning system and extracting high-dimensional data features via a neural network, and also provides reinforcement learning. Is used to make an action decision plan. Three typical actions are defined, moving to the target point, wall tracking, and obstacle avoidance, depending on the needs of the tunnel detection task, and a working network is built for each action, with corresponding environmental state variables and reward functions. Designed, for wall corner problems, a virtual target point-based tracking method is proposed. Each action has reached its corresponding goal and completes the tunnel detection task by calling each action network, which makes the algorithm more stable and more generalizable.
3. Since the present invention trains the mapping relationship from force to velocity using the kinematic model of AUV as an environment, the control method of the present invention does not require an accurate environmental model and the training experience is limited. It solves the problem that it is difficult to apply in the real environment, has universal adaptability compared to the research of other intelligent control algorithms, and can be applied to various tasks after successful training only once. ..

自律型無人潜水機のタスクを３層に分割した模式図である。It is a schematic diagram which divided the task of an autonomous underwater vehicle into three layers. タスクの分解模式図である。It is an exploded schematic diagram of a task. 壁追跡行動の模式図である。It is a schematic diagram of a wall tracking behavior. 外直角壁環境の模式図である。It is a schematic diagram of the outer right angle wall environment. 内直角壁環境の模式図である。It is a schematic diagram of the inner right angle wall environment. 障害物回避行動の模式図である。It is a schematic diagram of an obstacle avoidance behavior. ＡＵＶソナーの配置図である。It is a layout drawing of AUV sonar.

実施形態１
本実施形態は、強化学習に基づくＡＵＶ行動計画及び動作制御方法である。 Embodiment 1
This embodiment is an AUV action plan and motion control method based on reinforcement learning.

本発明は、自律型無人潜水機のタスクの３層構造、すなわち、タスク層、行動層、及び行動層を定義し、緊急事態が発生したときにＡＵＶ行動計画が実行され、ＤｅｅｐＤｅｔｅｒｍｉｎｉｓｔｉｃＰｏｌｉｃｙＧｒａｄｉｅｎｔ（ＤＤＰＧ）コントローラによってＡＵＶの行動制御が行われる。 The present invention defines a three-layer structure of tasks of an autonomous underwater vehicle, that is, a task layer, an action layer, and an action layer, and an AUV action plan is executed in the event of an emergency, and the Deep Deterministic Policy The behavior of the AUV is controlled by the DDPG) controller.

実現過程においては、次の３つの部分が含まれる。
（１）自律型無人潜水機のタスクの階層的設計
（２）行動計画システムの構築
（３）ＤＤＰＧ制御アルゴリズムに基づく設計。 The realization process includes the following three parts.
(1) Hierarchical design of tasks for autonomous underwater vehicle (2) Construction of action planning system (3) Design based on DDPG control algorithm.

さらに、前記（１）の過程は以下の通りである。
水中ロボットによるトンネル検出タスクを階層化するには、自律型無人潜水機によるトンネル検出タスク、行動及び動作の概念を定義し、つまり、自律型無人潜水機によるトンネル検出を総タスクとして定義し、総タスクを完了するためには、ターゲットへの移動、壁追跡及び障害物回避という３つの典型的な行動を定義し、ロボットが水中で航行して計画された行動を完了するために生じる特定の制御命令を動作として定義し、たとえば、ｎ度左折、ｎ度右折、ｎノットの速度での前進などがある。 Further, the process of (1) above is as follows.
To layer tunnel detection tasks by underwater robots, define the concept of tunnel detection tasks, actions and actions by autonomous underwater vehicle, that is, define tunnel detection by autonomous underwater vehicle as total task, and total. To complete a task, we define three typical actions: moving to a target, tracking walls and avoiding obstacles, and the specific controls that occur as the robot navigates underwater to complete the planned action. Instructions are defined as actions, such as n-degree left turn, n-degree right turn, and n-knot speed advancement.

図１に示すように、自律型無人潜水機の行動計画システムのアーキテクチャは、総タスク層、行動層及び動作層という３層に分かれている。このモデルは、下から上に階層化されたフレームワークであり、動作層は、ＡＵＶが環境と相互作用するプロセスと見なすことができ、ＡＵＶは、動作を実行して環境に作用し、計画システムは、この層を通じてリアルタイムな環境と自体状態データを取得し、計画システムの以前の学習経験に従って学習とトレーニングを行うことで、グローバル計画知識を更新する。トレーニングサンプルライブラリ内の環境状態データの履歴経験情報を現在の環境状態と比較・分析し、次に、比較結果と計画知識更新データを層タスク層にフィードバックし、総タスク層は、主に現在の環境状態を分析して、特定の内部戦略に従って計画結果を出力し、行動動作シーケンスの形で行動層に送信し、つまり、現在の環境状態データに従って行動シーケンスを計画する計画システムの上位層であり、行動層は、主に現在の動作層によって取得されたローカル環境状態情報を考慮し、総タスク層による上位層の計画結果に基づき、特定の戦略に従って行動を選択する中間層である。以上説明するように、総タスク層は、環境状態データに基づいて上位層の計画結果を提供し、行動層は、上位層の計画結果に基づいて行動を選択して実行し、動作層では、ＡＵＶは行動戦略に従って基本的な動作を実行し、環境状態の変化を検知する役割を果たし、３層のフレームワークは、ボトムアップ計画の決定モデルを構成する。 As shown in FIG. 1, the architecture of the action planning system of the autonomous underwater vehicle is divided into three layers: a total task layer, an action layer, and an action layer. This model is a bottom-to-top layered framework where the working layer can be thought of as the process by which the AUV interacts with the environment, where the AUV performs actions and acts on the environment and is a planning system. Updates global planning knowledge by acquiring real-time environment and self-state data through this layer and learning and training according to the previous learning experience of the planning system. The historical experience information of the environmental state data in the training sample library is compared and analyzed with the current environmental state, and then the comparison result and the planning knowledge update data are fed back to the layer task layer, and the total task layer is mainly the current one. It is the upper layer of the planning system that analyzes the environmental state, outputs the planning result according to a specific internal strategy, and sends it to the action layer in the form of the action action sequence, that is, plans the action sequence according to the current environmental state data. The action layer is an intermediate layer that selects actions according to a specific strategy based on the planning results of the upper layer by the total task layer, mainly considering the local environment state information acquired by the current operation layer. As described above, the total task layer provides the plan result of the upper layer based on the environmental state data, the action layer selects and executes the action based on the plan result of the upper layer, and the action layer The AUV performs basic actions according to the action strategy and plays a role of detecting changes in the environmental state, and the three-layer framework constitutes the decision model of the bottom-up plan.

また、前記（２）の過程は以下の通りである。
ＡＵＶは、トンネル検出タスクを実行する際に、タスクのニーズに応じて、グローバル経路計画によって指定されたクリティカル経路ポイントに順次到着する。ただし、実際の作業過程では、急に現れた障害物やトンネル壁の損傷によるトンネル壁環境の変化など、未知の環境情報が存在するため、安全性を確保するために、ＡＵＶは環境情報と自身の状況に基づいてタイムリーに応答する必要がある。ディープ強化学習に基づく行動計画システムは、反応式に基づく計画アーキテクチャを採用しており、環境状態と動作の間のマッピング関係を構築することにより、ＡＵＶは環境の変化に応じて動作をすばやく計画することができ、緊急環境変化に対するＡＵＶの対処能力を向上できる。 The process of (2) above is as follows.
When performing a tunnel detection task, the AUV sequentially arrives at the critical route points specified by the global route planning, depending on the needs of the task. However, in the actual work process, there is unknown environmental information such as sudden obstacles and changes in the tunnel wall environment due to damage to the tunnel wall. It is necessary to respond in a timely manner based on the situation of. Action planning systems based on deep reinforcement learning employ a reaction-based planning architecture that allows AUVs to quickly plan actions in response to changes in the environment by building mapping relationships between environmental states and actions. It is possible to improve the ability of AUV to cope with changes in the emergency environment.

本発明は、研究対象としてインテリジェント加圧送水トンネルを検出するＡＵＶを採用し、このＡＵＶは、ＡＵＶに装備した水中音響機器やセンサーなどを利用して水中環境を検出し、ディープ強化学習ＤＱＮアルゴリズムを使用して行動計画をリアルタイムで行い、つまり、マルチ行動ネットワークコール呼び出しに基づく行動計画アーキテクチャを構築し、タスクのニーズに応じて３つの基本動作の環境状態特徴の入力及び出力の動作を定義し、対応するディープ学習動作ネットワークを構築し、報酬関数を設計し、壁追跡行動では、壁の隅の問題に対しては、仮想ターゲットポイントに基づく追跡方法が提案されている。 The present invention employs an AUV that detects an intelligent pressurized water supply tunnel as a research target, and this AUV detects the underwater environment using an underwater acoustic device or sensor equipped in the AUV, and uses a deep enhanced learning DQN algorithm. Use to do action planning in real time, i.e. build an action planning architecture based on multi-action network call calls, define input and output behaviors of the environmental state features of the three basic behaviors according to the needs of the task. The corresponding deep learning motion network is constructed, the reward function is designed, and the wall tracking behavior proposes a tracking method based on a virtual target point for the problem of the corner of the wall.

行動層の計画の問題については、本発明は、トンネル検出を適用背景の代表例として、ターゲットへの移動行動、壁追跡行動、及び障害物回避行動という３つの代表的行動を提案し、底層の基本行動を定義し、行動ネットワークを設計し、計画システムは、トレーニング済み行動ネットワークを呼び出すことでトンネル検出タスクを完了する。トンネル検出タスクの場合、このタスクは一連の行動シーケンスに分解でき、図２に示すように、グローバル経路計画は、事前環境情報に基づいて複数の実行可能な経路ポイントを計画し、ＡＵＶは配置位置から出発し、各経路ポイントに順次到着する。 Regarding the problem of action layer planning, the present invention proposes three typical actions of moving to a target, wall tracking action, and obstacle avoidance action as typical examples of the application background of tunnel detection, and proposes three typical actions of the bottom layer. The basic behavior is defined, the behavior network is designed, and the planning system completes the tunnel detection task by calling the trained behavior network. For the tunnel detection task, this task can be broken down into a series of action sequences, the global route planning plans multiple viable route points based on prior environmental information, and the AUV is the placement position, as shown in FIG. Depart from and arrive at each route point in sequence.

航渡タスクは、ＡＵＶが各経路の開始ポイントから各クリティカルポイントに到着することであり、各航渡タスクごとに異なる速度制約を設定することができる。経路ポイントは既知環境下のグローバル計画であるため、航渡中、ＡＵＶはリアルタイム環境状態に従って障害物回避行動を呼び出して、経路ポイントに安全に到着するため、各トラックは一意ではない。トンネル検出タスクは経路ポイント３から始まり経路ポイント４で終わり、ＡＵＶは主に壁追跡行動を呼び出して、所定の検出目標に従ってタスクを完了する。 The voyage task is that the AUV arrives at each critical point from the start point of each route, and different speed constraints can be set for each voyage task. Since the route point is a global plan under known environment, each track is not unique because the AUV calls the obstacle avoidance action according to the real-time environmental condition and safely arrives at the route point during the voyage. The tunnel detection task starts at route point 3 and ends at route point 4, where the AUV primarily calls wall tracking actions to complete the task according to a predetermined detection target.

さらに、アーキテクチャ内の検知モジュール（ソナーを含む）は、ＡＵＶセンサーのデータを取得し、行動のニーズに応じてデータを分析することで、リアルタイムなＡＵＶ状態情報と環境情報を検出する。決定モジュールは、グローバルデータ、決定システム、行動ライブラリ及び評価システムを含む、計画システムの中核である。グローバルデータには、タスク情報、状況情報、計画知識などが記憶されており、決定システムは、ＤＱＮアルゴリズムと組み合わせた自己学習計画システムでもあり、決定システムは、まず大量のトレーニングを行い、計画タスクを実行するに先立って行動データベースからトレーニング済みネットワークパラメータを抽出し、次に、現在の環境状態情報を入力として、現在の行動動作を計画し、評価システムは強化学習アルゴリズムの報酬関数システムであり、ＡＵＶが行動動作を計画して実行すると、状態環境とタスク情報に基づいて報酬を提供し、すべてのデータはグローバルデータベースに記憶されている。 In addition, detection modules (including sonar) within the architecture acquire real-time AUV status information and environmental information by acquiring AUV sensor data and analyzing the data according to behavioral needs. The decision module is the core of the planning system, including global data, decision systems, action libraries and evaluation systems. Task information, situation information, planning knowledge, etc. are stored in the global data, and the decision system is also a self-learning planning system combined with the DQN algorithm. The trained network parameters are extracted from the behavior database prior to execution, then the current behavioral behavior is planned using the current environmental state information as input, and the evaluation system is the reward function system of the enhanced learning algorithm, AUV. As they plan and execute action actions, they provide rewards based on state environment and task information, and all data is stored in a global database.

２．１）ターゲットへの移動
ＡＵＶは、トンネル検出タスクを実行する過程に亘って、予めグローバルに計画されたターゲットポイントに到着する必要があり、経路を最短にするために、ターゲットポイントへの移動行動は、ＡＵＶが障害物を検知していないときに向首角を調整しながらターゲットポイントへ航行するようにし、したがって、ターゲットへの移動行動過程におけるＡＵＶのリアルタイム向首をできるだけターゲット方向付近に制御する必要がある。ターゲットへの移動行動のニーズに応じて、図２に示すように、特徴入力量は主にＡＵＶとターゲットポイントの位置及び角度の関係を考慮し、具体的には、現在のＡＵＶ位置座標（ｘ_AUV,ｙ_AUV）、ターゲットポイント座標（ｘ_goal,ｙ_goal）、現在の向首角θ及びターゲット向首角βの計６次元の入力を設定する。ターゲット向首角βは、ＡＵＶがターゲットへ航行しているときの向首角である。 2.1) Move to target The AUV must reach a pre-globally planned target point during the process of performing the tunnel detection task, and move to the target point to minimize the route. The behavior is to navigate to the target point while adjusting the head angle when the AUV is not detecting an obstacle, and therefore control the real-time heading of the AUV in the process of moving to the target as close to the target direction as possible. There is a need to. As shown in FIG. 2, the feature input amount mainly considers the relationship between the AUV and the position and angle of the target point according to the needs of the movement behavior to the target, and specifically, the current AUV position coordinates (x). _AUV , y _AUV ), target point coordinates (x _goal , y _goal ), current heading angle θ, and target heading angle β, for a total of 6-dimensional inputs. The target heading angle β is the heading angle when the AUV is navigating to the target.

２．１．１）報酬関数の設計
ターゲットへの移動行動は、主にＡＵＶが障害物無し環境でターゲットポイントへ航行するように駆動し、したがって、具体的な報酬関数は、２項に設定され、第１項ｒ₁₁はＡＵＶとターゲットポイントの距離の変化を考慮し、具体的には、

第２項ｒ₁₂はＡＵＶの向首角の変化を考慮し、ＡＵＶがターゲット向首に調整して航行するように促し、首角がターゲットに近いほど、ターゲット値の報酬値が大きく、現在のＡＵＶ向首とターゲット向首との夾角αが、
α＝θ−β （２）であり、
αの絶対値が小さいほど、取得する報酬値が大きく、具体的には、
ｒ₁₂＝ｋ_Aｃｏｓ（α）（３）
式中、ｋ_Aはターゲットへの移動過程に対応する報酬係数であり、
総報酬値は２項を加重したものであり、
ｒ₁＝ｋ₁₁ｒ₁₁＋ｋ₁₂ｒ₁₂ （４）
式中、ｋ₁₁、ｋ₁₂はそれぞれ加重値である。 2.1.1) Design of reward function The movement behavior to the target is mainly driven so that the AUV navigates to the target point in an obstacle-free environment, and therefore the specific reward function is set in the second term. , Paragraph 1 r ₁₁ considers the change in the distance between the AUV and the target point, specifically,

The second term r ₁₂ In view of the change of direction neck angle AUV, AUV is prompted to sail to adjust the target Kokubi, as the neck angle is close to the target, increased compensation value of the target value, the current The angle α between the AUV head and the target head is
α = θ−β (2),
The smaller the absolute value of α, the larger the reward value to be acquired.
r ₁₂ = k _A cos (α) (3)
In the formula, k _A is the reward coefficient corresponding to the process of moving to the target.
The total reward value is the weight of two terms,
r ₁ = k ₁₁ r ₁₁ + k ₁₂ r ₁₂ (4)
In the equation, k ₁₁ and k ₁₂ are weighted values, respectively.

２．２）壁追跡
ほとんどのトンネルの距離が長いため、水利プロジェクト全体が１０ｋｍ以上に達する可能性があり、ＡＵＶがトンネルの入口に入ると、手動による介入が困難になり、このため、ＡＵＶがトンネル環境に応じて自律的に検出タスクを完了することが求められる。衝突を回避するには、ＡＵＶは壁から安全な距離だけ離れる必要があり、そして、水中の光源や視認性などによって制限されて、ＡＵＶと壁の間の距離が画像収集の品質にも直接影響し、したがって、ＡＵＶには、壁から一定の距離を保持しながら壁に沿って航行する能力が求められる。 2.2) Wall tracking Due to the long distance of most tunnels, the entire irrigation project can reach more than 10km, and once the AUV enters the tunnel entrance, manual intervention becomes difficult, which makes the AUV difficult. It is required to complete the detection task autonomously according to the tunnel environment. To avoid collisions, the AUV must be a safe distance from the wall, and the distance between the AUV and the wall directly affects the quality of image acquisition, limited by underwater light sources and visibility. Therefore, the AUV is required to have the ability to navigate along the wall while maintaining a certain distance from the wall.

２．２．１）上記ＡＵＶの壁追跡機能のニーズに応じて、この行動は主にＡＵＶと壁の距離及び相対角度の情報を考慮する。図３に示すように、ＡＵＶが自体の右側の壁を追跡して航行する例では、ＡＵＶは、右側に配置された前後の２つのレンジングソナーを通じて壁からのＡＵＶの距離ｘ₄とｘ₅を取得する。 2.2.1) Depending on the needs of the wall tracking function of the AUV, this action mainly considers information on the distance and relative angle between the AUV and the wall. As shown in FIG. 3, in the example where the AUV navigates by tracking the wall on the right side of itself, the AUV measures the distance x ₄ and x ₅ of the AUV from the wall through two front and rear ranging sonars placed on the right side. get.

本実施形態では、ＡＵＶは、合計７個のレンジングソナーが設けられており、図７に示すように、ＡＵＶの前端には３つの前方ソナー（図７の１、２、３）が設けられ、ＡＵＶの両側のそれぞれに２つのソナー（図７の４、５、と６、７）が設けられ、各側にある２つのソナーはそれぞれ前後でそれぞれ１つ設けられ、前端のものをフロントソナー、後端のものをリアソナーと呼ぶ。 In the present embodiment, the AUV is provided with a total of seven ranging sonars, and as shown in FIG. 7, three front sonars (1, 2, 3 in FIG. 7) are provided at the front end of the AUV. Two sonars (4, 5, and 6, 7 in FIG. 7) are provided on each side of the AUV, one on each side is provided on the front and back, and the front end is the front sonar. The rear end is called the rear sonar.

方位磁針で現在のＡＵＶ向首角θを取得し、現在の壁角度θ_wallを推定する。

式中、ｌ_AUVは前後の２つのソナーの間の距離であり、壁追跡行動の環境状態の特徴入力はｘ₁、ｘ₂、ｘ₃、ｘ₄、ｘ₅、θ、θ_wall及びターゲット追跡距離ｌ_goalに設定され、ここで、ｘ₁〜ｘ₅はそれぞれ３つの前方ソナーと一側に設けられた前後ソナー（本実施形態では、１−５ソナーと記載する）により測定されたデータであり、特徴入力量は８個であり、前方ソナーと側面ソナーのデータを含み、前方ソナーは主に壁環境での前方の壁からの距離ｘ₁を検出し、以上の特徴変量はＡＵＶと壁の間の状態関係を完全に記述することができる。距離閾値を設定してソナーデータについて判断を行い、トレーニング中に閾値を超えると、現在のトレーニング周期を終了する。 The current AUV head angle θ is acquired with the compass, and the current wall angle θ _wall is estimated.

In the equation, l _AUV is the distance between the two sonars before and after, and the environmental state feature inputs of the wall tracking behavior are x ₁ , x ₂ , x ₃ , x ₄ , x ₅ , θ, θ _wall and target tracking. The distance is set to _{l goal} _{, where x 1 to} _{x 5} are data measured by three front sonars and front and rear sonars (referred to as 1-5 sonars in this embodiment) provided on one side, respectively. Yes, the feature input amount is 8, and the data of the front sonar and the side sonar are included. The front sonar mainly _{detects the distance x 1} from the front wall in the wall environment, and the above feature variables are AUV and wall. The state relationship between them can be completely described. A distance threshold is set to make decisions about sonar data, and if the threshold is exceeded during training, the current training cycle ends.

２．２．２）報酬関数の設計
ＡＵＶの壁追跡行動学習において、報酬関数は、ＡＵＶができるだけ壁に平行し、壁となす角度を約０°に維持し、壁とは一定の距離を保持するようにすることに用いられる。 2.2.2) Design of reward function In AUV wall tracking behavior learning, the reward function keeps the AUV as parallel to the wall as possible, keeps the angle with the wall at about 0 °, and keeps a certain distance from the wall. It is used to make it work.

以上の要素を考慮して、仮想ターゲットポイントに基づく壁追跡行動の報酬信号が主に４項に設定され、一般的な壁追跡行動の報酬信号が主に２項に設定される。 In consideration of the above factors, the reward signal for the wall tracking action based on the virtual target point is mainly set in the fourth term, and the reward signal for the general wall tracking action is mainly set in the second term.

第１項は、式（６）のようにＡＵＶと現在の壁がなす角度を考慮し、ＡＵＶと壁の角度が増大し閾値を超えると、負の報酬値を取得し、ＡＵＶと壁の角度が減小すると、正の報酬値を取得し、

第２項は、式（７）のようにＡＵＶの前後両端と壁の距離を考慮し、ＡＵＶと壁の距離と予め設定された値との差が減小すると、正の報酬を取得し、この差が増大すると、負の報酬を取得し、追跡距離が予め設定された値の±０．２ｍの範囲にあることができ、追跡範囲内のこの項の報酬値が０である場合、この場所と壁の距離値は、同一側面にある２つのレンジングソナーデータによる平均値である。

一般的な壁追跡行動の総報酬ｒは２項の報酬を加重したものであり、
ｒ₂＝ｋ₂₁ｒ₂₁＋ｋ₂₂ｒ₂₂ （８）
式中、ｋ₂₁、ｋ₂₂はそれぞれ加重値である。 The first term considers the angle formed by the AUV and the current wall as in equation (6), and when the angle between the AUV and the wall increases and exceeds the threshold value, a negative reward value is obtained and the angle between the AUV and the wall is obtained. When is reduced, you get a positive reward value,

The second term considers the distance between the front and rear ends of the AUV and the wall as in equation (7), and when the difference between the distance between the AUV and the wall and the preset value decreases, a positive reward is obtained. If this difference increases, you will get a negative reward and the tracking distance can be in the range of ± 0.2m of the preset value, and if the reward value for this item in the tracking range is 0, then this The distance value between the location and the wall is the average value of two ranging sonar data on the same side.

The total reward r for general wall tracking behavior is the weight of the reward in the second term.
r ₂ = k ₂₁ r ₂₁ + k ₂₂ r ₂₂ (8)
In the formula, k ₂₁ and k ₂₂ are weighted values, respectively.

２．２．３）仮想ターゲットポイントに基づく追跡方法
一般的な壁環境では、壁追跡行動には、ターゲットの向首角とターゲットの追跡距離のみを考慮すればよく、ターゲットへの移動行動や障害物回避行動に比べて、実際ターゲットポイントによる案内がないので、壁の隅などのような特殊な環境の場合は、正確な計画結果を提供することができない。壁の隅の問題は、ＡＵＶ壁追跡行動における主な難問であり、本発明では、主に２種類の壁の隅の環境、つまり外直角環境と内直角環境を考慮する。壁の隅の環境の特殊性のため、外直角を追跡する場合、ＡＵＶの前方にあるレンジングソナーが壁を検出できず、ＡＵＶはタイムリーに向首角を調整できず、ターゲットを失うことがある。内側の壁の隅の場合、基本報酬の設計に前方の障害物を考慮しないので、衝突が発生する。 2.2.3) Tracking method based on virtual target points In a general wall environment, wall tracking behavior only needs to consider the target's head angle and target's tracking distance, and movement behavior and obstacles to the target. Compared to object avoidance behavior, there is actually no guidance by the target point, so it is not possible to provide accurate planning results in special environments such as corners of walls. The wall corner problem is a major challenge in AUV wall tracking behavior, and the present invention primarily considers two types of wall corner environments: the outer right angle environment and the inner right angle environment. Due to the peculiarities of the environment in the corners of the wall, when tracking the outer right angle, the range sonar in front of the AUV cannot detect the wall, the AUV cannot adjust the head angle in a timely manner, and the target can be lost. is there. In the case of the inner wall corner, a collision occurs because the design of the base reward does not consider the obstacle in front.

この問題に対しては、本発明は、ＡＵＶ壁追跡をガイドするための仮想ターゲットポイントを構築する方法を提案する。図４及び図５には、外直角の壁と内直角の壁について構築される仮想ターゲットポイントが示されている。環境が外直角である場合、フロントソナーが障害物を検出していないときに入力が最大検出距離であるので、仮想壁は点線のように構築され、これに基づいて仮想ターゲットポイントが追加される。仮想ターゲットポイントの位置は、ＡＵＶ位置、レンジングソナーデータ、及び安全距離Ｌ₁によって決定される。

環境が内直角である場合、図５に示すように、仮想壁を構築できず、ＡＵＶがタイムリーに方向を変更して前方の壁の障害物を回避することを考慮すると、前方ソナーにより壁が検知されると、ＡＵＶが対向する現在のターゲット壁の他方の側で仮想ターゲットポイントが構築され、仮想ターゲットポイントの位置はＡＵＶ位置、向首角及び安全距離Ｌ₂により決定される。

２種の環境のいずれにも安全距離Ｌ₁とＬ₂が設定され、シミュレーションテストを行った結果、その値がターゲット追跡距離程度である場合、行動計画の効果が良好である。仮想ターゲットポイントに基づく報酬関数の構築は以下のとおりである。

ｒ₂₄＝ｋ_Bｃｏｓ（α）（１４）
式中、ｋ_Bは壁追跡過程に対応する報酬係数であり、
仮想ターゲットポイントに基づく追跡行動の総報酬ｒは４項の報酬を加重したものである。
ｒ₂＝ｋ₂₁ｒ₂₁＋ｋ₂₂＋ｒ₂₂＋ｋ₂₃ｒ₂₃＋ｋ₂₄ｒ₂₄（１５）
式中、ｋ₂₃、ｋ₂₄はそれぞれ加重値であり、
報酬係数ｋ₂₃とｋ₂₄値が大きいため、壁の隅の環境ではＡＵＶは仮想ターゲットポイントにより案内される傾向がある。 To address this issue, the present invention proposes a method of constructing virtual target points to guide AUV wall tracking. 4 and 5 show virtual target points constructed for the outer right-angled wall and the inner right-angled wall. If the environment is at right angles, the input is the maximum detection distance when the front sonar is not detecting obstacles, so the virtual wall is constructed like a dotted line, based on which virtual target points are added. .. Position of the virtual target points, AUV position is determined ranging sonar data, and the safe distance L _1.

If the environment is at right angles to the inside, as shown in FIG. 5, the virtual wall cannot be constructed, and considering that the AUV changes direction in a timely manner to avoid obstacles on the front wall, the wall is provided by the front sonar. Is detected, a virtual target point is constructed on the other side of the current target wall to which the AUV faces, and the position of the virtual target point is determined by the _{AUV position, heading angle, and safety distance L 2.}

When the safety distances L ₁ and L ₂ are set in both of the two environments and the value is about the target tracking distance as a result of the simulation test, the effect of the action plan is good. The construction of the reward function based on the virtual target points is as follows.

r ₂₄ = k _B cos (α) (14)
In the formula, k _B is the reward factor corresponding to the wall tracking process.
The total reward r of the tracking action based on the virtual target point is the weight of the reward of item 4.
r ₂ = k ₂₁ r ₂₁ + k ₂₂ + r ₂₂ + k ₂₃ r ₂₃ + k ₂₄ r ₂₄ (15)
In the formula, k ₂₃ and k ₂₄ are weighted values, respectively.
Due to the large reward coefficients k ₂₃ and k ₂₄ , the AUV tends to be guided by virtual target points in a wall corner environment.

ＡＵＶが次の部分の壁を追跡するまで徐々に調整したとき、つまり、外直角環境におけるレンジングソナーが再度ターゲット壁を検知したか、内直角環境における前方ソナーがさらに前方の壁を検知しない場合、仮想ターゲットポイントを削除し、一般的な壁追跡行動ネットワークを呼び出す。 When the AUV is gradually adjusted until it tracks the next part of the wall, that is, if the ranging sonar in the outer right angle environment detects the target wall again, or if the forward sonar in the inner right angle environment does not detect the further forward wall. Delete the virtual target point and call the general wall tracking behavior network.

２．３）障害物回避
障害物回避行動は、行動計画システムのキーであり、ＡＵＶの自律的決定レベルを決定し、ＡＵＶが作業タスクを安全的に実施できるかを左右する。 2.3) Obstacle Avoidance Obstacle avoidance behavior is a key to the action planning system, which determines the autonomous decision level of the AUV and determines whether the AUV can safely perform work tasks.

２．３．１）ＡＵＶ障害物回避行動のニーズに応じて、図６に示すように、行動計画システムは、周辺の障害物環境情報を十分に取得する必要があるので、環境状態の特徴入力には、３つの前方ソナーと両側のそれぞれに設けられたフロントソナーによるデータが含まれる。 2.3.1) As shown in FIG. 6, the action planning system needs to sufficiently acquire the surrounding obstacle environmental information according to the needs of the AUV obstacle avoidance action, so that the characteristics of the environmental state are input. Contains data from three front sonars and front sonars on each side.

ＡＵＶは、障害物回避を実行しながらターゲットポイントの方向へ近づき、ＡＵＶとターゲットポイントの相対位置情報を取得する必要があるので、特徴入力は、ＡＵＶの現在の位置座標（ｘ_AUV,ｙ_AUV）、ターゲットポイント位置座標（ｘ_goal,ｙ_goal）、現在の向首角θ及びターゲット向首角βという計１１次元の入力を含む。 Since the AUV needs to approach the target point while performing obstacle avoidance and acquire the relative position information between the AUV and the target point, the feature input is the current position coordinates of the _AUV (x AUV, y _AUV ). , Target point position coordinates (x _goal , y _goal ), current heading angle θ, and target heading angle β, including a total of 11 dimensional inputs.

２．３．２）報酬関数の設計
障害物回避行動は、ＡＵＶが急に現れた障害物を回避しターゲットポイントに順調に到着するようにするために用いられ、したがって、報酬信号分が３項に分けられ、第１項は障害物に対するＡＵＶ距離に基づいて得られた報酬値ｒ₃₁であり、式１６に示すように、ＡＵＶが障害物に近づくと、負の報酬の警告を取得し、ＡＵＶが障害物から離間すると、正の報酬を取得し、ＡＵＶが障害物から離間して航行するように促し、障害物と衝突すると報酬値−１を取得し、現在のトレーニング周期を終了する。

第２項は、現在のＡＵＶとターゲットポイントの距離に基づいて生じる報酬値ｒ₃₂であり、ＡＵＶが障害物を回避しながらターゲットポイントへ航行するように促し、このため、ＡＵＶがターゲットポイントから離間すると、負の報酬を取得し、ターゲットポイントに使づくと正の報酬を取得し、ＡＵＶがターゲットポイントに到着すると、正の報酬値１．０を取得し、トレーニング周期を終了する。

第３項は、ＡＵＶと現在のターゲットがなす角度αに基づいて生じる報酬ｒ₃₃であり、同様にＡＵＶがターゲットポイントの方向へ航行するように促すが、この項の報酬は、主に、現在のターゲット向首に近くなるように向首角を調整することをＡＵＶに学習させ、経路の長さを減らすようにするためである。
ｒ₃₃＝ｋ_cｃｏｓ（α）（１８）
式中、ｋ_Cは障害物回避過程に対応する報酬係数であり、
最後の総報酬信号はこの３項の報酬値を加重したものに等しく、
ｒ₃＝ｋ₃₁ｒ₃₁＋ｋ₃₂ｒ₃₂＋ｋ₃₃ｒ₃₃ （１９）
式中、ｋ₃₁〜ｋ₃₃はそれぞれ加重値である。 2.3.2) Design of reward function Obstacle avoidance behavior is used to ensure that the AUV avoids suddenly appearing obstacles and arrives at the target point smoothly, so the reward signal component is the third term. _{The first term is the reward value r 31} obtained based on the AUV distance to the obstacle, and as shown in Equation 16, when the AUV approaches the obstacle, a negative reward warning is obtained. When the AUV moves away from the obstacle, it gets a positive reward, encourages the AUV to navigate away from the obstacle, gets a reward value of -1 when it collides with the obstacle, and ends the current training cycle.

_{The second term is the reward value r 32} generated based on the current distance between the AUV and the target point, which encourages the AUV to navigate to the target point while avoiding obstacles, thus causing the AUV to move away from the target point. Then, a negative reward is acquired, a positive reward is acquired when used for the target point, and when the AUV arrives at the target point, a positive reward value of 1.0 is acquired and the training cycle ends.

_{The third term is the reward r 33} generated based on the angle α between the AUV and the current target, which also prompts the AUV to navigate in the direction of the target point, but the reward in this section is mainly the present. This is to train the AUV to adjust the heading angle so that it is closer to the target heading, and to reduce the length of the path.
r ₃₃ = k _c cos (α) (18)
In the formula, k _C is the reward coefficient corresponding to the obstacle avoidance process.
The final total reward signal is equal to the weight of the reward values in these three terms,
r ₃ = k ₃₁ r ₃₁ + k ₃₂ r ₃₂ + k ₃₃ r ₃₃ (19)
In the formula, k _{31 to} _{k 33} are weighted values, respectively.

さらに、前記（３）の過程は以下のとおりである。
強化学習は、動作から環境へのマッピングをトレーニングするものであり、ロボットモデルを環境モデルとして、動作からロボットモデルへのマッピングをトレーニングすることができる。したがって、本発明では、直接ロボットを環境として、ファジー流体力学パラメータのロボットの運動学及び動力学モデル、即ちＡＵＶモデルを作成し、ＤＤＰＧトレーニングを通じて力とトルクを得て水中ロボットに作用させ、ＡＵＶモデルを用いて計算することによりロボットの速度と角速度を得て、速度、角速度とターゲット速度、ターゲット角速度との誤差を利用して報酬値ｒ₄＝-｜△ｖ＋△Ψ｜を設計し、ここで、△ｖは速度誤差であり、△Ψは向首誤差である。また、トレーニング中のＡＵＶモデルにランダム干渉力を追加することで、動的に変化している水中環境をシミュレーションし、それにより、抗干渉能力を有するＤＤＰＧに基づく完全な制御システムがトレーニングにより得られる。制御システムのトレーニングが完了した後、ロボットの現在の位置及びターゲット経路から、経路追跡戦略に従ってターゲット命令を得て、ＤＤＰＧ制御システムを用いてロボットを計画命令に従うように制御する。 Further, the process of (3) above is as follows.
Reinforcement learning trains mapping from motion to environment, and it is possible to train mapping from motion to robot model using a robot model as an environment model. Therefore, in the present invention, the kinematics and dynamics model of the robot with fuzzy fluid dynamics parameters, that is, the AUV model is created by directly using the robot as the environment, and the force and torque are obtained through DDPG training to act on the underwater robot, and the AUV model is used. _{The speed and angular velocity of the robot are obtained by calculating using, and the reward value r 4} =-| △ v + △ Ψ | is designed by using the error between the velocity, the angular velocity and the target velocity, and the target angular velocity. , Δv is the velocity error, and ΔΨ is the heading error. It also simulates a dynamically changing underwater environment by adding random interference to the AUV model during training, which gives training a complete DDPG-based control system with anti-interference capabilities. .. After the control system training is completed, the target command is obtained from the robot's current position and target route according to the route tracking strategy, and the DDPG control system is used to control the robot to follow the planned command.

前記ＤＤＰＧの制御システムは動作ネットワークに対応し、ＤｅｅｐＤｅｔｅｒｍｉｎｉｓｔｉｃＰｏｌｉｃｙＧｒａｄｉｅｎｔ（ＤＤＰＧ）は、ＡｃｔｏｒＣｒｉｔｉｃとＤＱＮを組み合わせたアルゴリズムであり、ＡｃｔｏｒＣｒｉｔｉｃの安定性及び収束性を向上させる。その構想は、ＤＱＮ構造中のメモリバンク、及び構造が同じであるが、パラメータの更新頻度が異なる２つのニューラルネットワークの構想をＡｃｔｏｒＣｒｉｔｉｃに適用することである。さらに、Ｄｅｔｅｒｍｉｎｉｓｔｉｃ構想を利用して、従来のＡｃｔｏｒＣｒｉｔｉｃが連続動作区間においてランダムにスクリーニングするという方式を、連続空間において２つだけの動作値を出力するように変更する。 The DDPG control system corresponds to an operating network, and the Deep Policy Policy Gradient (DDPG) is an algorithm that combines Actor Critic and DQN, and improves the stability and convergence of Actor Critic. The concept is to apply the concept of two neural networks in the DQN structure, which have the same structure but different parameter update frequencies, to Actor Critic. Further, using the Deterministic concept, the conventional method of randomly screening the actor critical in the continuous operation section is changed so as to output only two operation values in the continuous space.

Ｃｒｉｔｉｃシステムでは、Ｃｒｉｔｉｃの学習過程はＤＱＮと類似しており、下式のように現実Ｑ値と推定Ｑ値の損失関数を用いてネットワーク学習を行う。

式中、Ｑ（ｓ,ａ）は、状態推定ネットワークに基づいて得られるものであり、ａは動作推定ネットワークから伝送してきた動作である。前の部分Ｒ＋γｍａｘ_aＱ（ｓ’,ａ）は現実Ｑ値であり、ＤＱＮと異なるのは、Ｑ値を計算するときに、貪欲法を使用して動作ａ’を選択するのではなく、動作現実ネットワークによりここでのａ’を得るのである。前記のように、Ｃｒｉｔｉｃの状態推定ネットワークのトレーニングは、現実Ｑ値と推定Ｑ値の二乗損失に基づくものであり、推定Ｑ値は、現在の状態ｓに基づいて、動作推定ネットワークにより出力される動作ａの入力状態を推定ネットワークに入力して得るものであり、現実Ｑ値は、現実の報酬Ｒに基づいて、次の時刻の状態ｓ’と、動作現実ネットワークにより得られた動作ａ’を状態現実ネットワークに入力して得たＱ値とを加算して得るものである。 In the Critic system, the learning process of Critic is similar to DQN, and network learning is performed using the loss function of the actual Q value and the estimated Q value as shown in the following equation.

In the equation, Q (s, a) is obtained based on the state estimation network, and a is an operation transmitted from the operation estimation network. The previous part R + γmax _a Q (s', a) is the actual Q value, which is different from DQN in that when calculating the Q value, the action a'is not selected using the greedy algorithm. The a'here is obtained by the real network. As described above, the training of Critic's state estimation network is based on the square loss of the actual Q value and the estimated Q value, and the estimated Q value is output by the motion estimation network based on the current state s. The input state of the operation a is input to the estimation network, and the actual Q value is the state s'at the next time and the operation a'obtained by the operation reality network based on the actual reward R. It is obtained by adding the Q value obtained by inputting to the state-real network.

Ａｃｔｏｒシステムでは、下記式に基づいて動作推定ネットワークのパラメータを更新する。

ｓは状態を表し、ｓ_tはｔ時刻での状態であり、ａは動作を表し、θ^Qとθμはネットワークの重みパラメータを表す。 In the Actor system, the parameters of the motion estimation network are updated based on the following equation.

s represents a state, s _t is the state at t time, a is shows the operation, theta ^Q and θμ represent weighting parameters of the network.

同じ状態について、システムが２つの異なる動作ａ１とａ２を出力し、状態推定ネットワークから２つのＱ値Ｑ１及びＱ２がフィードバックされる場合、Ｑ１＞Ｑ２であれば、動作１を用いると、より多くの報酬を得て、この場合、Ｐｏｌｉｃｙｇｒａｄｉｅｎｔの構想によれば、ａ１の確率が増加し、ａ２の確率が低下し、つまり、Ａｃｔｏｒはできるだけ大きなＱ値を取得しようとする。したがって、Ａｃｔｏｒの損失については、得たフィードバックＱ値が大きいほど、損失が小さく、得たフィードバックＱ値が小さいほど、損失が大きいと理解でき、このため、状態推定ネットワークから戻されたＱ値を負にすればよい。 For the same state, if the system outputs two different actions a1 and a2 and the two Q values Q1 and Q2 are fed back from the state estimation network, then if Q1> Q2, using action 1 will result in more. With a reward, in this case, according to the Policy feedback concept, the probability of a1 increases and the probability of a2 decreases, that is, the Actor tries to obtain as large a Q value as possible. Therefore, regarding the loss of the actor, it can be understood that the larger the obtained feedback Q value, the smaller the loss, and the smaller the obtained feedback Q value, the larger the loss. Therefore, the Q value returned from the state estimation network can be used. You can make it negative.

ＤＤＰＧコントローラの構想は、強化学習アルゴリズム中の動作をロボットの推力及びトルクに対応させ、アルゴリズム中の状態をロボットの速度及び角速度に対応させることである。アルゴリズムに対して学習トレーニングを行うことにより力から状態へのマッピング関係が実現される。 The concept of the DDPG controller is to make the movement in the reinforcement learning algorithm correspond to the thrust and torque of the robot, and the state in the algorithm to correspond to the speed and angular velocity of the robot. By performing learning training on the algorithm, the mapping relationship from force to state is realized.

ＤＤＰＧをＡＵＶ制御に適用するには、まず、Ｃｒｉｔｉｃニューラルネットワーク構造Ｑ（ｓ_tａ_t｜θ^Q）及びＡｃｔｏｒニューラルネットワーク構造μ（ｓ_t｜θμ）、（θ^Qとθμはネットワークの重みパラメータを示す。）を作成する。次に、それぞれＣｒｉｔｉｃとＡｃｔｏｒの２つの構造中に、ターゲットネットワーク（ｔａｒｇｅｔ＿ｎｅｔ）と予測ネットワーク（ｅｖａｌ＿ｎｅｔ）との２つのニューラルネットワークを作成する。次に、ＤＤＰＧの動作出力を制御システムの作用力τとして、制御システムが出力する作用力でロボットの動きを制御すると、ＤＤＰＧ制御システムをＡＵＶの現在の状態ｓからロボットの受ける力へのマッピングとすることができ、式（２１）の

と組み合わせ、関数で
τ＝μ（ｓ_t｜θμ）（２２）として表し、
ロボット状態ｓは主にロボットの速度と向首として示され、
Ｖ＝［ｕ,ｖ,ｒ］
Ψ＝［０,θ,Ψ］（２３）
式中、ｕ、ｖ、ｒはそれぞれＡＵＶの縦方向速度、横方向速度及び角速度であり、ΨはＡＵＶの向首角であり、
水平運動であるので、ｖ、ｒは無視され、このため、
τ＝μ（ｓ_t）＝μ（μ（ｔ）,Ψ（ｔ））（２４）
この式は、制御システムの出力力がロボットの速度、向首及びトリム角がターゲット命令のようになるように制御することを示す。 To apply DDPG the AUV control, first, Critic neural network structure _{_{Q (s t a t | θ}} Q) and Actor neural network structure μ (s _t | θμ), the weighting parameters (theta ^Q and Shitamyu network Show.) Is created. Next, two neural networks, a target network (target_net) and a prediction network (ever_net), are created in the two structures of Critic and Actor, respectively. Next, when the motion output of the DDPG is used as the acting force τ of the control system and the movement of the robot is controlled by the acting force output by the control system, the DDPG control system is mapped from the current state s of the AUV to the force received by the robot. Can be of equation (21)

Combination, the function tau = mu | expressed as _{(s t θμ) (22)} ,
The robot state s is mainly shown as the speed and heading of the robot,
V = [u, v, r]
Ψ = [0, θ, Ψ] (23)
In the equation, u, v, and r are the longitudinal velocity, lateral velocity, and angular velocity of the AUV, respectively, and Ψ is the heading angle of the AUV.
Since it is a horizontal motion, v and r are ignored, and therefore,
τ = μ ( _st ) = μ (μ (t), Ψ (t)) (24)
This equation indicates that the output force of the control system controls the speed, heading and trim angle of the robot to be like a target command.

実施形態２
実施形態１に記載のファジー流体力学パラメータのＡＵＶモデルの作成過程は、一般的なＡＵＶダイナミックモデリングの過程であり、本分野の従来技術を用いて実現でき、上記過程をより明瞭にするために、本実施形態では、ファジー流体力学パラメータのＡＵＶモデルの作成過程を説明するが、ただし、本発明は、以下のファジー流体力学パラメータのＡＵＶモデルの作成方式を含むが、それに制限されない。ファジー流体力学パラメータのＡＵＶモデルの作成過程には、
水中ロボットの流体力学方程を作成するステップと、

式中、ｆ−ランダム干渉力；Ｍ−システムの慣性係数行列、Ｍ＝Ｍ_RB+Ｍ_A≧０；Ｍ_RB−キャリアの慣性行列、

且つ

；Ｍ_A−追加品質係数行列、

；

−コリオリ力・求心力係数行列、

；Ｃ_RB−求心力係数行列；

−コリオリ力（モーメント）係数行列、

；

−粘性流体力係数行列、

；τ−制御入力ベクター；ｇ₀−静圧荷重ベクター、研究し易さからゼロとする；

−復元力／トルクベクター。 Embodiment 2
The process of creating an AUV model of fuzzy hydrodynamic parameters according to the first embodiment is a general process of AUV dynamic modeling, which can be realized by using the prior art in the art, and in order to clarify the above process. In the present embodiment, the process of creating an AUV model of fuzzy hydrodynamic parameters will be described, but the present invention includes, but is not limited to, the following method of creating an AUV model of fuzzy hydrodynamic parameters. In the process of creating an AUV model of fuzzy hydrodynamic parameters,
Steps to create a hydrodynamic equation for an underwater robot,

Wherein, f- random interference power; M- system inertia coefficient _{_{matrix, M = M RB + M A}} ≧ 0; M RB - inertia matrix of carrier,

and

; M _A - Add quality coefficient matrix,

;

-Coriolis force / centripetal force coefficient matrix,

C _RB -centripetal force coefficient matrix;

-Coriolis force (moment) coefficient matrix,

;

− Viscous fluid force coefficient matrix,

Τ-Control input vector; g ₀ -Static load vector, zero for ease of study;

-Restoring force / torque vector.

自律型無人潜水機の実行機構の構成から、その横揺れが小さく、主にスラスターを用いて上昇・ダイビング、縦方向動き、前後揺れ及び縦揺れの動きが行われると考えられ、その運動学モデルは５自由度の方程式で近似的に記述することができる。

式中、Ｘ、Ｙ、Ｚ、Ｍ、Ｎは、水中ロボットのアクチュエータが発生する作用による水中ロボットの各自由度での力（トルク）を表し、水中ロボットの受ける重力と浮力、スラスターの推力、水中ロボットの動きによる流体動力やいくつかの環境からの外力を含み、
Ｍは水中ロボットの全水中排水量の質量であり、
ｘ_G、ｙ_G、ｚ_Gは水中ロボットの重心の艇体座標系における座標であり、
Ｉ_y、Ｉ_zはそれぞれ艇体座標系のｙ、ｚ軸に対する水中ロボットの質量の慣性モーメントであり、
ｕ、ｖ、ω、ｑ、ｒはそれぞれ水中ロボットの艇体座標系での縦方向速度、横方向速度、垂向速度、トリム角速度、回転角速度であり、

は水中ロボットの艇体座標系での対応する自由度の（角）加速度であり、

などはすべて艇体の一次又は二次流体力学的導関数であり、理論計算、制約付きモデルの試験、識別及び近似推定により得られ得る。 Due to the configuration of the execution mechanism of the autonomous underwater vehicle, its rolling motion is small, and it is thought that ascending / diving, vertical movement, back-and-forth shaking and pitching movement are mainly performed using thrusters, and its kinematic model. Can be approximately described by an equation with five degrees of freedom.

In the formula, X, Y, Z, M, and N represent the force (torque) at each degree of freedom of the underwater robot due to the action generated by the actuator of the underwater robot, and the gravity and buoyancy received by the underwater robot, the thrust of the thruster, and so on. Including fluid power from the movement of underwater robots and external forces from some environments,
M is the mass of the total displacement of the underwater robot,
x _G , y _G , and z _G are the coordinates of the center of gravity of the underwater robot in the hull coordinate system.
I _y and I _z are the moments of inertia of the mass of the underwater robot with respect to the y and z axes of the hull coordinate system, respectively.
u, v, ω, q, and r are the longitudinal velocity, lateral velocity, vertical velocity, trim angular velocity, and rotational angular velocity of the underwater robot in the hull coordinate system, respectively.

Is the (angular) acceleration of the corresponding degree of freedom in the hull coordinate system of the underwater robot,

Etc. are all primary or secondary hydrodynamic derivatives of the hull and can be obtained by theoretical calculations, testing of constrained models, identification and approximate estimation.

実施例
本発明の最も主な目的は、水中ロボットが水中環境において現在の環境状態に基づいて行動決定及び動作制御を自律的に行うことによって、人を複雑なプログラミングプロセスから解放することであり、具体的には、以下のように実現される。 Example The main object of the present invention is to free a person from a complicated programming process by autonomously performing action determination and motion control based on the current environmental state in an underwater environment. Specifically, it is realized as follows.

１）プログラミングソフトウェアを用いてディープ強化学習に基づく自律型無人潜水機の行動計画シミュレーションシステムを作成し、シミュレーショントレーニングによりロボットの最適決定戦略を得て、具体的には、
１．１）環境モデルを作成して、初期位置とターゲットポイントを決定し、アルゴリズムパラメータを初期化させる。
１．２）現在のｔ時刻での環境状態及びロボットタスクを決定し、タスクをターゲットへの移動行動、壁追跡行動、障害物回避行動に分解する。
１．３）現在の状態に基づいてターゲットへの移動、壁追跡又は障害物回避を選択して、行動を動作に分解する。
１．４）動作ａを実行して、新しい状態ｓ’を観察し、報酬値Ｒを得る。
１．５）ニューラルネットワークをトレーニングして各動作のＱ値を得て、最大Ｑ値に基づいて動作を出力する。
１．６）Ｑ関数を更新する。
１．７）現在の時刻の状態を判断し、ターゲット状態に達する場合、１．８）に入り、そうではない場合、１．４）に入る。
１．８）選択した行動が完了した後、Ｑ関数を更新する。
１．９）検出が完了したか否かを判断し、完了した場合、１．１０）に入り、そうではない場合、１．３）に入る。
１．１０）Ｑ値が収束しているか否かを判断し、収束している場合、トレーニング又は計画を終了し、収束していない場合、ロボット位置を初期化させ、１．２）に入る。 1) Create an autonomous unmanned underwater vehicle action plan simulation system based on deep reinforcement learning using programming software, and obtain the optimal decision strategy for the robot through simulation training.
1.1) Create an environment model, determine the initial position and target point, and initialize the algorithm parameters.
1.2) Determine the environmental state and robot task at the current t time, and decompose the task into target movement behavior, wall tracking behavior, and obstacle avoidance behavior.
1.3) Select movement to target, wall tracking or obstacle avoidance based on the current state to break down the action into actions.
1.4) Execute the operation a, observe the new state s', and obtain the reward value R.
1.5) Train the neural network to obtain the Q value of each motion, and output the motion based on the maximum Q value.
1.6) Update the Q function.
1.7) If the current time state is judged and the target state is reached, 1.8) is entered, and if not, 1.4) is entered.
1.8) Update the Q function after the selected action is completed.
1.9) Judge whether the detection is completed, and if it is completed, enter 1.10), and if not, enter 1.3).
1.10) Judge whether the Q value has converged, and if it has converged, end training or planning, and if it has not converged, initialize the robot position and enter 1.2).

２）ＤＤＰＧコントローラを用いて、計画して出力する動作を完了するようにロボットを制御し、具体的には、以下のステップを含む。
２．１）パラメータを初期化する。
２．２）外部ループを行う。
２．２．１）ターゲット向首、ターゲット速度をランダムに生成する。
２．２．２）内部ループを行う。
２．２．２．１）ＤＤＰＧアルゴリズムを実行させて、動作τ＝ａ＝μ（ｓ_t｜θμ）を出力する。
２．２．２．２）ＡＵＶ運動学モデルに基づいてＡＵＶの加速度を計算する。

２．２．２．３）ＡＵＶ運動学モデルに基づいてＡＵＶ速度

、向首角

及びトリム角

、速度誤差△ｖ、向首誤差△Ψ及びトリム誤差△θを計算し、報酬戦略に従って報酬値を取得する。
ｒ＝−｜△ｖ＋△Ψ＋△θ｜
２．２．２．４）制御誤差が０であり、ｒ＋＝１である場合、小ループを終了する。
２．２．２．５）ｃｒｉｔｉｃニューラルネットワークを更新して最小損失を取得する。

２．２．２．６）勾配降下法によってａｃｔｏｒを更新する。

２．２．２．７）ネットワークパラメータ

を更新する。
２．２．２．８）内ループのステップ数に達すると、内ループを終了する。
２．２．３）外ループのステップ数に達すると、外ループを終了する。 2) The DDPG controller is used to control the robot to complete the planned and output operation, specifically including the following steps.
2.1) Initialize the parameters.
2.2) Perform an external loop.
2.2.1) Target heading and target speed are randomly generated.
2.2.2) Perform an internal loop.
2.2.2.1) DDPG by algorithm is executed, operation _{τ = a = μ (s t} | θμ) outputs a.
2.2.2.2) Calculate the acceleration of AUV based on the AUV kinematics model.

2.2.2.3) AUV velocity based on AUV kinematics model

, Neck angle

And trim angle

, Speed error Δv, heading error ΔΨ and trim error Δθ are calculated, and the reward value is obtained according to the reward strategy.
r ＝－｜ △ v ＋ △ Ψ ＋ △ θ ｜
2.2.2.4) When the control error is 0 and r + = 1, the small loop is terminated.
2.2.2.5) Update the critic neural network to get the minimum loss.

2.2.2.6) Update the actor by the gradient descent method.

2.2.2.7) Network parameters

To update.
2.2.2.8) When the number of steps in the inner loop is reached, the inner loop is terminated.
2.2.3) When the number of steps in the outer loop is reached, the outer loop is terminated.

Claims

AUV action plan and motion control method based on reinforcement learning.
Underwater robot tunnel detection is defined as a total task, or task, and actions to complete the task include moving to a target, wall tracking, and obstacle avoidance, completing the robot-planned action underwater. And the steps to define the concrete control instructions that occur for
When the AUV performs the tunnel detection task, it uses the deep reinforcement learning DQN algorithm to perform action planning in real time based on the underwater environment to be detected, that is, it builds an action planning architecture based on multi-action network calls. Steps to define the input and output behaviors of the environmental state characteristics of the three behaviors according to the needs of the task, build the corresponding deep learning behavioral network, and design the reward function.
The steps that the planning system completes the tunnel detection task by calling the trained behavior network,
Includes steps in which the control system completes planned actions by calling a trained operating network.
The process of building the corresponding deep learning behavioral network and designing the reward function involves the following steps:
The tunnel detection task is broken down into action sequences, multiple feasible route points are planned based on prior environmental information in global route planning, the AUV departs from the placement position and arrives at each route point in sequence.
Since the route point is a global plan under a known environment, during the travel process, the AUV calls the obstacle avoidance action to safely arrive at the route point based on the real-time environmental state, and the AUV is the main tunnel detection task. Call wall tracking action to complete the task according to a given detection target,
The decision module includes global data, decision system, action library and evaluation system, and task information, situation information and planning knowledge are stored in the global data. The decision system is a self-learning planning system combined with the DQN algorithm. , Trained, extract trained network parameters from the behavior library prior to performing the planning task, then plan the current behavior behavior using the current environmental state information as input, and the evaluation system is an enhanced learning algorithm. It is a reward function system of, and each time AUV plans and executes one action action plan, it provides reward based on the state environment and task information, and all the data is stored in the global database.
Among the above actions, the process of moving to the target includes the following steps.
The movement action to the target point is to navigate to the target point while adjusting the head angle when the AUV does not detect an obstacle, and the relationship between the position and angle of the AUV and the target point is mainly used as the feature input amount. In consideration, specifically, the current AUV position coordinates (x _AUV , y _AUV ), the target point coordinates (x _goal , y _goal ), the current heading angle θ, and the target heading angle β are input in a total of 6 dimensions. Among them, the target heading angle β is the heading angle when the AUV is navigating to the target.
Regarding the reward function, when the AUV navigates to the target point in an obstacle-free environment due to the movement action to the target, the reward function is set to the second term.
Item 1 r ₁₁ considers the change in the distance between the AUV and the target point.

The second term r ₁₂ considers the change in the AUV heading angle, and the closer the heading angle is to the target, the larger the reward value of the target value, and the current AUV heading angle α between the target heading angle is α = θ-β (2),
The smaller the absolute value of α, the larger the reward value to be acquired.
r ₁₂ = k _A cos (α) (3)
In the formula, k _A is the reward coefficient corresponding to the process of moving to the target.
The total reward value is the weight of two terms,
r ₁ = k ₁₁ r ₁₁ + k ₁₂ r ₁₂ (4)
In the formula, k ₁₁ and k ₁₂ are weighted values, respectively.
The wall tracking process of the above actions includes the following steps:
The AUV wall tracking behavior takes into account information on the distance and relative angle between the AUV and the wall, and the AUV measures the distance x ₄ and x ₅ of the AUV from the wall through two front and rear ranging sonars located on one side. Acquired,
Obtain the current AUV head angle θ with the compass, estimate the _{current wall angle θ wall, and}

The total reward r for general wall tracking behavior is the weight of the reward in the second term.
r ₂ = k ₂₁ r ₂₁ + k ₂₂ r ₂₂ (8)
In the formula, k ₂₁ and k ₂₂ are weighted values, respectively.
In tracking based on virtual target points, this virtual target point is a virtual target point created by a wall with an outer right angle and an inner right angle, and if the environment is an outer right angle, when the front sonar is not detecting an obstacle. Since the input to is the maximum detection distance, a virtual wall is built, a virtual target point is added, and if the environment is at right angles, when the forward sonar detects the wall, the AUV faces the other of the current target walls. A virtual target point is built on the side of
The construction of the reward function based on the virtual target points is as follows,

_{The third term is the reward r 33} generated based on the angle α between the AUV and the current target, which also prompts the AUV to navigate in the direction of the target point, but the reward in this section is mainly the present. This is to train the AUV to adjust the heading angle so that it is closer to the target heading, and to reduce the length of the path.
r ₃₃ = k _c cos (α)
In the formula, k _C is the reward coefficient corresponding to the wall obstacle avoidance process.
The final total reward signal is equal to the weight of the reward values in these three terms,
r ₃ = k ₃₁ r ₃₁ + k ₃₂ r ₃₂ + k ₃₃ r ₃₃
In the formula, k _{31 to} _{k 33} are weighted values, respectively.
Reinforcement learning trains the mapping from motion to environment. Using the robot as the environment, the force and torque are obtained through DDPG training to act on the underwater robot, and the speed of the robot is calculated by using the AUV model. _{Obtain the angular velocity and design the reward value r 4} =-| △ v + △ Ψ | using the error between the velocity, the angular velocity and the target velocity, and the target angular velocity, where Δv is the velocity error and ΔΨ is It is a heading error,
In addition, by adding random interference force to the AUV model during training, a control system based on DDPG is obtained by training, and after the training of the control system is completed, the route tracking strategy is performed from the current position and target route of the robot. AUV action planning and motion control method based on enhanced learning, characterized in that a target command is obtained according to the above and the robot is controlled to follow the planning command using a DDPG control system.

In the process of building a virtual target points for the outer right angle and the inner perpendicular wall, if the environment is outside a right angle, the position of the virtual target point is determined by the AUV position, ranging sonar data and the safe distance L ₁ The AUV action plan and motion control method based on the reinforcement learning according to claim 1, characterized in that.

In the process of constructing virtual target points for outer and inner right angles walls, if the environment is inner right, the position of the virtual target points is determined by the _{AUV position, head angle and safety distance L 2.} The AUV action plan and motion control method based on the reinforcement learning according to claim 2.

In the process of controlling the robot to obey the planning command using the DDPG control system,
The step in which the DDPG controller maps the movement in the reinforcement learning algorithm to the thrust and torque of the robot, the state in the algorithm to the speed and angular velocity of the robot, performs learning training on the algorithm, and acquires the mapping relationship from force to state. When,
To apply DDPG the AUV control, first, Critic neural network structure _{_{Q (s t a t | θ}} Q) and Actor neural network structure _{μ (s t | θμ) (} θ Q and Shitamyu indicates the weight parameter of the network ) Is created, and two neural networks, the target network target_net and the prediction network equal_net, are created in the two structures of Critic and Actor, respectively, and then the operation output of the DDPG is controlled as the acting force τ of the control system. The movement of the robot is controlled by the acting force output by the system.

In combination with a function, τ = μ | expressed as (s _t θμ),
The robot state s is mainly shown as the speed and heading of the robot,
V = [u, v, r]
Ψ = [0, θ, Ψ]
In the equation, u, v, and r are the longitudinal velocity, lateral velocity, and angular velocity of the AUV, respectively, and Ψ is the heading angle of the AUV.
v, r are ignored,
The equation τ = μ ( _st ) = μ (μ (t), Ψ (t)) controls the output force of the control system so that the speed, heading and trim angle of the robot are like the target command. The AUV action plan and motion control method based on the reinforcement learning according to claim 1, 2 or 3, wherein

In the Critic, network learning is performed using the loss function of the actual Q value and the estimated Q value.

In the equation, Q (s, a) is obtained based on the state estimation network, and a is the operation transmitted from the operation estimation network.
R + γmax _a Q (s', a) is the reality Q value, and the reality Q value is the state s'at the next time and the operation a'obtained by the operation reality network based on the actual reward R by the state reality network. The AUV action plan and motion control method based on reinforcement learning according to claim 4, wherein the obtained Q value is added and obtained.

The AUV action plan and motion control method based on reinforcement learning according to claim 5, wherein the actor updates the parameters of the motion estimation network based on the following equation.