Open AccessArticle

A Study on Path Planning for Curved Surface UV Printing Robots Based on Reinforcement Learning

Jie Liu

^*,

Xianxin Lin

Chengqiang Huang

Zelong Cai

Zhenyong Liu

Minsheng Chen

and

Zhicong Li

Guangdong Provincial Key Laboratory of Industrial Intelligent Inspection Technology, School of Mechatronic Engineering and Automation, Foshan University, Foshan 528225, China

Author to whom correspondence should be addressed.

Mathematics 2025, 13(4), 648; https://doi.org/10.3390/math13040648

Submission received: 17 January 2025 / Revised: 5 February 2025 / Accepted: 14 February 2025 / Published: 16 February 2025

Download

Browse Figures

Versions Notes

Abstract

In robotic surface UV printing, the irregular shape of the workpiece and frequent curvature changes require the printing robot to maintain the nozzle’s perpendicular orientation to the surface during path planning, which imposes high demands on trajectory accuracy and path smoothness. To address this challenge, this paper proposes a reinforcement-learning-based path planning method. First, an ideal main path is defined based on the nozzle characteristics, and then a robot motion accuracy model is established and transformed into a Markov Decision Process (MDP) to improve path accuracy and smoothness. Next, a framework combining Generative Adversarial Imitation Learning (GAIL) and Soft Actor–Critic (SAC) methods is proposed to solve the MDP problem and accelerate the convergence of SAC training. Experimental results show that the proposed method outperforms traditional path planning methods, as well as Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). Specifically, the maximum Cartesian space error in path accuracy is reduced from 1.89 mm with PSO and 2.29 mm with GA to 0.63 mm. In terms of joint space smoothness, the reinforcement learning method achieves the smallest standard deviation, especially with a standard deviation of 0.00795 for joint 2, significantly lower than 0.58 with PSO and 0.729 with GA. Moreover, the proposed method also demonstrates superior training speed compared to the baseline SAC algorithm. The experimental results validate the application potential of this method in intelligent manufacturing, particularly in industries such as automotive manufacturing, aerospace, and medical devices, with significant practical value.

Keywords:

UV printing; complex surface; path planning; reinforcement learning; SAC; robot

MSC:

68T40; 93C85; 70B15

1. Introduction

Ultraviolet (UV) printing technology, due to its high forming speed and superior printing quality, has gradually replaced traditional screen printing techniques in modern industrial production. It is widely applied in product packaging, trademarks, production date marking, pharmaceutical traceability, and artistic coloring. With the advancement of intelligent manufacturing, UV printing has also been extensively used in customized production, intelligent assembly lines, and flexible manufacturing systems, making it particularly suitable for manufacturing processes with complex shapes and customization requirements. For instance, UV printing has been widely employed in the intelligent manufacturing of automotive exteriors, aerospace components, and consumer electronics, as well as on medical device surfaces with special geometric structures. However, traditional UV printing equipment, due to its limited degrees of freedom, is typically restricted to planar printing and struggles to meet the demands of multi-degree-of-freedom curved surface printing. Specifically, for curved, inclined, cylindrical, or complex irregular surfaces, traditional equipment faces challenges in achieving efficient and precise printing. The emergence of industrial robots compensates for these limitations. With their multi-degree-of-freedom and high-precision dynamic motion control capabilities, industrial robots enable full-surface printing on irregular workpieces, overcoming the spatial constraints of conventional equipment. By integrating UV printing systems, robots can dynamically adjust nozzle angles and printing distances, precisely adapting to complex surface structures. This provides an efficient and flexible solution for curved surface UV printing, meeting the high-precision and high-flexibility demands of intelligent manufacturing, especially in large-scale customization, irregular surface processing, and complex design scenarios.

In UV printing robot path planning, both trajectory accuracy and smoothness are crucial factors affecting printing quality. Most existing research primarily focuses on curved surface path generation and path optimization. However, current methods still face the following challenges:

Path generation methods based on CAD/point cloud data mainly focus on geometric modeling [1] but lack optimization for trajectory smoothness and precision. Current research predominantly addresses spray gun modeling and coating thickness optimization, with relatively little focus on path accuracy and smoothness optimization. This may lead to suboptimal performance in high-speed, high-precision processing, negatively impacting printing quality.
Traditional optimization algorithms, such as Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), have been applied in path optimization. However, these methods [2] tend to fall into local optima in high-dimensional optimization problems, exhibit slow convergence rates, and have limited capability in trajectory smoothness optimization.
Reinforcement learning (RL), known for its adaptive optimization capabilities, has been applied to path optimization problems [3]. However, existing RL methods suffer from challenges such as convergence difficulties, low training efficiency, and inadequate trajectory smoothness optimization.

To address these issues, this paper proposes a reinforcement learning framework based on Generative Adversarial Imitation Learning and Soft Actor–Critic (GAIL-SAC). This method aims to plan a smooth trajectory in joint space while ensuring smoothness and precision in Cartesian space, thereby enhancing the stability and reliability of the printing trajectory.

The main contributions of this paper are as follows:

A curved surface trajectory generation method based on CNC path transformation is proposed, which integrates CAD models and point cloud data. A conversion strategy from CNC machining paths to robot trajectories is designed to ensure path accuracy and operational feasibility.
A robot motion accuracy model is established and formulated as an MDP problem, where the SAC reinforcement learning algorithm is used to optimize it. The reinforcement learning algorithm is applied in joint space trajectory planning, ensuring trajectory smoothness not only in joint space but also in Cartesian space with high accuracy.
A GAIL-SAC reinforcement learning framework is proposed, leveraging imitation learning to improve training efficiency and reinforcement learning to optimize trajectory precision, thereby enhancing the algorithm’s convergence speed and stability.
Experimental validation demonstrates that the proposed method outperforms existing methods in terms of trajectory smoothness and accuracy while significantly reducing training time and improving the robustness of trajectory optimization.

2. Related Work

This section reviews relevant research on curved surface path planning for UV printing robots, providing a detailed discussion on path generation methods, optimization approaches, and the application of reinforcement learning in path planning. Furthermore, it identifies the existing research gaps and challenges in this field.

2.1. Curved Surface Path Generation Method

Currently, many researchers have focused on surface path generation, with most methods based on Computer-Aided Design (CAD) models and point cloud data. In the research based on CAD models, Nieto Bastida and others from National Taiwan University [4] proposed a method that uses 3D point clouds of the printing workpiece as a geometric representation, which allows for the visualization of point cloud models and the generation of printing trajectories. Weber et al. [5] further refined this approach by modeling Bezier triangular surfaces to determine the optimal initial trajectory for spraying the workpiece. Typically, the initial steps of path planning involve using CAD models as input [6], followed by data classification, and then applying existing algorithms to generate the optimal path. Building on this, D. Gleeson et al. [7] used the CAD model as part of the initial trajectory and minimized the deviation between coating thickness and the target thickness to obtain the path of the applicator. Additionally, for obtaining point cloud data of actual objects, researchers [8] have often used depth cameras to capture the point cloud data of real objects for automatic path planning. Y. Meng et al. [9] used a 3D scanner to obtain point cloud data of a steel helmet and generated surface paths by processing the point cloud and applying B-spline curves. Similarly, Shah [10] used depth cameras to extract detailed point cloud data from the workpiece surface to determine the precise angle and depth the robot’s end-effector must maintain to ensure it remains perpendicular to the workpiece surface. This method is particularly useful for processing complex surfaces, as it generates normal trajectories to ensure the tool remains perpendicular to the workpiece surface.

Similarly, point-cloud-based methods are often used for surface Computer Numerical Control (CNC) path generation. Although multi-axis CNC machining shares similarities with the robotic printing process, current research has not deeply explored how to convert CNC paths into robotic paths, and this conversion process still lacks detailed discussion. For example, one study [11] proposed a method based on arc-length parameterization and Cartesian trajectory conversion, which improves the accuracy and smoothness of paths in free-form surface machining. Another study [12] combined CAD models to generate CNC machining data and optimized it through on-site point cloud measurements to improve the smoothness, stability, and machining precision of free-form surfaces. Although these methods effectively generate surface paths, they lack proper modeling of robotic kinematic characteristics, and the optimization of path smoothness and accuracy still receives insufficient attention.

2.2. Traditional Optimization Algorithms (GA and PSO) and Their Limitations

In the path optimization of printing robots, most research has focused on the optimization of spray models and coating thickness. For example, Zeng [13] developed a static variable posture spray gun model and a dynamic variable posture spray gun model along an arc path, proposing a spray gun optimization method based on variable spray angles to solve problems such as low spraying efficiency and excessive paint waste. Subsequently, Zhang Y [14] proposed a spraying path planning method based on patch boundary curves, which significantly reduced paint waste during the spraying process by optimizing the distance between the spraying path and the patch boundary. These studies demonstrate that coating thickness models play an active role in improving spraying quality. However, these methods often overlook the impact of trajectory smoothness and accuracy during the robot’s motion on printing quality. Smoothness is a key indicator in robotic processing, as even small cornering in a trajectory can lead to tangential discontinuities, causing vibration and impact during motion, which severely affects the robot’s performance in high-speed, high-precision operations. Currently, for robot motion smoothness, research often adopts fifth-order polynomials for trajectory planning. Lu et al. [15] established kinematic inverse equations, mapping data points and control points to joint space, and used fifth-order B-spline curves for secondary trajectory planning, achieving smooth motion. However, for multi-objective optimization problems, it is often necessary to combine other intelligent algorithms to obtain a more comprehensive solution.

Commonly used methods are metaheuristic algorithms, which are employed to enhance path accuracy and smoothness. Zhu and Pan [16] proposed an improved Genetic Algorithm (IGA), which solves the problems of slow convergence and unsmooth trajectories by introducing techniques such as direction-guided population initialization, noncommon point crossover, and range mutation. However, the GA algorithm still tends to fall into local optima during path planning. The Learning and Median-Based Spider-Wasp Optimizer (LMBSWO) [17] is an improved metaheuristic algorithm that combines the mechanisms of the Spider-Wasp Optimization algorithm with learning strategies, enhancing both global search and local optimization capabilities. It generates shorter and smoother paths, demonstrating superior accuracy and smoothness. PSO [18] performs well in trajectory optimization, but in high-dimensional optimization problems, the particle update strategy can lead to path discontinuities, affecting trajectory smoothness.

2.3. Reinforcement Learning Path Optimization Method

Compared to traditional metaheuristic algorithms, such as Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), reinforcement learning (RL) algorithms have been widely applied to path planning and trajectory optimization problems due to their independence from pre-established models and their ability to adaptively handle complex and dynamic environments. Unlike GA and PSO, which tend to fall into local optima and exhibit slow convergence, RL algorithms possess global optimization capabilities, enabling them to find better solutions in high-dimensional spaces and under complex constraints. Therefore, applying reinforcement learning to surface printing path planning can better address the optimization challenges of complex curved paths and provide more efficient and precise solutions than traditional algorithms.

Several researchers have analyzed reinforcement-learning-based path planning. Prianto proposed a path planning method based on the SAC algorithm [19], which improves path smoothness by optimizing the Q-value function through maximum entropy reinforcement learning, enhancing both path stability and efficiency. The authors in Ref. [20] introduced an improved Soft Actor–Critic (SAC) algorithm, which optimizes the exploration capability of path planning by incorporating a maximum entropy framework. This significantly improves the convergence speed and learning efficiency of robot path planning and effectively generates the optimal path.

However, in practical applications, reinforcement learning algorithms often face the challenge of convergence difficulties during the training process. The main reasons for this include sparse rewards, high data dimensionality, and the uncertainty of dynamic environments. To address these issues, some studies [21] have designed multi-objective reward functions that include path length penalties, dynamic obstacle avoidance, and goal-reaching rewards. Additionally, trajectory initialization strategies and hybrid training mechanisms have been introduced to accelerate policy convergence. Furthermore, targeted improvements [22] have been made to the deep reinforcement learning algorithm by incorporating methods such as hybrid action space design, the introduction of LSTM, and dynamic reward function optimization, which enhance data collection efficiency and accelerate training convergence. The authors in Ref. [23] proposed a reinforcement learning algorithm based on Double Deep Q-Networks (DDQN) and optimized the reward structure with an intrinsic reward mechanism. By incorporating time-series feature extraction networks (such as TimesNet) and a dual reward mechanism, the policy efficiency is improved, and overestimation bias is reduced.

Although there has been some progress in surface path generation and path optimization for UV printing, an effective solution that simultaneously ensures both path accuracy and smoothness is still lacking. In path optimization, traditional optimization algorithms, such as Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), often fall into local optima and have weak iteration capabilities, which limits their application in complex path planning tasks. While Reinforcement Learning (RL) algorithms possess global optimality and robustness in complex environments, they often face convergence difficulties during training. Therefore, there is a lack of an optimization framework that can guarantee both surface path accuracy and smoothness, while also accelerating the convergence of RL training. To address this, this paper proposes a surface path generation method based on CNC path transformation and introduces the Generative Adversarial Imitation Learning—Soft Actor–Critic (GAIL-SAC) framework. This method enhances the training efficiency of reinforcement learning while optimizing the smoothness and accuracy of robot surface UV printing paths, providing a more optimal solution for UV printing robot surface path planning.

3. Surface Path Planning Method for Spray Printing

3.1. Generate Main Path

In current research, the primary path generation methods mostly use point cloud data slicing based on CAD models [24]. However, this approach is computationally complex and time-consuming. To generate the primary spray printing path more efficiently, this paper adopts a CNC-based path generation method. The principle of this method leverages the similarity between the five-axis CNC fine machining process and the six-axis robotic spray printing process [25]. Using five-axis CNC machining software, such as Siemens NX 1899(UG), a precise machining tool path is first designed. This tool path is then exported as CNC data and subsequently converted into a surface spray printing path for the six-axis robot.

As shown in Figure 1, the five-axis CNC fine machining process establishes a workpiece coordinate system {WCS}, where the origin is located at the center of the bottom of the workpiece. Using the right-hand rule, a local tool coordinate system {LCS} is established based on three key directions: feed direction I, surface normal direction N, and cross-product direction

J = I \times N

[26]. In the local coordinate system {LCS} of the five-axis machine tool, the tool direction

T_{w} (B, C)

is determined by the rotational axes B and C of the five-axis machine. The tool trajectory

L^{c}

consists of a sequence of tool positions and orientations at each point. Therefore,

L^{c}

is defined by Equation (1):

L^{c} = {P_{1}^{T_{1}}, P_{2}^{T_{2}}, \dots, P_{k - 1}^{T_{k - 1}}, P_{k}^{T_{k}}}

(1)

where

P_{k}^{T_{k}}

represents the k-

t h

point of the tool axis path

L^{C}

and

T_{k}

represents the tool matrix at the k-

t h

point.

To describe the tool path as the robotic spray head path, the i-

t h

point of the tool path

L^{C}

, denoted as

P_{i}^{T_{i}}

, is described by Equation (2):

P_{i}^{T_{i}} = (x_{i}, y_{i}, z_{i}, B_{i}, C_{i})

(2)

where

x_{i}

y_{i}

, and

z_{i}

represent the coordinates of the x-axis, y-axis, and z-axis, respectively, in the

{W C S}

coordinate system and

B_{i}

and

C_{i}

denote the rotation angles of the B and C axes in the tool coordinate system, respectively [27]. According to the principles of rigid body rotation in space, the B axis corresponds to the rotational mode of the variable rotation angle in robotic kinematics, whereas the C axis corresponds to the rotational mode of the fixed rotation angle in robotic kinematics [28]. Therefore, the conversion from the five-axis tool path to the six-axis robotic path can be achieved. This paper provides the following conversion method:

T_{t o o l, i} = [\begin{matrix} 1 & 0 & 0 & x_{i} \\ 0 & 1 & 0 & y_{i} \\ 0 & 0 & 1 & z_{i} \\ 0 & 0 & 0 & 1 \end{matrix}]

(3)

T_{B C, i} = R_{z} (C_{i}) \cdot T_{t o o l, i} \cdot R_{y} (B_{i})

(4)

T_{C^{'}, i} = T_{B C, i} \cdot R_{y} (π)

(5)

where the matrix

T_{t o o l, i}

represents the position of the current tool axis, which includes only the Cartesian coordinates

x_{i}, y_{i}, z_{i}

and does not contain the tool axis orientation,

T_{B C, i}

represents the 4 × 4 matrix after applying rotations by angles

C_{i}

and

B_{i}

R_{y} (π)

is a rotation matrix, and

T_{C^{'}, i}

represents the matrix obtained by rotating

T_{B C, i}

around the Y-axis of the {LCS} coordinate system by an angle of

π

. After the transformations from Equations (3)–(5), the working coordinate system {LCS} can be considered as representing the robot’s end-effector {TCP} coordinate system.

The five-axis CNC machine tool has five degrees of freedom, namely

x_{i}

y_{i}

z_{i}

directions and

B_{i}

C_{i}

axes, whereas the six-axis robot has six degrees of freedom:

x_{i}

y_{i}

z_{i}

φ_{i}, β_{i}

and

Φ_{i}

. The main difference between them lies in the sixth degree of freedom. Assuming that the Cartesian coordinate sequence of the spray printing path is

L^{c} = {C_{1}^{T_{1}}, C_{2}^{T_{2}}, \dots, C_{k - 1}^{T_{k - 1}}, C_{k}^{T_{k}}}

, where the i-

t h

point is represented as

P_{i}^{T_{i}} = (x_{i}, y_{i}, z_{i}, B_{i}, C_{i})

, the spray head is required to maintain a stable posture during the printing process to ensure that the {TCP} remains perpendicular to the printed component at all times [29]. The Z-axis direction of the {TCP} is defined to be opposite to the surface normal vector direction, as shown in Figure 2. The description of the spray head {TCP} can be expressed as:

T_{T C P, i} = [\begin{matrix} x_{i} \\ \vec{I_{i}} & \vec{J_{i}} & \vec{N_{i}} & y_{i} \\ z_{i} \\ 0 & 0 & 0 & 1 \end{matrix}]

(6)

where i = 1, 2, 3, …, n.

x_{i}

y_{i}

z_{i}

represent the x, y, and z coordinates of the i-

t h

trajectory point

P_{i}

in the tool path

L^{c}

, the vector

\vec{N_{i}} = {(n_{x_{i}}, n_{y_{i}}, n_{z_{i}})}^{T}

indicates the rotation information of the Z-axis of the {TCP}, with its values derived from the corresponding entries in the

T_{C^{'}, i}

matrix, and the vector

\vec{I_{i}}

represents the rotation information of the TCP’s X-axis and can be defined as the forward direction of the tool path. This direction can be calculated using the difference between the subsequent point

\vec{I_{i + 1}} - \vec{I_{i}} = {(x_{i + 1} - x_{i}, y_{i + 1} - y_{i}, z_{i + 1} - z_{i})}^{T}

. Since the Z-axis and X-axis information of the {TCP} coordinate system are already determined, the positive direction of the Y-axis can be established through the ZX plane formed by the Z-axis and X-axis, yielding

{\vec{J}}_{i} = {(o_{x i}, o_{y i}, o_{z i})}^{T}

, where i = 1, 2, …, n. Thus, the matrix

T_{T C P, i}

can represent the Cartesian space coordinate information of the i-

t h

point along the current tool path L.

3.2. Establishment of Robot Motion Accuracy Model

In current research, the Denavit–Hartenberg (DH) parameters are commonly used to describe robotic models [30]. The transformation matrix between two links can be expressed as:

⁠_{i}^{i - 1} T = [\begin{matrix} cos θ_{i} & - sin θ_{i} \cos α_{i} & sin θ_{i} \sin α_{i} & a_{i} \cos θ_{i} \\ sin θ_{i} & cos θ_{i} \cos α_{i} & - cos θ_{i} \sin α_{i} & a_{i} \sin θ_{i} \\ 0 & sin α_{i} & cos α_{i} & d_{i} \\ 0 & 0 & 0 & 1 \end{matrix}]

(7)

where

⁠_{i}^{i - 1} T

represents the transformation matrix from the

i - 1

-th coordinate system to the i-th coordinate system,

θ_{i}

denotes the rotation angle of joint i,

α_{i}

represents the torsion angle of joint i, indicating the rotation from the

z_{i - 1}

-axis to the

z_{i}

-axis,

a_{i}

is the link length of joint i, representing the displacement along the

x_{i}

-axis, and

d_{i}

denotes the displacement of joint i, representing the displacement along the

z_{i}

-axis. The relationship between the robot end-effector and the robot base can then be defined as:

{}_{6}^{0}T = {}_{1}^{0}T \cdot {}_{2}^{1}T \cdot {}_{3}^{2}T \cdot {}_{4}^{3}T \cdot {}_{5}^{4}T \cdot {}_{6}^{5}T = [\begin{matrix} n_{x} & o_{x} & a_{x} & p_{x} \\ n_{y} & o_{y} & a_{y} & p_{y} \\ n_{z} & o_{z} & a_{z} & p_{z} \\ 0 & 0 & 0 & 1 \end{matrix}] .

(8)

where

{[n_{x}, n_{y}, n_{z}]}^{T}

represents the rotational components in the x-axis direction of the robot end-effector coordinate system,

{[o_{x}, o_{y}, o_{z}]}^{T}

represents the rotational components in the y-axis direction of the robot end-effector coordinate system,

{[a_{x}, a_{y}, a_{z}]}^{T}

represents the rotational components in the z-axis direction of the robot end-effector coordinate system, and

{[p_{x}, p_{y}, p_{z}]}^{T}

represents the displacement vector of the robot end-effector coordinate system.

In joint space, let the robot move by

Δ θ

at a certain moment; at this time, the position of the i-

t h

joint of the robot can be expressed as:

θ_{i} = θ_{t - 1}^{i} + Δ θ_{t}^{i}

(9)

In Cartesian space, the end-effector posture of the robot after movement can be obtained using Equations (8) and (9).

In current research, path optimization for surface spray printing robots mainly focuses on selecting spray guns and spray heads, as well as optimizing spraying uniformity [31]. However, there is limited research on path accuracy, particularly for high-precision specific spray heads, such as UV spray printing heads, which have stricter requirements for path accuracy and smoothness. This subsection primarily explores the methods for establishing a robotic motion accuracy model and how to ensure smoothness in the robot’s movement.

First, by introducing the path accuracy AC, the motion trajectory accuracy of the robot during the spray printing process is evaluated. AC can be modeled as shown in Equation (10):

AC = E_{P} + E_{e} + C

(10)

where

E_{p}

represents the positional error between the robot end-effector and the standard path at time t. It can be expressed by Equation (11):

E_{P} = ∥(P_{p, t}, P_{n e a r, t})∥

(11)

where

| | \cdot | |

represents the Euclidean distance,

P_{p, t} = (x, y, z)

denotes the Cartesian position coordinates of the robot’s end-effector {TCP}, and

P_{near} = (x_{near}, y_{near}, z_{near})

represents a point in the robot’s work coordinate system that is located near the standard path of the robot’s end-effector.

In Cartesian space [32], a six-degree-of-freedom robot is typically represented using Euler angles

φ, θ

and

ϕ

. In this study, their respective errors are defined as

φ_{e r r o r}, θ_{e r r o r}

, and

ϕ_{e r r o r}

, representing the deviation between the actual and desired angle values.

E_{e}

denotes the orientation error of the robot end-effector relative to the corresponding point on the standard path at time t, and can be described by Equation (12):

E_{e} = \sqrt{\frac{ϕ_{error}^{2} + θ_{error}^{2} + ψ_{error}^{2}}{3}}

(12)

C represents the completion level of the spray printing task, which can be expressed by Equation (13):

C = \frac{l e n (L_{t})}{l e n (L_{m})}

(13)

where

l e n (L_{t})

represents the path length traversed by the spray head of the robot end-effector at the current time t and

l e n (L_{m})

represents the total length of the reference path. These are calculated by Equations (14) and (15), respectively:

l e n (L_{t}) = Σ | | (P_{p, 1}, P_{p, 2}, \dots, P_{p, k}) | |

(14)

l e n (L_{m}) = Σ | | (P_{m, 1}, P_{m, 2}, \dots, P_{m, n}) | |

(15)

where

P_{p, k}

represents the position of the robot end-effector spray head at time step k and

P_{m, n}

represents the position of the n-th point on the reference primary path.

In summary, the path accuracy of the m-th path of L at time t is formulated by Equation (16):

{AC}_{t} = ∥(P_{p, t}, P_{n e a r, t})∥ + \sqrt{\frac{ϕ_{error}^{2} + θ_{error}^{2} + ψ_{error}^{2}}{3}} + \frac{l e n (L_{t})}{l e n (L_{m})}

(16)

3.3. Markov Decision Process (MDP)

The Markov Decision Process (MDP) is a core concept in reinforcement learning, providing a mathematical model for solving decision-making problems that involve randomness and temporal dependencies. In an MDP, the agent learns through interactions with the environment, which consist of states, actions, and rewards. The goal is to select the appropriate policy to maximize long-term rewards. An MDP is typically represented as a five-tuple:

M = {S, A, P, R, γ}

, where:

S (State set): Represents all possible states the agent can be in. For instance, in path planning, the state could include the robot’s position, orientation, and other relevant information at a given time. Each state encapsulates the full information about the environment.
A (Action set): Represents the set of actions the agent can take in each state. The action set can be discrete (e.g., move up, move down, move left, move right) or continuous (e.g., adjusting the robot’s speed or direction). Each action corresponds to a specific behavior.
P(s′ | s, a) (State transition function): Describes the probability of transitioning from state s to state $s^{'}$ after performing action a. It reflects the dynamic characteristics of the environment. For example, in path planning, the robot may fail to reach the desired position due to external disturbances.
R(s, a) (Reward function): Represents the immediate reward received after taking action a in state s. The reward is a scalar value used to measure the quality of the outcome of an action. In path planning problems, rewards can be related to factors such as path length, smoothness, and obstacle avoidance, with the goal of maximizing the cumulative reward.
$γ$ (Discount factor): Used to balance the trade-off between current rewards and future rewards, with values typically in the range [0, 1]. When $γ$ is close to 1, the agent focuses more on long-term returns; when $γ$ is smaller (closer to 0), the agent focuses more on short-term rewards.

In an MDP, the agent’s goal is to choose a policy

π

that maximizes its long-term cumulative reward. The policy

π

is a mapping from states to actions, defining which action should be taken in each given state.

One important objective in MDPs is to find an optimal policy

π^{*}

, such that starting from any state, the agent maximizes its cumulative reward. The cumulative reward is typically represented as an expected value, defined as:

V^{π} (s) = E [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}, a_{t}) ∣ s_{0} = s]

where

V^{π} (s)

represents the expected return starting from state s under policy

π

and

γ^{t}

is the discount factor for future rewards.

For the state–action value function

Q^{π} (s, a)

, it represents the expected return obtained by taking action a in state s, and following policy

π

thereafter:

Q^{π} (s, a) = E [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}, a_{t}) ∣ s_{0} = s, a_{0} = a]

To obtain the optimal policy, the agent iteratively updates

V (s)

and

Q (s, a)

, using the Bellman equation for dynamic programming. The optimal state-value function and the optimal action-value function satisfy the following Bellman equations:

V^{*} (s) = max_{a} Q^{*} (s, a)

Q^{*} (s, a) = R (s, a) + γ \sum_{s^{'}} P (s^{'} ∣ s, a) max_{a^{'}} Q^{*} (s^{'}, a^{'})

The agent can find the optimal policy

π^{*} (s) = arg {max}_{a} Q^{*} (s, a)

that maximizes the return by solving these equations.

3.4. Reinforcement Learning SAC Algorithm

Soft Actor–Critic (SAC) [33] is an off-policy reinforcement learning algorithm based on maximum entropy, aimed at optimizing the agent’s policy to maximize long-term returns while ensuring sufficient exploration by increasing the entropy of the policy, thereby avoiding premature convergence to local optima. SAC has shown excellent performance in many reinforcement learning tasks, particularly those involving high-dimensional continuous action spaces, such as robot control and path planning. The core idea of SAC is to introduce a maximum entropy objective, which optimizes not only the reward function but also the entropy of the policy, encouraging the agent to maintain adequate exploration. The Markov Decision Process (MDP) provides a structured framework for reinforcement learning, describing how an agent can affect the state by selecting actions in the environment and receive rewards. MDP is composed of the following five elements:

MDP = {S, A, P, R, γ}

In SAC, the Q-value network is used to estimate the long-term return for a given state–action pair. SAC utilizes the state transition function and reward function from the MDP framework, combined with the idea of maximum entropy, to optimize the agent’s behavior policy, maximizing the return while increasing exploration.

The network structure of the SAC algorithm mainly consists of two Q-value networks (Critic), a policy network (Actor), and a target Q-value network (Target Q-network) [34]. Each network in SAC plays the following role:

Q-Network(Critic): Used to evaluate the return of each state–action pair. SAC uses a dual Q-network ( $Q_{1}$ and $Q_{2}$ ), where the two Q-value networks independently estimate the return for the same state–action pair. By using dual Q-networks, SAC reduces the problem of overestimation of Q-values, enhancing the stability of the learning process. The Q-value network update formula is given by Equation (17):

$L_{Q} (θ_{i}) = E_{(s, a, r, s^{'}) \sim D} [(Q_{θ_{i}} (s, a) - (r + γ \cdot m i n Q_{θ_{j}} (s^{'}, a^{'}) - α \log π_{ϕ} (a^{'} | s^{'}))))$

(17)

where $Q_{θ_{i}} (s, a)$ represents the return estimate for taking action a in state s, $γ$ is the discount factor, which determines the importance given to future rewards, $α$ is the temperature parameter, which controls the balance between the reward r and the entropy of the policy, $π_{θ} (a^{'} | s^{'})$ is the action selection probability output by the policy network for action $a^{'}$ in state $s^{'}$ , and D represents the replay buffer, which stores the experience sequences $(s, a, r, s^{'})$ obtained by the actor during its interaction with the environment.
Target Q-Network: To stabilize the training of the Q-network, SAC introduces a target Q-network. The target Q-network is used to compute the target $Q_{{\bar{θ}}_{i}, i = 1, 2}$ , preventing overestimation of the Q-values. The target Q-value network is updated through soft updates, as follows:

$\bar{θ} = τ θ + (1 - τ) \bar{θ}$

(18)

where $τ$ is the soft update ratio, which controls the update rate of the target Q-value network, typically set to a small value (e.g., 0.005).
Policy Network (Actor): The purpose of the policy network is to generate the probability distribution of actions given a state. In SAC, the policy is represented by a Gaussian distribution, which outputs the mean and standard deviation of the actions. The goal of the policy network is to maximize the return while maintaining exploration. The update objective of the policy network is as follows:

$L_{S A C} = J_{π} (ϕ) = E_{s_{t} \sim D} [Q_{θ} (s_{t}, a_{t}) - α \log π_{ϕ} (a_{t} | s_{t})]$

(19)

where $Q_{θ} (s_{t}, a_{t})$ represents the return estimate from the Q-network and $log π_{ϕ} (a_{t} | s_{t})$ represents the entropy term of the policy, indicating the randomness in choosing action $a_{t}$ in state $s_{t}$ .
Temperature Parameter ( $α$ ): The temperature parameter controls the balance between the entropy of the policy and the return. A higher temperature encourages more exploration, while a lower temperature strengthens the maximization of the return. The update formula for the temperature parameter is as follows:

$α = \frac{1}{N} \sum_{t = 0}^{N} [log π_{θ} (a_{t} | s_{t}) - H]$

(20)

where H is the target entropy and N is the number of samples. The dynamic adjustment of the temperature helps SAC achieve a balance between exploration and exploitation.

3.5. Generative Adversarial Imitation Learning

Generative Adversarial Imitation Learning (GAIL) combines Generative Adversarial Networks (GAN) with Inverse Reinforcement Learning (IRL) in an imitation learning framework, extending IRL within the structural framework of GANs. This approach improves upon the limitations of IRL in terms of poor representation ability and low computational efficiency. Goodfellow et al. [35] proposed a new framework for evaluating generative models through an adversarial process: GAN. GAN simultaneously trains two models: a generator G, which captures the data distribution, and a discriminator D, which estimates whether a sample comes from the expert data or from the generator [36]. The expert data are represented as

D^{E} = {τ_{1}, τ_{2}, \dots, τ_{n}}

, which denotes the expert demonstration trajectories used for training the generative adversarial model. To enable the generator to capture the distribution

P_{d a t a}

of data x, the generator constructs a mapping function

G (z; θ)

internally, which maps a noise distribution

P_{z} (z)

to the data space. Here,

θ

represents the parameters of the generator. The discriminator

D (x; ϖ)

is used to determine the probability that x comes from expert data rather than from the generator

P_{d a t a}

, where

ϖ

represents the parameters of the discriminator. GAN simultaneously trains the generator and discriminator, adjusting the parameters of the generator to minimize

log (D_{ϖ} (x))

. The training objective function is given by:

min max V (D, G) = E_{x \sim P_{d a t a}} [log (D_{ϖ} (x))] + E_{z \sim P_{z} (z)} [log (1 - D_{ϖ} (G_{θ}))]

(21)

where x represents real data and z represents the noise variable.

The loss function of the discriminator D can be defined as:

L_{D} = - E_{x \sim P_{data}} [log (D_{ϖ} (x))] - E_{z \sim P_{z} (z)} [log (1 - D_{ϖ} (G_{θ} (z)))]

(22)

The GAIL algorithm applies the concept of GAN to imitation learning [37]. The model primarily consists of two components: the generator G and the discriminator D. The generator is responsible for generating a random policy, while the discriminator is used to determine whether the input trajectory sequence comes from expert demonstrations or the generator [38]. The core idea of GAIL is to learn an optimal policy by minimizing the distance between the occupancy measure

ρ_{π}

of the stochastic policy and the occupancy measure

ρ_{π^{*}}

of the expert policy. The generator produces a trajectory sequence

τ_{i}

to accomplish the task based on the current state, which is then fed into the discriminator along with an expert trajectory

τ^{*}

sampled from expert demonstrations. The discriminator computes the distance between the state–action pairs

(s, a)

from the generated trajectory and the state–action pairs

(s^{*}, a^{*})

from the expert trajectory, and then feeds back the result to the generator, encouraging it to generate better policies. This process is specifically implemented through the adversarial process described by Equation (23):

L_{G A I L} = min max E_{π_{θ}} [log D_{ϖ} (s, a)] + E_{π^{*}} [log (1 - D_{ϖ} (s, a))]

(23)

where

π^{*}

represents the expert policy,

π_{θ}

represents a stochastic policy,

θ

is the parameter of the generator, and

ϖ

is the parameter of the discriminator. Based on the results from the discriminator,

θ

is continuously updated, optimizing the stochastic policy

π_{θ}

4. Framework for Path Planning of Complex Surface Spray Printing

This section proposes a GAIL-SAC-based framework for complex surface path planning, as shown in Figure 3.

Figure 3 illustrates how GAIL-SAC addresses the MDP problem. This process consists of three stages:

Stage 1: The primary path is generated, which is designed using multi-axis CNC machining software such as UG. The data are exported as CNC data, and a conversion algorithm is developed to transform the CNC tool path data into a Cartesian path for the robot.
Stage 2: In this stage, robot motion planning is performed based on the primary path generated in the first stage. The robot moves with different joint angles at different positions to complete the path planning. This process is treated as an MDP problem, which is solved using SAC.
Stage 3: This stage utilizes the GAIL-SAC framework to improve the convergence speed and trajectory accuracy of reinforcement learning training, and its algorithmic process is shown in Table 1.

4.1. Spray Printing Trajectory Generation Scheme

In this stage, according to the requirements of spray printing, a multi-axis fine machining tool path trajectory

L^{c} = {P_{1}^{T_{1}}, P_{2}^{T_{2}}, \dots, P_{k - 1}^{T_{k - 1}}, P_{k}^{T_{k}}}

is designed using CNC machining software. The path trajectory is then converted into the ideal primary spray printing path L for the robot using Equation (6) from Section 3.1. The robot takes different actions at different positions, which results in different motion paths, each with varying accuracy. This can be treated as an MDP problem, which will be modeled in the next subsection.

4.2. Spray Printing Trajectory Planning Scheme

In the second phase, based on the robot path generated in the first phase, a smooth trajectory is planned in the robot’s joint space using a deep reinforcement learning algorithm. Simultaneously, the established robot motion accuracy model is applied to constrain the trajectory, ensuring both path accuracy and smoothness in the Cartesian space. To accomplish this task, the path planning problem of the printing robot is formulated as a Markov Decision Process (MDP), which involves the components (S, A, P, R). Given the MDP characteristics, this section introduces the Soft Actor–Critic (SAC) reinforcement learning algorithm to specifically address the problem.

Training Environment: In Section 3.2, Equation (8) is used to perform DH modeling of the robot, establishing the robot motion accuracy model. The robot is defined as the agent, and Equation (16) is used to help the agent learn the optimal policy for the MDP.

MDP Formulation: When the robot takes n actions

Δ θ_{1, 2, \dots, 6}

in joint space, a series of trajectory points

L = (P_{1}, P_{2}, P_{3}, P_{4}, \dots P_{n})

is generated. This process is described as an MDP problem, which is defined as follows:

State Space:

s_{t}

represents the state of the agent at time t, and therefore, it needs to include the position and orientation information of the robot end-effector spray head. That is,

s_{t} = (P_{t}, D_{t}, A H_{t})

, where

P_{t}

represents the pose and orientation of the robot end-effector spray head at time t, which can be expressed as

P_{t} = (x_{t}, y_{t}, z_{t}, φ_{t}, β_{t}, Φ_{t})

D_{t}

represents the point on the ideal primary path closest to the robot end-effector at time t, and is defined as

D_{t} = (\begin{matrix} {x^{D}}_{t}, {y^{D}}_{t}, {z^{D}}_{t}, {φ^{D}}_{t}, {β^{D}}_{t}, {Φ^{D}}_{t} \end{matrix})

. Finally,

A H_{t} = (\begin{matrix} {x^{A H}}_{t}, {y^{A H}}_{t}, {z^{A H}}_{t}, {φ^{A H}}_{t}, {β^{A H}}_{t}, {Φ^{A H}}_{t} \end{matrix})

represents the point reached by the robot after motion.

Action Space:

a_{t} = (Δ θ_{1}, Δ θ_{2}, Δ θ_{3}, \dots, Δ θ_{n})

represents the action of the robot at time t, where

Δ θ_{i}

denotes the angular increment of the i-th joint of the spray printing robot. The actions are continuous and bounded. The reward function reflects the expected reward for performing action

a_{t}

in state

s_{t}

, and can be expressed as:

R (s_{t}, a_{t}) = E_{π} [R_{t + 1} | s_{t} = s, a_{t} = a]

(24)

Reward Function: Due to the large state space of the system, it is extremely challenging to set rewards for all states, which leads to the problem of sparse rewards. This can result in a slow learning process or even make it impossible to learn effectively. To enable the robotic arm to quickly learn the optimal task path, a specially designed reward function allows the robotic arm to mimic the reference path, thereby enhancing the learning speed of path planning and increasing the success rate of task execution. Based on the task requirements, rewards and penalties can be categorized into two scenarios:

(a) The robotic arm receives a reward when its end effector moves in the direction of the reference path; conversely, a negative reward is applied when it moves in the opposite direction.

(b) The closer the end effector is to the reference path, the greater the reward received; if the distance is too great, a penalty is incurred, with the penalty increasing as the distance increases. Additionally, the accuracy and smoothness of the planned path must also be considered. Therefore, the reward after executing action

a_{t}

is designed as follows:

r (s_{t}, a_{t}) = - log ((\partial E_{P t} + ε E_{e t} + η C_{t}) + 1)

(25)

The design of the logarithmic function compresses the reward data into a certain range, preventing excessively large reward differences caused by variations in actions. In reinforcement learning, this reward mechanism can guide the robotic arm to quickly approach the main printing path, facilitating faster policy convergence and improving learning efficiency.

SAC Algorithm: SAC is a model-free reinforcement learning framework based on the Actor–Critic architecture, which incorporates the principle of maximum entropy. This allows it to achieve high returns while maintaining strong exploratory capabilities and high robustness. By combining the reward algorithm with the SAC reinforcement learning method, the reward function guides the agent in learning how to select optimal actions, while SAC trains the agent to choose suitable actions based on the defined reward function and state. In summary, SAC addresses the MDP by maximizing the expected reward and entropy within the Actor–Critic framework.

4.3. Reinforcement Learning GAIL-SAC

In recent years, Deep Reinforcement Learning (DRL) has made significant progress in solving complex problems. However, in high-dimensional environments, DRL often faces challenges such as large state spaces, sparse rewards, and high data dimensionality, leading to difficulties in convergence and stability. To address these issues, researchers have explored reward function design to provide finer-grained guidance, and experience-based methods like Hindsight Experience Replay (HER) and Prioritized Experience Replay (PER) to enhance training efficiency. While effective in low-dimensional settings, these approaches often underperform in high-dimensional environments.

Imitation Learning (IL) offers a promising solution by guiding reinforcement learning with expert trajectories or efficient data to provide prior knowledge. Combining IL with DRL can reduce exploration difficulty, accelerate convergence, and improve performance in high-dimensional environments. This hybrid approach represents a viable direction for enhancing the efficiency and stability of reinforcement learning algorithms.

The GAIL algorithm is inspired by maximum entropy IRL and generative adversarial networks (GANs). The objective of the GAIL algorithm can be understood as matching the current policy distribution with that of an expert policy, such that the discriminator cannot distinguish between the current and expert policies. However, since the GAIL algorithm relies on expert data to generate policies, if the strategies within this dataset are suboptimal or unable to achieve the goals, the performance of the generated policies cannot be guaranteed. Therefore, this paper proposes the GAIL-SAC algorithm, which combines the exploratory advantages of reinforcement learning with the strategy constraints inherent in imitation learning.

The framework of the GAIL-SAC algorithm is illustrated in Figure 4. As shown in Figure 4, the model consists of a value network, a policy network, and a discriminator network, with only the policy network retained during deployment. The experience pool comprises an expert experience pool and a trajectory experience pool, with trajectory data in the expert pool represented as

(s_{0}^{E}, a_{0}^{E}, \dots, s_{t}^{E}, a_{t}^{E})

. The trajectory experience pool stores path data generated through the interaction of the current policy with the environment, represented as

(s_{t}, a_{t}, s_{t + 1}, r_{t})

. The training loss function

L (θ)

for the policy network is composed of two components:

L_{S A C} (θ)

and

L_{G A I L} (θ)

. Therefore, the focus of the GAIL-SAC algorithm is on how to adjust the weight parameter

ω

during agent training to modulate the influences of

L_{S A C} (θ)

and

L_{G A I L} (θ)

on the policy network, thereby stabilizing the training of the optimal spraying strategy. The weight parameter

ω

follows a nonlinear decay strategy, transitioning from imitation learning dominance in the early training phases to reinforcement learning dominance in the later phases, which can be described as follows:

L (θ) = \{\begin{matrix} (1 - ω) L_{S A C} (θ) + ω L_{G A I L} (θ) & , & T \leq N_{G A I L} \\ L_{S A C} (θ) & , & T > N_{G A I L} \end{matrix}

(26)

ω = \frac{1}{1 + e x p (0.05 (i - N_{G A I L} / 2))}

(27)

where T represents the current training episode and

N_{G A I L}

denotes the number of episodes participating in the training of the loss function

L_{G A I L} (θ)

constructed based on imitation learning.

5. Simulation and Experiments

5.1. Experimental Environment Configuration

All experiments in this study were conducted on a CPU environment running Windows 11, with PyTorch version 1.3.1 and an 8-core CPU. The models in this chapter used the Adam optimizer to update the network parameters with an initial learning rate of 0.0004. In the experiments, the size of the experience replay buffer was set to 256, and the learning rates for both the Actor and Critic networks in the GAIL-SAC algorithm were 0.0004. This section validates the proposed method’s effectiveness in surface printing path planning through simulation experiments. The robot used in the experiments is the KUKA KR210, with a payload capacity of 20 kg and a maximum working radius of 2100 mm.

The UV nozzle used in this study is the Toshiba F3, manufactured by Toshiba Corporation in Tokyo, Japan. It is a piezoelectric on-demand printing head with a print width of 53.95 mm and a physical resolution of 600 dpi. The simulation environment used is the PyBullet robot simulator. In Section 5.2, the convergence of the GAIL-SAC framework is verified by designing a reacher experiment in a simple reacher environment, and the reward curves of the SAC algorithm and other mainstream improvement methods are compared. In Section 5.3, a simulation experiment of the printing robot trajectory planning is designed to verify the effectiveness of the GAIL-SAC reinforcement learning framework, comparing the performance of the proposed algorithm with traditional algorithms in terms of convergence speed, trajectory accuracy, and trajectory smoothness.

5.2. Main Path Generation Experiment

To validate the effectiveness of the main path generation method proposed in Section 3.1, this subsection presents experimental verification. Initially, the main path was designed and generated using the Computer-Aided Manufacturing (CAM) module in UG software (Siemens NX 1899), with the main parameters summarized in Table 2. This process simulates the operation of a UV nozzle, and the generated CNC data are illustrated in Figure 5a. After applying the transformation outlined in this study, the results are shown in Figure 5b. It is evident that the trajectory data of the robot align with the CNC machining path, meeting the experimental expectations where the path points correspond with the intended design.

5.3. Convergence of GAIL-SAC

To validate the convergence of the proposed GAIL-SAC framework, this section designs and sets up a simulation experiment for the Reacher task using PyBullet. PyBullet is a widely used robotics simulator that facilitates the simulation of robotic arm motion and collision detection. In the simulation, we employ the KUKA KR210 robotic arm with initial joint angles set to

[0, - 90, 90, - 90, 90, 0]

, enabling the arm to move from its initial position to the target point. The goal of the simulation experiment is to move the robot’s end-effector within a radius of 0.05 units from the target point and maintain this position for a certain duration, which is considered a success.The detailed experimental procedure is shown in Figure 6. The experiment compares the convergence of the GAIL-SAC algorithm with that of the baseline SAC algorithm, as well as Behavioral Cloning (BC) and Hindsight Experience Replay (HER). The algorithm is trained for 600 episodes, with each episode consisting of 50 steps.

The rewards from the experiments are shown in Figure 7. The reward curve indicates that the baseline SAC algorithm struggled in the reacher environment due to excessively sparse rewards, resulting in insufficient trainable data and even convergence failures. In contrast, the HER (Hindsight Experience Replay) algorithm performed well by effectively addressing the issue of sparse rewards in the reacher environment. This study employed the future strategy of HER, which involves randomly selecting

s_{t}

from the collected episode sequences as the target and assigning a reward of 1. The good convergence of the HER-SAC algorithm highlights the importance of dense rewards for algorithm convergence. Additionally, the BC-SAC algorithm showed inferior performance. Notably, during the experimental process, the BC-SAC reward curve initially did not converge; it only achieved the results shown in Figure 7 after data processing through the HER algorithm. In comparison, the proposed GAIL-SAC algorithm exhibited greater stability and improved convergence, demonstrating that our enhanced approach significantly enhances the agent’s ability to converge in sparse reward environments.

5.4. Printing Path Planning Experiment

This section focuses on validating the effectiveness of the proposed GAIL-SAC framework in three key aspects: convergence, trajectory accuracy, and trajectory smoothness within the printing environment simulation.

The experimental process of this study is shown in Figure 8. The experiment primarily focuses on planning a trajectory in joint space that ensures both smoothness in joint space and accuracy in Cartesian space. The main path

L^{c} = {P_{1}^{(T_{1})}, P_{2}^{(T_{2})}, \dots, P_{k - 1}^{(T_{k - 1})}, P_{k}^{(T_{k})}}

designed in Section 2.1 is used as the reference path. Then, the Actor network outputs different joint angle values based on the robot’s state, driving the robot’s motion. Next, the reward function, defined in Equation (25) of Section 3.2, assigns a reward to the action. The Critic and D networks evaluate the Actor network based on the reward value and update the networks, guiding the Actor network to learn the optimal policy and plan the optimal path. After training, the output path is compared with the reference standard path, and the evaluation is conducted based on the Cartesian coordinates (xyz) and the corresponding Euler angle information of each point.

After training, the output path is compared with the reference standard path. The evaluation is conducted based on the Cartesian coordinates (xyz) and the corresponding Euler angles of the points, ensuring a comprehensive assessment of the trajectory’s accuracy and smoothness.

From the perspective of convergence, this simulation compared three different methods. The experimental results, shown in Figure 9, indicate that the previous SAC-HER method struggled to converge under the mid to high-dimensional data of this experiment, remaining ineffective even after 1000 training iterations. In contrast, the SAC algorithm, which employed the designed reward function, achieved dense rewards and demonstrated convergence. The BC algorithm, utilizing a fifth-degree polynomial as expert instances, exhibited an upward trend after 1000 training iterations.

When comparing these algorithms, the proposed GAIL-SAC algorithm performed optimally in high-dimensional data settings. Notably, while all the aforementioned algorithms showed a steady increase in reward trends, indicating a convergence tendency, experimental validation revealed that trajectory accuracy was poor when the reward was less than −10. Only when the reward exceeded −10 did the trajectory accuracy begin to improve. Table 3 presents the final converged reward of GAIL-SAC, confirming that it also achieved the best trajectory accuracy in this experiment.

To validate the superiority of the proposed algorithm in terms of trajectory accuracy and smoothness, a comparative analysis was conducted between the proposed algorithm and metaheuristic algorithms, such as Particle Swarm Optimization (PSO) [39] and Genetic Algorithm (GA) [40]. In the design of the metaheuristic algorithms, the primary path was first generated by solving the inverse kinematics based on the established robot DH parameter model, resulting in a series of joint data. The path was then optimized in the joint space using the metaheuristic algorithms, while constraints were applied in the Cartesian space to ensure both the smoothness and accuracy of the robot’s path in Cartesian space. Experimental results show that the comprehensive fitness functions of PSO and GA effectively balance the path smoothness and accuracy. The fitness function is defined as the weighted sum of the target deviations, as shown in the following form:

f (x) = w_{1} E_{cartesian} + w_{2} E_{joint-smooth} + w_{3} E_{cartesian-smooth}

(28)

where

E_{cartesian}

represents the Cartesian space error, which is used to measure the deviation of the optimized path from the reference path, including both positional and orientation errors. It is defined as:

E_{c a r t e s i a n} = \sum_{i = 1}^{N} (‖ p_{i} - p_{i}^{r e f} ‖^{2} + λ {‖ log (R_{i}^{T} R_{i}^{ref}) ‖}_{F}^{2})

(29)

where

p_{i}

and

p_{i}^{ref}

represent the actual position and reference position of the i-th path point, respectively,

R_{i}^{T}

and

R_{i}^{ref}

denote the actual and reference rotation matrices of the i-th path point, respectively, and

λ

is the weight factor between the positional error and the orientation error.

E_{joint-smooth}

represents the joint smoothness error, which is used to limit the rapid changes in joint angles and ensure the smoothness of the robotic arm’s motion. It is defined as:

E_{joint-smooth} = \sum_{i = 2}^{N - 1} \sum_{j = 1}^{n} {(q_{i + 1, j} - 2 q_{i, j} + q_{i - 1, j})}^{2}

(30)

where

q_{(i, j)}

represents the j-th joint angle of the i-th path point.

w_{1}

w_{2}

, and

w_{3}

are the weight coefficients for Cartesian error, joint smoothness error, and Cartesian smoothness error, respectively, and are used to adjust the influence of each component.

Figure 10 shows the path performance in Cartesian space of different path planning algorithms. By comparing the trajectory projections on the X-Z, X-Y, and Y-Z planes, the differences in path accuracy between the GAIL-SAC algorithm (RL) and traditional algorithms (MOVE), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA) are highlighted. From the trajectory comparisons in the figure, it can be seen that the path generated by the MOVE algorithm exhibits noticeable jitter, particularly in the Y-Z plane (Figure 10c), where the path deviates significantly from the ideal path (STAND, red). This indicates that the MOVE algorithm has a lower path accuracy. This is primarily due to the fact that the MOVE algorithm directly computes the path inverse kinematics in Cartesian space, which may lead to singularities or no solutions when approaching the workspace boundaries, resulting in unstable robot paths. In contrast, the PSO and GA algorithms optimize the path accuracy to some extent and reduce jitter, but certain trajectory deviations are still observable in the Y-Z plane (Figure 10c), exhibiting some instability. The RL (GAIL-SAC) algorithm, by performing trajectory planning in joint space, generates a smoother path that closely follows the ideal path (STAND), especially in the Y-Z plane, where the RL algorithm’s path almost coincides with the ideal path, demonstrating its superiority in complex environments and path accuracy.

Table 4 and Figure 11 further quantify the differences in path accuracy among the algorithms. Table 4 shows the Mean Squared Error (MSE) of different algorithms in the X-Z, X-Y, and Y-Z planes. It can be seen that the MOVE algorithm has the largest error, with an MSE of 1231.13 mm² in the X-Z plane, 1672.26 mm² in the X-Y plane, and 1560.59 mm² in the Y-Z plane, all of which are significantly higher than those of other algorithms. In contrast, the MSE of the PSO and GA algorithms is noticeably lower than MOVE, particularly in the Y-Z plane, where the errors are 2.29 mm² and 1.89 mm², respectively. However, they still do not compare to the MSE of the RL algorithm. The RL (GAIL-SAC) algorithm has the smallest MSE, with 40.02 mm² in the X-Z plane, 4.88 mm² in the X-Y plane, and 4.77 mm² in the Y-Z plane. These values are significantly lower than those of other algorithms, indicating that the RL algorithm provides the most accurate path-following capability in all planes.

Additionally, Figure 11 presents the maximum Cartesian space error for each algorithm in different planes, further quantifying the differences in path accuracy. From the figure, it can be seen that the maximum error of the MOVE algorithm reaches 9.01 mm in both the X-Z and Y-Z planes, indicating significant path deviations in critical areas. The maximum errors for the PSO and GA algorithms are 2.29 mm and 1.89 mm, respectively, which are improvements over the MOVE algorithm but still higher than the RL algorithm. In contrast, the RL (GAIL-SAC) algorithm has a maximum error of only 0.63 mm, the lowest among all algorithms. Especially in the Y-Z plane, the maximum error of the RL algorithm is almost zero, demonstrating the smallest trajectory deviation and proving its superiority in path planning.

In summary, compared to other algorithms, the RL algorithm significantly reduces trajectory errors, especially in complex tasks and boundary environments, maintaining higher robustness and accuracy. Therefore, the proposed GAIL-SAC algorithm in this study demonstrates stronger path planning capabilities and higher accuracy in practical applications.

Figure 12 illustrates the velocity variations in joint space for different algorithms, assessing their performance in improving the smoothness of robotic motion. Figure 12a–d depict the velocity fluctuations of the conventional path planning algorithm, Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and the proposed reinforcement-learning-based GAIL-SAC algorithm. By analyzing these figures, it is evident that different algorithms exhibit distinct velocity fluctuations in joint space. Figure 12e–j further provide a comparative analysis of velocity variations across joints 0 to 5. The results indicate that the conventional path planning algorithm exhibits poor smoothness in joint space, characterized by significant velocity fluctuations. In contrast, the GA and PSO algorithms demonstrate improved smoothness in joint space, significantly outperforming traditional algorithms. Compared to these approaches, the proposed GAIL-SAC algorithm achieves the best smoothness in joint space during robotic motion, with almost no noticeable abrupt velocity fluctuations, highlighting its superior smooth performance.

To further quantify the velocity variations and analyze the smoothness of different algorithms in detail, Figure 13 presents the standard deviation of velocity fluctuations in joint space. The velocity variation standard deviation serves as a quantitative measure of motion smoothness, where lower values indicate a smoother trajectory. In Figure 13, the GAIL-SAC algorithm exhibits a significantly lower velocity variation standard deviation compared to other algorithms, particularly at joint 2, where its standard deviation is only 0.00795, far lower than that of the GA algorithm (0.729) and the PSO algorithm (0.58). Similar trends are observed across other joints. Therefore, combining the visual comparison in Figure 12 and the quantitative analysis in Figure 13, the GAIL-SAC algorithm demonstrates a distinct advantage in joint space smoothness, achieving the lowest velocity variation standard deviation across all joints, thus exhibiting the optimal motion smoothness. The experimental results confirm that the GAIL-SAC algorithm outperforms the conventional path planning algorithm, GA, and PSO in improving the smoothness of robotic motion.

6. Case Analysis

To validate the practical effectiveness of the proposed algorithm, we conducted printing experiments. This section applies the planned path data directly to a real robot. The case study utilized a KUKA robot equipped with a mainstream UV print head from Seer. The experimental subject was a motorcycle windshield, characterized by its complex curvature. To minimize waste and facilitate repeatability, we placed A4 paper on the surface of the windshield, allowing us to demonstrate the experimental results without affecting the final printed outcome.

Before the experiments commenced, we established a simulated environment that closely resembled the actual printing process to ensure safety and protect the UV print head. During the simulation, the robot’s movement trajectories were validated to prevent any collisions or unexpected incidents during operation. The printing robot used in this experiment was the KUKA KR210 model, with its simulated environment illustrated in Figure 14a and the corresponding actual experimental setup shown in Figure 14b.

As shown in Figure 15, Figure 15a–e illustrate the robot’s movement positions at various stages of the printing process. Throughout this process, it is crucial for the TCP of the print head to remain perpendicular to the surface of the printed workpiece to meet the requirements of UV printing. The actual results align well with the simulation outcomes. Figure 15f–j correspond to the printing results at each movement stage. The printed results demonstrate that the proposed algorithm effectively meets the demands of practical processing.

In the UV printing process, the smoothness of the robot’s end-effector motion in Cartesian space significantly affects the printing quality. The intensity of velocity fluctuations directly determines the uniformity and consistency of the ink ejected from the nozzle. Therefore, this paper compares the velocity variations in the X, Y, and Z directions across different path planning algorithms, with the experimental results shown in Figure 16. From the velocity variation curves, it is evident that the traditional path planning algorithm (MOVE) exhibits the most significant trajectory jitter and velocity fluctuations, particularly in the Y direction, where substantial jumps are observed, indicating unstable motion that fails to meet the precise printing requirements. In contrast, the GA and PSO algorithms demonstrate improvements in velocity smoothness, with their velocity curves being more stable than that of MOVE. Specifically, PSO has smaller fluctuations in the Z direction, while GA performs relatively better in the X direction. The RL (GAIL-SAC) algorithm shows the least velocity variation and the smoothest trajectory, with the velocity variation in the X, Y, and Z directions significantly reduced, indicating the best trajectory smoothness during actual robot operation.

To further quantify the smoothness of the different algorithms, this paper calculates the standard deviation of the velocity variation rate in the X, Y, and Z directions. The experimental results are shown in Figure 17. The analysis reveals that the MOVE algorithm has the highest velocity variation rate standard deviation, with values of 1474.8, 1919.1, and 1326.8 in the X, Y, and Z directions, respectively, indicating large trajectory fluctuations and poor motion smoothness. Both GA and PSO algorithms show improvements in smoothness in all directions, with a significant reduction in velocity fluctuations compared to the MOVE algorithm. Specifically, PSO reduces the standard deviation in the Z direction to 602.6, while GA reduces the standard deviation in the Y direction to 950.3, reflecting the distinct optimization advantages of both algorithms in different directions. However, the RL (GAIL-SAC) algorithm achieves the lowest velocity variation rate standard deviation, with values of 4.18, 4.67, and 4.32 in the X, Y, and Z directions, respectively. This indicates that the RL algorithm provides the smoothest trajectory in all directions, effectively reducing velocity fluctuations and enhancing motion stability. Overall, while GA and PSO improve the trajectory smoothness in Cartesian space to some extent, the RL (GAIL-SAC) algorithm performs best, effectively suppressing velocity jumps and ensuring the stability of the printing process, thereby further improving print quality.

7. Practical Implications Analysis

With the continuous development of intelligent manufacturing technologies, the application of robotic technology in precision manufacturing has been increasingly widespread. The Generative Adversarial Imitation Learning—Soft Actor–Critic (GAIL-SAC) framework proposed in this paper, specifically designed for the robotic surface path planning of UV printing, demonstrates broad application prospects. This method optimizes the robot’s printing trajectory by combining reinforcement learning and imitation learning, enhancing both the smoothness and accuracy of the trajectory, while also improving the robustness of path planning in complex environments.

In the field of intelligent manufacturing, especially in high-precision printing and printing applications, UV printing technology is gradually replacing traditional screen printing. The research results in this paper can significantly improve the path planning accuracy of printing robots, particularly in printing tasks involving complex surfaces and irregular objects. By introducing the GAIL-SAC framework, the robot can achieve more efficient and accurate printing on irregular surfaces, reduce printing errors, and improve both production efficiency and product quality.

Specific application areas include the following:

Automotive Industry: Automotive parts, such as bodies, hoods, doors, spoilers, and interior components, often have complex surface shapes. Traditional spraying methods fail to meet the high-precision spraying requirements. UV printing technology can accurately print patterns, colors, or provide protective coatings on these irregular surfaces, thereby enhancing the aesthetics of automotive exteriors and extending the lifespan of parts.
Aerospace: Aircraft components such as fuselages, wings, tail fins, engines, and turbine blades require clear identification, including production numbers, model types, airline logos, and safety marks. UV printing technology can accurately print these marks on complex surfaces, ensuring compliance with aviation safety standards and providing durability. Additionally, it offers uniform and high-quality coatings for complex aerospace parts, thereby improving production efficiency.
Medical Equipment: In medical device manufacturing, UV printing technology can meet the personalized printing needs of devices with special geometric shapes, such as custom prosthetics, orthotics, surgical instruments, and medical monitoring equipment, ensuring surface printing accuracy and enhancing both the functionality and aesthetics of the devices.
Food Packaging: UV printing technology can provide precise pattern printing solutions for beverage bottles, cans, and other packaging, enhancing brand recognition and increasing market competitiveness.
Consumer Electronics: In consumer electronic products such as smartphones, tablets, and other devices, UV printing technology can achieve high-quality coating printing, providing protection against static electricity, fingerprints, and enhancing the product’s lifespan and consumer experience.

These applications not only enhance production efficiency but also effectively reduce costs, increase production flexibility, and improve product quality. Particularly in large-scale customization and small-batch production, irregular surface printing technology shows significant advantages. With the continuous development of this technology, it is expected to bring more efficient, flexible, and personalized production modes to a variety of industries.

8. Conclusions

This paper presents a novel theoretical framework for robotic surface UV printing path planning, aiming to enhance the accuracy and smoothness of printing paths, with significant application value in freeform surface printing. The main contributions of this paper include the following: (1) based on the CAD model of the printing workpiece, a method to convert CNC data into robot path data is proposed, which is used to design the primary path suitable for robotic surface printing; (2) a motion accuracy model for the printing robot is established, integrating both the positional accuracy of the path and the attitude accuracy of the robot’s end-effector nozzle; and (3) the robot motion accuracy model is described using the Markov Decision Process (MDP), and the GAIL-SAC algorithm is proposed by improving the SAC algorithm, which combines the advantages of reinforcement learning with traditional path planning methods to obtain the optimal surface printing path. Experimental validation shows the following: (1) the improved SAC algorithm demonstrates better convergence performance in environments with sparse rewards and high-dimensional data; (2) in the robotic surface printing simulation environment, the proposed algorithm outperforms traditional GA and PSO algorithms in terms of trajectory accuracy, and exhibits better joint-space trajectory smoothness, proving the effectiveness of the proposed method in improving printing trajectory accuracy and smoothness; and (3) in real robot experiments, the feasibility of the proposed algorithm in practical applications is verified, and the experimental results show that in Cartesian space, the smoothness of the robot’s end-effector TCP motion path is superior to that of traditional GA and PSO algorithms.

However, the limitations of this paper lie in considering only the accuracy and smoothness of the printing path. Nozzle collision is a critical factor affecting printing accuracy and stability, especially when the robot’s end-effector approaches complex surfaces or irregular objects. To address this issue, future research will integrate obstacle avoidance algorithms and introduce obstacle avoidance constraints in the path planning process, incorporating them into the reinforcement learning reward function to ensure that the robot nozzle maintains a safe distance from the workpiece surface, effectively preventing nozzle-workpiece collisions. Additionally, during the reinforcement learning training process, the computation of large amounts of training data and high-dimensional state space may lead to long training times and excessive computational resource consumption. To improve the algorithm’s efficiency, future research will explore distributed computing frameworks to parallel process multiple training environments, reducing training time and computational costs, or use transfer learning techniques to pre-train models in similar tasks or environments, reducing training time and improving algorithm convergence speed.

In conclusion, the GAIL-SAC framework proposed in this study effectively enhances the printing path’s accuracy and smoothness, demonstrating the great potential of reinforcement learning in robotic surface UV printing path planning. With continuous technological advancements, the method presented in this paper is expected to find widespread application in intelligent manufacturing, automotive, aerospace, medical devices, and other fields, driving the manufacturing industry toward a more intelligent and personalized direction.

Author Contributions

Conceptualization, J.L. and X.L.; methodology, J.L. and X.L.; software, J.L. and X.L.; validation, X.L., C.H. and Z.C.; formal analysis, C.H. and Z.C.; investigation, J.L. and X.L.; resources, J.L. and X.L.; data curation, Z.L. (Zhenyong Liu) and Z.L. (Zhicong Li); writing—original draft preparation, J.L. and X.L.; writing—review and editing, J.L. and X.L.; visualization, J.L.; supervision, J.L.; project administration, J.L. and M.C.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Project of Natural Science Foundation of Guangdong Province, China grant number 2022B1515120025 and partially funded by the Guangdong Province University Student Science and Technology Innovation Training Special Fund (No. pdjh2023b0545) and the Foshan University Student Academic Fund (No. xsjj202302kjb10).

Data Availability Statement

The data used in this study were obtained from a key research project funded by the Guangdong Provincial Natural Science Foundation. Due to confidentiality agreements, the data are classified and cannot be made publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

{WCS}	Workpiece Coordinate System
{LCS}	Local Tool Coordinate System
RL	Reinforcement Learning
MDP	Markov Decision Process
PSO	Particle Swarm Optimization
GA	Genetic Algorithm
SAC	Soft Actor–Critic
GAIL	Generative Adversarial Imitation Learning
GAIL-SAC	Generative Adversarial Imitation Learning and Soft Actor–Critic
CAM	Computer-Aided Manufacturing
BC	Behavioral Cloning
HER	Hindsight Experience Replay
TCP	Tool Center Point
CNC	Computer Numerical Control
CAD	Computer-Aided Design

References

Verduyn, A.; De Schutter, J.; Decré, W.; Vochten, M. Shape-based path adaptation and simulation-based velocity optimization of initial tool trajectories for robotic spray painting. In Proceedings of the 2023 IEEE 19th International Conference on Automation Science and Engineering (CASE), Auckland, New Zealand, 26–30 August 2023; pp. 1–8. [Google Scholar]
Gao, R.; Zhou, Q.; Cao, S.; Jiang, Q. Apple-Picking Robot Picking Path Planning Algorithm Based on Improved PSO. Electronics 2023, 12, 1832. [Google Scholar] [CrossRef]
Huang, Z.; Chen, G.; Shen, Y.; Wang, R.; Liu, C.; Zhang, L. An Obstacle-Avoidance Motion Planning Method for Redundant Space Robot via Reinforcement Learning. Actuators 2023, 12, 69. [Google Scholar] [CrossRef]
Nieto Bastida, S.; Lin, C.Y. Autonomous Trajectory Planning for Spray Painting on Complex Surfaces Based on a Point Cloud Model. Sensors 2023, 23, 9634. [Google Scholar] [CrossRef] [PubMed]
Weber, A.M.; Gambao, E.; Brunete, A. A Survey on Autonomous Offline Path Generation for Robot-Assisted Spraying Applications. Actuators 2023, 12, 403. [Google Scholar] [CrossRef]
Bedaka, A.K.; Lin, C.Y. CAD-based robot path planning and simulation using OPEN CASCADE. Procedia Comput. Sci. 2018, 133, 779–785. [Google Scholar] [CrossRef]
Gleeson, D.; Jakobsson, S.; Salman, R.; Ekstedt, F.; Sandgren, N.; Edelvik, F.; Carlson, J.S.; Lennartson, B. Generating optimized trajectories for robotic spray painting. IEEE Trans. Autom. Sci. Eng. 2022, 19, 1380–1391. [Google Scholar] [CrossRef]
Park, J.H.; Lim, Y.E.; Choi, J.H.; Hwang, M.J. Trajectory-based 3D point cloud ROI determination methods for autonomous mobile robot. IEEE Access 2023, 11, 8504–8522. [Google Scholar] [CrossRef]
Meng, Y.; Jiang, Y.; Li, Y.; Pang, G.; Tong, Q. Research on point cloud processing and grinding trajectory planning of steel helmet based on 3D scanner. IEEE Access 2023, 12, 3085–3097. [Google Scholar] [CrossRef]
Shah, S.H.; Khan, S.G.; Tran, C.C. Surface Normal Generation and Compliance Control for Robotic Based Machining Operations. In Proceedings of the 2024 9th International Conference on Control and Robotics Engineering (ICCRE), Osaka, Japan, 10–12 May 2024; pp. 74–79. [Google Scholar]
Wu, L.; Zang, X.; Yin, W.; Zhang, X.; Li, C.; Zhu, Y.; Zhao, J. Pose and Path Planning for Industrial Robot Surface Machining Based on Direction Fields. IEEE Robot. Autom. Lett. 2024, 9, 10455–10462. [Google Scholar] [CrossRef]
Wang, G.; Li, W.; Jiang, C.; Zhu, D.; Li, Z.; Xu, W.; Zhao, H.; Ding, H. Trajectory planning and optimization for robotic machining based on measured point cloud. IEEE Trans. Robot. 2021, 38, 1621–1637. [Google Scholar] [CrossRef]
Zeng, Y.; Yu, Y.; Zhao, X.; Liu, Y.; Liu, J.; Liu, D. Trajectory planning of spray gun with variable posture for irregular plane based on boundary constraint. IEEE Access 2021, 9, 52902–52912. [Google Scholar] [CrossRef]
Zhang, Y.; Xu, C.; Xiao, H.; Zhou, B.; Zeng, Y. Planning method of offset spray path for patch considering boundary factors. Math. Probl. Eng. 2018, 2018, 6067391. [Google Scholar] [CrossRef]
Lu, S.; Ding, B.; Li, Y. Minimum-jerk trajectory planning pertaining to a translational 3-degree-of-freedom parallel manipulator through piecewise quintic polynomials interpolation. Adv. Mech. Eng. 2020, 12, 1687814020913667. [Google Scholar] [CrossRef]
Zhu, J.; Pan, D. Improved Genetic Algorithm for Solving Robot Path Planning Based on Grid Maps. Mathematics 2024, 12, 4017. [Google Scholar] [CrossRef]
Gao, Y.; Li, Z.; Wang, H.; Hu, Y.; Jiang, H.; Jiang, X.; Chen, D. An Improved Spider-Wasp Optimizer for Obstacle Avoidance Path Planning in Mobile Robots. Mathematics 2024, 12, 2604. [Google Scholar] [CrossRef]
Hsieh, H.T.; Chu, C.H. Improving optimization of tool path planning in 5-axis flank milling using advanced PSO algorithms. Robot. Comput.-Integr. Manuf. 2013, 29, 3–11. [Google Scholar] [CrossRef]
Prianto, E.; Park, J.H.; Bae, J.H.; Kim, J.S. Deep reinforcement learning-based path planning for multi-arm manipulators with periodically moving obstacles. Appl. Sci. 2021, 11, 2587. [Google Scholar] [CrossRef]
Zhao, T.; Wang, M.; Zhao, Q.; Zheng, X.; Gao, H. A path-planning method based on improved soft actor-critic algorithm for mobile robots. Biomimetics 2023, 8, 481. [Google Scholar] [CrossRef] [PubMed]
von Eschwege, D.; Engelbrecht, A. Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation. Mathematics 2024, 12, 3481. [Google Scholar] [CrossRef]
He, Y.; Hu, R.; Liang, K.; Liu, Y.; Zhou, Z. Deep Reinforcement Learning Algorithm with Long Short-Term Memory Network for Optimizing Unmanned Aerial Vehicle Information Transmission. Mathematics 2024, 13, 46. [Google Scholar] [CrossRef]
Huang, Y.; Zhou, C.; Zhang, L.; Lu, X. A Self-Rewarding Mechanism in Deep Reinforcement Learning for Trading Strategy Optimization. Mathematics 2024, 12, 4020. [Google Scholar] [CrossRef]
Chen, W.; Li, X.; Ge, H.; Wang, L.; Zhang, Y. Trajectory planning for spray painting robot based on point cloud slicing technique. Electronics 2020, 9, 908. [Google Scholar] [CrossRef]
He, S.; Hu, C.; Lin, S.; Zhu, Y. An online time-optimal trajectory planning method for constrained multi-axis trajectory with guaranteed feasibility. IEEE Robot. Autom. Lett. 2022, 7, 7375–7382. [Google Scholar] [CrossRef]
He, S.; Hu, C.; Lin, S.; Zhu, Y.; Tomizuka, M. Real-time time-optimal continuous multi-axis trajectory planning using the trajectory index coordination method. ISA Trans. 2022, 131, 639–649. [Google Scholar] [CrossRef] [PubMed]
Praniewicz, M.; Kurfess, T.R.; Saldana, C. Error qualification for multi-axis BC-type machine tools. J. Manuf. Syst. 2019, 52, 211–216. [Google Scholar] [CrossRef]
Xie, S.; Sun, L.; Chen, G.; Wang, Z.; Wang, Z. A novel solution to the inverse kinematics problem of general 7r robots. IEEE Access 2022, 10, 67451–67469. [Google Scholar] [CrossRef]
Chen, W.; Liu, J.; Tang, Y.; Huan, J.; Liu, H. Trajectory optimization of spray painting robot for complex curved surface based on exponential mean Bézier method. Math. Probl. Eng. 2017, 2017, 4259869. [Google Scholar] [CrossRef]
Gao, G.; Sun, G.; Na, J.; Guo, Y.; Wu, X. Structural parameter identification for 6 DOF industrial robots. Mech. Syst. Signal Process. 2018, 113, 145–155. [Google Scholar] [CrossRef]
Ren, J.; Sun, Y.; Hui, J.; Ahmad, R.; Ma, Y. Coating thickness optimization for a robotized thermal spray system. Robot. Comput.-Integr. Manuf. 2023, 83, 102569. [Google Scholar] [CrossRef]
Teng, Q.; Yi, J.; Zhu, X.; Zhang, Y. Extraction method of position and posture information of robot arm picking up target based on RGB-D data. Therm. Sci. 2020, 24, 1481–1488. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Zuo, G.; Zhao, Q.; Huang, S.; Li, J.; Gong, D. Adversarial imitation learning with mixed demonstrations from multiple demonstrators. Neurocomputing 2021, 457, 365–376. [Google Scholar] [CrossRef]
Kidera, S.; Shintani, K.; Tsuneda, T.; Yamane, S. Combined Constraint on Behavior Cloning and Discriminator in Offline Reinforcement Learning. IEEE Access 2024, 12, 19942–19951. [Google Scholar] [CrossRef]
Tsurumine, Y.; Matsubara, T. Goal-aware generative adversarial imitation learning from imperfect demonstration for robotic cloth manipulation. Robot. Auton. Syst. 2022, 158, 104264. [Google Scholar] [CrossRef]
Xu, L.; Cao, M.; Song, B. A new approach to smooth path planning of mobile robot based on quartic Bezier transition curve and improved PSO algorithm. Neurocomputing 2022, 473, 98–106. [Google Scholar] [CrossRef]
Wang, F.; Wu, Z.; Bao, T. Time-jerk optimal trajectory planning of industrial robots based on a hybrid WOA-GA algorithm. Processes 2022, 10, 1014. [Google Scholar] [CrossRef]

Figure 1. Five-axis machining process.

Figure 2. TCP diagram of a UV printing robot.

Figure 3. A Complex surface path planning framework based on GAIL-SAC.

Figure 4. GAIL-SAC network algorithm framework.

Figure 5. Main printing path data generation diagram. (a) The generated CNC data. (b) The robot data obtained after conversion.

Figure 6. Reacher experiment flowchart.

Figure 7. Reward curve in the reacher environment.

Figure 8. Spray printing experiment flowchart.

Figure 9. Comparison of algorithm rewards for the printing environment.

Figure 10. Cartesian space path comparison diagram. (a) Comparison of algorithm path in the X-Z plane. (b) Comparison of algorithm path in the X-Y plane. (c) Comparison of algorithm path in the Y-Z plane.

Figure 11. Comparison of algorithm rewards for the printing environment.

Figure 12. Comparison chart of joint space smoothness. (a)Velocity fluctuations in joint space for the conventional path planning algorithm. (b) Velocity fluctuations in joint space for the Genetic Algorithm (GA). (c) Velocity fluctuations in joint space for the Particle Swarm Optimization (PSO) algorithm. (d) Velocity fluctuations in joint space for the GAIL-SAC algorithm. (e) Comparative analysis of velocity fluctuations in joint 0 across algorithms. (f) Comparative analysis of velocity fluctuations in joint 1 across algorithms. (g) Comparative analysis of velocity fluctuations in joint 2 across algorithms. (h) Comparative analysis of velocity fluctuations in joint 3 across algorithms. (i) Comparative analysis of velocity fluctuations in joint 4 across algorithms. (j) Comparative analysis of velocity fluctuations in joint 5 across algorithms.

Figure 13. Comparison chart of joint space smoothness.

Figure 14. Comparison between simulated robots and real robots. (a) Simulation environment. (b) Real environment.

Figure 15. Spray printing process and spray printing effect diagram. (a) Robot’s movement position 1 during the printing process. (b) Robot’s movement position 2 during the printing process. (c) Robot’s movement position 3 during the printing process. (d) Robot’s movement position 4 during the printing process. (e) Robot’s movement position 5 during the printing process. (f) Printing result at position 1. (g) Printing result at position 2. (h) Printing result at position 3. (i) Printing result at position 4. (j) Printing result at position 5.

Figure 16. Comparison chart of smoothness in Cartesian space. (a) Cartesian space velocity variation plot of the conventional path planning algorithm. (b) Cartesian space velocity variation plot of the GA algorithm. (c) Cartesian space velocity variation plot of the PSO algorithm. (d) Cartesian space velocity variation plot of the GAIL-SAC algorithm. (e) Comparison of velocity variation in the X-axis direction for different algorithms. (f) Comparison of velocity variation in the Y-axis direction for different algorithms. (g) Comparison of velocity variation in the Z-axis direction for different algorithms.

Figure 17. Comparison chart of the standard deviation of velocity variation in Cartesian space.

Table 1. The GAIL-SAC algorithm process proposed in this article.

GAIL-SAC Algorithm Steps
1: Input: $θ_{1}, θ_{2}, ϕ$	Initialize neural network parameters
2: ${\bar{θ}}_{1} \leftarrow θ_{1}, {\bar{θ}}_{2} \leftarrow θ_{2}$	Initialize target Q-network parameters
3: $D \leftarrow \emptyset$	Initialize reply buffer D
4: $D^{E} = {τ_{1}, τ_{2}, \dots, τ_{n}}$	Initialize Expert buffer $D^{E}$
5: for each iteration do
6: for each environment do
7: $a_{t} \sim π_{ϕ} (a_{t} ∣ s_{t})$	The action selected through strategy $π_{ϕ} (a_{t} ∣ s_{t})$ based on the current state
8: $s_{t + 1} \sim p (s_{t + 1}, s_{t}, a_{t})$	The robot reaches the next state $s_{t + 1}$ to receive an immediate reward r
9: $D \leftarrow D \cup {(s_{t}, a_{t}, r_{t}, s_{t + 1})}$	Store $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in replay buffer D
10: end for
11: for each gradient step do
12: $θ_{i} \leftarrow θ_{i} - λ_{Q} {\hat{\nabla}}_{θ_{i}} J_{Q} (θ_{i}) f o r i \in {1, 2}$	Update Q-network parameter $θ_{i}$
13: $ϖ \leftarrow ϖ - η \nabla_{ϖ} L_{D}$	Update discriminator D-network parameters $ϖ$
14: Equation (26)	Determine the update formula for the Actor- network
15: if $T > N_{GAIL}$
16: $\begin{matrix} ϕ & \leftarrow ϕ + η ((1 - ω) \nabla_{ϕ} L_{S A C} + ω \nabla_{ϖ} L_{GAIL}) \end{matrix}$	Update Actor-network parameters $ϕ$
17: else
18: $ϕ \leftarrow ϕ - λ_{π} {\hat{\nabla}}_{ϕ} J_{π} (ϕ)$	Update Actor-network parameters $ϕ$
19: $α \leftarrow α - λ {\hat{\nabla}}_{α} J (α)$	Update temperature parameter $α$
20: ${\bar{θ}}_{i} = τ θ_{i} + (1 - τ) {\bar{θ}}_{i} f o r i \in {1, 2}$	Update target Q-network parameter ${\bar{θ}}_{i}$
21: end for
22: end for

Table 2. CNC machining parameter settings.

Parameters	Setting Method
Processing method	Variable profile milling
tool	Spherical milling cutter
Knife axis vector	Vertical to the processed component
guide	Boundary curve
Path direction	one-way
Machine tool type	Five axis machine tool (BC axis)

Table 3. The final convergence reward value of the algorithm.

Algorithm Comparison	Training Epochs	Reward
SAC	3000	−10.12
BC	3000	−15.23
GAIL-SAC	3000	−6.72

Table 4. Comparison of MSEs at different plane positions.

Algorithm	X-Z Plane MSE (mm²)	X-Y Plane MSE (mm²)	Y-Z Plane MSE (mm²)
PSO	55.436	7.987	3.138
MOVE	1231.129	1672.257	1560.586
RL	40.025	4.877	4.771
GA	51.443	5.533	8.249

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Lin, X.; Huang, C.; Cai, Z.; Liu, Z.; Chen, M.; Li, Z. A Study on Path Planning for Curved Surface UV Printing Robots Based on Reinforcement Learning. Mathematics 2025, 13, 648. https://doi.org/10.3390/math13040648

AMA Style

Liu J, Lin X, Huang C, Cai Z, Liu Z, Chen M, Li Z. A Study on Path Planning for Curved Surface UV Printing Robots Based on Reinforcement Learning. Mathematics. 2025; 13(4):648. https://doi.org/10.3390/math13040648

Chicago/Turabian Style

Liu, Jie, Xianxin Lin, Chengqiang Huang, Zelong Cai, Zhenyong Liu, Minsheng Chen, and Zhicong Li. 2025. "A Study on Path Planning for Curved Surface UV Printing Robots Based on Reinforcement Learning" Mathematics 13, no. 4: 648. https://doi.org/10.3390/math13040648

APA Style

Liu, J., Lin, X., Huang, C., Cai, Z., Liu, Z., Chen, M., & Li, Z. (2025). A Study on Path Planning for Curved Surface UV Printing Robots Based on Reinforcement Learning. Mathematics, 13(4), 648. https://doi.org/10.3390/math13040648

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study on Path Planning for Curved Surface UV Printing Robots Based on Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. Curved Surface Path Generation Method

2.2. Traditional Optimization Algorithms (GA and PSO) and Their Limitations

2.3. Reinforcement Learning Path Optimization Method

3. Surface Path Planning Method for Spray Printing

3.1. Generate Main Path

3.2. Establishment of Robot Motion Accuracy Model

3.3. Markov Decision Process (MDP)

3.4. Reinforcement Learning SAC Algorithm

3.5. Generative Adversarial Imitation Learning

4. Framework for Path Planning of Complex Surface Spray Printing

4.1. Spray Printing Trajectory Generation Scheme

4.2. Spray Printing Trajectory Planning Scheme

4.3. Reinforcement Learning GAIL-SAC

5. Simulation and Experiments

5.1. Experimental Environment Configuration

5.2. Main Path Generation Experiment

5.3. Convergence of GAIL-SAC

5.4. Printing Path Planning Experiment

6. Case Analysis

7. Practical Implications Analysis

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI