The comprehensive analysis in this review is organized into five distinct sections to provide a comprehensive and in-depth assessment of each paper. In
Section 5.1, an examination of the DRL method employed in each paper is presented.
Section 5.2 delves into the formulation of state and action spaces, closely examining the modeling choices, including the dimensionality of the state space and the representation of the action space.
Section 5.3 is dedicated to the analysis of reward functions within the selected papers, including an investigation of their theoretical foundations and implications on the learning process.
Section 5.4 highlights the methodologies employed for both training and testing data generation, providing a comprehensive exploration of simulation processes and, where relevant, providing the origins of empirical data. Last,
Section 5.5 synthesizes the key findings and conclusions from each paper, offering a cohesive overview of the collective contributions of the reviewed studies.
5.1. RL Methods
This report focuses on how DRL is used to optimize the dynamic hedging problem. However, there are two studies [
5,
30,
31] that do not formally use DRL, i.e., they approximate the optimal value function or policy with an NN. However, Refs. [
5,
31] are included in this analysis because they were the first two studies to formulate the dynamic hedging problem with a standard RL framework (state, action and reward). In [
5], the optimal hedging problem is initially framed as a Markov decision process (MDP) that may be solved with model-based DP. Through DP, the study computes the optimal hedge strategy recursively at each time step, with the aid of basis function expansions to approximate the continuous state solution. Ref. [
5] extends this DP result with a more traditional RL approach, relaxing the requirement for a precise model of the environment dynamics. Ref. [
5] represents one argument of the
Q-function, which can be found using Q-learning as the option price, and a second argument represents the optimal hedge for a given state–action pair. Therefore, the model introduced in [
5], which is called Q-learning Black–Scholes (QLBS), allows the estimated option price and optimal hedge to be reflected with a single model. Ref. [
5] then shows that QLBS can be extended to approximate a parametrized optimal
Q-function by using fitted Q-learning (FQI). Note that Ref. [
30] expands upon the work of [
5] with numerical experiments. Thus, although Refs. [
5,
30] use function approximation techniques to solve the RL problem, the studies do not use an NN.
Ref. [
31] was the next to frame the dynamic option hedging problem with RL, using SARSA as the RL method. Recall that the SARSA method requires a discrete look-up table for all
Q-values. To allow for a higher-dimensional environment, Ref. [
31] applies a non-linear regression of the form
to approximate the
Q-function, where
is the current state–action pair
. Akin to [
30], Ref. [
31] uses a non-NN function approximation method. However, Refs. [
5,
30,
31] are pioneering works regarding the use of RL to optimize the dynamic hedging problem.
Moving now to studies that use DRL, recall that DQN uses NNs to approximate the
Q-function, thereby allowing for a continuous state space. However, recall that DQN requires a max operation over all possible next actions to approximate the
Q-value target, a condition that makes DQN intractable when a continuous action space needs to be searched [
20]. In the context of hedging problems that ideally necessitate the acquisition the ability to buy various different amounts of shares depending on the underlying asset price, policy methods, which accommodate continuous action spaces, were more prevalent across the analyzed papers.
Ref. [
49] uses both DQN and DDPG and compare the performance of each. The result of this comparison is discussed in detail with the rest of the results in
Section 5.5. Another comparison of value- and policy-based methods is given in [
32], which compares DQN to PPO. Ref. [
32] also looks at a third method, called DQN pop-art, which clips the rewards of DQN to limit large gradient changes. The limitations of DQN are further examined in [
33], which compares DQN to a contextualized multi-arm bandit (CMAB) approach. Ref. [
33] describes that CMAB considers all rewards as being independent and identically distributed for each time step, meaning that the action selection does not impact future rewards.
Ref. [
1] also points to the limitations of DQN in the hedging environment and employs a DDPG formulation with two separate
Q-functions to serve as critics. The first
Q-function evaluates the expected hedging cost, whereas the second
Q-function evaluates the squared hedging cost, which is a measure of the hedging cost variance [
1]. Refs. [
51,
55,
60] also use the DDPG algorithm. Ref. [
60] compares the performance of DDPG to the NN approximation of a model-based DP strategy. Ref. [
59] modifies the DDPG algorithm, allowing the DDPG actor network to not only produce the action but also estimate the associated variance derived from the reward function. The critic network therefore assesses both the anticipated reward and variance for each actor’s output. Consequently, an optimization of the critic network leads to high weights being placed on actions that lead to high expected rewards and low expected variance [
59]. Ref. [
59] calls this modified DDPG method DDPG with uncertainty (DPPG-U).
Ref. [
57] found DDPG to be unstable in the option hedging environment, as it states that DDPG “seemed to either not converge at all or started to learn and then later diverge”. Ref. [
57] may have encountered this poor implementation of DDPG for various reasons. A lack of convergence may be due to the choice of algorithm hyperparameters (learning rates, discount factor, exploration parameter, batch size and more). The optimal hyperparameters for each training set and environment are different, and wrong hyperparameter choices can significantly impact the learning process [
61]. Moreover, the phenomenon of early learning followed by divergence may be an occurrence of catastrophic forgetting, a phenomenon that occurs in non-stationary environments when new experiences encountered by the RL agent erase previously learned optimal actions, as described in [
62]. Further, Fujimoto et al., 2018, describe that the DDPG critic network may overestimate
Q-values, leading to unnecessary exploitation by the actor. Ref. [
63] proposes a twin delayed DDPG (TD3) algorithm that uses a double critic. The idea of a double critic (double Q-learning) for DRL is discussed thoroughly in [
64]. The TD3 update uses the minimum target value of the two critics, and the action range is clipped to smooth the actor’s policy update [
57]. The authors of Ref. [
57] employ the TD3 algorithm in their study.
A further extension to DDPG is in [
58], which uses a distributional DDPG variant. Distributional DDPG is akin to the distributional DQN algorithm proposed in [
19], which defines
as the distribution of discrete fixed-point rewards to which new generated values can be allocated. Ref. [
65] shows that distributional DQN may be implemented for continuous action spaces with an actor–critic method, wherein the critic estimates
and the actor updates the policy accordingly. For the RL hedging agent, [
58] uses quantile regression (QR) to estimate
, a distributional DDPG variant first proposed in [
66]. Ref. [
58] makes this choice because two of its risk measures, value at risk (VaR) and conditional value at risk (CVaR), can be efficiently calculated with a quantile-based representation of outputs [
58]. This method is called distributed distributional deep deterministic policy gradient with QR (D4PG-QR). Ref. [
58] was the first to employ D4PG-QR to optimize dynamic hedging.
Another novel method in the DRL for option hedging literature is a TRPO algorithm variant that [
52] proposes. Ref. [
52] allows for a varied risk aversion parameter to be included as an input, allowing the RL agent to be trained to make decisions based on various risk appetites. Adding a risk aversion parameter to the state avoids having to retrain the agent each time the risk aversion level changes. Ref. [
52] calls this TRPO variant trust region volatility optimization (TRVO). Ref. [
56] introduces another method that incorporates variable risk aversion, proposing the construction of a linear combination of the state vector to learn general policies for a given hedging portfolio. Ref. [
56] details that this linear combination is mapped into actions. By including the risk aversion parameter as a state feature, Ref. [
56] explains that the policy is learned with respect to risk appetite. Ref. [
56] uses an NN to optimize the actor and critic outputs.
Other papers that use policy methods are [
50,
53,
54]. Ref. [
54] uses a process wherein the return from an MC episode that follows a certain policy is computed and compared to the current baseline state value function. The policy gradient is then computed based on the comparison, and a new policy is generated [
54]. The state value function is parametrized and trained alongside the policy, which is akin to an actor–critic method [
54]. Ref. [
50] also employs a policy search but does so directly, and the RL agent does not learn the value function at all. Ref. [
50] cites [
9], stating that this direct policy search allows for asymptotic convergence of deterministic policies, adequate exploration and use in continuous action spaces.
A final method used in the analyzed literature is importance weighted actor–learner architecture (IMPALA) [
53]. The IMPALA method was first introduced in [
67], which describes IMPALA as a distributed DRL technique wherein multiple actors operate in parallel. Each actor maintains its own respective value function estimate, and each actor therefore generates a separate trajectory [
67]. The learner aggregates and updates the policy based on the distribution of all actor experiences [
67]. Ref. [
53] cites [
67] in saying that the IMPALA method enables for better data efficiency and stability.
5.2. State and Action Spaces
With the RL algorithms for each study now presented, different state and action spaces may be investigated. Regarding the state space formulation for the dynamic hedging problem, all the analyzed studies include the stock price in the state space. Further, all works except [
58] consider the current holding (previous action) in the state space. Ref. [
58] makes no mention of why the current holding is not included. All authors include the time to expiration of the option in the state space except for [
49,
52,
58]. Ref. [
58] includes the BS Vega and Gamma in the state, whereas Refs. [
49,
52] include the BS Delta. As described in
Section 1, the BS Vega, BS Gamma and BS Delta are all BS Greeks, and the BS Greeks are all functions of the BS option price and time [
3]. In addition to [
49,
52], the state spaces of [
51,
59,
60] also include the BS Delta in the state space in their studies. Several papers explicitly explain that the BS Delta can be deduced from the option and stock prices, and its inclusion in the state space only serves to unnecessarily augment the input size [
1,
31,
32]. However, Ref. [
49] claims that the largest contingent of state information possible leads to increased robustness. Further, Ref. [
59] expresses that a larger state space leads the agent to learn further complexities inside the training data. Ref. [
60] performs a comparison of DRL agents trained with BS Delta and BS Delta-free state spaces to show that the inclusion of the BS Delta leads to better learning in the dynamic hedging environment.
Refs. [
49,
52,
56,
60] include the option price in the state vector. Option price inclusion is notable in [
49], which also includes volatility, the BS Delta, the stock price and the number of shares. Recall that the BS model option price is dependent only on time, stock price, volatility and strike price [
3]. Therefore, by including the option price in the state, the agent of [
49] can learn the time and strike price indirectly. Concerning volatility, Refs. [
57,
59] place volatility in their respective state spaces. Ref. [
59] includes volatility alongside the BS Delta, stock price and time.
Given that many state space variables for a hedging problem are correlated through the BS model, it is natural to wonder about redundancy and to examine if the state space can be reduced. To examine this analytically, a feature engineering study or sensitivity analysis can be conducted on the input parameters. This is partly discussed in [
59], which creates a relationship matrix between input variables but only includes volatility and the BS Greeks (Theta, Gamma, Vega and Delta). Refs. [
5,
30] transform the volatility, stock price, drift
and time into one stationary variable,
This transformation is given as
[
5,
30].
Ref. [
56] mentions using news feeds and market sentiment as input features, but it is not explicitly mentioned how this state inclusion can be accomplished. Market sentiment would be an interesting variable for a sensitivity analysis, as one would think that market sentiment and stock price are highly correlated. Although enabling a model to include extra features is by no means a bad idea, it is valuable to consider the tradeoff between information and efficiency. Human traders do consider a vast array of variables when making hedging decisions, but humans do not face the computational burdens of automated agents. Expanding the number of states considered by a RL agent inevitably extends the time required for the agent to explore and compute the true value function or policy for each state [
9].
Looking now toward action spaces, Refs. [
5,
30] formulate a problem in which the agent can act by selecting a discrete number of underlying stocks to buy or sell. A discrete action space is also used in [
31,
32], which both let the number of contracts of 100 shares be
, and the DRL agent may take a trading action in the discrete range of
. As [
33] uses a muti-arm bandit approach (CMAB), it uses 51 actions (bandits to select), representing 25 buy actions for the underlying stock, 25 sell actions and 1 hold action. In the second experiment in [
33], short selling is prohibited, and the action space is limited to 26 choices. Refs. [
49,
53,
54] also frame the action at each time step as the buying of some discrete number of shares.
The rest of the analyzed articles use a continuous action space wherein the agent can buy or sell a continuous number of underlying shares. Although many of these continuous action spaces are described in a broad manner, Refs. [
51,
57,
58,
59] all specifically mention that the action space has upper and lower bounds, and the action is a continuous fraction between the maximum and minimum hedge. Ref. [
58] details that the maximum hedge for their agent is determined by ensuring that at least one of the two following conditions is met:
The ratio of post-hedging the BS Gamma to pre-hedging the BS Gamma is within the range of [0, 1].
The ratio of post-hedging the BS Vega to pre-hedging the BS Vega is within the range of [0, 1].
Ref. [
57] uses a clipping formula for its outputted action, as mentioned in the description of the TD3 method in the previous section. Ref. [
57] writes that the action at time step
is given by
, where
is the parametrized policy.
5.3. Reward Formulations
Many of the analyzed works use a mean-variance formulation for the reward function. The mean-variance objective is a core concept of modern portfolio theory developed in [
68]. The goal of a mean-variance reward function is reflective of the name, as the mean-variance objective is to maximize the expected mean reward while minimizing the expected variance of the reward. A mean-variance objective is the core idea of hedging, wherein traders try to limit the potential for large losses while still trying to maximize the potential for gains. A generalized version of the mean-variance reward for a single time-step
is given as
recalling that λ is a measure of risk aversion. The mean-variance reward formulation is used in [
5,
30,
31,
32,
33,
49,
51,
59]. Ref. [
1] uses a variation of the mean-variance reward formulation, letting
be the hedging loss from time
onward and defining
Ref. [
1] trains the DDPG agent to minimize
, which is a minimization of cumulative hedging losses for an entire episode. Recall that Ref. [
1] uses DDPG with two separate
Q-function critics. The first critic evaluates
, whereas the second critic evaluates the expected variance,
. Refs. [
52,
54,
56,
57] all use the incremental wealth (
) as the reward at each time step and do not consider the variance.
Refs. [
53,
55] both look ahead to the end of an episode before computing the reward. In [
55], the episode reward is the difference between the cash held at expiry and the payoff of the short position in the call option (they are short, so the loss is
). In [
53], entire episodes are simulated, and the hedging agent is rewarded a +1 if the profit is positive and −1 if the profit is negative. Ref. [
50] also waits to the end of the episode to compute the reward. The reward in [
50] is the CVaR of the payoff on the call option in which the agent has shorted plus the sum of the profits generated over the episode. CVaR and VaR are risk measures [
4]. VaR finds a single point estimate of the maximum potential loss given a threshold (confidence level or risk aversion level) [
4], whereas CVaR quantifies the expected loss beyond the single point estimate found with VaR [
69]. Ref. [
58] also incorporates CVaR and VaR into the reward, and the distributional nature of these risk measures is one of the underlying reasons behind the choice of its distributional DRL algorithm choice (D4PG-QR). Specifically, Ref. [
58] looks at a 30-day trading period and tries to minimize the loss plus 1.645 times the standard deviation, while also attempting to minimize the VaR and CVaR at 95% confidence levels. Note that
, which explains the significance of the standard deviation multiplier in the reward function used in [
58].
5.4. Data Generation Processes
Once the RL algorithm, state space, action space and reward formulation are all configured, the RL agent needs data for both training and testing. In the dynamic hedging environment, data stem from the underlying asset price process. The asset price dynamics,
, for a continuous GBM process are given by
. This equation may be discretized to obtain a simulation model of the asset price process in discrete time:
The discretization size is given by
, where
is the time to maturity and
is the number of discretization steps. For each newly generated stock price, the BS model may be used to compute the BS option price and the BS Greeks, providing the hedging agent all requisite information at every time step. The generation of stock paths using GBM is straightforward, and due to its pivotal role as the underlying process for BS option price and Delta calculations, GBM is employed for training and testing data generation for at least one experiment in nearly every analyzed study, with only three exceptions [
53,
56,
58]. As is discussed in
Section 5.5, the first experimental step for most papers is the training of a DRL agent using GBM stock paths. For testing, more GBM stock paths are then generated to perform a hedging performance comparison between the proposed RL agent and the BS Delta hedging strategy. If no market frictions are considered and the option is European, outperforming the hedging of the BS Delta method in the limit of infinite testing episodes is theoretically impossible. Therefore, comparing the DRL agent to a BS delta strategy under no market frictions ensures that the DRL agent learns a theoretically viable strategy.
As discussed in
Section 1, the assumption of constant volatility in GBM processes is not reflective of a true market environment [
8]. Ref. [
70] introduces the SABR (stochastic alpha, beta, rho) model, which accounts for non-constant volatility by modeling the volatility as a stochastic process. Ref. [
1] uses the SABR model for training and testing data generation after completing its first experiment with a GBM data process for training and testing. Ref. [
58] uses only SABR to generate data. Ref. [
58] does not explain why it does not perform an experiment with an agent trained on GBM data. Ref. [
71] introduces a Delta hedging strategy for the SABR model (Bartlett Delta hedging) by taking into account both the influence of an alteration in the asset price and the anticipated alteration in implied volatility. Thus, once Ref. [
1] trains its agent on SABR data in its second experiment, it compares the agent’s test performance to both the BS Delta and the Bartlett Delta strategies using testing data generated with an SABR model.
Another common stochastic volatility model is the Heston [
72] model. In the Heston model, volatility undergoes a stochastic evolution influenced by various parameters, including the volatility of volatility, the ongoing estimation of asset price variance and a mean-reversion rate for this estimation [
72]. To train DRL agents for an environment with stochastic volatility, the Heston model is used to generate training data in [
49,
54,
57,
59]. Note that, to formulate a BS-based hedging strategy for an option written on an underlying Heston model path, the BS Greeks can be computed at each time step by using the square root of the variance process as the implied volatility [
56].
Although Refs. [
49,
54,
57,
59] train and test their agents on simulation data, each of these studies performs an additional sim-to-real test in which it tests the simulation-trained DRL agent on empirical data. Ref. [
49] uses end-of-day S&P500 asset prices from 2019–2022, omitting 2020 due to the COVID-19 pandemic. Ref. [
49] also collects option prices for each day on each of the underlying assets. The data in [
49] are from ivolatility (
https://www.ivolatility.com). For its sim-to-real test, Ref. [
54] uses S&P500 index prices from 1 January 2016 to 31 December 2019 and uses Wharton Data Research Services (WRDS) to find the prices on a set of European call options written in the S&P500 index (Strike Price of USD 1800.00, various maturities). Ref. [
59] uses WRDS to obtain S&P500 index data, uses this index as the underlying asset price and uses 59,550 options with different strikes and maturities. The dataset used in [
59] was originally created in [
73]. Ref. [
57] uses S&P index intraday options from 3 January 2006 to 31 December 2013, totaling 2009 trading days. Ref. [
57] writes that the P&L calculation required for the agent’s reward function is computed using the mid-price of the option quote. Ref. [
57] uses data obtained from the Chicago Board of Options Exchange (CBOE) LiveVol data shop. In addition to using empirical data for a sim-to-real test, [
57] trains its DRL agent using the empirical dataset from LiveVol.
Others that train a DRL agent with empirical data are [
51,
53]. Ref. [
53] uses stock price data from the Ho Chi Minh Stock Exchange (HSX) and uses the Ha Noi Stock Exchange (HNX) for option prices. The data used in [
53] are from 25 September 2017 to 21 May 2020 (17 May 2019 to 21 May 2020 for testing). Ref. [
51] uses S&P500, S&P100 and DJIA price paths from years 2004–2020, obtained from the CBOE data shop. Ref. [
51] performs a random 70–30 train–test split on the data. A final data generation process is considered in [
50]. Ref. [
50] trains and tests the DRL agent using data generated with a time series GAN. A GAN is an ML technique designed to learn the distribution of a collection of training examples in order to reproduce the given distribution [
74]. The key results of all 17 analyzed works are presented in the next section.
5.5. Comparison of Results
With the complete learning process for each RL study now outlined, the results of these works may be analyzed. Ref. [
30] performs numerical experiments for prior work, [
5]. Ref. [
30] considers no market frictions, and the QLBS model is tested with a risk aversion parameter
(used in mean-variance reward formulation). GBM stock price paths are generated for training, and actions are multiplied by a uniform random variable in the range of
, for
[
30]. When compared to the BS delta hedging strategy, Ref. [
30] shows that most actions are optimal, even with noise. Ref. [
30] concludes that the agent can learn from random actions and potentially even learn the BS model itself.
The analysis put forth in [
31] is the first to consider transaction costs in the hedging environments [
31]; it models the cost of crossing the bid–ask spread for a given tick size, formulating the transaction cost,
, as
, where 0.1 is the tick size and
is the number of purchased shares. Note that the market impact is also considered in [
31], as it adds
to the cost function. After comparing a transaction-cost-free DRL model with the BS Delta using a GBM simulation to show that the agent can learn to effectively hedge, Ref. [
31] shows that, in the presence of transaction costs, the transitions between hedging actions are smaller for the RL agent than the BS Delta strategy. The smoother path of DRL actions, i.e., not buying (or selling) many shares in one time step due to a large change in stock price, reflects the RL agent’s cost consciousness [
31]. Ref. [
32] finds a similar result to [
31] in that its DRL agent is much less aggressive than the BS Delta strategy when hedging in an environment with transaction costs. This result found in [
32] is confirmed with kernel density plots of volatility, t-statistics and total cost. Recall that [
32] also compares the performance of three algorithms, DQN, DQN with pop-art (clipped actions) and PPO. Ref. [
32] shows that the policy-based PPO algorithm converges much faster in all tests, whereas DQN with pop-art displays a more stable convergence profile than DQN.
The results of [
52] are similar to [
31,
32]. Ref. [
52] shows that the policies learned by their TRVO DRL agent are robust across multiple risk aversion parameters. Ref. [
56] also incorporates multiple risk aversion parameters into its model and finds similar results to [
52]. Ref. [
56] shows that, when risk aversion is low, the agent prefers to leave positions unhedged, leading to reduced transaction costs. When the risk aversion is high, Ref. [
56] shows that the agent prefers to hedge often, but like [
31], the DRL agent action path is smoother and has less large hedging position changes than the BS Delta path, reflecting the cost consciousness of the RL agent. Ref. [
51] draws similar conclusions to [
55]. Ref. [
59] models the environment with proportional transaction costs and simulates the performance of both DDPG and DDPG-U DRL agents trained on using GBM price paths. Ref. [
59] finds that, although both DRL agents outperform the BS Delta strategy, the DDPG-U method outperforms DDPG by achieving a lower expected cost and a similar variance.
Ref. [
1] also finds that its trained DRL agent is cost-conscious, outperforming the BS Delta strategy with GBM training and testing data when transaction costs are considered. The key result of [
1] is that, when the current holding is close to the new required BS holding, the RL agent action transition closely matches that of a BS Delta hedge. However, if the new BS holding is much higher than the current one, the DRL agent prefers to under-hedge relative to the BS Delta (and vice versa for over-hedging). Ref. [
1] also shows that, as the rebalance frequency increases, the enhancement of the DRL agent over the BS Delta becomes more pronounced. The BS Delta method is not cost-conscious like the DRL agent, and transaction costs accumulate as the position is rebalanced more frequently [
1]. Ref. [
1] then performs a test with stochastic volatility using the SABR model, and the DRL agent trained on SABR data once again outperforms the BS Delta and Bartlett Delta strategies.
Ref. [
49] conducts a comparison between DQN and DDPG and shows that employing discrete actions with DQN leads to reduced P&L variance but also introduces rounding errors, resulting in a lower mean P&L compared to DDPG. Ref. [
49] then shows that the DDPG agent trained on GBM data outperforms the BS Delta strategy when transaction fees are considered. Ref. [
49] repeats the DDPG training and testing process with Heston model data, using a Wilmott Delta hedging strategy as a benchmark. A Wilmott Delta strategy is akin to BS Delta hedging, except that the strategy defines a non-constant no-transaction region to account for transaction costs [
75]. Ref. [
49] shows that its DRL agent outperforms the Wilmott Delta strategy. To test the robustness of the GBM-trained DRL agent and the Heston-model-trained DRL agent, Ref. [
49] performs sim-to-real tests to evaluate the DRL agents’ performance on empirical test data. In this sim-to-real test, Ref. [
49] finds that both DRL agents exhibit significantly worse performance on empirical test data compared to when the DRL agents are tested using data that align with the training model. To improve the robustness of its agents, Ref. [
49] incorporates a domain randomization technique. Domain randomization results in different training environments, and agents can learn more generalized policies [
49]. However, Ref. [
49] concludes that the DRL agents trained with domain randomization show no significant improvements when tested on empirical data.
Ref. [
59] also performs a sim-to-real experiment, training a DRL with data from a GBM process and testing the agent on real data. Ref. [
59] concludes that its DRL agent shows the desired robustness and matches its simulated results, wherein the DRL agent learns a cost-efficient policy and achieves the desired results of a low cost and low volatility for the results. Ref. [
54] finds similar results to [
59] in its sim-to-real and stochastic volatility tests. Ref. [
57] also uses empirical data for its study, and it finds that its DRL agent (TD3) outperforms a BS Delta strategy in terms of expected cost and variance for both constant and stochastic volatility. Ref. [
57] concludes that the DRL agent is superior to the BS Delta strategy in both a sim-to-real test and a test wherein the DRL agent is trained on empirical data. Note that, based on the information in the respective papers, it is difficult to reason why [
54,
57,
59] obtain preferable results in sim-to-real tests, whereas [
49] do not. In the field of RL for dynamic hedging, the absence of a universally acknowledged benchmark or standardized testing dataset complicates the analysis of variations in DRL agent performance across different papers. For example, in the field of image classification, the performance of separate AI algorithms can be easily compared using the MNIST dataset. There is no such comparison method in the field of dynamic hedging with RL, making for a difficult comparison of algorithms put forth by separate parties.
In addition to [
49] showing that the hedging performance of a DQN agent is bested by a DDPG agent, Ref. [
32] shows that PPO outperforms DQN. Further, Ref. [
33] compares a CMAB method to DQN with GBM simulated data and shows that CMAB has faster convergence, higher data efficiency and better hedging results than DQN, both when transaction costs are absent and when they are incorporated. Ref. [
60] compares the performance of DDPG and an NN-approximated DP solution to the optimal hedging strategy. Using GBM simulated data, Ref. [
60] concludes that, as the time to maturity, volatility and risk aversion increase, the DRL solution outperforms the approximated DP. This is an important result, as approximate DP solutions are prevalent in the dynamic option hedging field, and those using model-free RL argue that model-based approximate DP solutions can lead to errors due to calibration in model parameters [
54].
The results of [
58], which trains and tests its D4PG-QR agent on data from an SABR model process, shows that, under the presence of increased transaction costs, the DRL agent learns to decrease the BS Gamma hedge. The DRL agent in [
58] beats the BS Delta, the BS Delta–Gamma and the BS Delta–Vega strategies in both expected cost and VaR (95%) when volatility is both constant and stochastic. Note that Delta–Gamma hedging involves taking a position that makes the hedging portfolio both Delta- and Gamma-neutral [
4]. Similarly, a Delta–Vega hedge aims to make the hedging portfolio Delta- and Vega-neutral [
4].
Other works with interesting results are [
51,
53], which use empirical data for training. Ref. [
51] shows that a DDPG agent outperforms a BS Delta strategy in an empirical testing setting. Ref. [
51] also shows that the BS Delta is bested by the DDPG agent when hedging empirical American options. However, one should note that the BS Delta strategy is derived without the consideration of early exercise [
4]. Ref. [
53] shows that its DRL agent yields a profit higher than the market return on the empirical testing dataset. A final notable result comes from [
50], which shows that an agent trained on data produced by GANs achieves a higher profit than a DRL agent trained on GBM data. This represents a potential future direction for the literature, as GANs may provide a closer representation to real data than GBM processes [
50].