Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu
Deep Learnable Strategy Templates for Multi-Issue Bilateral Negotiation Pallavi Bagga Nicola Paoletti Kostas Stathis Royal Holloway, University of London Egham, United Kingdom pallavi.bagga@rhul.ac.uk Royal Holloway, University of London Egham, United Kingdom nicola.paoletti@rhul.ac.uk Royal Holloway, University of London Egham, United Kingdom kostas.stathis@rhul.ac.uk arXiv:2201.02455v1 [cs.MA] 7 Jan 2022 ABSTRACT We study how to exploit the notion of strategy templates to learn strategies for multi-issue bilateral negotiation. Each strategy template consists of a set of interpretable parameterized tactics that are used to decide an optimal action at any time. We use deep reinforcement learning throughout an actor-critic architecture to estimate the tactic parameter values for a threshold utility, when to accept an offer and how to generate a new bid. This contrasts with existing work that only estimates the threshold utility for those tactics. We pre-train the strategy by supervision from the dataset collected using łteacher strategiesž, thereby decreasing the exploration time required for learning during negotiation. As a result, we build automated agents for multi-issue negotiations that can adapt to different negotiation domains without the need to be pre-programmed. We empirically show that our work outperforms the state-of-the-art in terms of the individual as well as social efficiency. KEYWORDS Multi-Issue Negotiation, Deep Reinforcement Learning, Bilateral Automated Negotiation, Interpretable Negotiation Strategies 1 INTRODUCTION We are concerned with the problem of modelling a self-interested agent negotiating with an opponent over multiple issues while learning to optimally adapt its strategy. For instance, an agent trying to buy a laptop, settles the price of a laptop on the behalf of its owner based on a number of other issues such as laptop type, delivery time, payment methods and location delivery [18]. For realistic and complex environments, we assume that our agent has no previous knowledge of the opponent’s preferences and its negotiating characteristics [5]. Also, the utility of offers exchanged during the negotiation decreases over time (in negotiation scenarios with a discount factor), thus, timely decision on rejecting or accepting an offer and making acceptable offers are substantial [16]. Moreover, in a multi-issue negotiation, there are likely to be a number of different offers at any given utility level. Since they all result in the same utility, our agent is indifferent between these offers. So, there is another challenge to select the best offer which maximizes the utility to the opponent, whilst maintaining our desired utility level (i.e., to aim for the łwin-winž solution) [32]. Existing work consists of four main approaches addressing the above-mentioned challenges. (a) Hand-crafted predefined heuristics ś these are proposed in a number of settings with competitive results [13], and although interpretable (e.g., [1, 2]), they are often characterized by ad-hoc parameter/weight settings that are difficult to adapt for different domains. (b) Meta-heuristic (or evolutionary) methods ś work well across domains and improve iteratively using a fitness function (as a guide for quality); however, in these approaches every time an agent decision is made, this needs to be delivered by the meta-heuristic, which is not efficient and does not result in a human-interpretable and reusable negotiation strategy. (c) Machine learning algorithms ś they show the best results with respect to run-time adaptability [8, 27], but often their working hypotheses are not interpretable, a fact that may hinder their eventual adoption by users due to lack of transparency in the decision-making that they offer. (d) Interpretable strategy templates ś developed in [10] to guide the use of a series of tactics whose optimal use can be learned during negotiation. The structure of such templates depends upon a number of learnable choice parameters, determining which acceptance and bidding tactic to employ at any particular time during negotiation. As these tactics represent hypotheses to be tested, defined by the agent developer, they can be explained to a user, and can in turn depend on learnable parameters. The outcome of this work is an agent model that formulates a strategy template for bid acceptance and generation so that an agent that uses it can make optimal decisions about the choice of tactics while negotiating in different domains [10]. The benefit of (d) is that it can combine (a), (b) and (c) by using heuristics for the components of the template and meta-heuristics or machine learning for evaluating the choice parameter values of these components. The problem with (d), however, is that the choice parameters of the components for the acceptance and bidding templates are learned once (during training) and used in all the different negotiation settings (during testing) [10]. This onesize-fits-all choice of tactics does not accumulate learning experience and may be unsuitable for unknown domains or unknown opponents. In other words, the current mechanism for learning the choice parameter values in [10] abstracts away from what is learned in a specific domain once the negotiation has finished, and therefore cannot transfer it to new domains or unseen opponents. To address the limitation of (d), we propose the idea of using Deep Reinforcement Learning (DRL) to estimate the choice parameter values of components in strategy templates. We name the proposed interpretable strategy templates as łDeep Learnable Strategy Templates (DLST)ž. Our contribution is that we study experimentally the ideas behind DLSTs so that agents that employ them to learn parameter values from and across negotiation experiences, hence being capable of transferring the knowledge from one domain to the other, or using the experience against one opponent on the other. This approach leads to ładaptivež and generalizable strategy templates. We also perform extensive evaluation experiments based on the ANAC tournaments [22] against agents with learning capabilities (readily available in GENIUS [33]) in a variety of domains with different sizes and competitiveness levels [33], each with two different profiles. The agents used for comparison span a wide range of strategies and techniques1 . Empirically, the DLST-based agent negotiation model outperforms existing strategies in terms of individual as well as social welfare utilities. The remainder of the paper is organized as follows. In Section 2, we discuss the previous work related to learning-based multi-issue negotiation. In Section 3, we give a description of negotiation settings considered in this paper. Then, in Section 4, the proposed DLST-based negotiation model is introduced followed by various methods and methodologies in Section 5. Subsequently, in Section 6, we experimentally evaluate the performance efficiency of the proposed model. We conclude in Section 7 where we also outline an open problem worth pursuing in the future, as a result of this work. 2 RELATED WORK Existing approaches with reinforcement learning have focused on methods such as Tabular Q-learning for bidding [12] and finding the optimal concession [34, 35] or DQN for bid acceptance [27], which are not optimal for continuous action spaces. Such spaces, however, are the main focus in this work in order to estimate the threshold target utility value below which no bid is accepted/proposed from/to the opponent agent. Also, in order to perform this effectively, the agents are required to conclude many prior negotiations with an opponent in order to learn the opponent’s behaviour. Consequently, their approach, and reinforcement learning in general, is not appropriate for one-off negotiation with an unknown opponent. The recently proposed adaptive negotiation model in [8, 9] uses DRL for continuous action spaces, but their motivation is significantly different to ours. In our work, the agent attempts to predict the tactic choices for acceptance and bidding strategies at any particular time as well as learn the threshold utility which will be used among one of the tactics to be used in acceptance and bidding strategies, while [8, 9] uses DRL for a complete agent strategy while negotiating with multiple sellers concurrently in e-market like scenarios. Moreover, we focus on building the generalized decoupled and interpretable decision component, i.e., separate acceptance and bidding strategies are learned based on interpretable templates containing different tactics to be employed at different times in different domains. Another closely related multi-issue DRL-based negotiation work has also been seen in [10, 11]. Unlike the use of meta-heuristic optimization to learn the strategy parameter values in [10, 11] and use it in all the negotiation settings, we use DRL and the strategy parameter values may differ in different negotiation settings. Also, unlike [10, 11], we abstract away from handling the user preference uncertainties and generating the near-Pareto-optimal bids under preference uncertainties. 3 NEGOTIATION SETTINGS As in [10], we assume that our negotiation environment 𝐸 consists of two agents 𝐴𝑢 and 𝐴𝑜 negotiating with each other over some 1 E.g., AgreeableAgent2018- Frequency-based opponent modelling, AgentHerb- Logistic Regression, SAGA -Genetic Algorithm (GA), KakeSoba- Tabu Search, Rubick- Gaussian distribution, Caduceus2016- Mixture of GA, algorithm portfolio and experts. domain 𝐷. A domain 𝐷 consists of 𝑛 different independent issues, 𝐷 = (𝐼 1, 𝐼 2, . . . 𝐼𝑛 ), with each issue taking a finite set of 𝑘 possible discrete or continuous values 𝐼𝑖 = (𝑣 𝑖1, . . . 𝑣𝑘𝑖 ). In our experiments, we consider issues with discrete values. An agent’s bid 𝜔 is a mapping from each issue to a chosen value (denoted by 𝑐𝑖 for the 𝑖-th issue), i.e., 𝜔 = (𝑣𝑐11 , . . . 𝑣𝑐𝑛𝑛 ). The set of all possible bids or outcomes is called outcome space Ω s.t. 𝜔 ∈ Ω. The outcome space is common knowledge to the negotiating parties and stays fixed during a single negotiation session. Negotiation protocol. Before the agents can begin the negotiation and exchange bids, they must agree on a negotiation protocol 𝑃, which determines the valid moves agents can take at any state of the negotiation [17]. Here, we consider the alternating offers protocol [28], with possible 𝐴𝑐𝑡𝑖𝑜𝑛𝑠 = {offer (𝜔), accept, reject}. One of the agents (say 𝐴𝑢 ) starts a negotiation by making an offer 𝑥𝐴𝑢 →𝐴𝑜 to the other agent (say 𝐴𝑜 ). The agent 𝐴𝑜 can either accept or reject the offer. If it accepts, the negotiation ends with an agreement, otherwise 𝐴𝑜 makes a counter-offer to 𝐴𝑢 . This process of making offers continues until one of the agents either accepts an offer (i.e., successful negotiation) or the deadline is reached (i.e., failed negotiation). Time Constraints. We impose a realśtime deadline 𝑡𝑒𝑛𝑑 on the negotiation process for both theoretical and practical reasons. The pragmatic reason is that without a deadline, the negotiation might go on forever, especially without any discount factors. Secondly, with unlimited time an agent may simply try a huge amount of proposals to learn the opponent’s preferences [7]. However, taking into account a realśtime deadline poses many challenges, such as, agents should be more willing to concede near the deadline, as a break-off yields zero (or the reserved utility, if any) utility for both agents; a realśtime deadline also makes it necessary to employ a strategy to decide when to accept an offer; and deciding when to accept involves some prediction whether or not a significantly better opportunity might occur in the future. Moreover, we assume that the negotiations are sensitive to time, i.e., time impacts the utilities of the negotiating parties. In other words, the value of an agreement decreases over time. Negotiation session. Formally, for each negotiation session between two agents 𝐴𝑢 and 𝐴𝑜 , let 𝑎𝑡𝐴 →𝐴 ∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠 denote the 𝑢 𝑜 offer action proposed by agent 𝐴𝑢 to agent 𝐴𝑜 at time 𝑡. A negotiation history 𝐻𝐴𝑡 ↔𝐴 between agents 𝐴𝑢 and 𝐴𝑜 until time 𝑡 can 𝑢 𝑜 be represented as in (1): 𝐻𝐴𝑡 𝑢 ↔𝐴𝑜 := (𝑥𝑝𝑡11 →𝑝 2 , 𝑥𝑝𝑡23 →𝑝 4 , · · · , 𝑥𝑝𝑡𝑛𝑛 →𝑝𝑛+1 ) (1) where, 𝑡𝑛 ≤ 𝑡 and the negotiation actions are ordered over time. Also, 𝑝 𝑗 = 𝑝 𝑗+2 , i.e., the negotiation process strictly follows the alternating-offers protocol. Given a negotiation thread between agents 𝐴𝑢 and 𝐴𝑜 , the action performed by 𝐴𝑢 at time 𝑡 ′ after receiving an offer 𝑥𝐴𝑜 →𝐴𝑢 at time 𝑡 from 𝐴𝑜 can be one from the set Actions if 𝑡 ′ < 𝑡𝑒𝑛𝑑 , i.e., negotiation deadline is not reached. Furthermore, we assume bounded rational agents due to the fact that given the limited time, information privacy, and limited computational resources, agents cannot calculate the optimal strategy to be carried out during the negotiation. Figure 1: Interaction between the components of DLST-based agent negotiation model Utility. We assume that each negotiating agent has its own private preference profile which describes how bids are offered over the other bids. This profile is given in terms of a utility function 𝑈 , defined as a weighted sum of evaluation functions, 𝑒𝑖 (𝑣𝑐𝑖 𝑖 ) as shown in (2). Each issue is evaluated separately contributing linearly without depending on the value of other issues and hence 𝑈 is referred to as the Linear Additive Utility space. Here, 𝑤𝑖 are the normalized weights indicating the importance of each issue to the user and 𝑒𝑖 (𝑣𝑐𝑖 𝑖 ) is an evaluation function that maps the 𝑣𝑐𝑖 𝑖 value of the 𝑖 𝑡ℎ issue to a utility. 𝑈 (𝜔) = 𝑈 (𝑣𝑐11 , . . . 𝑣𝑐𝑛𝑛 ) = 𝑛 ∑︁ 𝑖=1 𝑤𝑖 · 𝑒𝑖 (𝑣𝑐𝑖 𝑖 ), where 𝑛 ∑︁ 𝑤𝑖 = 1 (2) 𝑖=1 Whenever the negotiation terminates without any agreement, each negotiating party gets its corresponding utility based on the private reservation2 value (𝑢𝑟𝑒𝑠 ). In case the negotiation terminates with an agreement, each agent receives the discounted utility of 𝑡 . Here, 𝑑 is a discount factor the agreed bid, i.e., 𝑈 𝑑 (𝜔) = 𝑈 (𝜔)𝑑𝐷 𝐷 in the interval [0, 1] and 𝑡 ∈ [0, 1] is current normalized time. 4 DLST-BASED NEGOTIATION MODEL When building a negotiation agent, we normally consider three phases: pre-negotiation phase (i.e., estimation of agent owner’s preferences, preference elicitation), negotiation phase (i.e., offer generation, opponent modelling) and post-negotiation phase (i.e., assessing the optimality of offers) [23]. In this paper, we are interested in the second phase, which involves a Decide component for choosing an optimal action 𝑎𝑡 , i.e., 𝐻𝐴𝑡 −1→𝐴 . As in [10], we assume that our 𝑜 𝑢 agent 𝐴𝑢 is situated in an environment 𝐸 (containing the opponent 2 The reservation value is the minimum acceptable utility for an agent. It may vary for different parties and different domains. In our settings, it is the same for both parties. agent 𝐴𝑜 ) where, at any time 𝑡, 𝐴𝑢 senses the current state 𝑆𝑡 of 𝐸 and represents it as a set of internal attributes, as shown in Figure 1; however this component is implicit in [10]. For the estimation of threshold utility, the set of state attributes include information derived from the sequence of previous bids offered by 𝐴𝑜 (e.g., utility of the most recently received bid from the opponent 𝜔𝑡𝑜 , utility of the best opponent bid so far 𝑂𝑏𝑒𝑠𝑡 , average utility of all the opponent bids 𝑂𝑎𝑣𝑔 and their variability 𝑂𝑠𝑑 ) and information stored in 𝐴𝑢 ’s knowledge base (e.g., number of bids 𝐵 in the given partial order, 𝑑𝐷 , 𝑢𝑟𝑒𝑠 , Ω, and 𝑛), and the current negotiation time 𝑡. This internal state representation, denoted with 𝑠𝑡 , is used by the agent (in acceptance and bidding strategies) to decide what action 𝑎𝑡 to execute from the set of Actions based on the negotiation protocol 𝑃 at time 𝑡. Action execution then changes the state of the environment to 𝑆𝑡 +1 . The state 𝑠𝑡 for acceptance strategy involves the following attributes in addition to the above-mentioned state attributes: fixed target utility 𝑢, dynamic and learnable target utility 𝑢¯𝑡 , 𝑈 (𝜔), 𝑞 quantile value which changes w.r.t time 𝑡, 𝑄𝑈b (Ω𝑜 ) (𝑞). On the other 𝑡 hand, the state 𝑠𝑡 for bidding strategy involves the following set of attributes: 𝑏 Boulware , 𝑃𝑆 Pareto-optimal bid, 𝑏𝑜𝑝𝑝 (𝜔𝑡𝑜 ), U (Ω ≥𝑢¯𝑡 ) (as discussed in the subsequent section), in addition to the state attributes used for estimating the dynamic threshold utility value. The action 𝑎𝑡 is derived via two functions, 𝑓𝑎 and 𝑓𝑏 , for the acceptance and bidding strategies, respectively, as in [10]. The function 𝑓𝑎 takes as inputs 𝑠𝑡 , a dynamic threshold utility 𝑢¯𝑡 (defined later in the Methods section), the sequence of past opponent bids Ω𝑡𝑜 , and outputs a discrete action 𝑎𝑡 among accept or reject. When 𝑓𝑎 returns reject, 𝑓𝑏 computes what to bid next, with input 𝑠𝑡 and 𝑢¯𝑡 , see (3ś4). This separation of acceptance and bidding strategies is not rare, see for instance [6]. Also, 𝑓𝑎 and 𝑓𝑏 consists of a set of tactics as defined in [10]. 𝑓𝑎 (𝑠𝑡 , 𝑢¯𝑡 , Ω𝑡𝑜 ) 𝑓𝑏 (𝑠𝑡 , 𝑢¯𝑡 , Ω𝑡𝑜 ) = 𝑎𝑡 , 𝑎𝑡 ∈ {accept, reject} (3) = 𝑎𝑡 , 𝑎𝑡 ∈ {offer (𝜔), 𝜔 ∈ Ω} (4) We assume incomplete opponent preference information, therefore, b𝑜 . In particular, 𝑈b𝑜 is estimated Decide uses the estimated model 𝑈 at time 𝑡 using information from Ω𝑡𝑜 , see Methods section for more details. Unlike [10], we employ DRL in Acceptance strategy templates as well as Bidding Strategy templates in our work, in addition to Threshold utility (represented by three green coloured boxes in Figure 1) in Decide component. Each DRL component is actor-critic architecture-based [30] and has its own Evaluate and Negotiation Experience components. Evaluate refers to a critic helping our agent learn the dynamic threshold utility 𝑢¯𝑡 , acceptance strategy template parameters and bidding strategy template parameters, with the new experience collected during the negotiation against each opponent agent. More specifically, it is a function of random 𝐾 (𝐾 < 𝑁 ) experiences fetched from the agent’s memory. Here, learning is retrospective, since it depends on the reward 𝑟𝑡 obtained from 𝐸 by performing 𝑎𝑡 at 𝑠𝑡 . The reward values for every critic that are used for estimating the threshold utility (i.e., 𝑟𝑡𝑢¯𝑡 ) as well as choice parameter values of acceptance (i.e., 𝑟𝑡𝑏𝑖𝑑 ) and bidding strategy templates (i.e., 𝑟𝑡𝑎𝑐𝑐 ) depend on the discounted user utility of the last bid received from the opponent, 𝜔𝑡𝑜 , or of the bid accepted by either parties 𝜔 𝑎𝑐𝑐 and defined as (5), (6) and (7) respectively. 𝑟𝑡𝑢¯𝑡   𝑈 (𝜔 𝑎𝑐𝑐 , 𝑡),    𝑢 = 𝑈𝑢 (𝜔𝑡𝑜 , 𝑡),    −1,  𝑟𝑡𝑏𝑖𝑑 ( 𝑈𝑢 (𝜔 𝑎𝑐𝑐 , 𝑡), = −1, on agreement on received offer otherwise. (5) on agreement otherwise. (6)   𝑈 (𝜔 𝑎𝑐𝑐 , 𝑡),   𝑢  = 𝑈𝑢 (𝜔𝑡𝑜 , 𝑡),    −1,  on agreement and 𝑈𝑜 (𝜔 𝑎𝑐𝑐 , 𝑡) ≤ 𝑈𝑢 (𝜔 𝑎𝑐𝑐 , 𝑡) on rejection and 𝑈𝑜 (𝜔𝑡𝑜 , 𝑡) ≥ 𝑈𝑢 (𝜔𝑡𝑜 , 𝑡) otherwise. (7) 𝑟𝑡𝑢¯𝑡 (5) and 𝑟𝑡𝑏𝑖𝑑 (6) are straight-forward. In (7), 𝑈𝑜 (𝜔, 𝑡) is used as the reward value because reward is received from the environment 𝐸 where the opponent agent resides. In other words, we assume that 𝐸 has access to 𝐴𝑜 ’s real preferences, i.e., 𝑈𝑜 , but these preferences are not observable by our agent 𝐴𝑢 . The first case of the 𝑟𝑡𝑎𝑐𝑐 deals with an agreed bid and returns a positive reward value, if the bid gives higher utility to our agent than the opponent. The second case deals with a rejected bid and returns a positive reward value, if the bid gives lower utility to our agent than the opponent. In all other cases, it returns a negative value. Also, in (5), (6) and (7), 𝑈𝑢 (𝜔, 𝑡) is the discounted reward of 𝜔 defined as (8). 𝑟𝑡𝑎𝑐𝑐 𝑈𝑢 (𝜔, 𝑡) = 𝑈𝑢 (𝜔) · 𝑑 𝑡 , 𝑑 ∈ [0, 1] (8) In (8), 𝑑 is a temporal discount factor to encourage the agent to negotiate without delay. We should not confuse 𝑑, which is typically unknown to the agent, with the discount factor used to compute the utility of an agreed bid (𝑑𝐷 ). Negotiation Experience stores historical information about 𝑁 previous interactions of an agent with other agents. Experience elements are of the form ⟨𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡 +1 ⟩, where 𝑠𝑡 is the internal state representation of the negotiation environment 𝐸, 𝑎𝑡 is the performed action, 𝑟𝑡 is a scalar reward received from the environment and 𝑠𝑡 +1 is the new agent state after executing 𝑎𝑡 . Strategy templates. The strategy templates of [10] are a general form of parametric strategies for acceptance and bidding. These strategies apply different tactics at different phases of the negotiation. The total number of phases 𝑛 and the number of tactics 𝑛𝑖 to choose from at each phase 𝑖 = 1, . . . , 𝑛 are the only parameters fixed in advance. For each phase 𝑖, the duration 𝛿𝑖 (i.e., 𝑡𝑖+1 = 𝑡𝑖 +𝛿𝑖 ) and the particular choice of tactic are learnable parameters. The latter is encoded with choice parameters 𝑐𝑖,𝑗 , where 𝑖 = 1, . . . , 𝑛 and 𝑗 = 1, . . . , 𝑛𝑖 , such that if 𝑐𝑖,𝑗 is true then the (𝑖, 𝑗)-th tactic is selected for phase 𝑖. Tactics can be parametric in turn, and depend on learnable parameters p𝑖,𝑗 . We consider the same set of admissible tactics as [10]. The key difference is that our approach allows to evolve the entire strategy (within the space of strategies entailed by the template) at every negotiation, which makes more adaptable and generalizable. The tactics used for acceptance strategies are: • 𝑈𝑢 (𝜔𝑡 ), the estimated utility of the bid 𝜔𝑡 that our agent would propose at time 𝑡. • 𝑄𝑈𝑢 (Ω𝑡𝑜 ) (𝑎 · 𝑡 + 𝑏), where 𝑈𝑢 (Ω𝑡𝑜 ) is the distribution of (estimated) utility values of the bids in Ω𝑡𝑜 , 𝑄𝑈𝑢 (𝐵𝑜 (𝑡 )) (𝑝) is the quantile function of such distribution, and 𝑎 and 𝑏 are learnable parameters. In other words, we consider the 𝑝-th best utility received from the agent, where 𝑝 is a learnable (linear) function of the negotiation time 𝑡. In this way, this tactic automatically and dynamically decides how much the agent should concede at time 𝑡. Here, p𝑖,𝑗 = {𝑎, 𝑏} . • 𝑢¯𝑡 , the dynamic DRL-based utility threshold. • 𝑢, a fixed utility threshold. The bidding tactics are: • 𝑏 Boulware , a bid generated by a time-dependent Boulware strategy [15]. • 𝑃𝑆 (𝑎 · 𝑡 + 𝑏) extracts a bid from the set of Pareto-optimal bids 𝑃𝑆, derived using the NSGA-II algorithm3 [14] under 𝑈𝑢 and 𝑈b𝑜 . In particular, it selects the bid that assigns a weight of 𝑎 · 𝑡 + 𝑏 to our agent utility (and 1 − (𝑎 · 𝑡 + 𝑏) to the opponent’s), where 𝑎 and 𝑏 are learnable parameters telling how this weight scales with the negotiation time 𝑡. The TOPSIS algorithm [20] is used to derive such a bid, given the weighting 𝑎 · 𝑡 + 𝑏 as input. Here, p𝑖,𝑗 = {𝑎, 𝑏} . • 𝑏𝑜𝑝𝑝 (𝜔𝑡𝑜 ), a tactic to generate a bid by manipulating the last bid received from the opponent 𝜔𝑡𝑜 . This is modified in a greedy fashion by randomly changing the value of the least relevant issue (w.r.t. 𝑈 ) of 𝜔𝑡𝑜 . • 𝜔 ∼ U (Ω ≥𝑢¯𝑡 ), a random bid above our DRL-based utility threshold 𝑢¯𝑡 4 . 3 Meta-heuristics (instead of brute-force) for Pareto-optimal solutions have the potential to deal efficiently with continuous issues. is the uniform distribution over 𝑆 , and Ω ≥𝑢¯𝑡 is the subset of Ω whose bids have estimated utility above 𝑢¯𝑡 w.r.t. 𝑈 . 4 U (𝑆) Below, we give an example of a concrete acceptance strategy learned with our model. We use, as we will discuss in Section 6, a specific domain (Party) and we show how the strategy adapts in other negotiation domains (Grocery and Outfit) against the opponent strategy [10]. (a) Party Domain   𝑡 ∈ [0.000, 0.0361) → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑄𝑈 Ω𝑜 (−0.20 · 𝑡 + 0.22), 𝑢¯𝑡 𝑡   𝑡 ∈ [0.0361, 1.000] → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑢, 𝑄𝑈 Ω𝑜 (−0.10 · 𝑡 + 0.64) 𝑡 (b) Grocery Domain   𝑡 ∈ [0.000, 0.2164) → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑈𝑢 (𝜔𝑡 ), 𝑄𝑈 Ω𝑜 (−0.55 · 𝑡 + 0.05), 𝑢¯𝑡 𝑡   𝑡 ∈ [0.2164, 0.3379) → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑈𝑢 (𝜔𝑡 ), 𝑄𝑈 Ω𝑜 (−0.60 · 𝑡 + 1.40) 𝑡   𝑡 ∈ [0.3379, 1.000] → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑄𝑈 Ω𝑜 (−0.22 · 𝑡 + 0.29), 𝑢¯𝑡 𝑡 (c) Outfit Domain 𝑡 ∈ [0.000, 0.1545) → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ 𝑄𝑈 Ω𝑜 (−0.50 · 𝑡 + 0.70) 𝑡   𝑡 ∈ [0.1545, 0.3496) → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑢¯𝑡 , 𝑄𝑈 Ω𝑜 (−0.50 · 𝑡 + 0.90) 𝑡 𝑡 ∈ [0.3496, 1.000] → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ 𝑈𝑢 (𝜔𝑡 ) We can observe that the duration learned in the left-hand side of the tactics is different for different domains, e.g., initially in the first domain (𝑃𝑎𝑟𝑡𝑦) the first rule triggers when 𝑡 ∈ [0.0, 0.0361), while in the second (𝐺𝑟𝑜𝑐𝑒𝑟𝑦) and third (𝑂𝑢𝑡 𝑓 𝑖𝑡) domains, the first rule triggers at 𝑡 ∈ [0.0, 0.2164) and 𝑡 ∈ [0.0, 0.1545) respectively. Similarly, the parameters on the right-hand side of the tactics rules, e.g., for the first domain (𝑃𝑎𝑟𝑡𝑦) during the very early phase of the negotiation, the strategy uses a quantile tactic as well as dynamic threshold utility. However, in the second domain (𝐺𝑟𝑜𝑐𝑒𝑟𝑦), the strategy now employs future bid utility along with the quantile bid and the dynamic threshold utility tactics, whereas, in the third domain (𝑂𝑢𝑡 𝑓 𝑖𝑡), it only employs the quantile bid tactic. 5 METHODS In our approach, we first use supervised learning (SL) to pre-train the our agent using supervision examples collected from existing łteacherž negotiation strategies as inspired by [9, 10]. Such pretrained strategy is then evolved via RL using experience and rewards collected while interacting with other agents in the negotiation environment. This combination of SL and RL approaches enhances the process of learning an optimal strategy. This is because applying RL alone from scratch would require a large amount of experience before reaching a reasonable strategy, which might hinder the online performance of our agent. On the other hand, starting from a pre-trained policy ensures quicker convergence (as demonstrated empirically in [9, 10]). 5.1 Data set collection In order to collect the data set for pre-training our agent via SL, we have used the GENIUS simulation environment [26]. In particular, in our experiments we generate supervision data using the existing DRL-based state-of-the-art agent negotiation model [10] by negotiating it against the winning strategies of ANAC-2019 competition, i.e., AgentGG, KakeSoba and SAGA (readily available in GENIUS and requiring minimal changes to work for our negotiation settings) assuming no user preference uncertainty in three different domains (Laptop, Holiday, and Party). 5.2 Strategy Representation We represent both 𝑓𝑎 (3) and 𝑓𝑏 (4) using artificial neural networks (ANNs) [19], as these are powerful function approximators and benefit from extremely effective learning algorithms, unlike [10], which used the meta-heuristic optimization algorithm. We also use the same to predict the target threshold utility 𝑢¯𝑡 as in [10]. 5.2.1 ANN. In particular, we use feed-forward neural networks, i.e., functions organized into several layers, where each layer comprises a number of neurons that process information from the previous layer. More details can be found in [19]. Also, we keep the ANN configuration same as in [10]. 5.2.2 DRL. During our experiments, the agent negotiates with fixed-but-unknown opponent strategies in a negotiation environment, which can be learnt by our agent after some simulation runs. Hence, we consider our negotiation environment as fully-observable. Following this, for our dynamic and episodic environment, we use a model-free, off-policy RL approach which generates a deterministic policy based on the policy gradient method to support continuous control. More specifically, as in [10], we use Deep Deterministic Policy Gradient (DDPG) algorithm, which is an actor-critic RL approach and generates a deterministic action selection policy for the negotiating agent [25]. We consider a model-free RL approach because our problem is how to make an agent decide what action to take next in a negotiation dialogue rather than predicting the new state of the environment. In other words, we are not learning a model of the environment, as the strategies of the opponents are not observable properties of the environment’s state. Thus, our agent’s emphasis is more on learning what action to take next and not the state transition function of the environment. We consider the off-policy approach (i.e., an agent attempts to evaluate or improve the policy which is different from the one which was used to take an action) for independent exploration of continuous action spaces [25]. When being in a state 𝑠𝑡 , DDPG uses a so-called actor network 𝜇 to select an action 𝑎𝑐𝑡𝑡 , and a so-called critic network 𝑄 to predict the value 𝑄𝑡 at state 𝑠𝑡 of the action selected by the actor: 𝑎𝑐𝑡𝑡 = 𝜇 (𝑠𝑡 | 𝜃 𝜇 ) 𝑄 (9) 𝜇 𝑄 𝑄𝑡 (𝑠𝑡 , 𝑎𝑐𝑡𝑡 | 𝜃 ) = 𝑄 (𝑠𝑡 , 𝜇 (𝑠𝑡 | 𝜃 ) | 𝜃 ) 𝜃𝜇 (10) 𝜃𝑄 In (9) and (10), and are, respectively, the learnable parameters of the actor and critic neural networks. The parameters of the actor network are updated by the Deterministic Policy Gradient method [29]. The objective of the actor policy function is to maximize the expected return 𝐽 calculated by the critic function using (11). See [24] for further details on DDPG. 𝐽 = E[𝑄 (𝑠, 𝑎𝑐𝑡 |𝜃 𝑄 )|𝑠=𝑠𝑡 ,𝑎𝑐𝑡 =𝜇 (𝑠𝑡 ) ] (11) In our experiments, for predicting the dynamic threshold utility, the actor function is a single-output regression ANN; on the other hand, for acceptance and bidding strategies, it is a multiple-output regression ANN. In particular, when predicting 𝑢¯𝑡 , 𝑎𝑐𝑡𝑡 corresponds to 𝑢¯𝑡 ; whereas, for acceptance and bidding strategy templates,  𝑎𝑐𝑡𝑡 consists of a vector of multiple outputs 𝛿𝑖 , (𝑐𝑖,𝑗 , p𝑖,𝑗 ) 𝑗=1,...,𝑛𝑖 𝑖=1,...,𝑛 including the duration of each negotiation phase 𝛿𝑖 , Boolean choice parameters 𝑐𝑖,𝑗 and a set of learnable parameters p𝑖,𝑗 for each tactic 𝑗 that can be used in a negotiation phase 𝑖. 5.3 Opponent modelling We consider a negotiation environment with uncertainty about the opponent’s preferences. To derive an estimate of the opponent b𝑜 during negotiation, we use the distribution-based fremodel 𝑈 quency model proposed in [31], as also done in [10]. In this model, the empirical frequency of the issue values in Ω𝑡𝑜 provides an educated guess on the opponent’s most preferred issue values. The issue weights are estimated by analysing the disjoint windows of Ω𝑡𝑜 , giving an idea of the shift of opponent’s preferences from its previous negotiation strategy over time. 6 EXPERIMENTAL RESULTS AND DISCUSSIONS whereas opposition5 refers to the minimum distance from all possible outcomes to the point representing complete satisfaction of both negotiation parties (1,1). For the experiments of Hypothesis B, we choose readily-available 3 small-sized, 2 medium-sized, and 3 large-sized domains. Out of these domains, 2 are with high, 3 with medium and 3 with low opposition (see [33] for more details). For each configuration, each agent plays both roles in the negotiation (e.g., buyer and seller in Laptop domain) to compensate for any utility differences in the preference profiles. We call user profile the agent’s role along with the user’s preferences. Also, we set the 𝑢𝑟𝑒𝑠 and 𝑑𝐷 to their respective default values, whereas the deadline is set to 180s, normalized in [0, 1] (known to both negotiating parties in advance). For NSGA-II during the Pareto-bid generation phase, we choose the population size of 2% × |Ω|, 2 generations and mutation count of 0.1. With these hyperparameters, on our machine6 the run-time of NSGA-II never exceeded the given timeout of 10s for deciding an action at each turn, while being able to retrieve empirically good solutions. 6.3 Empirical Evaluation All the experiments are performed using the GENIUS tool [26], which are designed to prove the following two hypotheses: We evaluate and discuss the two hypotheses introduced at the beginning of the section. • Hypothesis A: DLST-based negotiation approach outperforms the łteacherž strategies in known negotiation settings in terms of individual and social efficiency. • Hypothesis B: DLST-based negotiation approach outperforms not-seen-before strategies and adapts to different negotiation settings in terms of individual and social efficiency. 6.3.1 Hypothesis A: DLST-based agent outperforms “teacher” strategies. We performed a total of 1200 negotiation sessions7 to evaluate the performance of DLST-based agent against the four łteacherž strategies (ANESIA [10], AgentGG, KakeSoba and SAGA) in three domains (Laptop, Holiday, and Party). These strategies were used to collect the dataset in the same domains for supervised training before the DRL process begins. Table 1 demonstrates the average results over all the domains and profiles for each agent. Clearly, DLST-based agent outperforms the łteacherž strategies in terms of individual efficiency, as well as social efficiency. 6.1 Performance metrics: We measure the performance of each agent in terms of six widelyadopted metrics inspired by the ANAC competition: 𝑡𝑜𝑡𝑎𝑙 : The utility gained by an agent averaged over all the • 𝑈 ind negotiations (↑); 𝑠 : The utility gained by an agent averaged over all the • 𝑈 ind successful negotiations (↑); • 𝑈 soc : The utility gained by both negotiating agents averaged over all successful negotiations (↑); • 𝑃avg : Average minimal distance of agreements from the Pareto Frontier (↓). • 𝑆 % : Proportion of successful negotiations (↑). The first and second measures represent individual efficiency of an outcome, whereas the third and fourth correspond to the social efficiency of agreements. 6.2 Experimental settings Our proposed DLST-based agent negotiation model is evaluated against state-of-the-art strategies that participated in ANAC’17 and ANAC’18, which are designed by different research groups independently. Each agent has no information about another agent’s strategies beforehand. Details of all these strategies are available in [3, 21]. We evaluate our approach on total of 11 negotiation domains which are different from each other in terms of size and opposition [4] to ensure good negotiation characteristics and to reduce any biases. The domain size refers to the number of issues, 6.3.2 Hypothesis B: Adaptive behaviour of DLST-based agents. We further evaluated the performance of DLST-based agent against the opponent agents from ANAC’17 and ANAC’18 unseen during training and having capability of learning from previous negotiations. For this, we performed two experiments against ANAC’17 and ANAC’18 agents, each with a total of 29120 negotiation sessions8 . Results in Table 2 are averaged over all domains, and demonstrate that DLST-based agent learns to make the optimal choice of tactics to be used at run time and outperforms the other 8 strategies in 𝑠 and 𝑈 . We also observed that our agent outperterms of 𝑈 ind soc forms the current state-of-the-art (ANESIA) in a tournament with ANAC’17 and ANAC’18 strategies in all the domains used for the purpose of evaluation as shown in Figures 2 ś 5. This indicates that the DLST approach of dynamically adapting the parameters of acceptance and bidding strategies leads consistently improve the ANESIA approach of keeping these parameters fixed once the agent is deployed. 5 The value of opposition reflects the competitiveness between parties in the domain. Strong opposition means a gain of one party is at the loss of the other, whereas, weak opposition means that both parties either lose or gain simultaneously [4]. 6 CPU: 8 cores, 2.10GHz; RAM: 32 GB 7 𝑛 × (𝑛 − 1)/2 × 𝑥 × 𝑦 × 𝑧 = 1200 where 𝑛 = 5, number of agents in a tournament; 𝑥 = 2, because agents play both sides; 𝑦 = 3, number of domains; 𝑧 = 20, because each tournament is repeated 20 times. 8 𝑛 × (𝑛 − 1)/2 × 𝑥 × 𝑦 × 𝑧 = 29120 where 𝑛 = 14; 𝑥 = 2; 𝑦 = 8; 𝑧 = 20. 𝑠 (↑) 𝑡𝑜𝑡𝑎𝑙 (↑) 𝑆 % (↑) 𝑈 ind 𝑈 ind Laptop Domain DLST-agent 0.0 ± 0.0 1.71 ± 0.03 0.91 ± 0.02 0.91 ± 0.02 1.00 ANESIA 0.0 ± 0.0 1.66 ± 0.20 0.86 ± 0.03 0.86 ± 0.03 1.00 KakeSoba 0.03 ± 0.12 1.48 ± 0.53 0.77 ± 0.20 0.82 ± 0.06 0.94 SAGA 0.01 ± 0.06 1.45 ± 0.48 0.89 ± 0.13 0.89 ± 0.10 0.99 AgentGG* 0.22 ± 0.35 1.14 ± 0.65 0.71 ± 0.38 0.91 ± 0.09 0.78 Holiday Domain DLST-agent 0.05 ± 0.11 1.74 ± 0.14 0.96 ± 0.14 0.96 ± 0.14 1.00 ANESIA 0.06 ± 0.1 1.74 ± 0.14 0.85 ± 0.15 0.85 ± 0.15 1.00 KakeSoba 0.21 ± 0.35 1.53 ± 0.5 0.84 ± 0.27 0.92 ± 0.07 0.91 SAGA 0.19 ± 0.36 1.55 ± 0.5 0.70 ± 0.25 0.77 ± 0.12 0.91 AgentGG* 0.46 ± 0.58 1.16 ± 0.82 0.74 ± 0.45 0.96 ± 0.03 0.67 Party Domain DLST-agent 0.15 ± 0.38 1.53 ± 0.6 0.74 ± 0.31 0.77 ± 0.14 0.87 ANESIA 0.37 ± 0.32 1.06 ± 0.5 0.52 ± 0.27 0.62 ± 0.14 0.83 KakeSoba 0.33 ± 0.32 1.11 ± 0.51 0.64 ± 0.3 0.75 ± 0.12 0.84 SAGA 0.15 ± 0.16 1.36 ± 0.26 0.61 ± 0.19 0.63 ± 0.16 0.87 AgentGG* 0.38 ± 0.42 0.92 ± 0.6 0.62 ± 0.4 0.77 ± 0.12 0.71 Table 1: Performance Comparison of DLST-agent with łteacherž strategies for all the three domains (Laptop, Holiday, and Party - All readily available in GENIUS). Best Results are in bold. Note * means user preference uncertainty is considered. Agent 𝑃avg (↓) 𝑈 soc (↑) 𝑡𝑜𝑡𝑎𝑙 (↑) 𝑠 (↑) 𝑃avg (↓) 𝑈 soc (↑) 𝑈 ind 𝑈 ind 𝑆 % (↑) Comparison of DLST and ANESIA with ANAC 2017 Agent Strategies DLST-agent 0.0 ± 0.0 1.17 ± 0.12 0.90 ± 0.0 0.93 ± 0.0 1.0 ANESIA 0.0 ± 0.0 1.16 ± 0.12 0.70 ± 0.25 0.76 ± 0.26 0.89 PonpokoAgent 0.70 ± 0.49 0.44 ± 0.70 0.62 ± 0.19 0.93 ± 0.04 0.89 ShahAgent 0.54 ± 0.54 0.79 ± 0.79 0.57 ± 0.07 0.64 ± 0.04 0.75 Mamenchis 0.50 ± 0.05 0.80 ± 0.80 0.66 ± 0.16 0.82 ± 0.18 0.89 AgentKN 0.0 ± 0.0 1.17 ± 0.0 0.65 ± 0.05 0.65 ± 0.05 1.0 Rubick 1.08 ± 0.0 1.00 ± 0.0 0.50 ± 0.09 0.64 ± 0.04 0.76 ParsCat2 0.54 ± 0.54 0.80 ± 0.08 0.66 ± 0.16 0.82 ± 0.04 0.57 SimpleAgent 1.08 ± 0.0 0.90 ± 0.0 0.57 ± 0.14 0.57 ± 0.14 1.0 AgentF 1.18 ± 0.0 1.07 ± 0.06 0.51 ± 0.0 0.81 ± 0.0 0.89 TucAgent 0.08 ± 0.29 0.90 ± 0.03 0.65 ± 0.38 0.52 ± 0.16 0.69 MadAgent 0.67 ± 0.05 1.09 ± 0.17 0.57 ± 0.0 0.57 ± 0.0 1.0 GeneKing 1.08 ± 0.0 0.99 ± 0.14 0.75 ± 0.0 0.67 ± 0.24 0.63 Farma17 0.77 ± 0.49 0.44 ± 0.70 0.65 ± 0.19 0.93 ± 0.04 0.79 Comparison of DLST and ANESIA with ANAC 2018 Agent Strategies DLST-agent 0.00 ± 0.08 1.54 ± 0.17 0.86 ± 0.07 0.87 ± 0.06 0.91 ANESIA 0.00 ± 0.09 1.41 ± 0.16 0.74 ± 0.14 0.84 ± 0.14 0.78 AgentHerb 0.02 ± 0.05 0.79 ± 0.11 0.78 ± 0.02 0.78 ± 0.11 0.61 AgreeableAgent 0.05 ± 0.11 1.12 ± 0.23 0.53 ± 0.10 0.56 ± 0.05 0.54 Sontag 0.03 ± 0.07 0.73 ± 0.18 0.78 ± 0.08 0.79 ± 0.07 0.59 Agent33 0.04 ± 0.07 0.74 ± 0.18 0.68 ± 0.09 0.78 ± 0.09 0.79 AngentNP1 0.04 ± 0.06 0.73 ± 0.16 0.65 ± 0.10 0.65 ± 0.1 0.69 FullAgent 0.02 ± 0.04 0.67 ± 0.12 0.69 ± 0.05 0.77 ± 0.12 0.61 ATeamAgent 0.09 ± 0.06 0.58 ± 0.13 0.75 ± 0.10 0.75 ± 0.08 0.75 ConDAgent 0.06 ± 0.09 1.16 ± 0.20 0.68 ± 0.11 0.65 ± 0.11 0.56 GroupY 0.03 ± 0.06 0.66 ± 0.15 0.53 ± 0.07 0.54 ± 0.06 0.58 Yeela 0.04 ± 0.06 0.68 ± 0.14 0.73 ± 0.08 0.73 ± 0.07 0.66 Libra 0.10 ± 0.09 0.54 ± 0.19 0.71 ± 0.08 0.56 ± 0.04 0.77 ExpRubick 0.00 ± 0.02 1.10 ± 0.18 0.78 ± 0.08 0.80 ± 0.12 0.91 Table 2: Performance Comparison of DLST-agent with existing strategies averaged over all the 8 domains (Airport Site, Camera, Energy, Fitness, Flight, Grocery, Itex-Cypress, Outfit - All are readily available in GENIUS). Best Results are in bold. Agent Figure 2: Comparison of DLST-agent VS ANESIA in terms of Agreement rate 𝑆 % (↑) Figure 4: Comparison of DLST-agent VS ANESIA in terms of 𝑠 (↑) individual utility rate over successful negotiations 𝑈 ind Figure 3: Comparison of DLST-agent VS ANESIA in terms of Social welfare utility 𝑈 soc (↑) Figure 5: Comparison of DLST-agent VS ANESIA in terms of 𝑡𝑜𝑡𝑎𝑙 (↑) individual utility rate over all negotiations 𝑈 ind 7 CONCLUSIONS AND FUTURE WORK This work uses an actor-critic architecture based deep reinforcement learning to support negotiation in domains with multiple issues. In particular, it exploits łinterpretablež strategy templates used in the state-of-the-art to learn the best combination of acceptance and bidding tactics at any negotiation time, and among its tactics, it uses an adaptive threshold utility, all learned using the DDPG algorithm which derives an initial neural network strategy via supervised learning. We have empirically evaluated the performance of our DLST-based approach against the łteacher strategiesž as well as the agent strategies of ANAC’17 and ANAC’18 competitions (since the tournament allowed learning from previous negotiations) in different settings, showing that our agent outperforms opponents known at training time and can effectively transfer its knowledge to environments with previously unseen opponent agents and domains. An open problem worth pursuing in the future is how to learn transferable strategies for concurrent bilateral negotiations over multiple issues. REFERENCES [1] Bedour Alrayes, Ozgur Kafali, and Kostas Stathis. 2014. CONAN: a heuristic strategy for COncurrent Negotiating AgeNts. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. 1585ś1586. [2] Bedour Alrayes, Özgür Kafalı, and Kostas Stathis. 2018. Concurrent bilateral negotiation for open e-markets: the CONAN strategy. Knowledge and Information Systems 56, 2 (2018), 463ś501. [3] Reyhan Aydoğan, Katsuhide Fujita, Tim Baarslag, Catholijn M Jonker, and Takayuki Ito. 2018. ANAC 2017: Repeated multilateral negotiation league. In International Workshop on Agent-Based Complex Automated Negotiation. Springer, 101ś115. [4] Tim Baarslag, Katsuhide Fujita, Enrico H Gerding, Koen Hindriks, Takayuki Ito, Nicholas R Jennings, Catholijn Jonker, Sarit Kraus, Raz Lin, Valentin Robu, et al. 2013. Evaluating practical negotiating agents: Results and analysis of the 2011 international competition. Artificial Intelligence 198 (2013), 73ś103. [5] Tim Baarslag, Mark JC Hendrikx, Koen V Hindriks, and Catholijn M Jonker. 2016. Learning about the opponent in automated bilateral negotiation: a comprehensive survey of opponent modeling techniques. Autonomous Agents and Multi-Agent Systems 30, 5 (2016), 849ś898. [6] Tim Baarslag, Koen Hindriks, Mark Hendrikx, Alexander Dirkzwager, and Catholijn Jonker. 2014. Decoupling negotiating agents to explore the space of negotiation strategies. In Novel Insights in Agent-based Complex Automated Negotiation. Springer, 61ś83. [7] Tim Baarslag, Koen Hindriks, Catholijn Jonker, Sarit Kraus, and Raz Lin. 2012. The first automated negotiating agents competition (ANAC 2010). In New Trends in agent-based complex automated negotiations. Springer, 113ś135. [8] Pallavi Bagga, Nicola Paoletti, Bedour Alrayes, and Kostas Stathis. 2020. A Deep Reinforcement Learning Approach to Concurrent Bilateral Negotiation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. ijcai.org, 297ś303. [9] Pallavi Bagga, Nicola Paoletti, Bedour Alrayes, and Kostas Stathis. 2021. ANEGMA: an automated negotiation model for e-markets. Autonomous Agents and Multi-Agent Systems 35, 2 (2021), 1ś28. [10] Pallavi Bagga, Nicola Paoletti, and Kostas Stathis. 2020. Learnable strategies for bilateral agent negotiation over multiple issues. arXiv preprint arXiv:2009.08302 (2020). [11] Pallavi Bagga, Nicola Paoletti, and Kostas Stathis. 2021. Pareto Bid Estimation for Multi-Issue Bilateral Negotiation under User Preference Uncertainty. In 2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). IEEE, 1ś6. [12] Jasper Bakker, Aron Hammond, Daan Bloembergen, and Tim Baarslag. 2019. RLBOA: A Modular Reinforcement Learning Framework for Autonomous Negotiating Agents.. In AAMAS. 260ś268. [13] Stefania Costantini, Giovanni De Gasperis, Alessandro Provetti, and Panagiota Tsintza. 2013. A heuristic approach to proposal-based negotiation: with applications in fashion supply chain management. Mathematical Problems in Engineering 2013 (2013). [14] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation 6, 2 (2002), 182ś197. [15] S Shaheen Fatima, Michael Wooldridge, and Nicholas R Jennings. 2001. Optimal negotiation strategies for agents with incomplete information. In International Workshop on Agent Theories, Architectures, and Languages. Springer, 377ś392. [16] Shaheen S Fatima, Michael Wooldridge, and Nicholas R Jennings. 2002. Multiissue negotiation under time constraints. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1. 143ś150. [17] Shaheen S Fatima, Michael Wooldridge, and Nicholas R Jennings. 2005. A comparative study of game theoretic and evolutionary models of bargaining for software agents. Artificial Intelligence Review 23, 2 (2005), 187ś205. [18] S Shaheen Fatima, Michael J Wooldridge, and Nicholas R Jennings. 2006. Multiissue negotiation with deadlines. Journal of Artificial Intelligence Research 27 (2006), 381ś417. [19] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press. [20] Ching-Lai Hwang and Kwangsun Yoon. 1981. Methods for multiple attribute decision making. In Multiple attribute decision making. Springer, 58ś191. [21] Catholijn Jonker, Reyhan Aydogan, Tim Baarslag, Katsuhide Fujita, Takayuki Ito, and Koen Hindriks. 2017. Automated negotiating agents competition (ANAC). In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. [22] Catholijn M Jonker, Valentin Robu, and Jan Treur. 2007. An agent architecture for multi-attribute negotiation using incomplete preference information. Autonomous Agents and Multi-Agent Systems 15, 2 (2007), 221ś252. [23] Usha Kiruthika, Thamarai Selvi Somasundaram, and S Kanaga Suba Raja. 2020. Lifecycle model of a negotiation agent: A survey of automated negotiation techniques. Group Decision and Negotiation 29, 6 (2020), 1239ś1262. [24] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015). [25] Timothy Paul Lillicrap, Jonathan James Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations (ICLR 2016). [26] Raz Lin, Sarit Kraus, Tim Baarslag, Dmytro Tykhonov, Koen Hindriks, and Catholijn M Jonker. 2014. Genius: An integrated environment for supporting the design of generic automated negotiators. Computational Intelligence 30, 1 (2014), 48ś70. [27] Yousef Razeghi, Celal Ozan Berk Yavaz, and Reyhan Aydoğan. 2020. Deep reinforcement learning for acceptance strategy in bilateral negotiations. Turkish Journal of Electrical Engineering & Computer Sciences 28, 4 (2020), 1824ś1840. [28] Ariel Rubinstein. 1982. Perfect equilibrium in a bargaining model. Econometrica: Journal of the Econometric Society (1982), 97ś109. [29] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning. [30] Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press. [31] Okan Tunalı, Reyhan Aydoğan, and Victor Sanchez-Anguix. 2017. Rethinking frequency opponent modeling in automated negotiation. In International Conference on Principles and Practice of Multi-Agent Systems. Springer, 263ś279. [32] Colin R Williams, Valentin Robu, Enrico H Gerding, and Nicholas R Jennings. 2012. Iamhaggler: A negotiation agent for complex environments. In New Trends in Agent-based Complex Automated Negotiations. Springer, 151ś158. [33] Colin R Williams, Valentin Robu, Enrico H Gerding, and Nicholas R Jennings. 2014. An overview of the results and insights from the third automated negotiating agents competition (ANAC2012). Novel Insights in Agent-based Complex Automated Negotiation (2014), 151ś162. [34] Yoshiaki Yasumura, Takahiko Kamiryo, Shohei Yoshikawa, and Kuniaki Uehara. 2009. Acquisition of a concession strategy in multi-issue negotiation. Web Intelligence and Agent Systems: An International Journal 7, 2 (2009), 161ś171. [35] Shohei Yoshikawa, Yoshiaki Yasumura, and Kuniaki Uehara. 2008. Strategy acquisition on multi-issue negotiation without estimating opponent’s preference. In KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications. Springer, 371ś380.