Deep Learnable Strategy Templates for Multi-Issue Bilateral
Negotiation
Pallavi Bagga
Nicola Paoletti
Kostas Stathis
Royal Holloway, University of London
Egham, United Kingdom
pallavi.bagga@rhul.ac.uk
Royal Holloway, University of London
Egham, United Kingdom
nicola.paoletti@rhul.ac.uk
Royal Holloway, University of London
Egham, United Kingdom
kostas.stathis@rhul.ac.uk
arXiv:2201.02455v1 [cs.MA] 7 Jan 2022
ABSTRACT
We study how to exploit the notion of strategy templates to learn
strategies for multi-issue bilateral negotiation. Each strategy template consists of a set of interpretable parameterized tactics that are
used to decide an optimal action at any time. We use deep reinforcement learning throughout an actor-critic architecture to estimate
the tactic parameter values for a threshold utility, when to accept
an offer and how to generate a new bid. This contrasts with existing
work that only estimates the threshold utility for those tactics. We
pre-train the strategy by supervision from the dataset collected
using łteacher strategiesž, thereby decreasing the exploration time
required for learning during negotiation. As a result, we build automated agents for multi-issue negotiations that can adapt to different
negotiation domains without the need to be pre-programmed. We
empirically show that our work outperforms the state-of-the-art in
terms of the individual as well as social efficiency.
KEYWORDS
Multi-Issue Negotiation, Deep Reinforcement Learning, Bilateral
Automated Negotiation, Interpretable Negotiation Strategies
1
INTRODUCTION
We are concerned with the problem of modelling a self-interested
agent negotiating with an opponent over multiple issues while
learning to optimally adapt its strategy. For instance, an agent
trying to buy a laptop, settles the price of a laptop on the behalf of
its owner based on a number of other issues such as laptop type,
delivery time, payment methods and location delivery [18].
For realistic and complex environments, we assume that our
agent has no previous knowledge of the opponent’s preferences
and its negotiating characteristics [5]. Also, the utility of offers exchanged during the negotiation decreases over time (in negotiation
scenarios with a discount factor), thus, timely decision on rejecting
or accepting an offer and making acceptable offers are substantial [16]. Moreover, in a multi-issue negotiation, there are likely
to be a number of different offers at any given utility level. Since
they all result in the same utility, our agent is indifferent between
these offers. So, there is another challenge to select the best offer
which maximizes the utility to the opponent, whilst maintaining
our desired utility level (i.e., to aim for the łwin-winž solution) [32].
Existing work consists of four main approaches addressing the
above-mentioned challenges. (a) Hand-crafted predefined heuristics ś these are proposed in a number of settings with competitive
results [13], and although interpretable (e.g., [1, 2]), they are often characterized by ad-hoc parameter/weight settings that are
difficult to adapt for different domains. (b) Meta-heuristic (or evolutionary) methods ś work well across domains and improve iteratively using a fitness function (as a guide for quality); however, in
these approaches every time an agent decision is made, this needs
to be delivered by the meta-heuristic, which is not efficient and
does not result in a human-interpretable and reusable negotiation
strategy. (c) Machine learning algorithms ś they show the best
results with respect to run-time adaptability [8, 27], but often their
working hypotheses are not interpretable, a fact that may hinder
their eventual adoption by users due to lack of transparency in the
decision-making that they offer. (d) Interpretable strategy templates
ś developed in [10] to guide the use of a series of tactics whose optimal use can be learned during negotiation. The structure of such
templates depends upon a number of learnable choice parameters,
determining which acceptance and bidding tactic to employ at any
particular time during negotiation. As these tactics represent hypotheses to be tested, defined by the agent developer, they can be
explained to a user, and can in turn depend on learnable parameters. The outcome of this work is an agent model that formulates
a strategy template for bid acceptance and generation so that an
agent that uses it can make optimal decisions about the choice of
tactics while negotiating in different domains [10].
The benefit of (d) is that it can combine (a), (b) and (c) by using
heuristics for the components of the template and meta-heuristics
or machine learning for evaluating the choice parameter values
of these components. The problem with (d), however, is that the
choice parameters of the components for the acceptance and bidding templates are learned once (during training) and used in all
the different negotiation settings (during testing) [10]. This onesize-fits-all choice of tactics does not accumulate learning experience and may be unsuitable for unknown domains or unknown
opponents. In other words, the current mechanism for learning
the choice parameter values in [10] abstracts away from what is
learned in a specific domain once the negotiation has finished, and
therefore cannot transfer it to new domains or unseen opponents.
To address the limitation of (d), we propose the idea of using Deep
Reinforcement Learning (DRL) to estimate the choice parameter
values of components in strategy templates. We name the proposed
interpretable strategy templates as łDeep Learnable Strategy Templates (DLST)ž. Our contribution is that we study experimentally
the ideas behind DLSTs so that agents that employ them to learn
parameter values from and across negotiation experiences, hence
being capable of transferring the knowledge from one domain to
the other, or using the experience against one opponent on the
other. This approach leads to ładaptivež and generalizable strategy
templates. We also perform extensive evaluation experiments based
on the ANAC tournaments [22] against agents with learning capabilities (readily available in GENIUS [33]) in a variety of domains
with different sizes and competitiveness levels [33], each with two
different profiles. The agents used for comparison span a wide
range of strategies and techniques1 . Empirically, the DLST-based
agent negotiation model outperforms existing strategies in terms
of individual as well as social welfare utilities.
The remainder of the paper is organized as follows. In Section 2,
we discuss the previous work related to learning-based multi-issue
negotiation. In Section 3, we give a description of negotiation settings considered in this paper. Then, in Section 4, the proposed
DLST-based negotiation model is introduced followed by various
methods and methodologies in Section 5. Subsequently, in Section 6,
we experimentally evaluate the performance efficiency of the proposed model. We conclude in Section 7 where we also outline an
open problem worth pursuing in the future, as a result of this work.
2
RELATED WORK
Existing approaches with reinforcement learning have focused on
methods such as Tabular Q-learning for bidding [12] and finding the
optimal concession [34, 35] or DQN for bid acceptance [27], which
are not optimal for continuous action spaces. Such spaces, however,
are the main focus in this work in order to estimate the threshold target utility value below which no bid is accepted/proposed from/to
the opponent agent. Also, in order to perform this effectively, the
agents are required to conclude many prior negotiations with an
opponent in order to learn the opponent’s behaviour. Consequently,
their approach, and reinforcement learning in general, is not appropriate for one-off negotiation with an unknown opponent. The
recently proposed adaptive negotiation model in [8, 9] uses DRL for
continuous action spaces, but their motivation is significantly different to ours. In our work, the agent attempts to predict the tactic
choices for acceptance and bidding strategies at any particular time
as well as learn the threshold utility which will be used among one
of the tactics to be used in acceptance and bidding strategies, while
[8, 9] uses DRL for a complete agent strategy while negotiating with
multiple sellers concurrently in e-market like scenarios. Moreover,
we focus on building the generalized decoupled and interpretable
decision component, i.e., separate acceptance and bidding strategies
are learned based on interpretable templates containing different
tactics to be employed at different times in different domains. Another closely related multi-issue DRL-based negotiation work has
also been seen in [10, 11]. Unlike the use of meta-heuristic optimization to learn the strategy parameter values in [10, 11] and
use it in all the negotiation settings, we use DRL and the strategy
parameter values may differ in different negotiation settings. Also,
unlike [10, 11], we abstract away from handling the user preference
uncertainties and generating the near-Pareto-optimal bids under
preference uncertainties.
3
NEGOTIATION SETTINGS
As in [10], we assume that our negotiation environment 𝐸 consists
of two agents 𝐴𝑢 and 𝐴𝑜 negotiating with each other over some
1 E.g.,
AgreeableAgent2018- Frequency-based opponent modelling, AgentHerb- Logistic
Regression, SAGA -Genetic Algorithm (GA), KakeSoba- Tabu Search, Rubick- Gaussian
distribution, Caduceus2016- Mixture of GA, algorithm portfolio and experts.
domain 𝐷. A domain 𝐷 consists of 𝑛 different independent issues,
𝐷 = (𝐼 1, 𝐼 2, . . . 𝐼𝑛 ), with each issue taking a finite set of 𝑘 possible
discrete or continuous values 𝐼𝑖 = (𝑣 𝑖1, . . . 𝑣𝑘𝑖 ). In our experiments,
we consider issues with discrete values. An agent’s bid 𝜔 is a mapping from each issue to a chosen value (denoted by 𝑐𝑖 for the 𝑖-th
issue), i.e., 𝜔 = (𝑣𝑐11 , . . . 𝑣𝑐𝑛𝑛 ). The set of all possible bids or outcomes
is called outcome space Ω s.t. 𝜔 ∈ Ω. The outcome space is common
knowledge to the negotiating parties and stays fixed during a single
negotiation session.
Negotiation protocol. Before the agents can begin the negotiation
and exchange bids, they must agree on a negotiation protocol 𝑃,
which determines the valid moves agents can take at any state of the
negotiation [17]. Here, we consider the alternating offers protocol
[28], with possible 𝐴𝑐𝑡𝑖𝑜𝑛𝑠 = {offer (𝜔), accept, reject}. One of the
agents (say 𝐴𝑢 ) starts a negotiation by making an offer 𝑥𝐴𝑢 →𝐴𝑜 to
the other agent (say 𝐴𝑜 ). The agent 𝐴𝑜 can either accept or reject
the offer. If it accepts, the negotiation ends with an agreement,
otherwise 𝐴𝑜 makes a counter-offer to 𝐴𝑢 . This process of making
offers continues until one of the agents either accepts an offer
(i.e., successful negotiation) or the deadline is reached (i.e., failed
negotiation).
Time Constraints. We impose a realśtime deadline 𝑡𝑒𝑛𝑑 on the
negotiation process for both theoretical and practical reasons. The
pragmatic reason is that without a deadline, the negotiation might
go on forever, especially without any discount factors. Secondly,
with unlimited time an agent may simply try a huge amount of
proposals to learn the opponent’s preferences [7]. However, taking
into account a realśtime deadline poses many challenges, such as,
agents should be more willing to concede near the deadline, as a
break-off yields zero (or the reserved utility, if any) utility for both
agents; a realśtime deadline also makes it necessary to employ
a strategy to decide when to accept an offer; and deciding when
to accept involves some prediction whether or not a significantly
better opportunity might occur in the future.
Moreover, we assume that the negotiations are sensitive to time,
i.e., time impacts the utilities of the negotiating parties. In other
words, the value of an agreement decreases over time.
Negotiation session. Formally, for each negotiation session between two agents 𝐴𝑢 and 𝐴𝑜 , let 𝑎𝑡𝐴 →𝐴 ∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠 denote the
𝑢
𝑜
offer action proposed by agent 𝐴𝑢 to agent 𝐴𝑜 at time 𝑡. A negotiation history 𝐻𝐴𝑡 ↔𝐴 between agents 𝐴𝑢 and 𝐴𝑜 until time 𝑡 can
𝑢
𝑜
be represented as in (1):
𝐻𝐴𝑡 𝑢 ↔𝐴𝑜 := (𝑥𝑝𝑡11 →𝑝 2 , 𝑥𝑝𝑡23 →𝑝 4 , · · · , 𝑥𝑝𝑡𝑛𝑛 →𝑝𝑛+1 )
(1)
where, 𝑡𝑛 ≤ 𝑡 and the negotiation actions are ordered over time.
Also, 𝑝 𝑗 = 𝑝 𝑗+2 , i.e., the negotiation process strictly follows the
alternating-offers protocol. Given a negotiation thread between
agents 𝐴𝑢 and 𝐴𝑜 , the action performed by 𝐴𝑢 at time 𝑡 ′ after receiving an offer 𝑥𝐴𝑜 →𝐴𝑢 at time 𝑡 from 𝐴𝑜 can be one from the
set Actions if 𝑡 ′ < 𝑡𝑒𝑛𝑑 , i.e., negotiation deadline is not reached.
Furthermore, we assume bounded rational agents due to the fact
that given the limited time, information privacy, and limited computational resources, agents cannot calculate the optimal strategy
to be carried out during the negotiation.
Figure 1: Interaction between the components of DLST-based agent negotiation model
Utility. We assume that each negotiating agent has its own private preference profile which describes how bids are offered over
the other bids. This profile is given in terms of a utility function
𝑈 , defined as a weighted sum of evaluation functions, 𝑒𝑖 (𝑣𝑐𝑖 𝑖 ) as
shown in (2). Each issue is evaluated separately contributing linearly without depending on the value of other issues and hence 𝑈
is referred to as the Linear Additive Utility space. Here, 𝑤𝑖 are the
normalized weights indicating the importance of each issue to the
user and 𝑒𝑖 (𝑣𝑐𝑖 𝑖 ) is an evaluation function that maps the 𝑣𝑐𝑖 𝑖 value
of the 𝑖 𝑡ℎ issue to a utility.
𝑈 (𝜔) = 𝑈 (𝑣𝑐11 , . . . 𝑣𝑐𝑛𝑛 ) =
𝑛
∑︁
𝑖=1
𝑤𝑖 · 𝑒𝑖 (𝑣𝑐𝑖 𝑖 ), where
𝑛
∑︁
𝑤𝑖 = 1 (2)
𝑖=1
Whenever the negotiation terminates without any agreement,
each negotiating party gets its corresponding utility based on the
private reservation2 value (𝑢𝑟𝑒𝑠 ). In case the negotiation terminates
with an agreement, each agent receives the discounted utility of
𝑡 . Here, 𝑑 is a discount factor
the agreed bid, i.e., 𝑈 𝑑 (𝜔) = 𝑈 (𝜔)𝑑𝐷
𝐷
in the interval [0, 1] and 𝑡 ∈ [0, 1] is current normalized time.
4
DLST-BASED NEGOTIATION MODEL
When building a negotiation agent, we normally consider three
phases: pre-negotiation phase (i.e., estimation of agent owner’s preferences, preference elicitation), negotiation phase (i.e., offer generation, opponent modelling) and post-negotiation phase (i.e., assessing
the optimality of offers) [23]. In this paper, we are interested in the
second phase, which involves a Decide component for choosing
an optimal action 𝑎𝑡 , i.e., 𝐻𝐴𝑡 −1→𝐴 . As in [10], we assume that our
𝑜
𝑢
agent 𝐴𝑢 is situated in an environment 𝐸 (containing the opponent
2 The
reservation value is the minimum acceptable utility for an agent. It may vary for
different parties and different domains. In our settings, it is the same for both parties.
agent 𝐴𝑜 ) where, at any time 𝑡, 𝐴𝑢 senses the current state 𝑆𝑡 of 𝐸
and represents it as a set of internal attributes, as shown in Figure 1;
however this component is implicit in [10]. For the estimation of
threshold utility, the set of state attributes include information derived from the sequence of previous bids offered by 𝐴𝑜 (e.g., utility
of the most recently received bid from the opponent 𝜔𝑡𝑜 , utility of
the best opponent bid so far 𝑂𝑏𝑒𝑠𝑡 , average utility of all the opponent bids 𝑂𝑎𝑣𝑔 and their variability 𝑂𝑠𝑑 ) and information stored in
𝐴𝑢 ’s knowledge base (e.g., number of bids 𝐵 in the given partial
order, 𝑑𝐷 , 𝑢𝑟𝑒𝑠 , Ω, and 𝑛), and the current negotiation time 𝑡. This
internal state representation, denoted with 𝑠𝑡 , is used by the agent
(in acceptance and bidding strategies) to decide what action 𝑎𝑡 to execute from the set of Actions based on the negotiation protocol 𝑃 at
time 𝑡. Action execution then changes the state of the environment
to 𝑆𝑡 +1 . The state 𝑠𝑡 for acceptance strategy involves the following
attributes in addition to the above-mentioned state attributes: fixed
target utility 𝑢, dynamic and learnable target utility 𝑢¯𝑡 , 𝑈 (𝜔), 𝑞
quantile value which changes w.r.t time 𝑡, 𝑄𝑈b (Ω𝑜 ) (𝑞). On the other
𝑡
hand, the state 𝑠𝑡 for bidding strategy involves the following set
of attributes: 𝑏 Boulware , 𝑃𝑆 Pareto-optimal bid, 𝑏𝑜𝑝𝑝 (𝜔𝑡𝑜 ), U (Ω ≥𝑢¯𝑡 )
(as discussed in the subsequent section), in addition to the state
attributes used for estimating the dynamic threshold utility value.
The action 𝑎𝑡 is derived via two functions, 𝑓𝑎 and 𝑓𝑏 , for the
acceptance and bidding strategies, respectively, as in [10]. The
function 𝑓𝑎 takes as inputs 𝑠𝑡 , a dynamic threshold utility 𝑢¯𝑡 (defined
later in the Methods section), the sequence of past opponent bids
Ω𝑡𝑜 , and outputs a discrete action 𝑎𝑡 among accept or reject. When
𝑓𝑎 returns reject, 𝑓𝑏 computes what to bid next, with input 𝑠𝑡 and
𝑢¯𝑡 , see (3ś4). This separation of acceptance and bidding strategies
is not rare, see for instance [6]. Also, 𝑓𝑎 and 𝑓𝑏 consists of a set of
tactics as defined in [10].
𝑓𝑎 (𝑠𝑡 , 𝑢¯𝑡 , Ω𝑡𝑜 )
𝑓𝑏 (𝑠𝑡 , 𝑢¯𝑡 , Ω𝑡𝑜 )
= 𝑎𝑡 , 𝑎𝑡 ∈ {accept, reject}
(3)
= 𝑎𝑡 , 𝑎𝑡 ∈ {offer (𝜔), 𝜔 ∈ Ω}
(4)
We assume incomplete opponent preference information, therefore,
b𝑜 . In particular, 𝑈b𝑜 is estimated
Decide uses the estimated model 𝑈
at time 𝑡 using information from Ω𝑡𝑜 , see Methods section for more
details. Unlike [10], we employ DRL in Acceptance strategy templates as well as Bidding Strategy templates in our work, in addition
to Threshold utility (represented by three green coloured boxes in
Figure 1) in Decide component. Each DRL component is actor-critic
architecture-based [30] and has its own Evaluate and Negotiation
Experience components.
Evaluate refers to a critic helping our agent learn the dynamic
threshold utility 𝑢¯𝑡 , acceptance strategy template parameters and
bidding strategy template parameters, with the new experience
collected during the negotiation against each opponent agent. More
specifically, it is a function of random 𝐾 (𝐾 < 𝑁 ) experiences
fetched from the agent’s memory. Here, learning is retrospective,
since it depends on the reward 𝑟𝑡 obtained from 𝐸 by performing 𝑎𝑡
at 𝑠𝑡 . The reward values for every critic that are used for estimating
the threshold utility (i.e., 𝑟𝑡𝑢¯𝑡 ) as well as choice parameter values
of acceptance (i.e., 𝑟𝑡𝑏𝑖𝑑 ) and bidding strategy templates (i.e., 𝑟𝑡𝑎𝑐𝑐 )
depend on the discounted user utility of the last bid received from
the opponent, 𝜔𝑡𝑜 , or of the bid accepted by either parties 𝜔 𝑎𝑐𝑐 and
defined as (5), (6) and (7) respectively.
𝑟𝑡𝑢¯𝑡
𝑈 (𝜔 𝑎𝑐𝑐 , 𝑡),
𝑢
= 𝑈𝑢 (𝜔𝑡𝑜 , 𝑡),
−1,
𝑟𝑡𝑏𝑖𝑑
(
𝑈𝑢 (𝜔 𝑎𝑐𝑐 , 𝑡),
=
−1,
on agreement
on received offer
otherwise.
(5)
on agreement
otherwise.
(6)
𝑈 (𝜔 𝑎𝑐𝑐 , 𝑡),
𝑢
= 𝑈𝑢 (𝜔𝑡𝑜 , 𝑡),
−1,
on agreement and 𝑈𝑜 (𝜔 𝑎𝑐𝑐 , 𝑡) ≤ 𝑈𝑢 (𝜔 𝑎𝑐𝑐 , 𝑡)
on rejection and 𝑈𝑜 (𝜔𝑡𝑜 , 𝑡) ≥ 𝑈𝑢 (𝜔𝑡𝑜 , 𝑡)
otherwise.
(7)
𝑟𝑡𝑢¯𝑡 (5) and 𝑟𝑡𝑏𝑖𝑑 (6) are straight-forward. In (7), 𝑈𝑜 (𝜔, 𝑡) is used as
the reward value because reward is received from the environment
𝐸 where the opponent agent resides. In other words, we assume that
𝐸 has access to 𝐴𝑜 ’s real preferences, i.e., 𝑈𝑜 , but these preferences
are not observable by our agent 𝐴𝑢 . The first case of the 𝑟𝑡𝑎𝑐𝑐 deals
with an agreed bid and returns a positive reward value, if the bid
gives higher utility to our agent than the opponent. The second
case deals with a rejected bid and returns a positive reward value,
if the bid gives lower utility to our agent than the opponent. In
all other cases, it returns a negative value. Also, in (5), (6) and (7),
𝑈𝑢 (𝜔, 𝑡) is the discounted reward of 𝜔 defined as (8).
𝑟𝑡𝑎𝑐𝑐
𝑈𝑢 (𝜔, 𝑡) = 𝑈𝑢 (𝜔) · 𝑑 𝑡 , 𝑑 ∈ [0, 1]
(8)
In (8), 𝑑 is a temporal discount factor to encourage the agent to
negotiate without delay. We should not confuse 𝑑, which is typically
unknown to the agent, with the discount factor used to compute
the utility of an agreed bid (𝑑𝐷 ).
Negotiation Experience stores historical information about 𝑁
previous interactions of an agent with other agents. Experience
elements are of the form ⟨𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡 +1 ⟩, where 𝑠𝑡 is the internal
state representation of the negotiation environment 𝐸, 𝑎𝑡 is the performed action, 𝑟𝑡 is a scalar reward received from the environment
and 𝑠𝑡 +1 is the new agent state after executing 𝑎𝑡 .
Strategy templates. The strategy templates of [10] are a general
form of parametric strategies for acceptance and bidding. These
strategies apply different tactics at different phases of the negotiation. The total number of phases 𝑛 and the number of tactics 𝑛𝑖
to choose from at each phase 𝑖 = 1, . . . , 𝑛 are the only parameters
fixed in advance. For each phase 𝑖, the duration 𝛿𝑖 (i.e., 𝑡𝑖+1 = 𝑡𝑖 +𝛿𝑖 )
and the particular choice of tactic are learnable parameters. The
latter is encoded with choice parameters 𝑐𝑖,𝑗 , where 𝑖 = 1, . . . , 𝑛
and 𝑗 = 1, . . . , 𝑛𝑖 , such that if 𝑐𝑖,𝑗 is true then the (𝑖, 𝑗)-th tactic is
selected for phase 𝑖. Tactics can be parametric in turn, and depend
on learnable parameters p𝑖,𝑗 .
We consider the same set of admissible tactics as [10]. The key
difference is that our approach allows to evolve the entire strategy
(within the space of strategies entailed by the template) at every
negotiation, which makes more adaptable and generalizable. The
tactics used for acceptance strategies are:
• 𝑈𝑢 (𝜔𝑡 ), the estimated utility of the bid 𝜔𝑡 that our agent
would propose at time 𝑡.
• 𝑄𝑈𝑢 (Ω𝑡𝑜 ) (𝑎 · 𝑡 + 𝑏), where 𝑈𝑢 (Ω𝑡𝑜 ) is the distribution of (estimated) utility values of the bids in Ω𝑡𝑜 , 𝑄𝑈𝑢 (𝐵𝑜 (𝑡 )) (𝑝) is
the quantile function of such distribution, and 𝑎 and 𝑏 are
learnable parameters. In other words, we consider the 𝑝-th
best utility received from the agent, where 𝑝 is a learnable
(linear) function of the negotiation time 𝑡. In this way, this
tactic automatically and dynamically decides how much the
agent should concede at time 𝑡. Here, p𝑖,𝑗 = {𝑎, 𝑏} .
• 𝑢¯𝑡 , the dynamic DRL-based utility threshold.
• 𝑢, a fixed utility threshold.
The bidding tactics are:
• 𝑏 Boulware , a bid generated by a time-dependent Boulware
strategy [15].
• 𝑃𝑆 (𝑎 · 𝑡 + 𝑏) extracts a bid from the set of Pareto-optimal
bids 𝑃𝑆, derived using the NSGA-II algorithm3 [14] under
𝑈𝑢 and 𝑈b𝑜 . In particular, it selects the bid that assigns a
weight of 𝑎 · 𝑡 + 𝑏 to our agent utility (and 1 − (𝑎 · 𝑡 + 𝑏)
to the opponent’s), where 𝑎 and 𝑏 are learnable parameters
telling how this weight scales with the negotiation time 𝑡.
The TOPSIS algorithm [20] is used to derive such a bid, given
the weighting 𝑎 · 𝑡 + 𝑏 as input. Here, p𝑖,𝑗 = {𝑎, 𝑏} .
• 𝑏𝑜𝑝𝑝 (𝜔𝑡𝑜 ), a tactic to generate a bid by manipulating the last
bid received from the opponent 𝜔𝑡𝑜 . This is modified in a
greedy fashion by randomly changing the value of the least
relevant issue (w.r.t. 𝑈 ) of 𝜔𝑡𝑜 .
• 𝜔 ∼ U (Ω ≥𝑢¯𝑡 ), a random bid above our DRL-based utility
threshold 𝑢¯𝑡 4 .
3 Meta-heuristics (instead of brute-force) for Pareto-optimal solutions have the potential
to deal efficiently with continuous issues.
is the uniform distribution over 𝑆 , and Ω ≥𝑢¯𝑡 is the subset of Ω whose bids
have estimated utility above 𝑢¯𝑡 w.r.t. 𝑈 .
4 U (𝑆)
Below, we give an example of a concrete acceptance strategy learned
with our model. We use, as we will discuss in Section 6, a specific
domain (Party) and we show how the strategy adapts in other
negotiation domains (Grocery and Outfit) against the opponent
strategy [10].
(a) Party Domain
𝑡 ∈ [0.000, 0.0361) → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑄𝑈 Ω𝑜 (−0.20 · 𝑡 + 0.22), 𝑢¯𝑡
𝑡
𝑡 ∈ [0.0361, 1.000] → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑢, 𝑄𝑈 Ω𝑜 (−0.10 · 𝑡 + 0.64)
𝑡
(b) Grocery Domain
𝑡 ∈ [0.000, 0.2164) → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑈𝑢 (𝜔𝑡 ), 𝑄𝑈 Ω𝑜 (−0.55 · 𝑡 + 0.05), 𝑢¯𝑡
𝑡
𝑡 ∈ [0.2164, 0.3379) → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑈𝑢 (𝜔𝑡 ), 𝑄𝑈 Ω𝑜 (−0.60 · 𝑡 + 1.40)
𝑡
𝑡 ∈ [0.3379, 1.000] → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑄𝑈 Ω𝑜 (−0.22 · 𝑡 + 0.29), 𝑢¯𝑡
𝑡
(c) Outfit Domain
𝑡 ∈ [0.000, 0.1545) → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ 𝑄𝑈 Ω𝑜 (−0.50 · 𝑡 + 0.70)
𝑡
𝑡 ∈ [0.1545, 0.3496) → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ max 𝑢¯𝑡 , 𝑄𝑈 Ω𝑜 (−0.50 · 𝑡 + 0.90)
𝑡
𝑡 ∈ [0.3496, 1.000] → 𝑈𝑢 (𝜔𝑡𝑜 ) ≥ 𝑈𝑢 (𝜔𝑡 )
We can observe that the duration learned in the left-hand side of
the tactics is different for different domains, e.g., initially in the
first domain (𝑃𝑎𝑟𝑡𝑦) the first rule triggers when 𝑡 ∈ [0.0, 0.0361),
while in the second (𝐺𝑟𝑜𝑐𝑒𝑟𝑦) and third (𝑂𝑢𝑡 𝑓 𝑖𝑡) domains, the first
rule triggers at 𝑡 ∈ [0.0, 0.2164) and 𝑡 ∈ [0.0, 0.1545) respectively.
Similarly, the parameters on the right-hand side of the tactics rules,
e.g., for the first domain (𝑃𝑎𝑟𝑡𝑦) during the very early phase of the
negotiation, the strategy uses a quantile tactic as well as dynamic
threshold utility. However, in the second domain (𝐺𝑟𝑜𝑐𝑒𝑟𝑦), the
strategy now employs future bid utility along with the quantile
bid and the dynamic threshold utility tactics, whereas, in the third
domain (𝑂𝑢𝑡 𝑓 𝑖𝑡), it only employs the quantile bid tactic.
5
METHODS
In our approach, we first use supervised learning (SL) to pre-train
the our agent using supervision examples collected from existing
łteacherž negotiation strategies as inspired by [9, 10]. Such pretrained strategy is then evolved via RL using experience and rewards
collected while interacting with other agents in the negotiation
environment. This combination of SL and RL approaches enhances
the process of learning an optimal strategy. This is because applying
RL alone from scratch would require a large amount of experience
before reaching a reasonable strategy, which might hinder the
online performance of our agent. On the other hand, starting from
a pre-trained policy ensures quicker convergence (as demonstrated
empirically in [9, 10]).
5.1
Data set collection
In order to collect the data set for pre-training our agent via SL, we
have used the GENIUS simulation environment [26]. In particular,
in our experiments we generate supervision data using the existing
DRL-based state-of-the-art agent negotiation model [10] by negotiating it against the winning strategies of ANAC-2019 competition,
i.e., AgentGG, KakeSoba and SAGA (readily available in GENIUS
and requiring minimal changes to work for our negotiation settings)
assuming no user preference uncertainty in three different domains
(Laptop, Holiday, and Party).
5.2
Strategy Representation
We represent both 𝑓𝑎 (3) and 𝑓𝑏 (4) using artificial neural networks
(ANNs) [19], as these are powerful function approximators and
benefit from extremely effective learning algorithms, unlike [10],
which used the meta-heuristic optimization algorithm. We also use
the same to predict the target threshold utility 𝑢¯𝑡 as in [10].
5.2.1 ANN. In particular, we use feed-forward neural networks, i.e.,
functions organized into several layers, where each layer comprises
a number of neurons that process information from the previous
layer. More details can be found in [19]. Also, we keep the ANN
configuration same as in [10].
5.2.2 DRL. During our experiments, the agent negotiates with
fixed-but-unknown opponent strategies in a negotiation environment, which can be learnt by our agent after some simulation runs.
Hence, we consider our negotiation environment as fully-observable.
Following this, for our dynamic and episodic environment, we use a
model-free, off-policy RL approach which generates a deterministic
policy based on the policy gradient method to support continuous
control. More specifically, as in [10], we use Deep Deterministic
Policy Gradient (DDPG) algorithm, which is an actor-critic RL approach and generates a deterministic action selection policy for
the negotiating agent [25]. We consider a model-free RL approach
because our problem is how to make an agent decide what action
to take next in a negotiation dialogue rather than predicting the
new state of the environment. In other words, we are not learning
a model of the environment, as the strategies of the opponents are
not observable properties of the environment’s state. Thus, our
agent’s emphasis is more on learning what action to take next and
not the state transition function of the environment. We consider
the off-policy approach (i.e., an agent attempts to evaluate or improve the policy which is different from the one which was used to
take an action) for independent exploration of continuous action
spaces [25]. When being in a state 𝑠𝑡 , DDPG uses a so-called actor
network 𝜇 to select an action 𝑎𝑐𝑡𝑡 , and a so-called critic network
𝑄 to predict the value 𝑄𝑡 at state 𝑠𝑡 of the action selected by the
actor:
𝑎𝑐𝑡𝑡 = 𝜇 (𝑠𝑡 | 𝜃 𝜇 )
𝑄
(9)
𝜇
𝑄
𝑄𝑡 (𝑠𝑡 , 𝑎𝑐𝑡𝑡 | 𝜃 ) = 𝑄 (𝑠𝑡 , 𝜇 (𝑠𝑡 | 𝜃 ) | 𝜃 )
𝜃𝜇
(10)
𝜃𝑄
In (9) and (10),
and
are, respectively, the learnable parameters of the actor and critic neural networks. The parameters of
the actor network are updated by the Deterministic Policy Gradient method [29]. The objective of the actor policy function is to
maximize the expected return 𝐽 calculated by the critic function
using (11). See [24] for further details on DDPG.
𝐽 = E[𝑄 (𝑠, 𝑎𝑐𝑡 |𝜃 𝑄 )|𝑠=𝑠𝑡 ,𝑎𝑐𝑡 =𝜇 (𝑠𝑡 ) ]
(11)
In our experiments, for predicting the dynamic threshold utility,
the actor function is a single-output regression ANN; on the other
hand, for acceptance and bidding strategies, it is a multiple-output
regression ANN. In particular, when predicting 𝑢¯𝑡 , 𝑎𝑐𝑡𝑡 corresponds
to 𝑢¯𝑡 ; whereas, for acceptance and bidding strategy templates,
𝑎𝑐𝑡𝑡
consists of a vector of multiple outputs 𝛿𝑖 , (𝑐𝑖,𝑗 , p𝑖,𝑗 ) 𝑗=1,...,𝑛𝑖 𝑖=1,...,𝑛
including the duration of each negotiation phase 𝛿𝑖 , Boolean choice
parameters 𝑐𝑖,𝑗 and a set of learnable parameters p𝑖,𝑗 for each tactic
𝑗 that can be used in a negotiation phase 𝑖.
5.3
Opponent modelling
We consider a negotiation environment with uncertainty about
the opponent’s preferences. To derive an estimate of the opponent
b𝑜 during negotiation, we use the distribution-based fremodel 𝑈
quency model proposed in [31], as also done in [10]. In this model,
the empirical frequency of the issue values in Ω𝑡𝑜 provides an educated guess on the opponent’s most preferred issue values. The
issue weights are estimated by analysing the disjoint windows of
Ω𝑡𝑜 , giving an idea of the shift of opponent’s preferences from its
previous negotiation strategy over time.
6
EXPERIMENTAL RESULTS AND
DISCUSSIONS
whereas opposition5 refers to the minimum distance from all possible outcomes to the point representing complete satisfaction of
both negotiation parties (1,1). For the experiments of Hypothesis B,
we choose readily-available 3 small-sized, 2 medium-sized, and 3
large-sized domains. Out of these domains, 2 are with high, 3 with
medium and 3 with low opposition (see [33] for more details).
For each configuration, each agent plays both roles in the negotiation (e.g., buyer and seller in Laptop domain) to compensate for
any utility differences in the preference profiles. We call user profile
the agent’s role along with the user’s preferences. Also, we set the
𝑢𝑟𝑒𝑠 and 𝑑𝐷 to their respective default values, whereas the deadline is set to 180s, normalized in [0, 1] (known to both negotiating
parties in advance). For NSGA-II during the Pareto-bid generation
phase, we choose the population size of 2% × |Ω|, 2 generations
and mutation count of 0.1. With these hyperparameters, on our
machine6 the run-time of NSGA-II never exceeded the given timeout of 10s for deciding an action at each turn, while being able to
retrieve empirically good solutions.
6.3
Empirical Evaluation
All the experiments are performed using the GENIUS tool [26],
which are designed to prove the following two hypotheses:
We evaluate and discuss the two hypotheses introduced at the
beginning of the section.
• Hypothesis A: DLST-based negotiation approach outperforms the łteacherž strategies in known negotiation settings
in terms of individual and social efficiency.
• Hypothesis B: DLST-based negotiation approach outperforms not-seen-before strategies and adapts to different negotiation settings in terms of individual and social efficiency.
6.3.1 Hypothesis A: DLST-based agent outperforms “teacher” strategies. We performed a total of 1200 negotiation sessions7 to evaluate
the performance of DLST-based agent against the four łteacherž
strategies (ANESIA [10], AgentGG, KakeSoba and SAGA) in three
domains (Laptop, Holiday, and Party). These strategies were used
to collect the dataset in the same domains for supervised training
before the DRL process begins. Table 1 demonstrates the average
results over all the domains and profiles for each agent. Clearly,
DLST-based agent outperforms the łteacherž strategies in terms of
individual efficiency, as well as social efficiency.
6.1
Performance metrics:
We measure the performance of each agent in terms of six widelyadopted metrics inspired by the ANAC competition:
𝑡𝑜𝑡𝑎𝑙 : The utility gained by an agent averaged over all the
• 𝑈 ind
negotiations (↑);
𝑠 : The utility gained by an agent averaged over all the
• 𝑈 ind
successful negotiations (↑);
• 𝑈 soc : The utility gained by both negotiating agents averaged
over all successful negotiations (↑);
• 𝑃avg : Average minimal distance of agreements from the
Pareto Frontier (↓).
• 𝑆 % : Proportion of successful negotiations (↑).
The first and second measures represent individual efficiency of
an outcome, whereas the third and fourth correspond to the social
efficiency of agreements.
6.2
Experimental settings
Our proposed DLST-based agent negotiation model is evaluated
against state-of-the-art strategies that participated in ANAC’17 and
ANAC’18, which are designed by different research groups independently. Each agent has no information about another agent’s
strategies beforehand. Details of all these strategies are available
in [3, 21]. We evaluate our approach on total of 11 negotiation
domains which are different from each other in terms of size and
opposition [4] to ensure good negotiation characteristics and to
reduce any biases. The domain size refers to the number of issues,
6.3.2 Hypothesis B: Adaptive behaviour of DLST-based agents. We
further evaluated the performance of DLST-based agent against the
opponent agents from ANAC’17 and ANAC’18 unseen during training and having capability of learning from previous negotiations.
For this, we performed two experiments against ANAC’17 and
ANAC’18 agents, each with a total of 29120 negotiation sessions8 .
Results in Table 2 are averaged over all domains, and demonstrate
that DLST-based agent learns to make the optimal choice of tactics
to be used at run time and outperforms the other 8 strategies in
𝑠 and 𝑈 . We also observed that our agent outperterms of 𝑈 ind
soc
forms the current state-of-the-art (ANESIA) in a tournament with
ANAC’17 and ANAC’18 strategies in all the domains used for the
purpose of evaluation as shown in Figures 2 ś 5. This indicates
that the DLST approach of dynamically adapting the parameters
of acceptance and bidding strategies leads consistently improve
the ANESIA approach of keeping these parameters fixed once the
agent is deployed.
5 The
value of opposition reflects the competitiveness between parties in the domain.
Strong opposition means a gain of one party is at the loss of the other, whereas, weak
opposition means that both parties either lose or gain simultaneously [4].
6 CPU: 8 cores, 2.10GHz; RAM: 32 GB
7 𝑛 × (𝑛 − 1)/2 × 𝑥 × 𝑦 × 𝑧 = 1200 where 𝑛 = 5, number of agents in a tournament;
𝑥 = 2, because agents play both sides; 𝑦 = 3, number of domains; 𝑧 = 20, because
each tournament is repeated 20 times.
8 𝑛 × (𝑛 − 1)/2 × 𝑥 × 𝑦 × 𝑧 = 29120 where 𝑛 = 14; 𝑥 = 2; 𝑦 = 8; 𝑧 = 20.
𝑠 (↑)
𝑡𝑜𝑡𝑎𝑙 (↑)
𝑆 % (↑)
𝑈 ind
𝑈 ind
Laptop Domain
DLST-agent
0.0 ± 0.0
1.71 ± 0.03
0.91 ± 0.02
0.91 ± 0.02
1.00
ANESIA
0.0 ± 0.0
1.66 ± 0.20
0.86 ± 0.03
0.86 ± 0.03
1.00
KakeSoba
0.03 ± 0.12
1.48 ± 0.53
0.77 ± 0.20
0.82 ± 0.06
0.94
SAGA
0.01 ± 0.06
1.45 ± 0.48
0.89 ± 0.13
0.89 ± 0.10
0.99
AgentGG*
0.22 ± 0.35
1.14 ± 0.65
0.71 ± 0.38
0.91 ± 0.09
0.78
Holiday Domain
DLST-agent
0.05 ± 0.11
1.74 ± 0.14
0.96 ± 0.14
0.96 ± 0.14
1.00
ANESIA
0.06 ± 0.1
1.74 ± 0.14
0.85 ± 0.15
0.85 ± 0.15
1.00
KakeSoba
0.21 ± 0.35
1.53 ± 0.5
0.84 ± 0.27
0.92 ± 0.07
0.91
SAGA
0.19 ± 0.36
1.55 ± 0.5
0.70 ± 0.25
0.77 ± 0.12
0.91
AgentGG*
0.46 ± 0.58
1.16 ± 0.82
0.74 ± 0.45
0.96 ± 0.03
0.67
Party Domain
DLST-agent
0.15 ± 0.38
1.53 ± 0.6
0.74 ± 0.31
0.77 ± 0.14
0.87
ANESIA
0.37 ± 0.32
1.06 ± 0.5
0.52 ± 0.27
0.62 ± 0.14
0.83
KakeSoba
0.33 ± 0.32
1.11 ± 0.51
0.64 ± 0.3
0.75 ± 0.12
0.84
SAGA
0.15 ± 0.16
1.36 ± 0.26
0.61 ± 0.19
0.63 ± 0.16
0.87
AgentGG*
0.38 ± 0.42
0.92 ± 0.6
0.62 ± 0.4
0.77 ± 0.12
0.71
Table 1: Performance Comparison of DLST-agent with łteacherž strategies for all the three domains (Laptop, Holiday, and
Party - All readily available in GENIUS). Best Results are in bold. Note * means user preference uncertainty is considered.
Agent
𝑃avg (↓)
𝑈 soc (↑)
𝑡𝑜𝑡𝑎𝑙 (↑)
𝑠 (↑)
𝑃avg (↓)
𝑈 soc (↑)
𝑈 ind
𝑈 ind
𝑆 % (↑)
Comparison of DLST and ANESIA with ANAC 2017 Agent Strategies
DLST-agent
0.0 ± 0.0
1.17 ± 0.12
0.90 ± 0.0
0.93 ± 0.0
1.0
ANESIA
0.0 ± 0.0
1.16 ± 0.12
0.70 ± 0.25
0.76 ± 0.26
0.89
PonpokoAgent
0.70 ± 0.49
0.44 ± 0.70
0.62 ± 0.19
0.93 ± 0.04
0.89
ShahAgent
0.54 ± 0.54
0.79 ± 0.79
0.57 ± 0.07
0.64 ± 0.04
0.75
Mamenchis
0.50 ± 0.05
0.80 ± 0.80
0.66 ± 0.16
0.82 ± 0.18
0.89
AgentKN
0.0 ± 0.0
1.17 ± 0.0
0.65 ± 0.05
0.65 ± 0.05
1.0
Rubick
1.08 ± 0.0
1.00 ± 0.0
0.50 ± 0.09
0.64 ± 0.04
0.76
ParsCat2
0.54 ± 0.54
0.80 ± 0.08
0.66 ± 0.16
0.82 ± 0.04
0.57
SimpleAgent
1.08 ± 0.0
0.90 ± 0.0
0.57 ± 0.14
0.57 ± 0.14
1.0
AgentF
1.18 ± 0.0
1.07 ± 0.06
0.51 ± 0.0
0.81 ± 0.0
0.89
TucAgent
0.08 ± 0.29
0.90 ± 0.03
0.65 ± 0.38
0.52 ± 0.16
0.69
MadAgent
0.67 ± 0.05
1.09 ± 0.17
0.57 ± 0.0
0.57 ± 0.0
1.0
GeneKing
1.08 ± 0.0
0.99 ± 0.14
0.75 ± 0.0
0.67 ± 0.24
0.63
Farma17
0.77 ± 0.49
0.44 ± 0.70
0.65 ± 0.19
0.93 ± 0.04
0.79
Comparison of DLST and ANESIA with ANAC 2018 Agent Strategies
DLST-agent
0.00 ± 0.08
1.54 ± 0.17
0.86 ± 0.07
0.87 ± 0.06
0.91
ANESIA
0.00 ± 0.09
1.41 ± 0.16
0.74 ± 0.14
0.84 ± 0.14
0.78
AgentHerb
0.02 ± 0.05
0.79 ± 0.11
0.78 ± 0.02
0.78 ± 0.11
0.61
AgreeableAgent
0.05 ± 0.11
1.12 ± 0.23
0.53 ± 0.10
0.56 ± 0.05
0.54
Sontag
0.03 ± 0.07
0.73 ± 0.18
0.78 ± 0.08
0.79 ± 0.07
0.59
Agent33
0.04 ± 0.07
0.74 ± 0.18
0.68 ± 0.09
0.78 ± 0.09
0.79
AngentNP1
0.04 ± 0.06
0.73 ± 0.16
0.65 ± 0.10
0.65 ± 0.1
0.69
FullAgent
0.02 ± 0.04
0.67 ± 0.12
0.69 ± 0.05
0.77 ± 0.12
0.61
ATeamAgent
0.09 ± 0.06
0.58 ± 0.13
0.75 ± 0.10
0.75 ± 0.08
0.75
ConDAgent
0.06 ± 0.09
1.16 ± 0.20
0.68 ± 0.11
0.65 ± 0.11
0.56
GroupY
0.03 ± 0.06
0.66 ± 0.15
0.53 ± 0.07
0.54 ± 0.06
0.58
Yeela
0.04 ± 0.06
0.68 ± 0.14
0.73 ± 0.08
0.73 ± 0.07
0.66
Libra
0.10 ± 0.09
0.54 ± 0.19
0.71 ± 0.08
0.56 ± 0.04
0.77
ExpRubick
0.00 ± 0.02
1.10 ± 0.18
0.78 ± 0.08
0.80 ± 0.12
0.91
Table 2: Performance Comparison of DLST-agent with existing strategies averaged over all the 8 domains (Airport Site, Camera,
Energy, Fitness, Flight, Grocery, Itex-Cypress, Outfit - All are readily available in GENIUS). Best Results are in bold.
Agent
Figure 2: Comparison of DLST-agent VS ANESIA in terms of
Agreement rate 𝑆 % (↑)
Figure 4: Comparison of DLST-agent VS ANESIA in terms of
𝑠 (↑)
individual utility rate over successful negotiations 𝑈 ind
Figure 3: Comparison of DLST-agent VS ANESIA in terms of
Social welfare utility 𝑈 soc (↑)
Figure 5: Comparison of DLST-agent VS ANESIA in terms of
𝑡𝑜𝑡𝑎𝑙 (↑)
individual utility rate over all negotiations 𝑈 ind
7
CONCLUSIONS AND FUTURE WORK
This work uses an actor-critic architecture based deep reinforcement learning to support negotiation in domains with multiple
issues. In particular, it exploits łinterpretablež strategy templates
used in the state-of-the-art to learn the best combination of acceptance and bidding tactics at any negotiation time, and among its
tactics, it uses an adaptive threshold utility, all learned using the
DDPG algorithm which derives an initial neural network strategy
via supervised learning. We have empirically evaluated the performance of our DLST-based approach against the łteacher strategiesž
as well as the agent strategies of ANAC’17 and ANAC’18 competitions (since the tournament allowed learning from previous
negotiations) in different settings, showing that our agent outperforms opponents known at training time and can effectively transfer
its knowledge to environments with previously unseen opponent
agents and domains.
An open problem worth pursuing in the future is how to learn
transferable strategies for concurrent bilateral negotiations over
multiple issues.
REFERENCES
[1] Bedour Alrayes, Ozgur Kafali, and Kostas Stathis. 2014. CONAN: a heuristic strategy for COncurrent Negotiating AgeNts. In Proceedings of the 2014 international
conference on Autonomous agents and multi-agent systems. 1585ś1586.
[2] Bedour Alrayes, Özgür Kafalı, and Kostas Stathis. 2018. Concurrent bilateral
negotiation for open e-markets: the CONAN strategy. Knowledge and Information
Systems 56, 2 (2018), 463ś501.
[3] Reyhan Aydoğan, Katsuhide Fujita, Tim Baarslag, Catholijn M Jonker, and
Takayuki Ito. 2018. ANAC 2017: Repeated multilateral negotiation league. In
International Workshop on Agent-Based Complex Automated Negotiation. Springer,
101ś115.
[4] Tim Baarslag, Katsuhide Fujita, Enrico H Gerding, Koen Hindriks, Takayuki Ito,
Nicholas R Jennings, Catholijn Jonker, Sarit Kraus, Raz Lin, Valentin Robu, et al.
2013. Evaluating practical negotiating agents: Results and analysis of the 2011
international competition. Artificial Intelligence 198 (2013), 73ś103.
[5] Tim Baarslag, Mark JC Hendrikx, Koen V Hindriks, and Catholijn M Jonker. 2016.
Learning about the opponent in automated bilateral negotiation: a comprehensive
survey of opponent modeling techniques. Autonomous Agents and Multi-Agent
Systems 30, 5 (2016), 849ś898.
[6] Tim Baarslag, Koen Hindriks, Mark Hendrikx, Alexander Dirkzwager, and
Catholijn Jonker. 2014. Decoupling negotiating agents to explore the space
of negotiation strategies. In Novel Insights in Agent-based Complex Automated
Negotiation. Springer, 61ś83.
[7] Tim Baarslag, Koen Hindriks, Catholijn Jonker, Sarit Kraus, and Raz Lin. 2012.
The first automated negotiating agents competition (ANAC 2010). In New Trends
in agent-based complex automated negotiations. Springer, 113ś135.
[8] Pallavi Bagga, Nicola Paoletti, Bedour Alrayes, and Kostas Stathis. 2020. A
Deep Reinforcement Learning Approach to Concurrent Bilateral Negotiation.
In Proceedings of the Twenty-Ninth International Joint Conference on Artificial
Intelligence, IJCAI 2020. ijcai.org, 297ś303.
[9] Pallavi Bagga, Nicola Paoletti, Bedour Alrayes, and Kostas Stathis. 2021.
ANEGMA: an automated negotiation model for e-markets. Autonomous Agents
and Multi-Agent Systems 35, 2 (2021), 1ś28.
[10] Pallavi Bagga, Nicola Paoletti, and Kostas Stathis. 2020. Learnable strategies for
bilateral agent negotiation over multiple issues. arXiv preprint arXiv:2009.08302
(2020).
[11] Pallavi Bagga, Nicola Paoletti, and Kostas Stathis. 2021. Pareto Bid Estimation
for Multi-Issue Bilateral Negotiation under User Preference Uncertainty. In 2021
IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). IEEE, 1ś6.
[12] Jasper Bakker, Aron Hammond, Daan Bloembergen, and Tim Baarslag. 2019.
RLBOA: A Modular Reinforcement Learning Framework for Autonomous Negotiating Agents.. In AAMAS. 260ś268.
[13] Stefania Costantini, Giovanni De Gasperis, Alessandro Provetti, and Panagiota
Tsintza. 2013. A heuristic approach to proposal-based negotiation: with applications in fashion supply chain management. Mathematical Problems in Engineering
2013 (2013).
[14] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. A
fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on
evolutionary computation 6, 2 (2002), 182ś197.
[15] S Shaheen Fatima, Michael Wooldridge, and Nicholas R Jennings. 2001. Optimal
negotiation strategies for agents with incomplete information. In International
Workshop on Agent Theories, Architectures, and Languages. Springer, 377ś392.
[16] Shaheen S Fatima, Michael Wooldridge, and Nicholas R Jennings. 2002. Multiissue negotiation under time constraints. In Proceedings of the first international
joint conference on Autonomous agents and multiagent systems: part 1. 143ś150.
[17] Shaheen S Fatima, Michael Wooldridge, and Nicholas R Jennings. 2005. A comparative study of game theoretic and evolutionary models of bargaining for software
agents. Artificial Intelligence Review 23, 2 (2005), 187ś205.
[18] S Shaheen Fatima, Michael J Wooldridge, and Nicholas R Jennings. 2006. Multiissue negotiation with deadlines. Journal of Artificial Intelligence Research 27
(2006), 381ś417.
[19] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT
press.
[20] Ching-Lai Hwang and Kwangsun Yoon. 1981. Methods for multiple attribute
decision making. In Multiple attribute decision making. Springer, 58ś191.
[21] Catholijn Jonker, Reyhan Aydogan, Tim Baarslag, Katsuhide Fujita, Takayuki Ito,
and Koen Hindriks. 2017. Automated negotiating agents competition (ANAC). In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
[22] Catholijn M Jonker, Valentin Robu, and Jan Treur. 2007. An agent architecture for multi-attribute negotiation using incomplete preference information.
Autonomous Agents and Multi-Agent Systems 15, 2 (2007), 221ś252.
[23] Usha Kiruthika, Thamarai Selvi Somasundaram, and S Kanaga Suba Raja. 2020.
Lifecycle model of a negotiation agent: A survey of automated negotiation techniques. Group Decision and Negotiation 29, 6 (2020), 1239ś1262.
[24] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,
Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with
deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
[25] Timothy Paul Lillicrap, Jonathan James Hunt, Alexander Pritzel, Nicolas Heess,
Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous
control with deep reinforcement learning. In Proceedings of the 4th International
Conference on Learning Representations (ICLR 2016).
[26] Raz Lin, Sarit Kraus, Tim Baarslag, Dmytro Tykhonov, Koen Hindriks, and
Catholijn M Jonker. 2014. Genius: An integrated environment for supporting the
design of generic automated negotiators. Computational Intelligence 30, 1 (2014),
48ś70.
[27] Yousef Razeghi, Celal Ozan Berk Yavaz, and Reyhan Aydoğan. 2020. Deep reinforcement learning for acceptance strategy in bilateral negotiations. Turkish
Journal of Electrical Engineering & Computer Sciences 28, 4 (2020), 1824ś1840.
[28] Ariel Rubinstein. 1982. Perfect equilibrium in a bargaining model. Econometrica:
Journal of the Econometric Society (1982), 97ś109.
[29] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and
Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In Proceedings
of the 31st International Conference on Machine Learning.
[30] Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
[31] Okan Tunalı, Reyhan Aydoğan, and Victor Sanchez-Anguix. 2017. Rethinking frequency opponent modeling in automated negotiation. In International Conference
on Principles and Practice of Multi-Agent Systems. Springer, 263ś279.
[32] Colin R Williams, Valentin Robu, Enrico H Gerding, and Nicholas R Jennings.
2012. Iamhaggler: A negotiation agent for complex environments. In New Trends
in Agent-based Complex Automated Negotiations. Springer, 151ś158.
[33] Colin R Williams, Valentin Robu, Enrico H Gerding, and Nicholas R Jennings.
2014. An overview of the results and insights from the third automated negotiating agents competition (ANAC2012). Novel Insights in Agent-based Complex
Automated Negotiation (2014), 151ś162.
[34] Yoshiaki Yasumura, Takahiko Kamiryo, Shohei Yoshikawa, and Kuniaki Uehara.
2009. Acquisition of a concession strategy in multi-issue negotiation. Web
Intelligence and Agent Systems: An International Journal 7, 2 (2009), 161ś171.
[35] Shohei Yoshikawa, Yoshiaki Yasumura, and Kuniaki Uehara. 2008. Strategy
acquisition on multi-issue negotiation without estimating opponent’s preference.
In KES International Symposium on Agent and Multi-Agent Systems: Technologies
and Applications. Springer, 371ś380.