Secure Failure Detection and Consensus in Trustedpals
Secure Failure Detection and Consensus in Trustedpals
Secure Failure Detection and Consensus in Trustedpals
1 INTRODUCTION
C
ONSIDER a set of parties who wish to correctly compute
some common function F of their local inputs, while
keeping their local data as private as possible, but who do not
trust each other, nor the channels by which they commu-
nicate. This is the problem of Secure Multiparty Computation
(SMC) [1]. SMC is a very general security problem, i.e., it can
be used to solve various real-life problems such as
distributed voting, private bidding and online auctions,
sharing of signature or decryption functions and so on.
Unfortunately, solving SMC iswithout extra assump-
tionsvery expensive in terms of communication (number
of messages), resilience (amount of redundancy), and time
(number of synchronous rounds).
TrustedPals [2] is a smart card-based security framework
for solving SMC which allows much more efficient solutions
to the problem. Conceptually, TrustedPals considers a
distributed system in which processes are locally equipped
with tamper-proof security modules. In practice, processes
are implemented as a Java desktop application and security
modules are realized using Java Card Technology enabled
smart cards [3], tamper-proof Subscriber Identity Modules
(SIM) [4] like those used in mobile phones, or storage cards
with built-in tamper-proof processing devices [5]. Roughly
speaking, solving SMC among processes is achieved by
havingsecuritymodules jointlysimulate a Trusted Third Party
(TTP), as we now explain.
To solve SMC in the TrustedPals framework, the
function F is coded as a Java function and is distributed
within the network in an initial setup phase. Then, processes
hand their input value to their security module and the
framework accomplishes the secure distribution of the input
values. Finally, all security modules compute F and return
the result to their process. The network of security modules
sets up confidential and authenticated channels between
each other and operates as a secure overlay within the
distribution phase. Roughly speaking, within this secure
overlay arbitrary and malicious behavior of an attacker is
reduced to process crashes and message omissions. Trus-
tedPals therefore allows to reduce the security problem of
SMC to a problem of fault-tolerant synchronization [2], an
area which has a long research tradition in fault-tolerant
distributed computing (see, for example, Lynch [6]). How-
ever, solving the synchronizationproblemalone is not trivial,
especially since we investigate it under message omission
failures, a failure scenario which is rather unusual. Further-
more, the reduction from security to fault-tolerance creates a
new set of validation obligations regarding the integration of
a fault-tolerant algorithm into a secure system, which is also
far from being trivial.
The initial definition of TrustedPals and its implementa-
tion assumed a synchronous network setting, i.e., a setting in
which all important timing parameters of the network are
bounded and known. This makes TrustedPals sensitive to
unforeseen variations in network delay and therefore not
very suitable for deployment in networks like the Internet.
610 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
. R. Cortinas, A. Lafuente, M. Larrea, and I. Soraluze are with the Facultad
de Informatica, University of the Basque Country UPV/EHU, Paseo
Manuel de Lardizabal 1, E-20018 Donostia, Spain.
E-mail: {roberto.cortinas, alberto.lafuente, mikel.larrea,
iratxe.soraluze}@ehu.es.
. F.C. Freiling is with the Friedrich-Alexander-University Erlangen-
Nurnberg, Lehrstuhl fur Informatik 1 (IT-Sicherheitsinfrastrukturen),
Am Wolfmantel 46, 91058 Erlangen, Germany.
E-mail: felix.freiling@cs.fau.de.
. M. Ghajar-Azadanlou is with the Bayer Business Services, 51368
Leverkusen, Germany.
E-mail: Marjan.Ghajar-Azadanlou@rwth-aachen.de.
. L.D. Penso is with the Faculty of Mathematics and Economics, Institute for
Optimization and Operations Research, University of Ulm, 89069 Ulm,
Germany. E-mail: lucia.penso@uni-ulm.de.
Manuscript received 2 July 2010; revised 23 June 2011; accepted 9 Feb. 2012;
published online 17 Feb. 2012.
For information on obtaining reprints of this article, please send e-mail to:
tdsc@computer.org, and reference IEEECS Log Number TDSC-2010-07-0109.
Digital Object Identifier no. 10.1109/TDSC.2012.23.
1545-5971/12/$31.00 2012 IEEE Published by the IEEE Computer Society
In this paper, we explore how to make TrustedPals
applicable in environments with less synchrony. More
precisely, we show how to solve the asynchronous version
of SMC using asynchronous synchronization algorithms
inspired by recent results in fault-tolerant distributed
computing: we use an asynchronous consensus algorithm
and encapsulate (some very weak) timing assumptions
within a device known as a failure detector [7].
The concept of a failure detector has been investigated in
quite some detail in systems with merely crash faults [8]. In
such systems, correct processes (i.e., processes which do not
crash) must eventually permanently suspect crashed pro-
cesses. There is very little work on failure detection and
consensus in message omission environments. In fact, it is
not clear what a sensible definition of a failure detector (and
consensus) is in such environments because the notion of a
correct process can have several different meanings (e.g., a
process with no failures whatsoever or a process which
does not crash but just omits some messages). In this work,
instead of correct processes, we will consider well-connected
processes, i.e., those processes which are able to compute
and communicate without omissions with a majority of
processes.
1.1 Related Work
Our work on TrustedPals can be regarded as building
failure detectors for arbitrary (Byzantine) failures, which has
been investigated previously (see, for example, Malkhi and
Reiter [9], Kihlstrom et al. [10], Doudou et al. [11], Doudou
et al. [12], Haeberlen et al. [13], [14], and more recently
Haeberlen and Kuznetsov [15]). In contrast to previous
works on Byzantine failure detectors, we use security
modules to avoid the tar pits of this area. This contrasts
TrustedPals to the large body of work that tackles Byzantine
faults directly, like Castro and Liskovs Practical Byzantine
Fault-Tolerance [16] or more recently the Aardvark system
by Clement et al. [17] and the Turquois protocol by Moniz
et al. [18]. While being conceptually simpler, Byzantine-
tolerant protocols necessarily have to assume a two-thirds
majority of correct processes in nonsynchronous settings
[19] while TrustedPals needs only a simple majority (due to
the availability of the secure overlay network). Next to an
improved resilience, TrustedPals by design can provide
secrecy of data against attackers, a notion that can only be
achieved in Byzantine-tolerant algorithms by applying
complex secret sharing mechanisms [20]. All these advan-
tages result from using security modules to constrain the
attacker in such a way that Byzantine faults are reduced to
general omission faults.
Delporte-Gallet et al. [21] were the first to investigate
nonsynchronous settings in the TrustedPals context, always
with the (implicit) motivation to make TrustedPals more
practical. Following the approach of Chandra and Toueg [7]
(and similar to this paper) they separate the trusted system
into an asynchronous consensus layer and a partially
synchronous failure detection layer. The main difference
however is that they assume that transient omissions are
masked by a piggybacking scheme while we detect transient
omissions and, therefore, we do not need to piggyback all
message history. Besides, they solve a different version of
consensus than we do: roughly speaking, message omissions
can cause processes to only be able to communicate indirectly
and we admit processes to participate in consensus even if
they cannot communicate directly. Delporte-Gallet et al. [21]
only guarantee that all processes that can communicate
directly with eachother solve consensus. Incontrast, we allow
also another set of processes to propose and decide: those
which are able to send and receive messages even indirectly.
As a minor difference, we focus on the class T of eventually
perfect failure detectors whereas they [21] implement the
failure detector. Furthermore, they [21] do not describe
how to integrate failure detection and consensus within the
TrustedPals framework: a realistic adversary who is able to
selectively influence the communication messages of the
algorithms for failure detectionandconsensus cancause their
consensus algorithmto fail. This problemis partly addressed
in a recent paper [22] where consensus and failure detection
are integrated for efficiency purposes, not for security.
Apart from Delporte-Gallet et al. [21], other authors also
investigated solving consensus in systems with omission
faults. Work by Dolev et al. [23], [24] also follows the failure
detector approach to solve consensus, however they focus
on the class o(om) of failure detectors. Babaoglu et al. [25]
also follow the path of o to solve consensus in partition-
able systems. Alternatively, Santoro and Widmayer [26]
assume a synchronous system model, and Moniz et al. [27]
use randomization.
Recently, solving SMC without security modules has
received some attention focusing mainly on two-party
protocols [28], [29], [30], [31], [32], [33]. In systems with
security modules, Avoine and Vaudenay [34] examined the
approach of jointly simulating a TTP. This approach was
later extended by Avoine et al. [35] who show that in a
system with security modules fair exchange can be reduced
to a special form of consensus. They derive a solution to fair
exchange in a modular way so that the agreement abstrac-
tion can be implemented in diverse manners. Benenson et al.
[2] extended this idea to the general problem of SMC and
showed that the use of security modules cannot improve the
resilience of SMC but enables more efficient solutions for
SMC problems. All these papers assume a synchronous
network model.
Ben-Or et al. [36] were the first to investigate solutions to
the asynchronous variant of SMC which is slightly more
involved than its synchronous counterpart because the
failure to provide input to F by some party cannot be
attributed solely to that partyit can also be due to
unpredictable delays in the network. This has consequences
regarding the resilience of SMC. While it is possible to solve
SMC in the synchronous setting with a simple majority of
benign processes [37], [38], in asynchronous settings a two-
thirds majority is necessary [39].
Correia et al. [40] present a system which employs a
real-time distributed security kernel to solve SMC. The
architecture is very similar to that of TrustedPals as it also
uses the notion of architectural hybridization [41]. How-
ever, the adversary model of Correia et al. [40] assumes
that the attacker only has remote access to the system
while TrustedPals allows the owner of a security module
to be the attacker. Like other previous works [2], [34], [35],
Correia et al. [40] also assume a synchronous network
model at least in a part of the system.
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 611
1.2 Contributions
In this paper, we present a modular redesign of TrustedPals
using consensus and failure detection as modules. More
specifically, we make the following technical contributions:
. We show how to solve asynchronous Secure Multi-
party Computation by implementing TrustedPals in
asynchronous systems with a (weak) failure detec-
tor. We do this by reducing the problem of SMC to
the problem of uniform consensus in omission
failure environments. As a corollary we show that
in systems with security modules and weak timing
assumptions the resilience of asynchronous SMC can
be improved from a two-thirds majority to a simple
majority of benign processes.
. We propose a new definition of connectedness in
omission environments. Informally, a process is in-
connected if it does not crash and, despite omissions,
receives either directly or indirectly all messages that
a majority of processes sends to it. Similarly, a
process is out-connected if it does not crash and all
messages it sends are received by a majority of
processes. We also consider well-connected processes,
which are those processes that are both in-connected
and out-connected.
. We give a novel definition of consensus in the new
omission model, by refining the termination prop-
erty of consensus (Every in-connected process
eventually decides some value), and an algorithm
which uses the failure detector class T(om) to
solve consensus. That algorithm is an adaptation of
the classic algorithm by Chandra and Toueg for the
crash model.
. We give a novel definition of T in the omission
model, T(om), and we show how to implement it in
a system with weak synchrony assumptions in the
spirit of partial synchrony.
. We integrate failure detection and consensus se-
curely in TrustedPals by employing message pad-
ding and dummy traffic, tools known from the area
of privacy enhancing techniques.
1.3 Paper Outline
This paper is structured as follows: In Section 2 we give an
overview over TrustedPals, its architecture and the motiva-
tion behind the definitions and model and show how
asynchronous SMC can be reduced to uniform consensus.
In Section 3 we fully formalize the system model of
TrustedPals. In Section 4 we show how to solve consensus
using an abstract failure detector of the class T(om). In
Section 5 we show how to implement the failure detector
T(om) in the omission failure model under very weak
synchrony assumptions. In Section 6 we describe how to
integrate failure detection and consensus securely in the
TrustedPals framework. We conclude in Section 7.
2 TRUSTEDPALS IN WEAKLY SYNCHRONOUS
SYSTEMS
We now present an overview over the TrustedPals system
architecture and show how the reduction from SMC to
consensus can be done in nonsynchronous systems.
Additionally, this section is meant as an informal introduc-
tion giving a high level view of the definitions used in the
remainder of the paper. All necessary notions are fully
formalized later in Section 3.
2.1 Untrusted and Trusted System
We formalize the system assumptions within a hybrid
model, i.e., the model is divided into two parts (see Fig. 1).
The lower part consists of n processes which represent the
untrusted hosts. The upper part equally consists of n processes
which represent the security modules. Due to the lack of
mutual trust among untrusted hosts, we call the former part
the untrusted system. Since the security modules trust each
other we call the latter part the trusted system.
The processes in the untrusted system (i.e., the hosts)
execute (possibly untrustworthy) user applications like
e-banking or e-voting programs. Because of the untrust-
worthy nature of these processes, they use the trusted
system as a subsystem to solve the involved security
problems. The trusted system consists of software running
inside the security modules. This software must have been
certified by some accepted authority. It is not possible for the
user to install arbitrary software on the trusted system. The
tamper-proof nature of the trusted processes allows to
protect stored and transmitted information even from the
untrusted processes on which they reside. The authority can
be an independent service provider (like a network
operator) and is only necessary within the bootstrap phase
of the system, not during any operational phases (like
running the SMC algorithms).
Formally, the connection between the untrusted and
trusted system is achieved by associating each process in
the untrusted system (i.e., each host) with exactly one
process in the trusted system (i.e., a security module) and
vice versa. Hence, every untrusted process has a trusted
pal (an associated trusted process). Since host and security
module reside on the same physical machine, we assume
that for each association there is a bidirectional eventually
timely and secure communication channel, e.g., implemen-
ted by shared variables or message passing communication
in the host operating system.
612 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
Fig. 1. The untrusted and trusted system.
For better readability, we sometimes refer to untrusted
processes simply as hosts and to trusted processes simply as
security modules. We refer to TrustedPals as the collection of
trusted processes together with the services they provide to
the untrusted processes.
2.2 Defining Asynchronous SMC with TrustedPals
As mentioned in the introduction, SMC allows the set of
hosts to correctly compute some common function F of
their local values. While the original work on TrustedPals
[2] used the synchronous definition of SMC given by
Goldreich [42], we follow the definition of asynchronous
SMC given by Ben-Or et al. [36]. The difference between the
synchronous and asynchronous form of SMC lies in the fact
that F is not computed on all inputs but rather on a subset
of inputs of size at least n f, where f is an upper bound
on the number of faulty (i.e., nonbenign) processes. A
process is benign if, besides not crashing, it follows its
protocol (i.e., it is non-Byzantine).
More formally, let x
1
; . . . ; x
n
be the private inputs of each
host. In asynchronous SMC, the result r is computed on the
inputs from a subset C of hosts of size at least n f, i.e.,
r = F(y
1
; . . . ; y
n
) where y
i
= x
i
if host i is in C and some
default value (e.g., y
i
= 0) otherwise. The result r should be
computed reliably and securely, i.e., as if all hosts were
using a TTP. This means that the individual inputs remain
secret to other hosts (apart from what is given away by r)
and that malicious hosts can neither prevent the computa-
tion from taking place nor influence r in favorable ways.
In this paper, we use the following definition of
(asynchronous) SMC: A protocol solves secure multiparty
computation with TrustedPals if it satisfies the following
properties [36], [42]:
. SMC-Validity. If a host receives a result, then that
result was computed by applying F on the inputs
from a subset C of hosts of size at least n f, i.e.,
r = F(y
1
; . . . ; y
n
) where y
i
= x
i
if host i is in C and
some default value (e.g., y
i
= 0) otherwise. The set C
is the same for all hosts that receive a result.
. SMC-Agreement. No two values returned by security
modules differ.
. SMC-Termination. Every benign host eventually
receives a result from its associated security module.
. SMC-Privacy. Faulty hosts learn nothing about the
input values of benign hosts (apart from what is
given away by the result r and the input values of all
faulty hosts).
Recall that these properties abstractly specify what
happens when a TTP is used to solve the problem [36], [42].
2.3 Untrusted System: Assumptions
Within the untrusted system each pair of hosts is connected
by a pair of unidirectional communication channels, one in
each direction. We assume that there is a minimal set of
reliable channels in the system. The rest of the channels can
be lossy. Recall that every message sent through a reliable
channel is eventually delivered at the destination. We
assume no particular ordering relation on channels.
2.3.1 Failure Model
The process failure model we assume in the untrusted
system is the Byzantine failure model [43]. A Byzantine
process can behave arbitrarily. We will assume a majority of
benign processes in the untrusted system.
2.3.2 Timing
We assume that a local real-time clock is available to each
process in the untrusted system, but clocks are not
necessarily synchronized within the network. The untrusted
system is assumed to be partially synchronous, meaning that
eventually unknown bounds on processing and commu-
nication delays hold for the majority of benign processes.
The model is a variant of the partial synchrony models of
Dwork et al. [44]. The differences are that the bounds must
hold for just a majority of processes, and that we assume a
set of reliable, eventually timely channels connecting those
processes.
2.4 Trusted System: Assumptions
The trusted system can be considered as an overlay
networka network that is built on top of another
networkover the untrusted system. Nodes in the overlay
network can be thought of as being connected by virtual or
logical channels. In practice, for example, smart cards could
form the overlay network which runs on top of the Internet
modeled by the untrusted processes. In the trusted system,
each process has also an outgoing and an incoming
communication channel with every other process.
2.4.1 Trust Model
Within the trusted system we assume that any two commu-
nicating parties can establish mutual message confidenti-
ality, message integrity, and message authentication. This
can be realized, for example, by exchanging cryptographic
keys during a setup phase of the system. As mentioned
above, we assume that the code running within the trusted
system has been certified by some trusted authority, i.e.,
nodes in the trusted system may assume that each others
programs have not been tampered with. The trusted
authority acts only during the setup phase of the system,
not during the operational phase.
2.4.2 Timing
Security modules do not have any clock, they have just a
simple step counter, whereby a step consists of possibly
consuming a message, executing a local computation and
possibly sending a message. Passing of time is checked by
counting the number of steps executed. Roughly speaking,
the timing assumptions for the processes in the trusted
system are the same as those of the untrusted system, i.e., we
assume partial synchrony. However, as we will explain next,
in case the trusted process is associated with an untrusted
process which is faulty/malicious, the trusted process may
not rely on any timing assumptions whatsoever.
2.4.3 Failure Model
Like the untrusted system, the trusted system is also prone
to attacks. However, the assumptions on the security
modules and the possibility to establish secure channels
reduce the options of the malicious hosts to attacks on the
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 613
liveness of the system, i.e., 1) destruction of the security
module, 2) interception of messages between the channel
and the security module, or 3) changes in the frequency
of the step counter. This way, in the trusted system we
assume the failure model of general omission and some
asynchrony that can only affect those trusted processes
associated with faulty processes in the untrusted system
(i.e., security modules residing on Byzantine hosts).
The concept of omission faults, meaning that a process
drops a message either while sending (send omission) or
while receiving it (receive omission), was introduced by
Hadzilacos [45] and later generalized by Perry and Toueg
[46]. In the general omission model processes can fail by
crashing or experience either send omissions or receive
omissions. In our system we allow the possibility of transient
omissions, i.e., a process may temporarily drop messages
and later on reliably deliver messages again. Of course,
permanent omissions are possible too.
Besides omissions, trusted processes can become arbi-
trarily slow (asynchronous) although the physical system in
which they operate (i.e., the untrusted system) is partially
synchronous. This models the effect of two different types
of attacks by malicious hosts which we now explain:
. Timing attacks. Recall that security modules use a step
counter to be aware of passing of time. The speed of
the step counter is controlled by the associated host.
In a timing attack, a malicious host arbitrarily changes
the speed of the step counter of its security module. In
that way, it can make the security module work faster
(although the speed is physically bounded by the
hosts own clock) or slower (not bounded). As a
consequence, the behavior of its security module
could become asynchronous and thus the commu-
nication through all the virtual channels which are
adjacent to that security module would become
asynchronous as well.
. Buffering attacks. As we have previously pointed out,
a malicious host can intercept messages between its
security module and the communication channels.
Removal of an intercepted message is modeled as a
message omission. In a buffering attack, the host does
not remove the messages but stores them in a buffer
and later on injects them into the communication
channel after an arbitrary delay. This means that the
message is not omitted but communication through
that channel may become asynchronous in the
trusted system.
Observe that both attacks affect the timing behavior of the
attacked security module. However, buffering attacks are
more selective, since a particular communication channel of
a security module could be attacked (become asynchronous
in the trusted system) without affecting the rest of the
communication channels of that security module.
In addition to the previous types of attacks, message
reordering attacks are treated as a particular case of
message buffering attacks, since in our algorithms no
message is delivered if it is not the expected one (messages
carry a unique sequence number). Also, note that the case of
buffer overflow in the smart card (e.g., if the attacker buffers
a lot of messages and then passes all those messages at the
same time to the smart card) can be naturally treated as if
the smart card omitted the reception of some message(s).
To summarize, processes in the trusted system can fail by
crashing or omitting messages. Additional types of failure
include the process or any of its incoming or outgoing
communication channels becoming asynchronous and/or
lossy. We will fullyformalize this faultybehavior inSection3.
2.4.4 Classes of Processes
Earlier work on failure detectors and partial synchrony [7],
[44] assumed a majority of correct processes in order to solve
consensus. However, as observed earlier we allow faulty
processes to participate in consensus provided that they
keep their ability to compute (no crash) and to communicate
without omissions with a majority of processes.
In order to circumvent some transient omissions, we
admit the possibility of indirect communication between two
processes. For example, if there are omissions in the
communication channel from a process p to another process
q, but both of them have no omissions with a third process r,
process p could indirectly communicate with q through r
without any omission. This way, a process will be considered
a well-connected process as long as it is able to communicate
with a majority of processes without any omission, even if it
has suffered some omissions. The set of well-connected
processes will be formalized in Section 3.
Furthermore, we should notice that the connectedness of a
process can be asymmetric, since it can suffer send omissions
and receive omissions independently, e.g., a process can be
able to send to a majority of processes, but not be able to
receive from a majority of processes (because it has had too
many receive omissions). Following this motivation, we
consider the following classes of processes, based on their
ability to communicate (we give only rough and intuitive
definitions here and fully formalize these notions later in
Section 3):
. A process is in-connected if it does not crash and it
receives all messages that some well-connected
process sends to it.
. A process is out-connected if it does not crash and all
messages it sends to some well-connected process
are received by that process.
Based on these definitions, well-connected processes are
both in-connected and out-connected. Observe that every
out-connected process can send information to any in-
connected process with no omissions. Fig. 2 shows an
example where arcs represent channels with no omissions.
The majority of well-connected processes corresponds to
the set x; y; w; r; v. Processes p and q are out-connected,
while process s is in-connected. Finally, process u is neither
in-connected nor out-connected.
614 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
Fig. 2. Examples for classes of processes.
2.4.5 Connecting the Failure Assumptions
To connect the failure assumptions from trusted and
untrusted systems we make the following assumption: a
benign process in the untrusted system, i.e., a benign host,
implies that its associated process in the trusted system is
well-connected. Since there is a majority of benign hosts, the
previous assumption implies that there is a majority of well-
connected processes in the trusted system. Observe that a
nonbenign host does not necessarily imply a non well-
connected process in the trusted system.
2.5 Uniform Consensus
We are now ready to give the definition of consensus used in
this paper. Intuitively, in the consensus problem, every
process proposes a value, and correct processes must
eventually decide on some common value that has been
proposed. In the crash model, every correct process is required
to eventually decide some value. This is called the Termina-
tion property of consensus. The difference between (regular)
consensus and uniform consensus lies in the uniform
agreement property that demands that noncorrect processes
are not allowed to decide differently from correct processes.
In order to adapt consensus to the omission model, we
argue that only the Termination property has to be redefined.
This property now involves in-connected processes, since,
although they can experience some omissions, in-connected
processes will be able to receive the decision. The properties
of uniform consensus in the omission model are thus the
following:
. Termination. Every in-connected process eventually
decides some value.
. Integrity. Every process decides at most once.
. Uniform agreement. No two processes decide differ-
ently.
. Validity. If a process decides v, then v was proposed
by some process.
2.6 Solving Asynchronous SMC with TrustedPals
The original work on TrustedPals [2] reduced the problem
of SMC to that of Uniform Interactive Consistency (UIC) in
the trusted system. Roughly speaking, within the trusted
system, trusted processes exchange the inputs, then
compute the function F and synchronize before returning
the result back into the untrusted system. The implementa-
tion was based on an algorithm for uniform consensus by
Parve dy and Raynal [47]. We now argue that the
asynchronous variant of SMC can also be solved using
uniform consensus in TrustedPals. We give pseudocode in
form of a procedure in Fig. 3.
For every trusted process i, after receiving the input
value x
i
from its host, the trusted process sends x
i
to all
other processes in the trusted system. Thereafter, it waits for
the receipt of at least (n 1)=2| values from other trusted
processes and collects the pairs (j; x
j
) in a local set V
i
, where
x
j
is the value received from process j. Next, the set V
i
is
proposed to uniform consensus. Let V denote the decision
value of uniform consensus. From V , the process constructs
the vector of inputs for F (as used in the definition of SMC),
computes F on that vector and returns the result to its host.
We now argue that this algorithm implements SMC for a
majority of benign hosts if uniform consensus can be solved
in the trusted system. Recall that a majority of benign hosts
implies that a majority of processes in the trusted system
will be well-connected and thus in-connected.
First consider SMC-Termination. Although the system is
asynchronous, the distribution of input values to all other
processes will terminate since only a majority of values has
to be received and a majority of hosts are benign. Therefore,
all the trusted processes of benign hosts will eventually
propose a value to uniform consensus. From the Termina-
tion property of uniform consensus, every such process will
eventually decide and return the computed value to its host.
Now consider SMC-Validity. Since all security modules
on benign hosts enter uniform consensus, they all propose a
set V
i
containing a majority of values. From the Validity
property of uniform consensus, the decided value V on
every such process will also contain a majority of values. The
uniform agreement property of uniform consensus in turn
guarantees that any returned result will be computed on the
same vector of inputs. SMC-Agreement trivially follows
from the protocol and the uniform agreement property of
uniform consensus.
Proving SMC-Privacy is much more intricate because it
depends critically on how the trusted system (i.e., Trus-
tedPals) operates. Intuitively, the trusted system should
leak no information other than what is communicated at its
interface (i.e., the input of x
i
and the output of the result r).
This will be subject of Section 6 where we integrate all
algorithms in the trusted system such that we achieve a
form of unobservability [48], [49], a notion known from the
area of privacy-enhancing techniques that theoretically
closes all side channels through which confidential in-
formation may leak.
Fig. 4 summarizes the layers and interfaces of the
proposed modular architecture for TrustedPals. A message
exchange is performed in the transport layer, which is under
control of the untrusted host. The security mechanisms for
message encryption/decryption run in the layer termed
Security on the security module. In the failure detector
and consensus/SMC layers run the failure detection and
consensus/SMC algorithms, respectively. Finally, in the
application layer, which again is under the control of the
untrusted host, application software offering user interfaces
to consensus/SMC operate.
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 615
Fig. 3. Solving SMC using uniform consensus.
3 FORMALIZATION OF THE TRUSTED SYSTEM
As explained above, the reduction from SMC to Consensus
assumes that there exists an algorithm for Uniform Con-
sensus in the trusted system. Describing such an algorithm
will be the main subject of the remainder of this paper.
But before we describe the algorithm, we first formalize all
necessary concepts within the trusted system.
3.1 Processes and Channels
We model a distributed system as a set of n > 1 processes
= p
1
; p
2
; . . . ; p
n
which are connected through pairwise
bidirectional communication channels in a fully connected
topology. In the following, we will also use p, q, r, etc., to
denote processes. We denote the channel from p to q by c
pq
.
3.2 Algorithms and Events
An algorithm A consists of a set of deterministic automata,
one for each process. We give our algorithms in an event-
basednotationandthus assume that a local FIFOevent queue
is part of the local state of every process. Within an execution
step, a process takes an event from the queue, performs a
state transition according to the event, and then may send a
message or add a new event to the queue. Message arrivals
are treated as events too, i.e., when a message arrives, an
appropriate event is added to the queue. It is received by
the process when this event is processed. We assume that
every process which does not crash executes infinitely many
events.
3.3 Global Clock
We use a discrete global clock to simplify the presentation
of our model. However, no process has access to this clock;
it is merely a fictional device. For simplicity we take the
range T of the clock to be the set of natural numbers.
Steps (i.e., event executions) on processes are always
associated with a certain global time. We assume a linear
model of event execution, i.e., for every instance in time
there is at most one event in the system which is executed.
3.4 Process Failures
Processes can experience different kinds of failures: crash
failures, omission failures, and timing failures.
A crash failure set F
c
is a subset of processes.
Informally, F
c
contains all processes that will eventually
crash.
A send-omission failure set F
so
is a relation over
. Informally, (p; q) F
so
means that process p experiences
at least one send omission toward q. If (p; q) , F
so
then p
never experiences a send omission toward q.
Similarly, a receive-omission failure set F
ro
is a
relation over . Informally, (p; q) F
ro
means that process
q experiences at least one receive omission from p. So if
(p; q) , F
ro
then q never experiences a receive omission
from p.
Some processes may experience timing failures. Timing
failures refer to process asynchrony. We define an asyn-
chronous process failure set F
ap
as a subset of processes.
Intuitively, F
ap
contains all processes which are asynchro-
nous in the system. Processes which are not in F
ap
are
eventually synchronous [44] meaning that their processing
speed is eventually bounded. Formally, a process p is
synchronous if there exists a known bound such that the
time between the execution of any two steps of p is bounded
by . A process p is eventually synchronous if there exists a
time after which p is synchronous (additionally, can be
unknown). Note that this implies that the relative process
speeds between any pair of eventually synchronous
processes is bounded. In our system model, both and
the time after which holds are unknown.
A process failure set F = (F
c
; F
so
; F
ro
; F
ap
) is a tuple
consisting of a crash failure set, a send-omission failure
set, a receive-omission failure set, and an asynchronous
process failure set.
We define the set of correct processes to be the set of all
processes that neither crash nor experience any omission
nor are asynchronous. We denote this set with c. Formally
c = p : p , F
c
. p , F
ap
.
(\q : (p; q) , F
so
. (q; p) , F
ro
):
As we will see, we do not need to assume the existence of
a majority of correct processes. Moreover, the set of correct
processes could even be empty.
3.5 Send and Receive
Processes can send a message using the Send primitive. The
event Send(p; m; q; t) means that at time t process p sends m
to q. More precisely, m is inserted into the channel c
pq
unless
p experiences a send omission of m toward q. If c
pq
is a
reliable channel, for any message m inserted into c
pq
, the
channel guarantees that eventually an appropriate event is
added to the local event queue of process q. However, in
case of buffering attacks or channel asynchrony/loss, there
could be no time bound for the event in q to be added. If c
pq
is not reliable or q experiences a receive omission, then m is
removed from the channel without adding an appropriate
event to the event queue of q. When this event is processed,
we say that the message is received at time t, formalized as
the occurrence of the event Receive(p; m; q; t). We allow
processes to selectively wait for messages.
No particular order relations are defined in the reception
of messages. We assume that every message m from p to q is
tagged with a unique sequence number assigned by p.
616 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
Fig. 4. The architecture of our system.
3.6 Channel Failures
Given a channel c
pq
, we say that a message m from p to q is
received timely if the receive event of m happens at most
clock ticks after the send event of m, being a time
bound. We say that the channel c
pq
is eventually timely if
there exists a point in time t such that all messages from p to
q sent and received after t are received timely. Formally
t T : bound : \m : Receive(p; m; q; t
r
) .
Send(p; m; q; t
s
) . t < t
s
< t
r
= (t
r
t
s
_ ):
Again, in our partially synchronous system model both
and the time after which holds are unknown.
We define an asynchronous/lossy channel failure set F
ac
as a relation over such that c
pq
F
ac
iff the
channel c
pq
is asynchronous and/or lossy. Note that if
c
pq
, F
ac
then c
pq
is eventually timely and reliable. Note also
that c
pq
, F
ac
does not necessarily imply that c
qp
, F
ac
.
Considering both process and channel failures, we now
define the relation p q to denote an eventually timely and
failure-free direct communication from p to q
p q = (p , F
c
. p , F
ap
) . (q , F
c
. q , F
ap
) .
(p; q) , F
so
. (p; q) , F
ro
. c
pq
, F
ac
:
Note that not necessarily p; q c.
3.7 Indirect Communication
Two processes p and q may communicate directly through
the channel c
pq
, or indirectly using message relaying
through some path of any length. When considering indirect
communication, messages from p can be received timely by
q despite the fact that the channel c
pq
is not eventually timely
and reliable. Based on this, we define the relation p
+
q to
denote an eventually timely and failure-free (direct or
indirect) communication from p to q. Informally, p
+
q
means that there is a path from p to q such that for every pair
of adjacent processes r and s along the path, r s is
satisfied. Formally
p
+
q = (p q) . (r : (p r . r
+
q)):
Clearly, the relation
+
is transitive.
3.8 Well-Connected Processes
Consider the graph composed of processes and
+
relations.
Since we assume a majority of benign hosts (as presented at
the end of Section 2.4), and every two benign hosts imply
two-way connected processes via
+
relations, there exists a
connected component in our system containing a majority
of processes. We denote by c
+
the set of processes in this
component. A process p is well-connected iff p c
+
.
Roughly speaking, every pair of processes in the set c
+
can communicate reliably, eventually timely, and without
omissions between them. Observe also that every correct
process is well-connected, i.e., c _ c
+
.
3.9 In-Connected and Out-Connected Processes
We now define the set of in-connected processes as follows:
1) Every process in c
+
is in-connected, and 2) a process p is
in-connected if there exists a process q c
+
for which q
+
p.
Formally
in-connected = p : p c
+
. q c
+
: q
+
p:
Similarly, the set of out-connected processes is defined as
follows: 1) Every process in c
+
is out-connected, and 2) a
process p is out-connected if there exits a process q c
+
for
which p
+
q. Formally
out-connected = p : p c
+
. q c
+
: p
+
q:
3.10 T(om) Failure Detector
The range of the T(om) failure detector class is 1 = 2