Nothing Special   »   [go: up one dir, main page]

Secure Failure Detection and Consensus in Trustedpals

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Secure Failure Detection

and Consensus in TrustedPals


Roberto Cortin as, Felix C. Freiling, Marjan Ghajar-Azadanlou,
Alberto Lafuente, Mikel Larrea, Lucia Draque Penso, and Iratxe Soraluze
AbstractWe present a modular redesign of TrustedPals, a smart card-based security framework for solving Secure Multiparty
Computation (SMC). Originally, TrustedPals assumed a synchronous network setting and allowed to reduce SMC to the problem of
fault-tolerant consensus among smart cards. We explore how to make TrustedPals applicable in environments with less synchrony
and show how it can be used to solve asynchronous SMC. Within the redesign we investigate the problem of solving consensus in a
general omission failure model augmented with failure detectors. To this end, we give novel definitions of both consensus and the class
T of failure detectors in the omission model, which we call T(om), and show how to implement T(om) and have consensus in such
a system with very weak synchrony assumptions. The integration of failure detection and consensus into the TrustedPals framework
uses tools from privacy enhancing techniques such as message padding and dummy traffic.
Index TermsFailure detection, fault-tolerance, smart cards, consensus, secure multiparty computation, message padding, dummy
traffic, general omission model, security performance, reliability.

1 INTRODUCTION
C
ONSIDER a set of parties who wish to correctly compute
some common function F of their local inputs, while
keeping their local data as private as possible, but who do not
trust each other, nor the channels by which they commu-
nicate. This is the problem of Secure Multiparty Computation
(SMC) [1]. SMC is a very general security problem, i.e., it can
be used to solve various real-life problems such as
distributed voting, private bidding and online auctions,
sharing of signature or decryption functions and so on.
Unfortunately, solving SMC iswithout extra assump-
tionsvery expensive in terms of communication (number
of messages), resilience (amount of redundancy), and time
(number of synchronous rounds).
TrustedPals [2] is a smart card-based security framework
for solving SMC which allows much more efficient solutions
to the problem. Conceptually, TrustedPals considers a
distributed system in which processes are locally equipped
with tamper-proof security modules. In practice, processes
are implemented as a Java desktop application and security
modules are realized using Java Card Technology enabled
smart cards [3], tamper-proof Subscriber Identity Modules
(SIM) [4] like those used in mobile phones, or storage cards
with built-in tamper-proof processing devices [5]. Roughly
speaking, solving SMC among processes is achieved by
havingsecuritymodules jointlysimulate a Trusted Third Party
(TTP), as we now explain.
To solve SMC in the TrustedPals framework, the
function F is coded as a Java function and is distributed
within the network in an initial setup phase. Then, processes
hand their input value to their security module and the
framework accomplishes the secure distribution of the input
values. Finally, all security modules compute F and return
the result to their process. The network of security modules
sets up confidential and authenticated channels between
each other and operates as a secure overlay within the
distribution phase. Roughly speaking, within this secure
overlay arbitrary and malicious behavior of an attacker is
reduced to process crashes and message omissions. Trus-
tedPals therefore allows to reduce the security problem of
SMC to a problem of fault-tolerant synchronization [2], an
area which has a long research tradition in fault-tolerant
distributed computing (see, for example, Lynch [6]). How-
ever, solving the synchronizationproblemalone is not trivial,
especially since we investigate it under message omission
failures, a failure scenario which is rather unusual. Further-
more, the reduction from security to fault-tolerance creates a
new set of validation obligations regarding the integration of
a fault-tolerant algorithm into a secure system, which is also
far from being trivial.
The initial definition of TrustedPals and its implementa-
tion assumed a synchronous network setting, i.e., a setting in
which all important timing parameters of the network are
bounded and known. This makes TrustedPals sensitive to
unforeseen variations in network delay and therefore not
very suitable for deployment in networks like the Internet.
610 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
. R. Cortinas, A. Lafuente, M. Larrea, and I. Soraluze are with the Facultad
de Informatica, University of the Basque Country UPV/EHU, Paseo
Manuel de Lardizabal 1, E-20018 Donostia, Spain.
E-mail: {roberto.cortinas, alberto.lafuente, mikel.larrea,
iratxe.soraluze}@ehu.es.
. F.C. Freiling is with the Friedrich-Alexander-University Erlangen-
Nurnberg, Lehrstuhl fur Informatik 1 (IT-Sicherheitsinfrastrukturen),
Am Wolfmantel 46, 91058 Erlangen, Germany.
E-mail: felix.freiling@cs.fau.de.
. M. Ghajar-Azadanlou is with the Bayer Business Services, 51368
Leverkusen, Germany.
E-mail: Marjan.Ghajar-Azadanlou@rwth-aachen.de.
. L.D. Penso is with the Faculty of Mathematics and Economics, Institute for
Optimization and Operations Research, University of Ulm, 89069 Ulm,
Germany. E-mail: lucia.penso@uni-ulm.de.
Manuscript received 2 July 2010; revised 23 June 2011; accepted 9 Feb. 2012;
published online 17 Feb. 2012.
For information on obtaining reprints of this article, please send e-mail to:
tdsc@computer.org, and reference IEEECS Log Number TDSC-2010-07-0109.
Digital Object Identifier no. 10.1109/TDSC.2012.23.
1545-5971/12/$31.00 2012 IEEE Published by the IEEE Computer Society
In this paper, we explore how to make TrustedPals
applicable in environments with less synchrony. More
precisely, we show how to solve the asynchronous version
of SMC using asynchronous synchronization algorithms
inspired by recent results in fault-tolerant distributed
computing: we use an asynchronous consensus algorithm
and encapsulate (some very weak) timing assumptions
within a device known as a failure detector [7].
The concept of a failure detector has been investigated in
quite some detail in systems with merely crash faults [8]. In
such systems, correct processes (i.e., processes which do not
crash) must eventually permanently suspect crashed pro-
cesses. There is very little work on failure detection and
consensus in message omission environments. In fact, it is
not clear what a sensible definition of a failure detector (and
consensus) is in such environments because the notion of a
correct process can have several different meanings (e.g., a
process with no failures whatsoever or a process which
does not crash but just omits some messages). In this work,
instead of correct processes, we will consider well-connected
processes, i.e., those processes which are able to compute
and communicate without omissions with a majority of
processes.
1.1 Related Work
Our work on TrustedPals can be regarded as building
failure detectors for arbitrary (Byzantine) failures, which has
been investigated previously (see, for example, Malkhi and
Reiter [9], Kihlstrom et al. [10], Doudou et al. [11], Doudou
et al. [12], Haeberlen et al. [13], [14], and more recently
Haeberlen and Kuznetsov [15]). In contrast to previous
works on Byzantine failure detectors, we use security
modules to avoid the tar pits of this area. This contrasts
TrustedPals to the large body of work that tackles Byzantine
faults directly, like Castro and Liskovs Practical Byzantine
Fault-Tolerance [16] or more recently the Aardvark system
by Clement et al. [17] and the Turquois protocol by Moniz
et al. [18]. While being conceptually simpler, Byzantine-
tolerant protocols necessarily have to assume a two-thirds
majority of correct processes in nonsynchronous settings
[19] while TrustedPals needs only a simple majority (due to
the availability of the secure overlay network). Next to an
improved resilience, TrustedPals by design can provide
secrecy of data against attackers, a notion that can only be
achieved in Byzantine-tolerant algorithms by applying
complex secret sharing mechanisms [20]. All these advan-
tages result from using security modules to constrain the
attacker in such a way that Byzantine faults are reduced to
general omission faults.
Delporte-Gallet et al. [21] were the first to investigate
nonsynchronous settings in the TrustedPals context, always
with the (implicit) motivation to make TrustedPals more
practical. Following the approach of Chandra and Toueg [7]
(and similar to this paper) they separate the trusted system
into an asynchronous consensus layer and a partially
synchronous failure detection layer. The main difference
however is that they assume that transient omissions are
masked by a piggybacking scheme while we detect transient
omissions and, therefore, we do not need to piggyback all
message history. Besides, they solve a different version of
consensus than we do: roughly speaking, message omissions
can cause processes to only be able to communicate indirectly
and we admit processes to participate in consensus even if
they cannot communicate directly. Delporte-Gallet et al. [21]
only guarantee that all processes that can communicate
directly with eachother solve consensus. Incontrast, we allow
also another set of processes to propose and decide: those
which are able to send and receive messages even indirectly.
As a minor difference, we focus on the class T of eventually
perfect failure detectors whereas they [21] implement the
failure detector. Furthermore, they [21] do not describe
how to integrate failure detection and consensus within the
TrustedPals framework: a realistic adversary who is able to
selectively influence the communication messages of the
algorithms for failure detectionandconsensus cancause their
consensus algorithmto fail. This problemis partly addressed
in a recent paper [22] where consensus and failure detection
are integrated for efficiency purposes, not for security.
Apart from Delporte-Gallet et al. [21], other authors also
investigated solving consensus in systems with omission
faults. Work by Dolev et al. [23], [24] also follows the failure
detector approach to solve consensus, however they focus
on the class o(om) of failure detectors. Babaoglu et al. [25]
also follow the path of o to solve consensus in partition-
able systems. Alternatively, Santoro and Widmayer [26]
assume a synchronous system model, and Moniz et al. [27]
use randomization.
Recently, solving SMC without security modules has
received some attention focusing mainly on two-party
protocols [28], [29], [30], [31], [32], [33]. In systems with
security modules, Avoine and Vaudenay [34] examined the
approach of jointly simulating a TTP. This approach was
later extended by Avoine et al. [35] who show that in a
system with security modules fair exchange can be reduced
to a special form of consensus. They derive a solution to fair
exchange in a modular way so that the agreement abstrac-
tion can be implemented in diverse manners. Benenson et al.
[2] extended this idea to the general problem of SMC and
showed that the use of security modules cannot improve the
resilience of SMC but enables more efficient solutions for
SMC problems. All these papers assume a synchronous
network model.
Ben-Or et al. [36] were the first to investigate solutions to
the asynchronous variant of SMC which is slightly more
involved than its synchronous counterpart because the
failure to provide input to F by some party cannot be
attributed solely to that partyit can also be due to
unpredictable delays in the network. This has consequences
regarding the resilience of SMC. While it is possible to solve
SMC in the synchronous setting with a simple majority of
benign processes [37], [38], in asynchronous settings a two-
thirds majority is necessary [39].
Correia et al. [40] present a system which employs a
real-time distributed security kernel to solve SMC. The
architecture is very similar to that of TrustedPals as it also
uses the notion of architectural hybridization [41]. How-
ever, the adversary model of Correia et al. [40] assumes
that the attacker only has remote access to the system
while TrustedPals allows the owner of a security module
to be the attacker. Like other previous works [2], [34], [35],
Correia et al. [40] also assume a synchronous network
model at least in a part of the system.
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 611
1.2 Contributions
In this paper, we present a modular redesign of TrustedPals
using consensus and failure detection as modules. More
specifically, we make the following technical contributions:
. We show how to solve asynchronous Secure Multi-
party Computation by implementing TrustedPals in
asynchronous systems with a (weak) failure detec-
tor. We do this by reducing the problem of SMC to
the problem of uniform consensus in omission
failure environments. As a corollary we show that
in systems with security modules and weak timing
assumptions the resilience of asynchronous SMC can
be improved from a two-thirds majority to a simple
majority of benign processes.
. We propose a new definition of connectedness in
omission environments. Informally, a process is in-
connected if it does not crash and, despite omissions,
receives either directly or indirectly all messages that
a majority of processes sends to it. Similarly, a
process is out-connected if it does not crash and all
messages it sends are received by a majority of
processes. We also consider well-connected processes,
which are those processes that are both in-connected
and out-connected.
. We give a novel definition of consensus in the new
omission model, by refining the termination prop-
erty of consensus (Every in-connected process
eventually decides some value), and an algorithm
which uses the failure detector class T(om) to
solve consensus. That algorithm is an adaptation of
the classic algorithm by Chandra and Toueg for the
crash model.
. We give a novel definition of T in the omission
model, T(om), and we show how to implement it in
a system with weak synchrony assumptions in the
spirit of partial synchrony.
. We integrate failure detection and consensus se-
curely in TrustedPals by employing message pad-
ding and dummy traffic, tools known from the area
of privacy enhancing techniques.
1.3 Paper Outline
This paper is structured as follows: In Section 2 we give an
overview over TrustedPals, its architecture and the motiva-
tion behind the definitions and model and show how
asynchronous SMC can be reduced to uniform consensus.
In Section 3 we fully formalize the system model of
TrustedPals. In Section 4 we show how to solve consensus
using an abstract failure detector of the class T(om). In
Section 5 we show how to implement the failure detector
T(om) in the omission failure model under very weak
synchrony assumptions. In Section 6 we describe how to
integrate failure detection and consensus securely in the
TrustedPals framework. We conclude in Section 7.
2 TRUSTEDPALS IN WEAKLY SYNCHRONOUS
SYSTEMS
We now present an overview over the TrustedPals system
architecture and show how the reduction from SMC to
consensus can be done in nonsynchronous systems.
Additionally, this section is meant as an informal introduc-
tion giving a high level view of the definitions used in the
remainder of the paper. All necessary notions are fully
formalized later in Section 3.
2.1 Untrusted and Trusted System
We formalize the system assumptions within a hybrid
model, i.e., the model is divided into two parts (see Fig. 1).
The lower part consists of n processes which represent the
untrusted hosts. The upper part equally consists of n processes
which represent the security modules. Due to the lack of
mutual trust among untrusted hosts, we call the former part
the untrusted system. Since the security modules trust each
other we call the latter part the trusted system.
The processes in the untrusted system (i.e., the hosts)
execute (possibly untrustworthy) user applications like
e-banking or e-voting programs. Because of the untrust-
worthy nature of these processes, they use the trusted
system as a subsystem to solve the involved security
problems. The trusted system consists of software running
inside the security modules. This software must have been
certified by some accepted authority. It is not possible for the
user to install arbitrary software on the trusted system. The
tamper-proof nature of the trusted processes allows to
protect stored and transmitted information even from the
untrusted processes on which they reside. The authority can
be an independent service provider (like a network
operator) and is only necessary within the bootstrap phase
of the system, not during any operational phases (like
running the SMC algorithms).
Formally, the connection between the untrusted and
trusted system is achieved by associating each process in
the untrusted system (i.e., each host) with exactly one
process in the trusted system (i.e., a security module) and
vice versa. Hence, every untrusted process has a trusted
pal (an associated trusted process). Since host and security
module reside on the same physical machine, we assume
that for each association there is a bidirectional eventually
timely and secure communication channel, e.g., implemen-
ted by shared variables or message passing communication
in the host operating system.
612 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
Fig. 1. The untrusted and trusted system.
For better readability, we sometimes refer to untrusted
processes simply as hosts and to trusted processes simply as
security modules. We refer to TrustedPals as the collection of
trusted processes together with the services they provide to
the untrusted processes.
2.2 Defining Asynchronous SMC with TrustedPals
As mentioned in the introduction, SMC allows the set of
hosts to correctly compute some common function F of
their local values. While the original work on TrustedPals
[2] used the synchronous definition of SMC given by
Goldreich [42], we follow the definition of asynchronous
SMC given by Ben-Or et al. [36]. The difference between the
synchronous and asynchronous form of SMC lies in the fact
that F is not computed on all inputs but rather on a subset
of inputs of size at least n f, where f is an upper bound
on the number of faulty (i.e., nonbenign) processes. A
process is benign if, besides not crashing, it follows its
protocol (i.e., it is non-Byzantine).
More formally, let x
1
; . . . ; x
n
be the private inputs of each
host. In asynchronous SMC, the result r is computed on the
inputs from a subset C of hosts of size at least n f, i.e.,
r = F(y
1
; . . . ; y
n
) where y
i
= x
i
if host i is in C and some
default value (e.g., y
i
= 0) otherwise. The result r should be
computed reliably and securely, i.e., as if all hosts were
using a TTP. This means that the individual inputs remain
secret to other hosts (apart from what is given away by r)
and that malicious hosts can neither prevent the computa-
tion from taking place nor influence r in favorable ways.
In this paper, we use the following definition of
(asynchronous) SMC: A protocol solves secure multiparty
computation with TrustedPals if it satisfies the following
properties [36], [42]:
. SMC-Validity. If a host receives a result, then that
result was computed by applying F on the inputs
from a subset C of hosts of size at least n f, i.e.,
r = F(y
1
; . . . ; y
n
) where y
i
= x
i
if host i is in C and
some default value (e.g., y
i
= 0) otherwise. The set C
is the same for all hosts that receive a result.
. SMC-Agreement. No two values returned by security
modules differ.
. SMC-Termination. Every benign host eventually
receives a result from its associated security module.
. SMC-Privacy. Faulty hosts learn nothing about the
input values of benign hosts (apart from what is
given away by the result r and the input values of all
faulty hosts).
Recall that these properties abstractly specify what
happens when a TTP is used to solve the problem [36], [42].
2.3 Untrusted System: Assumptions
Within the untrusted system each pair of hosts is connected
by a pair of unidirectional communication channels, one in
each direction. We assume that there is a minimal set of
reliable channels in the system. The rest of the channels can
be lossy. Recall that every message sent through a reliable
channel is eventually delivered at the destination. We
assume no particular ordering relation on channels.
2.3.1 Failure Model
The process failure model we assume in the untrusted
system is the Byzantine failure model [43]. A Byzantine
process can behave arbitrarily. We will assume a majority of
benign processes in the untrusted system.
2.3.2 Timing
We assume that a local real-time clock is available to each
process in the untrusted system, but clocks are not
necessarily synchronized within the network. The untrusted
system is assumed to be partially synchronous, meaning that
eventually unknown bounds on processing and commu-
nication delays hold for the majority of benign processes.
The model is a variant of the partial synchrony models of
Dwork et al. [44]. The differences are that the bounds must
hold for just a majority of processes, and that we assume a
set of reliable, eventually timely channels connecting those
processes.
2.4 Trusted System: Assumptions
The trusted system can be considered as an overlay
networka network that is built on top of another
networkover the untrusted system. Nodes in the overlay
network can be thought of as being connected by virtual or
logical channels. In practice, for example, smart cards could
form the overlay network which runs on top of the Internet
modeled by the untrusted processes. In the trusted system,
each process has also an outgoing and an incoming
communication channel with every other process.
2.4.1 Trust Model
Within the trusted system we assume that any two commu-
nicating parties can establish mutual message confidenti-
ality, message integrity, and message authentication. This
can be realized, for example, by exchanging cryptographic
keys during a setup phase of the system. As mentioned
above, we assume that the code running within the trusted
system has been certified by some trusted authority, i.e.,
nodes in the trusted system may assume that each others
programs have not been tampered with. The trusted
authority acts only during the setup phase of the system,
not during the operational phase.
2.4.2 Timing
Security modules do not have any clock, they have just a
simple step counter, whereby a step consists of possibly
consuming a message, executing a local computation and
possibly sending a message. Passing of time is checked by
counting the number of steps executed. Roughly speaking,
the timing assumptions for the processes in the trusted
system are the same as those of the untrusted system, i.e., we
assume partial synchrony. However, as we will explain next,
in case the trusted process is associated with an untrusted
process which is faulty/malicious, the trusted process may
not rely on any timing assumptions whatsoever.
2.4.3 Failure Model
Like the untrusted system, the trusted system is also prone
to attacks. However, the assumptions on the security
modules and the possibility to establish secure channels
reduce the options of the malicious hosts to attacks on the
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 613
liveness of the system, i.e., 1) destruction of the security
module, 2) interception of messages between the channel
and the security module, or 3) changes in the frequency
of the step counter. This way, in the trusted system we
assume the failure model of general omission and some
asynchrony that can only affect those trusted processes
associated with faulty processes in the untrusted system
(i.e., security modules residing on Byzantine hosts).
The concept of omission faults, meaning that a process
drops a message either while sending (send omission) or
while receiving it (receive omission), was introduced by
Hadzilacos [45] and later generalized by Perry and Toueg
[46]. In the general omission model processes can fail by
crashing or experience either send omissions or receive
omissions. In our system we allow the possibility of transient
omissions, i.e., a process may temporarily drop messages
and later on reliably deliver messages again. Of course,
permanent omissions are possible too.
Besides omissions, trusted processes can become arbi-
trarily slow (asynchronous) although the physical system in
which they operate (i.e., the untrusted system) is partially
synchronous. This models the effect of two different types
of attacks by malicious hosts which we now explain:
. Timing attacks. Recall that security modules use a step
counter to be aware of passing of time. The speed of
the step counter is controlled by the associated host.
In a timing attack, a malicious host arbitrarily changes
the speed of the step counter of its security module. In
that way, it can make the security module work faster
(although the speed is physically bounded by the
hosts own clock) or slower (not bounded). As a
consequence, the behavior of its security module
could become asynchronous and thus the commu-
nication through all the virtual channels which are
adjacent to that security module would become
asynchronous as well.
. Buffering attacks. As we have previously pointed out,
a malicious host can intercept messages between its
security module and the communication channels.
Removal of an intercepted message is modeled as a
message omission. In a buffering attack, the host does
not remove the messages but stores them in a buffer
and later on injects them into the communication
channel after an arbitrary delay. This means that the
message is not omitted but communication through
that channel may become asynchronous in the
trusted system.
Observe that both attacks affect the timing behavior of the
attacked security module. However, buffering attacks are
more selective, since a particular communication channel of
a security module could be attacked (become asynchronous
in the trusted system) without affecting the rest of the
communication channels of that security module.
In addition to the previous types of attacks, message
reordering attacks are treated as a particular case of
message buffering attacks, since in our algorithms no
message is delivered if it is not the expected one (messages
carry a unique sequence number). Also, note that the case of
buffer overflow in the smart card (e.g., if the attacker buffers
a lot of messages and then passes all those messages at the
same time to the smart card) can be naturally treated as if
the smart card omitted the reception of some message(s).
To summarize, processes in the trusted system can fail by
crashing or omitting messages. Additional types of failure
include the process or any of its incoming or outgoing
communication channels becoming asynchronous and/or
lossy. We will fullyformalize this faultybehavior inSection3.
2.4.4 Classes of Processes
Earlier work on failure detectors and partial synchrony [7],
[44] assumed a majority of correct processes in order to solve
consensus. However, as observed earlier we allow faulty
processes to participate in consensus provided that they
keep their ability to compute (no crash) and to communicate
without omissions with a majority of processes.
In order to circumvent some transient omissions, we
admit the possibility of indirect communication between two
processes. For example, if there are omissions in the
communication channel from a process p to another process
q, but both of them have no omissions with a third process r,
process p could indirectly communicate with q through r
without any omission. This way, a process will be considered
a well-connected process as long as it is able to communicate
with a majority of processes without any omission, even if it
has suffered some omissions. The set of well-connected
processes will be formalized in Section 3.
Furthermore, we should notice that the connectedness of a
process can be asymmetric, since it can suffer send omissions
and receive omissions independently, e.g., a process can be
able to send to a majority of processes, but not be able to
receive from a majority of processes (because it has had too
many receive omissions). Following this motivation, we
consider the following classes of processes, based on their
ability to communicate (we give only rough and intuitive
definitions here and fully formalize these notions later in
Section 3):
. A process is in-connected if it does not crash and it
receives all messages that some well-connected
process sends to it.
. A process is out-connected if it does not crash and all
messages it sends to some well-connected process
are received by that process.
Based on these definitions, well-connected processes are
both in-connected and out-connected. Observe that every
out-connected process can send information to any in-
connected process with no omissions. Fig. 2 shows an
example where arcs represent channels with no omissions.
The majority of well-connected processes corresponds to
the set x; y; w; r; v. Processes p and q are out-connected,
while process s is in-connected. Finally, process u is neither
in-connected nor out-connected.
614 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
Fig. 2. Examples for classes of processes.
2.4.5 Connecting the Failure Assumptions
To connect the failure assumptions from trusted and
untrusted systems we make the following assumption: a
benign process in the untrusted system, i.e., a benign host,
implies that its associated process in the trusted system is
well-connected. Since there is a majority of benign hosts, the
previous assumption implies that there is a majority of well-
connected processes in the trusted system. Observe that a
nonbenign host does not necessarily imply a non well-
connected process in the trusted system.
2.5 Uniform Consensus
We are now ready to give the definition of consensus used in
this paper. Intuitively, in the consensus problem, every
process proposes a value, and correct processes must
eventually decide on some common value that has been
proposed. In the crash model, every correct process is required
to eventually decide some value. This is called the Termina-
tion property of consensus. The difference between (regular)
consensus and uniform consensus lies in the uniform
agreement property that demands that noncorrect processes
are not allowed to decide differently from correct processes.
In order to adapt consensus to the omission model, we
argue that only the Termination property has to be redefined.
This property now involves in-connected processes, since,
although they can experience some omissions, in-connected
processes will be able to receive the decision. The properties
of uniform consensus in the omission model are thus the
following:
. Termination. Every in-connected process eventually
decides some value.
. Integrity. Every process decides at most once.
. Uniform agreement. No two processes decide differ-
ently.
. Validity. If a process decides v, then v was proposed
by some process.
2.6 Solving Asynchronous SMC with TrustedPals
The original work on TrustedPals [2] reduced the problem
of SMC to that of Uniform Interactive Consistency (UIC) in
the trusted system. Roughly speaking, within the trusted
system, trusted processes exchange the inputs, then
compute the function F and synchronize before returning
the result back into the untrusted system. The implementa-
tion was based on an algorithm for uniform consensus by
Parve dy and Raynal [47]. We now argue that the
asynchronous variant of SMC can also be solved using
uniform consensus in TrustedPals. We give pseudocode in
form of a procedure in Fig. 3.
For every trusted process i, after receiving the input
value x
i
from its host, the trusted process sends x
i
to all
other processes in the trusted system. Thereafter, it waits for
the receipt of at least (n 1)=2| values from other trusted
processes and collects the pairs (j; x
j
) in a local set V
i
, where
x
j
is the value received from process j. Next, the set V
i
is
proposed to uniform consensus. Let V denote the decision
value of uniform consensus. From V , the process constructs
the vector of inputs for F (as used in the definition of SMC),
computes F on that vector and returns the result to its host.
We now argue that this algorithm implements SMC for a
majority of benign hosts if uniform consensus can be solved
in the trusted system. Recall that a majority of benign hosts
implies that a majority of processes in the trusted system
will be well-connected and thus in-connected.
First consider SMC-Termination. Although the system is
asynchronous, the distribution of input values to all other
processes will terminate since only a majority of values has
to be received and a majority of hosts are benign. Therefore,
all the trusted processes of benign hosts will eventually
propose a value to uniform consensus. From the Termina-
tion property of uniform consensus, every such process will
eventually decide and return the computed value to its host.
Now consider SMC-Validity. Since all security modules
on benign hosts enter uniform consensus, they all propose a
set V
i
containing a majority of values. From the Validity
property of uniform consensus, the decided value V on
every such process will also contain a majority of values. The
uniform agreement property of uniform consensus in turn
guarantees that any returned result will be computed on the
same vector of inputs. SMC-Agreement trivially follows
from the protocol and the uniform agreement property of
uniform consensus.
Proving SMC-Privacy is much more intricate because it
depends critically on how the trusted system (i.e., Trus-
tedPals) operates. Intuitively, the trusted system should
leak no information other than what is communicated at its
interface (i.e., the input of x
i
and the output of the result r).
This will be subject of Section 6 where we integrate all
algorithms in the trusted system such that we achieve a
form of unobservability [48], [49], a notion known from the
area of privacy-enhancing techniques that theoretically
closes all side channels through which confidential in-
formation may leak.
Fig. 4 summarizes the layers and interfaces of the
proposed modular architecture for TrustedPals. A message
exchange is performed in the transport layer, which is under
control of the untrusted host. The security mechanisms for
message encryption/decryption run in the layer termed
Security on the security module. In the failure detector
and consensus/SMC layers run the failure detection and
consensus/SMC algorithms, respectively. Finally, in the
application layer, which again is under the control of the
untrusted host, application software offering user interfaces
to consensus/SMC operate.
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 615
Fig. 3. Solving SMC using uniform consensus.
3 FORMALIZATION OF THE TRUSTED SYSTEM
As explained above, the reduction from SMC to Consensus
assumes that there exists an algorithm for Uniform Con-
sensus in the trusted system. Describing such an algorithm
will be the main subject of the remainder of this paper.
But before we describe the algorithm, we first formalize all
necessary concepts within the trusted system.
3.1 Processes and Channels
We model a distributed system as a set of n > 1 processes
= p
1
; p
2
; . . . ; p
n
which are connected through pairwise
bidirectional communication channels in a fully connected
topology. In the following, we will also use p, q, r, etc., to
denote processes. We denote the channel from p to q by c
pq
.
3.2 Algorithms and Events
An algorithm A consists of a set of deterministic automata,
one for each process. We give our algorithms in an event-
basednotationandthus assume that a local FIFOevent queue
is part of the local state of every process. Within an execution
step, a process takes an event from the queue, performs a
state transition according to the event, and then may send a
message or add a new event to the queue. Message arrivals
are treated as events too, i.e., when a message arrives, an
appropriate event is added to the queue. It is received by
the process when this event is processed. We assume that
every process which does not crash executes infinitely many
events.
3.3 Global Clock
We use a discrete global clock to simplify the presentation
of our model. However, no process has access to this clock;
it is merely a fictional device. For simplicity we take the
range T of the clock to be the set of natural numbers.
Steps (i.e., event executions) on processes are always
associated with a certain global time. We assume a linear
model of event execution, i.e., for every instance in time
there is at most one event in the system which is executed.
3.4 Process Failures
Processes can experience different kinds of failures: crash
failures, omission failures, and timing failures.
A crash failure set F
c
is a subset of processes.
Informally, F
c
contains all processes that will eventually
crash.
A send-omission failure set F
so
is a relation over
. Informally, (p; q) F
so
means that process p experiences
at least one send omission toward q. If (p; q) , F
so
then p
never experiences a send omission toward q.
Similarly, a receive-omission failure set F
ro
is a
relation over . Informally, (p; q) F
ro
means that process
q experiences at least one receive omission from p. So if
(p; q) , F
ro
then q never experiences a receive omission
from p.
Some processes may experience timing failures. Timing
failures refer to process asynchrony. We define an asyn-
chronous process failure set F
ap
as a subset of processes.
Intuitively, F
ap
contains all processes which are asynchro-
nous in the system. Processes which are not in F
ap
are
eventually synchronous [44] meaning that their processing
speed is eventually bounded. Formally, a process p is
synchronous if there exists a known bound such that the
time between the execution of any two steps of p is bounded
by . A process p is eventually synchronous if there exists a
time after which p is synchronous (additionally, can be
unknown). Note that this implies that the relative process
speeds between any pair of eventually synchronous
processes is bounded. In our system model, both and
the time after which holds are unknown.
A process failure set F = (F
c
; F
so
; F
ro
; F
ap
) is a tuple
consisting of a crash failure set, a send-omission failure
set, a receive-omission failure set, and an asynchronous
process failure set.
We define the set of correct processes to be the set of all
processes that neither crash nor experience any omission
nor are asynchronous. We denote this set with c. Formally
c = p : p , F
c
. p , F
ap
.
(\q : (p; q) , F
so
. (q; p) , F
ro
):
As we will see, we do not need to assume the existence of
a majority of correct processes. Moreover, the set of correct
processes could even be empty.
3.5 Send and Receive
Processes can send a message using the Send primitive. The
event Send(p; m; q; t) means that at time t process p sends m
to q. More precisely, m is inserted into the channel c
pq
unless
p experiences a send omission of m toward q. If c
pq
is a
reliable channel, for any message m inserted into c
pq
, the
channel guarantees that eventually an appropriate event is
added to the local event queue of process q. However, in
case of buffering attacks or channel asynchrony/loss, there
could be no time bound for the event in q to be added. If c
pq
is not reliable or q experiences a receive omission, then m is
removed from the channel without adding an appropriate
event to the event queue of q. When this event is processed,
we say that the message is received at time t, formalized as
the occurrence of the event Receive(p; m; q; t). We allow
processes to selectively wait for messages.
No particular order relations are defined in the reception
of messages. We assume that every message m from p to q is
tagged with a unique sequence number assigned by p.
616 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
Fig. 4. The architecture of our system.
3.6 Channel Failures
Given a channel c
pq
, we say that a message m from p to q is
received timely if the receive event of m happens at most
clock ticks after the send event of m, being a time
bound. We say that the channel c
pq
is eventually timely if
there exists a point in time t such that all messages from p to
q sent and received after t are received timely. Formally
t T : bound : \m : Receive(p; m; q; t
r
) .
Send(p; m; q; t
s
) . t < t
s
< t
r
= (t
r
t
s
_ ):
Again, in our partially synchronous system model both
and the time after which holds are unknown.
We define an asynchronous/lossy channel failure set F
ac

as a relation over such that c
pq
F
ac
iff the
channel c
pq
is asynchronous and/or lossy. Note that if
c
pq
, F
ac
then c
pq
is eventually timely and reliable. Note also
that c
pq
, F
ac
does not necessarily imply that c
qp
, F
ac
.
Considering both process and channel failures, we now
define the relation p q to denote an eventually timely and
failure-free direct communication from p to q
p q = (p , F
c
. p , F
ap
) . (q , F
c
. q , F
ap
) .
(p; q) , F
so
. (p; q) , F
ro
. c
pq
, F
ac
:
Note that not necessarily p; q c.
3.7 Indirect Communication
Two processes p and q may communicate directly through
the channel c
pq
, or indirectly using message relaying
through some path of any length. When considering indirect
communication, messages from p can be received timely by
q despite the fact that the channel c
pq
is not eventually timely
and reliable. Based on this, we define the relation p
+
q to
denote an eventually timely and failure-free (direct or
indirect) communication from p to q. Informally, p
+
q
means that there is a path from p to q such that for every pair
of adjacent processes r and s along the path, r s is
satisfied. Formally
p
+
q = (p q) . (r : (p r . r
+
q)):
Clearly, the relation
+
is transitive.
3.8 Well-Connected Processes
Consider the graph composed of processes and
+
relations.
Since we assume a majority of benign hosts (as presented at
the end of Section 2.4), and every two benign hosts imply
two-way connected processes via
+
relations, there exists a
connected component in our system containing a majority
of processes. We denote by c
+
the set of processes in this
component. A process p is well-connected iff p c
+
.
Roughly speaking, every pair of processes in the set c
+
can communicate reliably, eventually timely, and without
omissions between them. Observe also that every correct
process is well-connected, i.e., c _ c
+
.
3.9 In-Connected and Out-Connected Processes
We now define the set of in-connected processes as follows:
1) Every process in c
+
is in-connected, and 2) a process p is
in-connected if there exists a process q c
+
for which q
+
p.
Formally
in-connected = p : p c
+
. q c
+
: q
+
p:
Similarly, the set of out-connected processes is defined as
follows: 1) Every process in c
+
is out-connected, and 2) a
process p is out-connected if there exits a process q c
+
for
which p
+
q. Formally
out-connected = p : p c
+
. q c
+
: p
+
q:
3.10 T(om) Failure Detector
The range of the T(om) failure detector class is 1 = 2

TRUE; FALSE. Informally, it consists of a set of out-


connected processes as well as a Boolean flag indicating
whether the process considers itself as in-connected. Observe
that such a failure detector is a trust based one instead of a
suspicion-based one.
A T(om) failure detector history H is a function
H : T 1. For example, H(p; t) = (p; q; r; FALSE)
means that at time t process p considers processes p, q, and r
as out-connected and does not consider itself as in-
connected. We denote by H
1
and H
2
the projection of the
tuple returned by H to its first and second element, i.e.,
H
1
(p; t) = p; q; r and H
2
(p; t) = FALSE, respectively, in
the previous example.
We formally define the class of eventually perfect failure
detectors for the omission model, T(om), as the set of all
failure detectors satisfying the following three properties:
. In-connectedness. Eventually, every process will per-
manently consider itself as in-connected iff it is in-
connected. Formally
\F : \p : p in-connected =
t
1
T : \t
2
T : t
2
> t
1
: H
2
(p; t
2
) = TRUE:
. Strong completeness. Every process that is not out-
connected will not be permanently considered as out-
connected by any in-connected process. Formally
\F : \p in-connected : \q , out-connected :
\t
1
T : t
2
T : t
2
> t
1
: q , H
1
(p; t
2
):
. Eventual strong accuracy. Eventually every process
that is out-connected will be permanently consid-
ered as out-connected by every in-connected pro-
cess. Formally
\F : \p in-connected : \q out-connected :
t
1
T : \t
2
T : t
2
> t
1
: q H
1
(p; t
2
):
This specification looks for in-connected processes to agree
on the set of out-connected processes. To this end, every
process must determine whether it is in-connected (In-
connectedness). Due to timing/buffering attacks or channel
asynchrony, some processes may temporarily be consid-
ered as if they were out-connected without really being it,
so it is not possible for in-connected processes to agree on
the exact set of out-connected processes. Alternatively,
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 617
Eventual Strong Accuracy ensures that out-connected
processes will eventually be trusted forever. Finally, Strong
Completeness avoids trusting forever a process that is not
out-connected.
Our definition of an eventually perfect failure detector
results naturally, i.e., at no additional cost, from the uniform
(periodical, all-to-all) communication pattern used in our
failure detection algorithm. We use this communication
pattern in order to avoid side channel attacks (see Section 6)
that otherwise an adversary could try in Byzantine
environments. Said this, we guess that in a non-Byzantine
omissive setting a failure detector of type o or could be
strong enough to solve consensus, as it is the case in the
crash failure model.
4 T(om)-BASED CONSENSUS IN THE TRUSTED
SYSTEM
We now focus our attention on the consensus layer of the
TrustedPals architecture (see Fig. 4), in which we implement
uniform consensus (as defined in Section 2.5) using the
failure detector class T(om) defined in the previous section.
Note that we fully operate within the trusted system, i.e., the
consensus algorithm itself is asynchronous, meaning that it
tolerates arbitrary phases of asynchrony. We first give the
consensus algorithm and then prove its correctness.
4.1 Consensus Algorithm
Figs. 5 and 6 present an algorithm solving consensus in
the omission model using T(om). It is an adaptation of
the well-known o-based Chandra-Toueg consensus algo-
rithm [7] (which also works with a T failure detector).
The use of T(om) by every process p is modeled by
means of the following two variables: a Boolean variable
I am InConnected
p
which provides the In-connectedness
property, and a set OutConnected
p
which provides the
Strong Completeness and Eventual Strong Accuracy
properties.
The algorithm is based on the rotating coordinator
paradigm. It executes in rounds, and each round is
coordinated by a single process, which tries to impose a
value to the rest of participants. If it succeeds, then it takes a
decision and reliably broadcasts [50] it to all processes,
which adopt it. Otherwise, i.e., if the current coordinator is
suspected, then processes advance to the next round. Each
roundis dividedinfour phases: a voting phase, a proposition
phase, an acknowledgment phase, and a (potential) decision
phase. An adequate use of the T(om) failure detector
ensures that, if not earlier, eventually a process that is well-
connected will succeed in imposing a value and thus will
decide.
We now comment on the modifications required to adapt
the original Chandra-Toueg algorithm:
. In Phase 2, the current coordinator waits for a
majority of estimates while it considers itself as in-
connected in order not to block.
. In Phase 3, every process p waits for the new
estimate proposed by the current coordinator while
p considers itself as in-connected and the coordina-
tor as out-connected in order not to block.
. In Phase 4, if the current coordinator sent a valid
estimate in Phase 2, it waits for replies of out-
connected processes while it considers itself as in-
connected in order not to block.
When a process p sends a message m to another process
q, the following relaying approach is assumed: 1) p sends m
to all processes, including q, except p itself, and 2) whenever
p receives for the first time a message m whose actual
destination is another process q, p forwards m to all
processes (except the process from which p has received
m and p itself). This approach can take advantage of the
underlying all-to-all implementation of the T(om) failure
detector without generating any extra message apart from
the periodical all-to-all communication pattern of the failure
detector, as we will see in the next section.
4.2 Correctness Proof
Since the algorithm is very similar to the one proposed by
Chandra and Toueg [7], we only sketch the correctness
proof here. First of all, observe that uniform agreement is
preserved, because we keep the original mechanism based
on majorities to decide on a value. Also, it is easy to see that
integrity and validity are satisfied. Finally, in order to
show that termination is satisfied, we first show that the
algorithm does not block in any of its wait instructions:
. In Phase 2, if the current coordinator p is not in-
connected, it will eventually stop waiting because
the failure detector will eventually set I am
InConnected
p
to FALSE. On the other hand, if p is
in-connected, it will eventually receive a majority of
estimates since by definition there is a majority of
well-connected processes in the system. Hence, no
coordinator blocks forever in the wait instruction of
Phase 2.
. In Phase 3, every process p waits for the new
estimate proposed by the current coordinator or a
NEXT message while p considers itself as in-
connected and the coordinator as out-connected.
Clearly, by the properties of T(om) no process
blocks forever in the wait instruction of Phase 3.
. In Phase 4, the current coordinator waits for replies
of out-connected processes while it considers itself
as in-connected. Again, by the properties of T(om)
no coordinator blocks forever in the wait instruction
of Phase 4.
By the previous facts, eventually some well-connected
process c will coordinate a round in which:
. In Phase 2, the coordinator c will receive a majority
of estimates, because I am InConnected
c
will be
permanently set to TRUE (by the properties of
T(om)) and there is a majority of well-connected
processes in the system. Hence, c will send a valid
estimate to all processes at the end of Phase 2.
. InPhase 3, everywell-connectedprocess p will receive
cs validestimate, because I am InConnected
p
will be
permanently set to TRUE and c will be permanently in
OutConnected
p
(by the properties of T(om)). Hence,
p will send an ACKmessage to c at the end of Phase 3.
618 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 619
Fig. 5. Solving consensus in the omission model using T(om): main algorithm.
. In Phase 4, the coordinator c will receive a majority of
ACK messages, because I am InConnected
c
will be
permanently set to TRUE and all well-connected
processes will be permanently in OutConnected
c
(by
the properties of T(om)) and there is a majority of
well-connected processes in the system. Hence, c will
R-broadcast the decision, and every in-connected
process will eventually decide.
5 FAILURE DETECTION IN THE TRUSTED SYSTEM
Recall from Section 2 that at the transport layer TrustedPals
merely assumes a partially synchronous system model. On
top of this model, we now explain how to implement a
failure detector of class T(om), i.e., we focus on the failure
detection layer of the TrustedPals architecture (see Fig. 4).
The failure detection algorithm presented in this section
determines the connectivity relations defined in Section 3
and builds a basis for the consensus algorithm of Section 4.
The algorithm is based on heartbeat messages that every
process sends periodically to the rest of processes. This
schema provides a kind of delayed message forwarding,
which enables indirect communication of information
attached to heartbeat messages. At the same time, it provides
the support for piggybacking consensus level messages, as
we will see in Section 6.
5.1 Failure Detector Algorithm
Figs. 7, 8, and 9 present an algorithm implementing T(om)
according to the properties defined in Section 3. The
algorithm provides to every process p a set of out-
connected processes, OutConnected
p
and a Boolean vari-
able, I am InConnected
p
. The set OutConnected
p
will
eventually and permanently contain all the out-connected
processes in case p is in-connected. Regarding the
I am InConnected
p
variable, it will be TRUE if p is in-
connected.
The algorithm is based on the periodical communication
of heartbeat messages from every process to the rest of
processes. Roughly speaking, heartbeat messages carry
information about connectivity among processes. When a
process p receives a heartbeat message, it uses that
connectivity information to update its perception of the
connectivity of the rest of processes. This information,
together with ps own connectivity, gives p a view of the
current system connectivity, which will be propagated to
the rest of processes attached to ps subsequent heartbeat
messages. Next we explain in detail how our algorithm
implements this approach.
Every process p has a matrix M
p
of n n elements
representing connectivity information ( relations between
every pair of processes). Every heartbeat message from p
carries the matrix M
p
and a sequence number used to detect
message omissions. Received messages are buffered to be
delivered in FIFO order. When a heartbeat from process q is
received by process p, p updates M
p
according to the state of
its input channels and with the received M
q
. Actually, M
p
represents the transposed adjacency matrix, a (0,1)-matrix of
a directed graph, where the value of the element M
p
[p[[q[
shows if there is an arc from q to p. M
p
has the information
needed to calculate the set of out-connected processes and
the value of the I am InConnected
p
variable. The algorithm
calculates powers of the adjacency matrix to find paths of
any length between processes, which correspond to transi-
tive relations q
+
p, and therefore representing indirect
communication paths from q to p. The set of out-connected
processes, along with the I am InConnected
p
variable, is
computed in the update Connectivity() procedure (Fig. 8),
which is called every time a value of the matrix M
p
is
changed. Observe that it is important for a process p to
check its own in-connectivity to verify the validity of the
information contained in M
p
. The in-connectivity condition
of p (more than n=2| processes communicate properly with
it) is checked in the update Connectivity() procedure too,
and its value is output to the I am InConnected
p
variable.
In the algorithm, every process p executes three tasks:
. In Task 1 (line 13), p periodically sends to the rest
of processes a heartbeat message including the
matrix M
p
and a unique sequence number.
. In Task 2 (line 20), if p does not receive the expected
message from a process q (according to the
next receive
p
[q[ sequence number) in the expected
time, the value of M
p
[p[[q[ is set to 0.
. In Task 3 (line 27), received messages are pro-
cessed. The messages p receives from another
process q are inserted in a FIFO buffer Buffer
p
[q[
(line 28), and delivered following the sequence
number next receive
p
[q[. Once delivered the next
expected message from a process q, the condition of
empty buffer means that there is no message left
from q, so M
p
[p[[q[ is set to 1.
The procedure deliver next message() (Fig. 9) is used to
update the adjacency matrix M
p
using the information
carried by the message. In the procedure, process p copies
into M
p
the row q of the matrix M
q
received from q. This
way, p learns about qs input connectivity. With respect to
every other process u, a mechanism based on version
numbers is used to avoid copying old information about us
input connectivity. Process p will only copy into M
p
the row
u of M
q
if its associated version number is higher.
The periodical exchange of matrices provides a mechan-
ism to indirectly communicate connectivity information
which allows to determine the existence of eventually
timely paths corresponding to
+
relations. This will be the
case for processes in the set c
+
and processes that are in- or
out-connected as defined in Section 3. Besides, a process p
which is not out-connected due to timing/buffering attacks
or channel asynchrony could behave as out-connected if
several reliable (i.e., neither omissive nor lossy) paths from p
to c
+
perfectly alternate in such a way that, eventually, there
is always a path from p to c
+
along which no time-out
expires. In other words, the set of paths emulates an
eventually timely path from p to c
+
, but there is not a stable
eventually timely path from p to c
+
. An analogous reasoning
620 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
Fig. 6. Solving consensus in the omission model using T(om): adopting
the decision.
can be made for a process that, not being in-connected,
could behave as in-connected. Although this strange
behavior might exist, for simplicity we assume that it does
not occur. Observe that the assumption of a majority of
benign processes connected via
+
relations makes this
form of pseudoconnectedness harmless to the system. Indeed,
if needed, pseudoconnectedness could be detected by
checking if paths along which no time-out expires remain
unchanged (i.e., are stable) or not, although we do not
address this issue here.
5.2 Correctness Proof
We now show that the algorithm of Figs. 7, 8, and 9
implements T(om) in the omission model.
Observation 1. At every process p, the matrix M
p
is updated
with its own connectivity information and with the
matrices M
q
received in the heartbeat messages. The
updated M
p
and its version number V ersion
p
are sent
with ps next heartbeat message. The local delay in
process p to send M
p
and V ersion
p
is bounded in the
algorithm by the period of Task 1 of p, which is finite if p
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 621
Fig. 7. T(om) in the omission model: main algorithm.
is eventually synchronous and has not crashed. This
way, an indirect communication schema is obtained.
Lemma 1. \p; q , iff q
+
p, then eventually and permanently
(M
p
)
n
[p[[q[ _ 1.
Proof. If q
+
p, then, by definition of
+
, eventually and
permanently there is a reliable and eventually timely
path with no omission of any length from q to p. By
Task 1 q sends periodically a heartbeat message
including its updated M
q
and V ersion
q
[q[ to the rest of
processes. Every process r receiving by Task 3 a new
V ersion
q
[q[ will update from M
q
, in the procedure
deliver next message(), the row q of M
r
as well as
V ersion
r
[q[. Process r will update matrix M
r
too with its
own connectivity information: by Task 3 of r, M
r
[r[[q[ is
set to 1 every time Buffer
r
[q[ becomes empty; by Task 2
M
r
[r[[q[ is set to 0 when the next expected message from
q is not received timely by r. By Observation 1, M
r
[r[[q[
and V ersion
r
[r[ are propagated to every process s if s is
eventually synchronous and has not crashed. (Note that,
as a particular case, r or s may be p.) When some
message is delivered by Task 3 of p, by the procedure
deliver next message(), p will update M
p
and calculate
(M
p
)
n
if some value in M
p
has changed. By the
definition of the relation
+
, if q
+
p then (M
p
)
n
[p[[q[
will be evaluated eventually and permanently to a
positive value, otherwise, if not q
+
p, (M
p
)
n
[p[[q[ will
not be set to 1 permanently because, by definition, in
every possible path from q to p there will be two
processes, r and s, such that not r s (r and s could be
q and p), and therefore, (M
p
)
n
[p[[q[ = 0. '
Lemma 2. \p in-connected, \q out-connected, eventually
and permanently q OutConnected
p
.
Proof. By definition of out-connected process, there is
some well-connected process r such that q
+
r. By
definition of in-connected process, there is some well-
connected process s such that s
+
p. By transitivity, for
every well-connected process u, q
+
u, and in particular
q
+
s. By Lemma 1, (M
s
)
n
[s[[q[ _ 1. By the procedure
deliver next message(), s will update M
s
copying all the
rows from at least the rest of well-connected processes.
As a consequence of that, since [ c
+
[ > n=2|, column q
of (M
s
)
n
will include eventually and permanently more
than n=2| positive values. Again by Lemma 1 and the
procedure deliver next message(), column q of (M
p
)
n
will have eventually and permanently a positive value
for more than n=2| processes. As a consequence,
according to the procedure update Connectivity(), q will
be permanently included in the set OutConnected
p
. '
Lemma 3. \p in-connected, \q , out-connected, q is not
permanently in OutConnected
p
.
Proof. Since q is not out-connected, it does not exist a well-
connected process r such that q
+
r. By Lemma 1,
(M
p
)
n
[r[[q[ _ 1 only when q receives messages timely
from r, however, since q is not out-connected, this will not
occur permanently. As a consequence, since [ c
+
[ > n=2|,
the number of processes s such that (M
p
)
n
[s[[q[ _ 1 will
not be permanently greater than n=2|, and by the
procedure update Connectivity(), q will not be perma-
nently included in the set OutConnected
p
. '
Lemma 4. \p , iff p in-connected, then eventually and
permanently I am InConnected
p
= TRUE.
Proof. If p in-connected, \r c
+
; r
+
p. By Lemma 1, and
followingasimilar reasoningtothe proof of Lemma2, now
applied to row p of (M
p
)
n
, iff p in-connected the
procedure update Connectivity() will eventually and
permanently set I am InConnected
p
to TRUE (line 45). '
Theorem 1. The algorithm of Figs. 7, 8, and 9 implements
T(om) in the omission model.
622 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
Fig. 9. T(om) in the omission model: procedure deliver_next_message().
Fig. 8. T(om) in the omission model: procedure update_Connectivity().
Proof. The strong completeness, eventual strong accuracy,
and in-connectivity properties of T(om) are satisfied by
Lemmas 3, 2, and 4, respectively. '
6 INTEGRATING FAILURE DETECTION AND
CONSENSUS SECURELY
Most properties of SMC have their direct counterparts in
properties provided by the consensus abstraction used in
TrustedPals. One notable difference is the property of SMC-
Privacy, which relies on subtle issues not relevant to fault-
tolerant synchronization. As an example, it is possible to
successfully attack TrustedPals if the attacker can distin-
guish different message types on the network. Recall that
every smart cardhas to process messages fromthe consensus
protocol and messages from the failure detector (see Fig. 4).
Therefore, it is important to integrate the consensus and
failure detector algorithms securely in TrustedPals.
6.1 Types of Messages
In TrustedPals there are two types of messages sent over the
channel: messages by the consensus algorithm and failure
detector messages. We call the former protocol messages and
the latter heartbeats. Heartbeats are time critical, i.e., they
should not be delayed by the transport layer, while protocol
messages are asynchronous, i.e., eventual delivery is
sufficient for them.
The idea of TrustedPals is to use heartbeats as the
transport mechanism for protocol messages, providing an
implicit relaying mechanism to the consensus layer. Every
heartbeat has a small fixed-size message field called the
payload. Similar to network transport protocols (like IP)
protocol messages are inserted into the payload of heart-
beats when they are sent. If a protocol message is larger
than the size of the payload, it is fragmented into smaller
parts which are sent one after the other. Similar to the
fragmentation mechanism in IP, unique identifiers and
sequence numbers allow to piece together the fragments at
the destination in the correct order. If there is no protocol
message to be sent, the payload of a heartbeat can remain
empty. In this way, we achieve that TrustedPals generates
fixed size messages in fixed (periodic) time intervals.
Observe that the use of heartbeats as the transport
mechanism for protocol messages has a negative impact on
the performance of the consensus algorithm, since the
consensus messages are delayed until messages of the
failure detection layer are sent. This is the price to pay for
hiding the communication pattern of the consensus algo-
rithm to the adversary.
6.2 Avoiding Message and Traffic Analysis
Traffic analysis refers to an attack technique which tries to
derive information about messages by simply analyzing the
variation of the times in which they are sent and received.
Since we send protocol messages within heartbeats we
increase the difficulty for the attacker to distinguish an
empty heartbeat from a full heartbeat by only looking
at the timing of traffic. Since heartbeats are sent in an all-to-
all pattern, it is also hard to distinguish which process is
sending protocol messages to which other process. This
approach ensures unobservability regarding protocol mes-
sages, a notion known from the area of privacy-enhancing
techniques [48], [49].
Of course, the attacker could simply look into the contents
of a heartbeat to discern a full froman empty message. That is
whywe employcryptography onthe channel. We implement
a secure channel satisfying confidentiality, integrity, and
authenticity using standard techniques from cryptography
[51]. The idea is to use the assumed keying material within
the security modules to encrypt all messages and to use
message authentication codes or digital signatures to ensure
the integrity and authenticity of the channel.
Important from a message-analysis point of view is that
all heartbeats are indistinguishable from each other. As
mentioned above, all heartbeats messages have the same
length and if they are encrypted they will ideally look like
random data. Hence, from just analyzing the contents of a
heartbeat it is impossible to distinguish a full from an
empty heartbeat. Note that information about source and
destination of a heartbeat must be sent unencrypted to
allow routing. However, this information must also be
stored within the encrypted part of the message to ensure
authenticity.
7 CONCLUSIONS
We have presented a modular redesign of TrustedPals, a
smart card-based security framework capable of efficiently
solving Secure Multiparty Computation. The framework is
based on a two-part architecture. One part represents the
untrusted system, consisting of untrusted, Byzantine hosts.
In the other part, representing the trusted system, security
modules reduce the security problem to a fault-tolerant
consensus among smart cards.
The modular redesign allows TrustedPals to face the
consensus problem in the general-omission failure model,
which is more benign than the Byzantine model. Addi-
tionally, the trusted system has to deal with attacks which
cannot be filtered by smart cards, specifically timing
attacks and buffering attacks, resulting in a system model
that includes asynchronous processes and/or channels.
According to that, we model classes of processes and types
of communication among them, and give a novel definition
of the eventually perfect failure detector class for the
omission failure model. The new failure detector properties
are based on process connectedness rather than on process
correctness. We have proposed a failure detector algorithm
which assumes a majority of well-connected processes, and
a consensus algorithm using it. Interestingly, the consensus
algorithm is an adaptation of the classical o-based
Chandra-Toueg algorithm for the crash model.
Another relevant aspect of the redesign is the integration
of failure detection and consensus into the TrustedPals
framework. Since the failure detector follows a heartbeat-
based, all-to-all communication pattern, TrustedPals uses
heartbeats as the transport mechanism for consensus
messages. This approach ensures unobservability. Concep-
tually, the system is reasonably secure against almost all
practical attacks.
Our algorithms can be improved with respect to effi-
ciency. In particular, our implementation of T(om) can be
modified such that it results in by omitting the all-to-all
message exchange pattern, saving a substantial amount of
messages. However, the decision to choose T(om) was
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 623
deliberate since the integrationinto TrustedPals makes all-to-
all communication necessary anyway to protect against side
channel analysis that could endanger security. Therefore,
any such efficiency improvement would be futile in practical
systems. Nevertheless, we consider that determining the
weakest failure detector for solving consensus in this system
is an interesting open question which deserves further
research. Also, the space requirements of the failure detector
messages (mainly the matrix of bits) can be compressed
substantially by special encoding techniques in practice.
As future work, we intend to implement this approach
and perform practical experiments with the system. Within
the failure detector implementation, the size of the payload
field will be an interesting parameter to choose. It is
necessary to find an acceptable tradeoff between security
and performance such that a message size provides better
security in expense of worse performance.
On the theoretical side it would be interesting to study
the minimal storage and communication effort necessary
to solve consensus in our model, since we use unbounded
buffers in our implementation and the bit complexity of
the messages we use is also rather high.
ACKNOWLEDGMENTS
Work by the Spanish authors was supported by the
Spanish Research Council, under grant HA2005-0078.
Research partially supported by the Spanish Research
Council, under grant TIN2010-17170, and the Basque
Government, under grant IT395-10. Work by the German
authors was supported by DAAD PPP Programme Acciones
Integradas Hispano Alemanas. A preliminary version of this
paper appeared in the proceedings of the 9th International
Symposium on Safety, Security and Stabilization (SSS 2007),
Paris, France, 2007.
REFERENCES
[1] A.C.-C. Yao, Protocols for Secure Computations (Extended
Abstract), Proc. IEEE 23rd Symp. Foundations of Computer Science
(FOCS), pp. 160-164, 1982.
[2] M. Fort, F.C. Freiling, L.D. Penso, Z. Benenson, and D. Kesdogan,
Trustedpals: Secure Multiparty Computation Implemented with
Smart Cards, Proc. 11th European Symp. Research in Computer
Security (ESORICS), pp. 34-48, 2006.
[3] Z. Chen, Java Card Technology for Smart Cards: Architecture and
Programmers Guide. Addison-Wesley Longman Publishing Co.,
Inc., 2000.
[4] N. Leavitt, Will Proposed Standard Make Mobile Phones More
Secure?, Computer, vol. 38, no. 12, pp. 20-22, Dec. 2005.
[5] Certgate GmbH, Certgate Smart Card, http://www.certgate.
com/web_en/products/smartcardmmc.html, 2008.
[6] N.A. Lynch, Distributed Algorithms. Morgan Kaufmann Publishers,
Inc., 1996.
[7] T.D. Chandra and S. Toueg, Unreliable Failure Detectors for
Reliable Distributed Systems, J. ACM, vol. 43, no. 2, pp. 225-267,
1996.
[8] F.C. Freiling, R. Guerraoui, and P. Kuznetsov, The Failure
Detector Abstraction, ACM Computing Surveys, vol. 43, no. 2,
pp. 1-40, 2011.
[9] D. Malkhi and M.K. Reiter, Unreliable Intrusion Detection in
Distributed Computations, Proc. 10th Computer Security Founda-
tions Worksop (CSFW), pp. 116-125, 1997.
[10] K.P. Kihlstrom, L.E. Moser, and P.M. Melliar-Smith, Byzantine
Fault Detectors for Solving Consensus, Computing J., vol. 46,
no. 1, pp. 16-35, 2003.
[11] A. Doudou, B. Garbinato, R. Guerraoui, and A. Schiper, Mute-
ness Failure Detectors: Specification and Implementation, Proc.
Third European Dependable Computing Conf. Dependable Computing
(EDCC), pp. 71-87, 1999.
[12] A. Doudou, B. Garbinato, and R. Guerraoui, Encapsulating
Failure Detection: From Crash to Byzantine Failures, Proc.
Seventh Intl Conf. Reliable Software Technologies, pp. 24-50, 2002.
[13] A. Haeberlen, P. Kouznetsov, and P. Druschel, The Case for
Byzantine Fault Detection, Proc. Second Workshop Hot Topics in
System Dependability (HotDep), 2006.
[14] A. Haeberlen, P. Kouznetsov, and P. Druschel, Peerreview:
Practical Accountability for Distributed Systems, Proc. 21st ACM
SIGOPS Symp. Operating Systems Principles (SOSP), T.C. Bressoud
and M.F. Kaashoek, eds., pp. 175-188, 2007.
[15] A. Haeberlen and P. Kuznetsov, The Fault Detection Problem,
Proc. 13th Intl Conf. Principles of Distributed Systems (OPODIS),
pp. 99-114, 2009.
[16] M. Castro and B. Liskov, Practical Byzantine Fault Tolerance and
Proactive Recovery, ACM Trans. Computing Systems, vol. 20, no. 4,
pp. 398-461, 2002.
[17] A. Clement, E.L. Wong, L. Alvisi, M. Dahlin, and M. Marchetti,
Making Byzantine Fault Tolerant Systems Tolerate Byzantine
Faults, Proc. Sixth USENIX Symp. Networked Systems Design and
Implementation (NSDI), pp. 153-168, 2009.
[18] H. Moniz, N.F. Neves, and M. Correia, Turquois: Byzantine
Consensus in Wireless Ad Hoc Networks, Proc. IEEE/IFIP Intl
Conf. Dependable Systems and Networks (DSN), pp. 537-546, 2010.
[19] G. Bracha and S. Toueg, Asynchronous Consensus and Broadcast
Protocols, J. ACM, vol. 32, no. 4, pp. 824-840, 1985.
[20] M. Herlihy and J.D. Tygar, How to Make Replicated Data
Secure, Proc. Conf. Theory and Applications of Cryptographic
Techniques on Advances in Cryptology (CRYPTO), pp. 379-391, 1987.
[21] C. Delporte-Gallet, H. Fauconnier, and F.C. Freiling, Revisiting
Failure Detection and Consensus in Omission Failure Environ-
ments, Proc. Intl Colloquium Theoretical Aspects of Computing
(ICTAC), pp. 394-408, 2005.
[22] C. Delporte-Gallet, H. Fauconnier, A. Tielmann, F.C. Freiling, and
M. Kilic, Message-Efficient Omission-tolerant Consensus with
Limited Synchrony, Proc. IEEE Intl Symp. Parallel and Distributed
Processing (IPDPS), pp. 1-8, 2009.
[23] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi, Failure
Detectors in Omission Failure Environments, Technical Report
TR96-1608, Cornell Univ., Computer Science Dept., 1996.
[24] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi, Failure Detectors
in Omission Failure Environments, Proc. 16th Ann. ACM Symp.
Principles of Distributed Computing (PODC), p. 286, 1997.
[25] O

. Babaoglu, R. Davoli, and A. Montresor, Group Communica-


tion in Partitionable Systems: Specification and Algorithms, IEEE
Trans. Software Eng., vol. 27, no. 4, pp. 308-336, Apr. 2001.
[26] N. Santoro and P. Widmayer, Agreement in Synchronous
Networks with Ubiquitous Faults, Theoretical Computer Science,
vol. 384, nos. 2/3, pp. 232-249, 2007.
[27] H. Moniz, N.F. Neves, M. Correia, and P. Ver ssimo, Randomi-
zation Can Be a Healer: Consensus with Dynamic Omission
Failures, Proc. Intl Symp. Distributed Computing (DISC), pp. 63-77,
2009.
[28] P.D. MacKenzie, A. Oprea, and M.K. Reiter, Automatic Genera-
tion of Two-Party Computations, Proc. ACM Conf. Computer and
Comm. Security, pp. 210-219, 2003.
[29] D. Malkhi, N. Nisan, B. Pinkas, and Y. Sella, Fairplay - Secure
Two-Party Computation System, Proc. USENIX Security Symp.,
pp. 287-302, 2004.
[30] Y. Lindell and B. Pinkas, An Efficient Protocol for Secure Two-
Party Computation in the Presence of Malicious Adversaries,
Proc. 26th Ann. Intl Conf. Advances in Cryptology (EUROCRYPT),
pp. 52-78, 2007.
[31] V. Kolesnikov, Gate Evaluation Secret Sharing and Secure One-
Round Two-Party Computation, Proc. 11th Intl Conf. Theory and
Application of Cryptology and Information Security (ASIACRYPT),
pp. 136-155, 2005.
[32] L. Kruger, S. Jha, E.-J. Goh, and D. Boneh, Secure Function
Evaluation with Ordered Binary Decision Diagrams, Proc. ACM
Conf. Computer and Comm. Security, pp. 410-420, 2006.
[33] M. Burkhart, M. Strasser, D. Many, and X.A. Dimitropoulos,
Sepia: Privacy-Preserving Aggregation of Multi-Domain Net-
work Events and Statistics, Proc. USENIX Security Symp., pp. 223-
240, 2010.
624 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 4, JULY/AUGUST 2012
[34] G. Avoine and S. Vaudenay, Optimal Fair Exchange with
Guardian Angels, Proc. Intl Workshop Information Security
Applications (WISA), pp. 188-202, 2003.
[35] G. Avoine, F.C. Gartner, R. Guerraoui, and M. Vukolic, Grace-
fully Degrading Fair Exchange with Security Modules, Proc. Fifth
European Dependable Computing Conf. (EDCC), pp. 55-71, 2005.
[36] M. Ben-Or, R. Canetti, and O. Goldreich, Asynchronous Secure
Computation, Proc. Symp. Theory of Computing (STOC), pp. 52-61,
1993.
[37] O. Goldreich, S. Micali, and A. Wigderson, How to Play Any
Mental Game or a Completeness Theorem for Protocols with
Honest Majority, Proc. 19th Ann. Symp. Theory of Computing
(STOC), pp. 218-229, 1987.
[38] D. Chaum, I. Damgard, and J. van de Graaf, Multiparty
Computations Ensuring Privacy of Each Partys Input and
Correctness of the Result, Proc. Conf. Theory and Applications of
Cryptographic Techniques (CRYPTO), pp. 87-119, 1987.
[39] M. Ben-Or, B. Kelmer, and T. Rabin, Asynchronous Secure
Computations with Optimal Resilience (Extended Abstract),
Proc. Symp. Principles of Distributed Computing (PODC), pp. 183-
192, 1994.
[40] M. Correia, P. Ver ssimo, and N.F. Neves, The Design of a Cots-
Real-Time Distributed Security Kernel, Proc. Fourth European
Dependable Computing Conf. (EDCC), pp. 234-252, 2002.
[41] P. Sousa, N.F. Neves, and P. Ver ssimo, Proactive Resilience
through Architectural Hybridization, Proc. ACM Symp. Applied
Computing (SAC), pp. 686-690, 2006.
[42] O. Goldreich, Secure Multi-Party Computation, Internet:
http://www.wisdom.weizmann.ac.il/oded/pp.html, 2002.
[43] L. Lamport, R.E. Shostak, and M.C. Pease, The Byzantine
Generals Problem, ACM Trans. Programming Languages and
Systems, vol. 4, no. 3, pp. 382-401, 1982.
[44] C. Dwork, N.A. Lynch, and L.J. Stockmeyer, Consensus in the
Presence of Partial Synchrony, J. ACM, vol. 35, no. 2, pp. 288-323,
1988.
[45] V. Hadzilacos, Issues of Fault Tolerance in Concurrent Computa-
tions, PhD dissertation, Harvard Univ., Also Published as
Technical Report TR11-84, 1984.
[46] K.J. Perry and S. Toueg, Distributed Agreement in the Presence
of Processor and Communication Faults, IEEE Trans. Software
Eng., vol. SE-12, no. 3, pp. 477-482, Mar. 1986.
[47] P.R. Parvedy and M. Raynal, Optimal Early Stopping Uniform
Consensus in Synchronous Systems with Process Omission
Failures, Proc. 16th Ann. ACM Symp. Parallelism in Algorithms
and Architectures (SPAA), pp. 302-310, 2004.
[48] G. Danezis and C. Diaz, A Survey of Anonymous Communica-
tion Channels, Technical Report MSR-TR-2008-35, Microsoft
Research, 2008.
[49] A. Pfitzmann and M. Hansen, Anonymity, Unlinkability,
Undetectability, Unobservability, Pseudonymity, and Identity
Management - A Consolidated Proposal for Terminology,
technical report, TU Dresden, 2008.
[50] V. Hadzilacos and S. Toueg, Fault-Tolerant Broadcasts and
Related Problems, Distributed Systems, second ed., pp. 97-145,
ACM Press/Addison-Wesley Publishing Co., 1993.
[51] A.J. Menezes, P.C. van Oorschot, and S.A. Vanstone, Handbook of
Applied Cryptography. CRC Press, 1996.
Roberto Cortin as received the PhD degree in computer science from
the University of the Basque Country UPV/EHU, Spain. He is currently
an assistant professor of computer science at UPV/EHU. His research
interests include fault tolerance and distributed and ubiquitous systems.
Felix C. Freiling received the PhD degree in computer science from TU
Darmstadt, Germany. He is currently a full professor of computer science
at Friedrich-Alexander-University Erlangen-Nuremberg, Germany. His
research interests include the full space of theory and practice of
dependable and secure computing.
Marjan Ghajar-Azadanlou received the Diploma degree in computer
science from RWTH Aachen University, Germany. She currently works
for Bayer Business Services in Cologne, Germany.
Alberto Lafuente received the PhD degree in computer science from
the University of the Basque Country UPV/EHU, Spain. He is currently
an associate professor of computer science at the University of the
Basque Country. His research interests are distributed systems and
algorithms, dependable computing, and ubiquitous systems.
Mikel Larrea received the PhD degree in computer science from the
University of the Basque Country UPV/EHU, Spain. He is currently an
associate professor of computer science at the University of the Basque
Country. His research interests include dependable, distributed, and
ubiquitous computing.
Lucia Draque Penso After receiving the PhD degree at Brown
University in the United States, which comprised as well long stays at
RWTH Aachen University and the University of Mannheim in Germany,
she stayed at TU Ilmenau in Germany. She is currently at the University
of Ulm, Germany.
Iratxe Soraluze received the PhD degree in computer science from the
University of the Basque Country UPV/EHU. She is currently an
associate professor of computer science at the University of the Basque
Country. Her research interests include distributed algorithms and
systems, fault tolerance, and ubiquitous computing.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
CORTI
~
NAS ET AL.: SECURE FAILURE DETECTION AND CONSENSUS IN TRUSTEDPALS 625

You might also like