Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Distributed execution with remote audit

1999

Distributed Execution with Remote Audit Fabian Monrose New York University New York, NY Peter Wycko Bellcore Morristown, NJ Abstract sources for supporting parallel computation within the Java framework. They support the execution of coarsegrained parallel computations on numerous anonymous machines on the Internet, and rely on Java's security architecture [40] to ensure safety to hosts. However, the issue of detecting misbehavior by hosts has been largely ignored. Although it is possible to provide simple result checkers for speci c SPMD-style applications such as matrix multiplication and factoring, providing result checkers for arbitrary SPMD programs is not easily accomplished. With the exception of redundant computation and voting schemes [35, 28], no additional mechanisms for verifying the work performed by remote hosts participating in coarse-grained parallel computations has yet been proposed. Though the idea of computing with secrets in public is appealing for addressing this issue, detecting misbehavior through the use of assertion mechanisms that rely on inserting secret keys within the active content, is inadequate | since the content distributed across these platforms must be readable by a potentially large group of hosts, the keys are also readable, and thus can be easily recovered and reused [18]. Furthermore, the availability of eÆcient decompilation techniques [30, 39], even in light of code obfuscation [22], make these approaches unsatisfactory for deterring hosts from misbehaving. Recently, Sanders and Tschudin [34, 33] proposed the use of computing with encrypted functions [32, 1] to address the issues pertaining to publicly readable executable content. While the idea of computing with encrypted functions is appealing, numerous obstacles, both theoretical and practical, still need to be overcome before its application to mobile cryptography can be fully realized. An examination of some of the problems associated with computing with encrypted function, as it applies to protecting against malicious hosts, is presented in Section 3. We have designed and implemented an audit mechanism to detect misbehavior by hosts participating in Metacomputing environments. In particular, we show that a host claiming to perform work that it in fact Recently, there has been a rapidly expanding body of work with the vision of seamlessly integrating idle networked computers into virtual computing environments. This is enabled primarily by the success of research e orts promoting parallel and distributed computing on networks of workstations and the wide acceptance of Java. The proliferation of work in this area has provided new Internet-based infrastructures that harness the power of computing bases comprising hundreds of loosely-connected volunteered machines (i.e., hosts). While many of these systems have proposed the use of non-altruistic market-based schemes for promoting large-scale participation, mechanisms for ensuring that hosts participating in collaborative computing environments perform the work assigned to them have been largely ignored. This paper presents our implementation of one framework that layers a remote audit mechanism on top of an existing distributed computing model, and provides eÆcient methods for verifying, with a tunable level of certainty, whether a remote host performed the task it was assigned. 1 Aviel D. Rubin AT&T Labs - Research Florham Park, NJ Introduction The popularity of the World Wide Web and Java [5, 25] has promoted a new model for distributing code | down-loadable active content. This new model has lead to a proliferation of research in Metacomputing, that is, the transparent integration of multiple computer platforms, possibly across geographically dispersed areas, into a single virtual computing environment. The distinctive feature of this model is the exploitation of code mobility and mechanisms developed for parallel computing on loosely coupled machines [4]. Platforms that support Metacomputing, such as Atlas [7], Charlotte [8], Javelin [13], and ParaWeb [10] provide programming models that support the singleprogram-multiple-data (SPMD) computational model in heterogeneous computing environments. These platforms address the issues of scalability and fault tolerance, and for the most part, make eÆcient use of re1 worker is a Java-enabled Web browser. The following subtle distinction is made between cheating and malicious workers: did not do will be caught with high probability. Our implementation is Java-speci c and makes use of the Java execution environment. However, the same principles and techniques could also be applied to other environments such as Safe-Tcl [29, 41]. Section 2 introduces a high-level overview of our model for verifying the work performed by remote hosts. Some preliminary work on mobile agent security, as well as other approaches within di erent contexts, but with similar desirable goals as ours, is presented in Section 3. Our design and implementation are presented in Section 4. To illustrate proof of concept we present an example application and provide some empirical performance results in Section 5. An analysis of the probability associated with catching cheating hosts is presented in Section 6, and limitations of the current design and implementation are examined in Section 7. We conclude with directions for future work. 2 { A cheating worker is a worker that does not execute a computational component in its entirety; a cheating worker may execute any part(s) of the component that it chooses. For instance, a cheating worker may execute only the rst half of a computational component and return partially computed values for fM g. More formally, let 0  P < 1, then a cheating worker devotes P T execution time to a component and returns an incorrect set of values for fM g. A cheating worker may use part of the execution time devoted to a component for techniques such as decompilation. { A malicious worker is a worker which attempts to deliberately subvert an ongoing computation by providing an incorrect set of values for fM g, even at the cost of execution time greater than T . More formally, let P  0, then a malicious worker devotes P T execution time to a computational component. Overview A high-level overview of our approach to validating whether hosts participating in coarse-grain, SPMDstyle parallel computations, perform the work assigned is now presented. We consider the model where a \manager" has a large parallel computation to be performed and there exists a collection of \workers" willing to participate in the computation for some payment. The payment mechanism is outside the scope of this paper. We assume that there is some arrangement between the manager and the workers, and payment is facilitated by some other mechanism (perhaps NetCash [27] or NetBill [36]). Our goal is to audit workers so that the manager can detect misbehavior. The following terms are used throughout the paper:  The practical di erence between cheating and malicious workers is that a cheating worker's goal is to minimize resource expenditure (possibly to maximize pro t), whereas a malicious worker's goal is to subvert the computation. This paper addresses the issue of detecting misbehavior by cheating hosts only.  Computational components are tasks to be per- formed and are characterized by executable content containing code and a set of variables fM g whose values are to be computed. The running time of a computational component is denoted as T. Managers are processes requiring computing re-  Workers are processes o ering computing re- execution of a computational component. They represent processes working on behalf of managers. Our current veri cation system is geared towards catching cheating workers and is based on instrumenting computational components, with the aid of compiler techniques, to produce proofs of execution. Once a worker completes the work it was assigned, a proof of execution is sent to the veri er, which examines parts of the proof to check (with some desired level of con dence) whether the component was executed correctly. Since our goal is to provide a framework upon which Metacomputing infrastructures can reside, transparency is important in our design. With regards to the proofs which are generated, there are two important issues that must be addressed. First, There are three entities which participate in our computing model: managers, workers and veri ers:  Veri ers are processes which con rm the remote sources. sources and represent execution environments. They host computational components and provide support for execution. In its simplest form a 2 the proof must cover the execution of the entire component, not just parts of it. Otherwise a cheating worker could modify the component to execute only the parts that are checkable. Second, since a cheating working will know the exact protocol for producing and verifying the proof a priori, the veri cation protocol cannot present any implications about the security of the system. 3 Related work Although the research on program checkers, probabilistic interactive proofs and their spin-o s is quite di erent from that presented in this paper, it is nonetheless particularly important to our work. The theory and ideas put forth in that research provide a sound formal background for work in this area. The work of Blum et al. [9], on program checking is concerned with verifying that a given program returns a correct answer on a given input rather than on all inputs, and, with the adaption of interactive proof systems [19], provides the basic intuition for our work. Recently, research on validating remote execution has been conducted within the mobile agent framework. Although the distributed SPMD-style computing model which we are interested in is quite di erent from the focus of classical mobile agent systems, there are some similar desirable security goals in both models. Mobile agents are de ned as processes which can autonomously migrate to new hosts while executing their task. Approaches to solve the problem of malicious hosts within these environments have been proposed by Hohl [22], Sanders and Tschudin [34, 33], and Vigna [38]. To address the problem of misbehavior by hosts, Hohl [22] proposes a combination of code mess-up (i.e., obfuscation) and placing time critical restrictions on the mobile code. These restrictions are encapsulated as part of the code, with the intention that nodes which host agents comply with the restrictions placed on the code. Therefore, an agent is invalidated as it migrates from host to host if it is not executed within a certain time window. The timing constraints of the execution window are based on empirical results for the expected de-mangling time of the code (typically on the order of seconds). The goal is to restrict the lifetime of the agent by limiting the execution window to a fraction of the time it would take to de-mangle, understand, and change the code in a deliberate manner. Since the de-mangling time is much less than that of typical coarse-grain parallel computations, this approach is not applicable to our goals. The approach of Sanders et al. [34, 33] for protecting mobile agents against malicious hosts is based on the idea of computing with encrypted functions [32, 1]. The intuition is as follows: given a function f (x) which can be represented by a program P , rather than distribute P , the encrypted function E (P ) is distributed instead. The remote host then executes y = E (P )(x) and transmits y back to the originating host, at which point y is decrypted to retrieve f (x). P is expressed as a polynomial or rational function with certain mathe- In our model, a proof consists of the state of the computational component at various points in its execution. In essence, a computational component is transformed into small subcomponents whose post execution state is the input to the next subcomponent. Each post execution state consists of private data and information from the call stack, which we call a trace. Veri cation consists of repeatedly picking one of the traces, i, at random, and executing the computational component from the point of the execution that produced that state, to the point in the execution which triggers the next trace. If the state of the call stack in the veri er's virtual machine after executing up to the next state point is the same as the i +1 state point that the worker submitted, the worker produced the correct state transition between those two traces. Figure 1 depicts a high-level view of our system architecture illustrating the compile-time, the run-time, and the veri cation modules. We describe each of these modules in more detail in Sections 4.1, 4.2.1, and 4.2.2, respectively. As the gure shows, computational components are automatically instrumented (via the compile-time module) with the code needed to produce checkable state points. This augmented component is sent to the worker which executes it, producing the proof of execution and the result(s). Once completed, the worker sends the veri er the results and the sequence of state points that comprise the proof. The veri er then samples a random set of these state points, checking the correctness of the transitions between the state points. Intuitively, if a computational component is transformed into N 1 pieces, then there are N traces (the initialization variables for the component form the extra trace). If the proof contains L incorrect traces, and the veri er chooses to verify K of the traces, then the probability that the veri er nds one of the incorrect traces is greater than 1 ((N L)=N )K . In Section 4.1 we describe how computational components are transformed in such a way that producing intermediate states requires that the components are executed. 3 Run-time Environment Compile-time Environment worker transformation transformed component B0 B136 B147 task B25 B126 computational component B64 checkable units remote agent verification stack traces local stack load random stack trace proof = compare states in remote and local call stacks. Figure 1: High-level overview of the system design illustrating the compile-time, run-time and veri cation environments. The compile-time module automatically instruments computational components with the code needed to generate state points. These augmented components are sent to the worker which then executes them in its local interpreter. A remote agent, responsible for handling each component, creates a proof of execution and the results ( nal state). Once the worker completes the task assigned, the remote agent sends the veri er the results and the sequence of state points that comprise the proof. The veri er then samples a few of these state points checking the correctness of the transitions between the states. matical properties. The caveat is that computing with encrypted functions for general programs is an open problem and many strong arguments which largely discourage the idea of computing with encrypted data have already been put forth (see [2, 11, 32]). Furthermore, eÆcient encryption schemes with the desired homomorphic properties (see [34]) for arbitrary functions are not known and since P is executed in its encrypted form, exactly how this approach will be realized in practice is unclear. The solution put forth by Vigna [38] is to verify program execution by tracing the operations performed by a mobile agent during its lifetime. In Vigna's model, a roaming agent is composed of code and a state that is determined, at some speci ed point, by code execution. Every statement executed by the agent is logged and before the agent migrates to a new host, a digitally signed message is sent to the originating host containing a checksum of the program's nal state before migrating, a checksum of the execution trace, and a unique identi er. Traces are logs of the operations performed by the agent during its lifetime. The extended form of the execution trace is stored for some limited period of time. In the event that tampering of an agent's code is suspected, tampering can be proven by verifying the agent program against a supposed history of its execution i.e., simulating the entire program locally. The system is implemented in a restricted subset of Safe-Tcl [29] and tracing is accomplished by adding new instructions to the language. It is assumed that the code is static, therefore performance enhancements such as just-intime compilation are not possible. The main caveat, 4 however, is that the complexity of the validation process is linear with respect to the size of the execution traces, which can be signi cant. No implementation or empirical results based on system performance were presented. B0 state i (initialization) B1 B2 4 Architecture B73 Our design and implementation can be roughly separated into a compile-time module and a run-time module. The compile-time module automatically instruments computational components with code to interact with the remote debugger and the run-time module performs the veri cation. We discuss the design and implementation of these two modules in the following sections. 4.1 B90 z z B5 state i+1 z z B70 B55 z z state n (creates output) checkable units Compile-time module Figure 2: High-level overview of the transformation of the In Section 2 we stated that the remote agent periodically saves information from the call stacks of executing computational components based on conditions specied by the manager at run-time. The mechanism by which these conditions are selected is now elaborated on. As mentioned earlier, the compile-time module transforms a computational component into checkable units, some of which the veri er runs from start to nish on its local machine, to check that the proof of execution provided by a worker corresponds to a correct execution of those units. A proof is represented by a sequence of traces, t0 ; t1 ; : : : ; tn . We now de ne some terms that are useful for describing the design and implementation of the system. A block of code is de ned as execution capturable if, with high probability, the state of the job after its execution cannot be produced correctly by means other than executing the block. For example, a subcomputation that generates a single bit of information is not execution capturable because an adversary may be able to guess the outcome and generate a correct state with 0:5 probability. During the transformation of a component, some parts of the program generate single units while loops correspond to multiple units. We must ensure that the multiple units that are generated within a loop have some property that makes them unique. Otherwise, the traces that correspond to their executions might be interchangeable. We de ne the unit that is generated from a part of the program that does not have a loop boundary (i.e., it is not contained in a loop but it may contain loops) as distinct. We de ne the units generated from a loop where the values computed in the computational component for discrete log example (from the Appendix) into checkable units using its CFG. Given this CFG and its associated component, the compile-time module transforms the code by instrumenting it to trigger breakpoints at speci c points. At run-time, when breakpoints are triggered, the remote agent associated with the current computational component, collects internal state information on the call stack for the current thread. These traces are used later during veri cation. loop depend on at least the loop index and the state of the program before the loop as uniquely identi able. We require that the units that our transformation creates are either distinct or uniquely identi able. Otherwise, the trace transitions that are de ned by the units may not necessarily correspond to only one unit. If that is the case, a cut-and-paste style attack is possible. To guarantee that these criteria are met, we require that at least one variable within a computational component captures past history. That is, there is at least one object (for example, an array) whose values correspond to the computation, and therefore, re ect the computation between traces. This ensures that successive traces include some actual state that was computed. 4.1.1 Code transformation Our compile-time module uses the control ow graph (CFG) [3] for a given computational component, P , to generate its corresponding checkable units. Intuitively, the key to the instrumentation process is: (1) the execution of the checkable units is captured in the traces and (2) each trace corresponds to the output of exactly 5 vices would not be transparent to the programmer who would need to write stack marshaling and unmarshaling routines for every breakpoint in a computational component. To avoid this, we chose the most common Metacomputing environment implementation language, Java, as our runtime system's implementation platform. While a number of inherent security aws have been outlined [26, 16, 23], we provide a framework that assumes that these concerns will be resolved as Java matures; however, our veri cation technique will not be compromised even if a worker tampers with its JVM. one checkable unit. The CFG is produced by recovering high-level structure from the Java bytecode of the untransformed computational component using our own back-end to the Java compiler. Java is used as our platform for implementation because of its portability and heterogeneity. The xed format of Java class les makes it ideal for recovering high-level structure [30] needed to build the CFGs. In a manner similar to the work on compiler-assisted checkpointing [24], given the CFG our compile-time module adds the target nodes, i, of any back-edges to a set of breakpoint candidates S . Additionally, every exit node in P is added to S . For all i 2 S a new object actionSi , consisting of a single boolean method, is created. This method speci es when a computational component's state should be saved and triggers the remote agent whenever this condition is true. ActionSi is created based on simple data analysis techniques and can be overwritten by the programmer. Work on optimizing Java compilers [15, 12] utilize signi cantly more powerful analysis techniques than those we currently use, and we hope to borrow some design and implementation from their research to enhance our functionality. All objects are loaded on-demand across the network by the remote agent, transparently to the worker. Figure 2 depicts a transformation of a computational component into checkable units, each of whose \output" is the input to the next unit. If the entire component were not covered by checkable units, a malicious worker, would be free to corrupt the uncovered parts without any danger of being caught. Once S and actionSi are created, for all i 2 S , the compile-time module inserts calls to actionSi at the location speci ed by Si . The manager saves some metadata including S in persistent storage, and at run-time, the remote agent responsible for collecting traces for the executing component is instructed (via a callback mechanism) where to set breakpoints. These breakpoints alert the remote agent when state must be saved. A trace represents the current state of a computation, including the value of all instance variables within scope, at the point in time when the breakpoint was triggered. Type information is not saved since the veri er knows the type of each object that should appear on its local stack at any given point. We show that even if the condition under which state is saved is known a priori it is of no advantage to a cheating worker. 4.2 4.2.1 Remote monitoring < CallBack > Remote Agent Debugger Callback Breakpoint Handler Agent Input Agent RemoteDebugger Request socket Notification event socket Separate processes Figure 3: Overview of the remote debugging process in Java. The architecture consists of a debugger client, a debugger server, and a TCP/IP socket based communication protocol. The API allows us to connect and communicate with the JVM of the worker and obtain low-level information about the internal states of executing threads. A remote agent handles the actions initiated by the client, while a separate thread handles noti cation events from the server. The communication and noti cation events are transparent to the worker. Our implementation of remote auditing is based primarily on Java's Debugger API. This API is designed around the concept of remote debugging which allows the debugger and its target (i.e., debugee) to execute on separate machines. The Java model for remote debugging is depicted in Figure 3. Run-time module The runtime component of our system must be interoperable with existing Metacomputing environments. Otherwise, remote monitoring and veri cation ser6 The model consists of a debugger client, a debugger server, and a TCP/IP socket-based communication protocol. The API allows a process to connect and communicate with the JVM [25] of the target and obtain low-level information about the internal states of executing threads. All communication to and from the debugger server is performed over two socket connections created when the RemoteDebugger class is instantiated. A remote agent thread handles synchronous actions initiated by the client and communicated over one of the sockets, while an agent input thread handles noti cation events from the server. These events are asynchronous to the client and are implemented via a callback mechanism. When the Agent Input thread receives a message from the debugger server over the socket, the message is interpreted and the appropriate method within the registered callback is invoked. For ease of implementation and to reduce communication overhead, the remote agent and the debugger server reside on the target worker machine. This decision has little signi cance with regards to cheating workers, but does in uence the trust relationship between managers and workers. While the current architecture allows for functionality provided by these modules to be easily migrated to the manager, network latency overhead would degrade system performance. The remote agent's task of capturing the state of the computational object in a serialized form suitable for transmission is accomplished through the use of an object serialization mechanism known as pickling [31]. The complementary process of unpickling is used by the veri er for initializing frames in its local call stack to those transmitted by the remote agent. verifier Load remote stack image remote Agent Execute same step in computational component Compare states in remote and local call stacks. ✘ = ✔ Continue Penalize worker Reward worker Figure 4: The veri cation process. The veri er selects a trace, i, at random, and reconstructs its state on its local call stack. The computational component is executed until state i + 1 is reached and the nal state is compared with that in trace i + 1 submitted by the remote agent. If the two states are equal, then the trace is accepted and the worker is rewarded. Otherwise, the trace is rejected. At that point, it might be prudent to penalize the worker or the organization which it represents. 4.2.2 Veri cation The veri er is depicted in Figure 4. A traditional challenge-response style protocol is used by the verier to validate the proofs of execution submitted by a remote agent working on behalf of a worker. The protocol proceeds as follows. The remote agent commits to each trace (ti ; : : : ; tn ) by submitting a cryptographic hash (SHA-1 in our implementation) of each trace collected. The veri er then randomly chooses r 2 n and challenges the remote agent to present the trace for tr . The remote agent responds with the requested trace, which is cross-checked by the veri er. To validate trace tr+1 , the veri er reconstructs the image of tr on its stack (checking the values of loop indices for correctness), executes a local copy of P from the program counter immediately following the point which triggered the commitment of state tr , until state r+1 . The local execution represents running a fraction of the computational component, which corresponds to the amount of work performed between successive traces. The veri er compares the hash of its call stack with hash(tr+1 ), previously committed by the remote agent. If the hashes are equal, tr is accepted, otherwise it is rejected. The procedure may be repeated for any number of the remaining traces|depending on the level of assurance required by the manager. We show that the run-time associated with validating traces is signi cantly less than the time it would take if the manager executed the entire component locally (i.e., spotchecking [35]). t 7 5 Example: computing a dis- crete logarithm 1600 As proof of concept, we consider an example where a party, Alice, uses the same secret exponent (x) for multiple DiÆe-Hellman [17] key exchanges. Mallot, Alice's adversary, has knowledge of this fact, and wants to impersonate Alice. However, he has only limited resources and exhaustively searching for the secret exponent is the only method available. Mallot also knows that Alice uses the relatively small public parameters, g = 17 and n = 9311. This particular example was chosen because it can be represented by a SPMD-style application that can be eÆciently supported by current Metacomputing infrastructures. To expedite the search for the secret exponent, Mallot creates a SPMD-style computational component utilizing a striping technique, and submits the component to a manager, thereafter agreeing on some form of payment for workers. The Java code for the component is given in the Appendix. While it is trivial for Mallot to check that exponents returned by workers are in fact the secret exponent x he is seeking, payment must also be made to workers who pruned the search space requested of them, but did not yield any candidate exponent(s). Furthermore, cheating workers must not receive any payment. For these reasons, the veri cation service provided by the manager is essential to the task at hand. This example is used as a basis for the performance results presented in the following section. 5.1 1400 1200 Time (secs) 1000 I = 100 % I = 10 % I=1% 800 600 Sequential 400 200 0 1 2 4 6 8 10 12 Number of Workers Figure 5: Performance of the auditing environment. I represents the trace interval, that is, the amount of work performed between successive traces. As more workers are added to the computation, speedups over the sequential execution are realized. The relatively low overhead imposed on workers in these distributed settings makes such an auditing environment bene cial to managers, as it provides assurances with regard to a worker's execution of a computational component. participating in the computation, and the remote agent logging state at every step during the computation, was approximately 1452 seconds. This represents a slowdown of more than 5 times over the non-monitored, sequential execution. However, when the remote agent was instructed to save the state at 10% intervals (i.e., the amount of work performed between successive traces corresponds to 10% of the total work to be performed) the elapsed time was 458 seconds. This represents a slowdown of only 1:65. The overhead diminishes rapidly as more workers are added to the computation (see Figure 5). In fact, speedups over the sequential execution are achieved with 5 or more workers, regardless of how much state is saved. A second set of experiments was conducted to evaluate the performance of the veri cation system. The experiments involved starting one manager that spawned n = 1; 2; : : : ; 12 workers in successive experiments. For each worker participating in the discrete log computation the veri er picked 10% of the submitted traces at random (in addition to the initialization and nal states) and checked the correctness of each state transition against its locally reconstructed image. The results are depicted in Figure 6 | the elapsed time for Performance Performance results based on the discrete log example are now presented. Experiments were conducted within the Distributed Systems Laboratory at New York University on twelve 200Mhz Pentium Pro machines running Linux, each with 64MB of main memory. The machines were interconnected by a 100 Mbit/sec Ethernet. All entities resided on separate machines. Our rst set of experiments examined the performance and overhead associated with the remote monitoring environment. In these experiments, the elapsed execution time for exhaustively searching in parallel for the secret exponent, on a series of machines, was compared to the sequential execution of the same task in a non-monitored environment (i.e., the execution was performed on one machine outside the remote monitoring environment). The sequential execution time was 277 seconds. The elapsed time for performing the search within the remote monitoring environment with only one worker 8 has been unsupported since JDK 1.02), implementing a veri cation system which does not rely on the functionality provided by the remote debugging facilities can be accomplished with little diÆculty. We are currently pursuing this approach. 300 Time (secs) 250 200 Sequential 6 150 Probability of catching a cheating worker 100 We now present an analysis of the probability associated with catching a cheating worker. Consider the case where the adversary is paid to execute 100 jobs, each of which is transformed into 100 units. Assume that the adversary wants to (or only has the resources to) execute 95% of the total work neccessary to complete the 100 jobs in their entirety. We compare the minimum probability of catching such a cheating worker in our system with spot checking [35]. The minimum probability of catching the worker is computed by nding the probability of catching a worker that is employing an optimal cheating strategy. For our scheme the probability of catching a cheating worker is multiplicative, and therefore, the optimal cheating strategy is to execute 95 of the units for each job. If the veri er checks a single unit per job, the veri er has a :05 probability on each job of catching the adversary cheating. Over the 100 jobs the probability of detecting  misbehavior is approximately :994 1 (1 :05)100 . This probability is associated with the veri er checking only 1% of each job (actually, the probability is slightly higher because when the transition from ri to ri+1 is checked for 1  i  n 1, if either of ri and ri+1 is incorrect, the veri cation will catch the error). If instead as in spot checking, the veri er were to periodically execute a job locally, in its entirety, then the optimal cheating strategy is to execute 95 of the jobs in their entirety, not expending any e ort for the other 5 jobs. Then the probability of detecting misbehavior is only :05. 50 0 1 2 4 6 8 10 12 Number of Workers Figure 6: Performance of the veri cation system. For each worker participating in the computation the veri er picked I = 10% of the submitted traces at random (in addition to the initialization and nal states) and checked the correctness of each state transition against its locally reconstructed image. These results show that our approach represents an eÆcient method for verifying a worker's supposed execution of a computational component. starting the remote n workers is not re ected here. Verifying 10% of the traces of one worker required signi cantly less time (109:30 seconds) than executing the computational component sequentially. As more workers were added, and hence more traces needed to be veri ed, the performance of the veri cation system steadily degraded due to the overhead associated with receiving multiple proofs of execution from remote agents, simultaneously. Nonetheless, these results show that our approach represents an eÆcient method for catching cheating workers in a distributed environment. To evaluate where possible improvements in performance could be achieved, we measured the overhead of the various monitoring and veri cation activities. It was observed that while there was room for enhancement within the veri cation component whose primary function is in reconstructing and comparing stack states, a signi cant portion of the overhead, 37:2%, is associated with context switching, especially in time spent pausing and resuming threads within the Java debugging environment. The same is true for the remote monitoring environment. Although we suspect that performance enhancements within the classes that provide the functionality for the Java debugging environment are unlikely (it 7 Limitations Our approach to remote auditing of hosts is well suited when the computational components are SPMD-style programs. While our work addresses an important problem within the distributed and parallel research community, our techniques are not directly transferable to other elds, such as classical mobile computing. Some limitations of our current design and implementation, which will be addressed as this research matures, are now outlined: 9 8  The veri cation techniques described herein assume the existence of good pseudo random number generators; otherwise, if a worker can guess with high probability which traces are examined by the veri er a priori, this knowledge signi cantly decreases its chances of getting caught. Given the current random number generators in Java, such attacks on our system are possible. However,this attack can be subverted through the use of hardware number generators.  Our model assumes that computational components can be transformed into checkable units that have similar execution times. However, there are certainly applications where this is not the case. Therefore, in choosing which traces to check, it is necessary that the veri er weigh the traces with estimates of their relative execution time. Research on designing eÆcient weighting algorithms is currently underway.  Programs which rely on external input or random data can not be handled per se by the method outlined here | however programmer assistance using pragmas or more sophisticated compiler assistance can be used to handled external input and random data.  Performance enhancements within the remote debugging interface are unlikely as it has been unsupported since JDK Release 1.02. In addition, the functionality provided by the remote debugging environment requires access to privileged system targets beyond the normal con nes of the Java sandbox. However, in environments where the principals participating in distributed computations are represented by Java applications, as is the case with JavaPVM [14], object signing is not an issue. Although the security policies which de ne the behavior of local versus remote code are being designed to be more exible and extensible [21, 20], if one intends to support this framework entirely within Java enabled browsers then issues pertaining to object signing, need to be taken into consideration. presented a framework based on compile- and run-time modules which include a veri er capable of catching cheating workers with high probability, based on checking a small number of the state transitions between these units. Intuitively, this scheme has the advantage that for a worker to cheat, it must successfully corrupt at least one of the checkable units, which is much more diÆcult than corrupting an entire computation by simply returning an erroneous result. For our auditing technique to apply, we are restricted to a set of programs that are execution capturable and have uniquely identi able units generated within loops. Although we show the probability of catching cheating workers given a proper transformation of a computation, we do not show what the probabilities are for programs not in our restricted set|which may be the case if we apply our techniques to non-SPMD programs. Previous practical work in this area has relied on unreasonably strong assumptions about the adversarial model of cheating or malicious workers. For example, systems which rely on embedding keys within tasks and returning some function of these keys as assertions that the task was correctly executed, assume that the code will not be decompiled. The availability of eÆcient decompilation techniques has rendered these approaches inadequate. Other approaches have proposed replicating the tasks on domain disjoint machines, and occasionally checking task outputs using a trusted machine. Our approach is in some sense orthogonal to these heuristics and they can therefore be applied to our system; however, the advantage would be minimal. Our contribution is in providing a more eÆcient execution environment for detecting misbehavior by hosts participating in Metacomputing infrastructures. A promising approach to improving the guarantees provided by our system involves the application of techniques such as those of Arora and Safra [6] for devising probabilistically checkable proofs (PCP) to the veri cation of the transitions between checkable units. We plan to pursue this approach by creating redundancy in the traces so that checking a reasonably small number of the augmented pieces of the proof of execution presented by a worker would nd errors in the worker's computation with high probability. We are also considering incorporating more complex binary predicates instead of only stack equalities as checks and hope to achieve these goals without too high a cost in performance. Conclusion and future work In this paper, we presented a technique for auditing the execution of SPMD tasks in a distributed environment based on transforming tasks into checkable units. Our work is geared towards auditing workers participating in coarse-grain parallel computations, on numerous anonymous machines on the Internet. We 9 Acknowledgments The authors would like to thank Saugata Basu, Matt Franklin, and Laxmi Parida for stimulating discussions 10 about this research. We also thank the anonymous referees, Geritt Bleumer, Tom Bowen, Ian Jermyn, and Zvi Kedem for their suggestions on improving earlier drafts of the paper. This research was sponsored by the Defense Advanced Research Projects Agency and Rome Laboratory, Air Force Materiel Command, USAF, under agreement number F30602-96-1-0320; and by the National Science Foundation under grant number CCR-94-11590. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the author and should not be interpreted as necessarily representing the oÆcial policies or endorsements, either expressed or implied, of the Defense Advanced Research Projects Agency, Rome Laboratory, or the U.S. Government. 11 APPENDIX The Java code which implements the discrete log example discussed in Section 5 is now given. The code is augmented with labels that represent the nodes in the control ow graph (CFG) of Figure 2. These labels represent the actual o sets within the Java bytecode of the corresponding class le. // Discrete log example with relatively small integers import java.math.*; import java.lang.*; import java.util.*; public class DLSearch extends Thread f private int whoami, window, result=0; private final int prime = 9311, g=17, z = 5653; private int[] zp; /* variable which captures past computations */ B0: public static void main( String args[] )f g DLSearch d = new DLSearch(Integer.parseInt(args[0]), Integer.parseInt(args[1])); B1: DLSearch( int who, int slice) g this.whoami = who; this.window = slice; this.setPriority( 6 ); zp = new int[prime/window]; this.start(); f /* divide work based on striping technique */ B2: public void run() f int i = 0; while ( (whoami+(i*window)) < prime ) f zp[i] = ( ( (BigInteger.valueOf(g)).pow(whoami+(i*window) ) ). mod(BigInteger.valueOf( prime)) ).intValue(); if ( zp[i] == this.z ) f result=whoami+(i*window); B73: B5: B55: g B70: i++; g B90: System.exit(0); g g Figure 7: Labels B0 . . . BN correspond to the labels in the control ow graph (CFG) in Figure 2. High level structure is recovered from the Java bytecode; labels represent the actual o sets within the Java bytecode of the corresponding DLSearch class le. 12 [13] P. Capello, B. Christiansen, M. Ionescu, M. Neary, K. Schauser, and D. Wu. Javelin: Internet-Based Parallel Computing Using Java. ACM Workshop References [1] M. Abadi and J. Feigenbaum. Secure Circuit Evaluation. Journal of Cryptography, 2(1):1{12, 1990. on Java for Science and Engineering Computation, 1997. [2] N. Ahituv, Y. Lapid, and S. Neumann. Processing Encrypted Data. Communications of the ACM, 30(9):777{780, 1987. [14] Center for Human-Machine System Research. JavaPVM: The Java to PVM Interface, August 1997. [3] Aho, Sethi, and Ullman. Compilers: Principles, Techniques, and Tools. Addison Wesley, 1986. [15] M. Cierniak and W. Li. Optimizing Java Bytecodes. Concurrency - Practice and Experience, 9(6):427{444, June 1997. [4] T. E. Anderson, D. E. Culler, and D. A. Patterson. A Case for Networks of Workstations: NOW. IEEE Micro, Feb 1995. [16] Drew Dean, Ed Felten, and Dan Wallach. Java Security: From HotJava to Netscape and Beyond. In Proceedings of the IEEE Symposium on Security and Privacy, pages 190{200, 1996. [5] K. Arnold and J. Gosling. The Java Programming Language. Addison Wesley, 1996. [17] W. DiÆe and M. E. Hellman. New Directions in Cryptography. IEEE Transactions on Information Theory, 22(6):644{654, Nov. 1976. [6] S. Arora and S. Safra. Probabilistic Checking of Proofs. In Proceedings of the 33rd Annual Symposium on Foundations of Computer Science, pages 2{13. IEEE Computer Society Press, October 1992. [18] William Farmer, Joshua Guttman, and Vipin Swarup. Security for Mobile Agents: Issues and Requirements. In Proceedings of the 19th National Information Systems Security Conference, pages 591{597, 1996. [7] J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. ATLAS: An Infrastructure for Global Computing. In Proceedings of the Seventh ACM SIGOPS European Workshop on System Support for Worldwide Applications, 1996. [19] S. Goldwasser, S. Micali, and C. Racko . The Knowledge Complexity of Interactive Proof Systems. 17th ACM Symposium on Theory of Computing, pages 291{304, 1985. [8] A. Baratloo, M. Karaul, Z. M. Kedem, and P. Wycko . Charlotte: Metacomputing on the Web. In Proceedings of the 9th International [20] Li Gong, M. Mueller, H. Prafullchandra, and R. Schemers. Going Beyond the Sandbox: An Overview of the New Security Architecture in the Java Development Kit 1.2. In Proceedings of the Conference on Parallel and Distributed Computing Systems, 1996. USENIX Sysposium on Internet Technologies and Systems, December 1997. [9] M. Blum and S. Kannan. Programs That Check Their Work. In Proceedings of the Twenty First Annual ACM Symposium on Theory of Computing, 1989. [21] Li Gong and Roland Schemers. Implementing Protection Domains in the Java Development Kit 1.2. In Proc. Internet Society Symposium on Network and Distributed System Security, pages 125{134, March 1998. [10] T. Brecht, H. Sandhu, M. Shan, and J. Talbot. ParaWeb: Towards World-Wide Supercomputing. In Proceedings of the Seventh ACM SIGOPS Euro- pean Workshop on System Support for Worldwide Applications, 1996. [22] Fritz Hohl. An Approach to Solve the Problem of Malicious Hosts. Technical Report TR-1997-03, Universitat Stuttgart, Fakultat Informatik, Germany, March 1997. [11] Ernest F. Brickell and Yacov Yacobi. On Privacy Homomorphisms (Extended Abstract). Eurocrypt, 1987. Abstracts: IV-7-IV-14. [23] Mark D. Ladue. Java Insecurity. Computer Society, Spring 1997. [12] Z. Budlimic and K. Kennedy. Optimizing Java: Theory and Practice. Concurrency - Practice and Experience, 9(6):445{464, June 1997. [24] C. J. Li and W. K. Fuchs. CATCH: CompilerAssisted Techniques for Checkpointing. In 20th 13 International Symposium on Fault Tolerant Computing, pages 74{81, 1990. [32] R. Rivest, L. Adleman, and M. Dertouzos. On Data Banks and Privacy Homomorphisms. Foundations of Secure Computation, pages 169{177, 1978. [25] T. Lindholm and F. Yellin. The Java Virtual Machine Speci cation. Addison-Wesley, Menlo Park, California, 1996. [33] T. Sanders and C. Tschudin. Protecting Mobile Agents Against Malicious Hosts. Mobile Agent Security, 1997. [26] Gary McGraw and Edward Felten. Java Security: Hostile Applets, Holes and Antidotes. John Wiley and Sons. New York. New York, 1996. [34] T. Sanders and C. Tschudin. Toward Mobile Cryptography. IEEE Symposium on Security and Privacy, 1998. [27] G. Medvinsky and B. C. Neuman. NetCash: A Design for Practical Electronic Currency on the Internet. In Proceedings of the First ACM Conference on Computer and Communications Security, Nov. 1993. [35] Luis F. G. Sarmenta. Bayanihan: Web-Based Volunteer Computing Using Java. In Proceedings of the 2nd International Conference of World-Wide Computing and its Applications, 1998. [36] M. Sirbu and J. D. Tygar. Netbill: An Internet Commerce System Optimized for Network Delivered Service. IEEE Personal Communications, pages 34{39, Aug. 1995. [28] Y. Minsky, R. van Renesse, F. B. Schneider, and S. D. Stoller. Cryptographic Support for Fault-Tolerant Distributed Computing. In Seventh ACM SIGOPS European Workshop, pages 109{ 114, Connemara, Ireland, 1996. [37] Glenn Vanderburg. Tricks of the Java Programming Gurus. Sams Net, 1997. [29] J. Ousterhout, J. Levy, and B. Welch. The SafeTcl Security Model. Technical report, Sun Microsystems, Nov 1996. [38] G. Vigna. Protecting Mobile Agents through Tracing. In Proceedings of the 3rd Workshop on Mobile Object Systems, June 1997. [30] Todd A. Proebsting and Scott A. Watterson. Krakatoa: Decompilation in Java. In Proceedings [39] Hanpeter van Vliet. The Mocha Decompiler, 1996. of the 3rd USENIX Conference on Object-Oriented Technologies and Systems, pages 185{197, June [40] Frank Yellin. Low Level Security in Java. In Proceedings of the 4th International World Wide Web Conference, Boston, Massachusetts, Decem- 1997. ber 1995. [31] R. Riggs, J. Waldo, and A. Wollrath. Pickling State in Java. In Proceedings of the 2nd Con- [41] X. N. Zhang. Secure Code Distribution. Computer, 30(6):76{79, June 1997. ference on Object-Oriented Technologies and Systems, pages 241{250, Toronto, June 1996. 14