Keywords

1 Introduction

New ways to handle the increasing complexity observed in embedded systems while simultaneously coping with component failures have to be found. One promising approach is to adapt self-organizational principles to computer systems.

The Artificial Hormone System (AHS) middleware [14] adapts the natural endocrine system in order to decentrally manage tasks in a distributed system: By exchanging digital messages, called hormones after their biological model, via a communication network, tasks can be distributed in the system in a flexible and self-organizing manner, exhibiting properties such as self-configuration and self-healing.

This paper deals with an extension to the AHS middleware that supports task assignment priorities, thus allowing graceful system degradation if the amount of node failures is too big to allow the system to fully heal itself. Our contribution is three-fold:

  1. 1.

    We describe a priority-based task-decision strategy for the AHS.

  2. 2.

    We derive hard time bounds for this strategy’s self-configuration and self-healing capabilities.

  3. 3.

    We analyze the worst- and average-case time required to degrade a system in an overload situation.

The paper is structured as follows: We first present related work and the general AHS in Sects. 2 and 3. Section 4 describes our priority-based extensions to the AHS. Its worst-case time bounds are derived in Sect. 5. Section 6 proposes a simple degradation strategy to self-heal the system in overload situations and analyzes its worst- and average-case healing times. Finally, Sect. 7 concludes this paper.

2 Related Work

The approach presented in this paper enables a distributed system to recover from hardware failures by dynamically (re)configuring itself. A classical way to improve a system’s robustness against such types of failures is the duplication of functional units: By providing each unit with an identical, redundant unit in hot stand-by, a limited number of failures can be compensated. This pattern is frequently used for safety-critical control units in the automotive domain, albeit recent approaches like the AutoKonf project [10] try to reduce costs by sharing a single backup between multiple different control units so that a single failure can be compensated.

In contrast, our approach allows to assign tasks to a distributed system’s computing nodes in a self-organizing way, allowing more flexibility. Task priorities allow gradual system degradation after so many node failures have occurred so that the system can no longer fully recover. This flexible mechanism allows to reduce the number of required backup nodes while still being able to tolerate a limited number of node failures.

Our approach is inspired by multiple general research trends: IBM’s Autonomic Computing initiative [8] introduced self-x properties such as selfconfiguration, selfoptimization or selfhealing. These properties can also be observed in systems based on Organic Computing principles [11]: Organic Computing can be characterized as a postponement of various decisions to the system’s run-time that were traditionally made at design time [9], allowing the system to dynamically adapt to changed operational conditions [13].

The resulting dynamism distinguishes our concept from approaches like [6] where a (offline) precomputed adaption scenario is applied in case of node failures or overload situations. In contrast, our approach completely postpones the calculation of adaption responses to the run-time and thus allows a more dynamic reaction.

Nevertheless, our concept is by far not the only approach to assign tasks in distributed systems.

Contract Net Protocols [12] can be employed to distribute tasks to agents in a multi-agent system. The approach presented in [17] consists of an improved Contract Net Protocol for task assignment and employs self-healing capabilities as well as task priorities. Yet, contrarily to our approach, it is not completely decentralized and does not guarantee hard real-time bounds.

Contract Net Protocols have also been used for task assignment in Wireless Sensor Networks (WSNs), e.g. in [4]. Other recent research in this domain utilizes self-organization principles [16], game theory [5] or particle swarm optimization [7, 15]. Many of these approaches employ some kind of task priority, e.g. deadline-based priorities or number of dependent tasks in a task graph. Additionally, with WSN nodes typically having a limited energy budget, energy-efficiency is one of the main goals of these approaches, rather than guaranteeing real-time behavior as with our approach. In addition, WSNs do not guarantee that any two nodes have a direct communication link as required by our approach, thus needing different methods to distribute the tasks.

To the best of our knowledge, no self-organizing middleware for prioritized task allocation in distributed systems comparable to our approach exists.

3 The Artificial Hormone System

The Artificial Hormone System (AHS) is a completely decentralized middleware based on Organic Computing principles that allows the distribution of tasks in a self-organizing way. If a processing element (PE) fails, the remaining PEs will compensate this failure by re-configuring themselves.

The AHS works by realizing control loops based on hormones, which are short digital messages, on all PEs in the system: They exchange eager values for all tasks, indicating their suitability for each task. In every cycle of the hormone control loop (called hormone cycle), each PE tries to make a decision upon one task by comparing its own eager value with all received eager values. If it has sent the highest eager value for some task T in the current cycle, it is allowed to start executing T.

Once T has been started, its PE will start sending out a suppressor hormone for this task. Upon receiving this suppressor, the other PEs will reduce their eager values for T accordingly, preventing them from taking additional instances of T.

Accelerator hormones act antagonistically to suppressors by increasing a task’s eager value: Tasks that cooperate in some way or work on similar problems may be defined as related. When a task T is running on some \(\text {PE}_{\gamma }\), accelerators for all tasks related to T are spread in the vicinity of \(\text {PE}_{\gamma }\). This furthers the execution of those tasks on neighboring PEs, forming functional clusters of related tasks.

Fig. 1.
figure 1

Hormone control loop executed on \(\text {PE}_{\gamma }\). For each task \(T_i\), a modified eager value \(Em_{i\gamma }\) (its static local eager value \(E_{i\gamma }\) plus all received accelerators \(A^{i\gamma }\) minus all received suppressors \(S^{i\gamma }\)) is sent to all PEs. If the sent modified eager value is positive and greater than all received modified eager values, \(T_i\) may be taken. Thus, suppressors for \(T_i\) will be sent to all other PEs (preventing infinite assignments of \(T_i\)) and accelerators for tasks related to \(T_i\) will be sent to neighboring PEs, forming clusters of related tasks on them.

Figure 1 shows the basic principle of the hormone control loop. It is completely decentralized and thus, the AHS has no single point of failure. In addition, the original AHS (without our priority extension) can guarantee hard time bounds, allowing its use in real-time systems:

  • The system’s initial self-configuration requires at most m hormone cycles, where m is the number of tasks to distribute.

  • In case a PE running \(m_f\) tasks fails, the AHS will automatically perform a self-healing by reassigning those tasks among the healthy PEs. This takes at most \(m_f+a\) hormone cycles where a is the number of hormone cycles required to notice the failure (by missing suppressors for the failed tasks).Footnote 1

The length of each hormone cycle can be chosen arbitrarily, a lower bound is only imposed by the communication bus’ latency and bandwidth.

For more detailed information on the AHS, please refer to [1,2,3, 14].

4 A Priority-Based Task Decision Strategy

4.1 Motivation

The AHS’ self-healing capabilities are insufficient if the remaining PEs do not have enough computational resources to execute all tasks. With many systems consisting of tasks with mixed criticality levels, this may easily lead to situations where low-criticality tasks are running, but high-criticality tasks previously executed on the failed PE cannot be reassigned. In fact, it is undefined which task subset will be running in such situations.

The AHS models system load by means of load suppressors a task sends to the PE it is running on, thus limiting the number of tasks the PE can take. If it is fully loaded, it will send an eager value of 0 for all remaining tasks. Therefore, situations of system overload can in principle be recognized by examining the hormones broadcasted in the system.

We thus implemented an AHS extension that allows to give each task a priority. This priority is used to (a) control the order in which tasks are assigned during the initial self-configuration and (b) resolve system overload situations by stopping tasks of low priority, freeing capacities for high-priority tasks.

4.2 Conception

Our approach, the priority-based task decision strategy, is an extension of the AHS’ so-called aggressive task decision strategy [2]: Each PE may take at most one task per hormone cycle. If more than one task were taken per cycle and PE, all tasks could be assigned before accelerators had a chance to become effective, failing to cluster related tasks. Using the aggressive task decision strategy, each PE actively searches for a task it may take in each cycle. This allows at least one PE to take one task per cycle, resulting in the time bound of m hormone cycles to distribute m tasks mentioned in Sect. 3.

Our priority-based task decision strategy has the same operating principle with the following differences: Each task has an integer priority that is known to and equal on all PEs in the system. Each PE searches its task list for a task to be taken in the order of descending priorities. If the received hormones suggest that another PE has won some task T (and thus might take it in this cycle), no task \(T'\) having a lower priority than T’s may be taken in the current cycle, ensuring a correct order of task assignment. Figure 2a sketches this procedure: The variable lockPrio is used to stop deciding on tasks after a higher-priority task has been identified that may be taken on another PE in this cycle. The variable missingTask is used to track a high-priority task that is not running in the system because of an overload situation. Both variables are set by the sub-procedure \(\text {Decide}(T)\) as shown in Fig. 2b.Footnote 2

Fig. 2.
figure 2

Priority-based task decision strategy

The decide procedure works by comparing the local eager value sent with all received eager values. There are three possible outcomes of this comparison:

  1. a)

    No bidders: No PE sent a positive modified eager value for T. If suppressors were received for T instead, T is taken somewhere and the decision procedure may continue to the next task. Else, no PE has capacities to take T, so the system is in an overload situation. In this case, lockPrio is set to prevent tasks of lower priority from being taken and missingTask is used to track this situation.

  2. b)

    Loser: The local PE did not send the highest eager value for T. Thus, it sets lockPrio to prevent taking any task of lower priority.

  3. c)

    Winner: The local PE sent the highest eager value for T. Thus, it takes T (and will send a suppressor for T in the next cycle, preventing the other PEs from taking an additional instance of this task).

The current hormone cycle ends once a task has been taken or a task with lower priority than lockPrio has been reached. If, in the latter case, missingTask is set, some high-priority task is not running because of an overload situation. In order to resolve this situation, each PE will try to drop tasks of low priority in order to free up resources and send an eager value \({>}0\) for this task.

This is, however, based on the following fundamental assumption:

Assumption

Any task may be (temporarily) stopped at any time in order to free up resources for tasks of higher priority.

It is up to a task dropping strategy to decide on which specific tasks will be stopped in an overload situation; more information on this matter as well as the priority-based task decision’s time bounds will be presented in the following sections.

5 Worst-Case Analysis

Our priority-based extension can still guarantee real-time behavior for self-configuration as well as self-healing if the remaining capacities are sufficient to redistribute all failed tasks:

Self-Configuration. It can be shown that it takes at most \(2m-1\) hormone cycles to assign m tasks during the system’s initial self-configuration. As our strategy is based on the aggressive strategy, a similar argument to the one given in [2] can be employed: Each PE searches actively for a won task, thus at least one task is taken per cycle in the system. However, after a task T is taken by some \(\text {PE}_{\alpha }\) in cycle i, all other PEs will still send out an eager value \({>}0\) for T in cycle \(i+1\) before \(\text {PE}_{\alpha }\)’s suppressor for T finally becomes effective. This introduces one delay cycle in which no task allocation can happen after the last task of each priority level has been assigned. Thus, if all m tasks have different priorities, \(m-1\) delay cycles are introduced and self-configuration takes at most \(2m-1\) hormone cycles.

Self-Healing. If a PE fails, it won’t send any more hormones. Thus, it is possible to detect this failure by missing hormones. This takes a constant amount of a hormone cycles.Footnote 3 Afterwards, the failed tasks are automatically re-assigned with the time bound for self-configuration applying. If \(m_f\) tasks were running on the failed PE and the remaining PEs’ resources suffice to take all those tasks, self-healing thus takes at most \(2m_f-1+a\) hormone cycles.

If, however, the remaining capacities do not suffice to take all failed tasks, the system is considered to be in an overload situation. Since a premise of our priority-based task decision is to allow gracefully degrading the system in such situations, the next section will deal with overload situations.

6 Overload Situations

As mentioned in Sect. 4.2, a task dropping strategy is responsible for deciding on which specific tasks to stop in overload situations. However, a model is required to facilitate the analysis of such situations. Thus, let v be the number of PEs that remain operational. Let \(\text {PE}_\times \) denote the failed PE and \(\text {PE}_{1}\cdots \text {PE}_{v}\) denote the remaining PEs. Additionally, we will make the following assumptions:

  • All tasks have different priorities.

  • All tasks induce equal load to the PEs and each PE may execute any task.

  • At the instant \(\text {PE}_\times \) fails, \(\text {PE}_\times \) and \(\text {PE}_{1}\cdots \text {PE}_{v}\) are each completely utilized by executing exactly m tasks.

  • No additional PE failure occurs during self-healing.

The resulting model is visualized in Fig. 3.

Fig. 3.
figure 3

Overload model used during analysis

6.1 Task Dropping Strategy

We now propose the following task dropping strategy to resolve an overload situation:

Strategy

Upon noticing an overload situation, each PE shall stop all running tasks that have a priority lower than the priority of the highest-priority task that is currently not running.

In the context of Fig. 2, this strategy basically stops all tasks having a priority lower than missingTask’s priority. After the tasks have been stopped, the system re-configures itself by assigning as many of the highest-priority non-running tasks as possible. Since the system is in overload, not all tasks can be assigned, but no more tasks will be stopped by the next invocation of the task dropping strategy. As this is arguably a very simple strategy, we called it naive task dropping. In addition, analyzing its worst-case in the given model is straightforward:

Theorem 1

(Worst-Case Analysis for Overload Situations). Self-healing in an overload situation takes at most \(2mv+a\) hormone cycles when using the naive task dropping strategy.

Proof

It takes a constant amount a of hormone cycles to notice the failure of \(\text {PE}_\times \) due to missing suppressors. All v remaining PEs are fully utilized, so the system is in an overload situation.

Thus, in the next cycle, missingTask is set to the highest-priority task that was previously running on \(\text {PE}_\times \) and no task is taken in the system. The strategy will now stop all tasks with a priority lower than missingTask’s priority; if missingTask is the single-highest priority task, a total of mv tasks will be stopped.

Starting with the following cycle, the mv highest-priority tasks (of all \(mv + m\) non-running tasks) will be assigned, taking \(2mv - 1\) cycles at most.

As a result, no more than \(2mv - 1 +1 + a=2mv+a\) hormone cycles are required to complete the self-healing and reach a stable system state again.   \(\square \)

6.2 Average-Case Analysis

In this paper, we additionally want to analyze the time required for self-healing when using this strategy in the average case. Although an average-case analysis is not relevant in the context of real-time systems, we decided to nevertheless analyze it in this regard: The expected self-healing duration might especially be of interest for scenarios that don’t require hard real-time bounds.

Preparations. In order to facilitate this analysis, some arrangements have to be made:

Definition 1

Let \(n, k \in \mathbb {N}\). Then, the rising factorial \(n^{\overline{k}}\) shall be defined as

$$\begin{aligned} n^{\overline{k}} := \underbrace{n \cdot (n+1)\cdots (n+k-1)}_{k \text { factors}}=\prod _{i=n}^{n+k-1}i. \end{aligned}$$

Lemma 1

For \(k,m \in \mathbb {N}\) with \(k \ge 1\),

$$\begin{aligned} \sum _{i=1}^{k}i^{\overline{m}}=k\cdot \frac{(k+1)^{\overline{m}}}{1+m}\quad \text {holds.} \end{aligned}$$

Proof

Multiplying with (m!/m!) allows to convert the summands to binomial coefficients:

$$\begin{aligned} \sum _{i=1}^{k}i^{\overline{m}}&= m!\cdot \sum _{i=1}^{k}\frac{i^{\overline{m}}}{m!} = m! \cdot \sum _{i=1}^{k} \left( {\begin{array}{c}i+m-1\\ m\end{array}}\right) = m! \cdot \sum _{i=m}^{k+m-1}\left( {\begin{array}{c}i\\ m\end{array}}\right) \end{aligned}$$

This sum can now be simplified using the identity \(\sum _{i=r}^{n}\left( {\begin{array}{c}i\\ r\end{array}}\right) =\left( {\begin{array}{c}n+1\\ r+1\end{array}}\right) \):

$$\begin{aligned}&= m! \cdot \left( {\begin{array}{c}k+m\\ m+1\end{array}}\right) = m!\cdot \frac{k^{\overline{m+1}}}{(m+1)!}=k\cdot \frac{(k+1)^{\overline{m}}}{1+m}. \end{aligned}$$

   \(\square \)

Analysis. We can now analyze the naive task dropping strategy’s average case within our overload situation model, assuming all tasks are distributed randomly to the available PEs. For this reason, we quantify the number of tasks stopped on average:

Theorem 2

Let X be a random variable representing the number of tasks stopped by the naive task dropping strategy and \(\mathbf {E}\left[ X\right] \) its expected value. Then, \(\mathbf {E}\left[ X\right] = (m^2v) / (1+m)\) holds.

Proof

We will first calculate the probability distribution of X by assuming that all \((v+1)\cdot m\) tasks are distributed in sequence of descending priorities to mv positions on \(\text {PE}_{1}\cdots \text {PE}_{v}\) and mv positions on \(\text {PE}_\times \).

Obviously, \(0 \le X \le mv\) holds: In case the m lowest-priority tasks are running on \(\text {PE}_\times \), no tasks have to be stopped. If the highest-priority task is running on \(\text {PE}_\times \), all tasks on \(\text {PE}_{1}\cdots \text {PE}_{v}\) are stopped.

Generalizing this argument yields

  • \(X=mv \iff \) The highest-priority task is set to one of \(\text {PE}_\times \)’s m positions.

  • \(X=mv - 1 \iff \) The second-highest-priority task is the first task to be set to one of \(\text {PE}_\times \)’s m positions.

     

  • \(X=0 \iff \) The \((m\cdot (v+1)-1)\)-highest-priority task (which is the m-lowest-priority task) is the first task to be set to one of \(\text {PE}_\times \)’s m positions.

Thus, the probability distribution of X is given by

This is equivalent to

$$\begin{aligned} \mathbf {P}\left( X=j\right) = \frac{\prod \limits _{i=j+1}^{mv} i}{\prod \limits _{i=m+j+1}^{mv + m} i} \cdot \frac{m}{m+j} = m \cdot \frac{\prod \limits _{i=j+1}^{mv} i}{\prod \limits _{i=m+j}^{mv + m} i} \end{aligned}$$
(1)

for all \(0 \le j \le mv\). Figure 4 plots the probability distribution as given by this equation for arbitrarily chosen values of m and v.

Fig. 4.
figure 4

Probability distribution of number of tasks stopped by naive task dropping strategy for \(m=4\) and \(v=6\)

Equation 1 now allows us to calculate \(\mathbf {E}\left[ X\right] \):

$$\begin{aligned}\begin{array}{lll} \mathbf {E}\left[ X\right] &{}= \sum _{j=0}^{mv}j\cdot \mathbf {P}\left( X=j\right) ~&{}=~ \sum _{j=1}^{mv} j \cdot m \cdot \frac{\prod \limits _{i=j+1}^{mv} i}{\prod \limits _{i=m+j}^{mv + m} i} \\ &{}= m \cdot \sum _{j=1}^{mv} j\cdot \frac{(mv)!/j!}{(mv + m)!/(m + j - 1)!} ~~&{}=~ m \cdot \sum _{j=1}^{mv} \frac{(m+j-1)!/(j-1)!}{(mv +m)!/(mv)!} \\ &{}= \frac{m}{\prod \limits _{i=mv+1}^{mv + m}i} \cdot \sum _{j=1}^{mv}\left( \prod _{i=j}^{m+j-1} i\right) &{}=~ \frac{m}{(mv + 1)^{\overline{m}}}\cdot \sum _{j=1}^{mv} j^{\overline{m}} \end{array} \end{aligned}$$

This expression can be simplified using Lemma 1:

$$\begin{aligned} \begin{array}{lll}&= \frac{m}{(mv+1)^{\overline{m}}}\cdot \frac{mv \cdot (mv+1)^{\overline{m}}}{1+m}&=~ \frac{m^2v}{1+m}. \end{array} \end{aligned}$$

   \(\square \)

Discussion. When examining the number of tasks dropped by the naive task dropping strategy, it becomes obvious that its worst case is not substantially worse than the average case, especially for large values of m:

$$\begin{aligned} \lim \limits _{m\rightarrow \infty } \frac{\text {worst case}}{\text {average case}} = \lim \limits _{m\rightarrow \infty } \frac{mv}{\frac{m^2v}{1+m}}=\lim \limits _{m\rightarrow \infty } \left( 1+\frac{1}{m}\right) = 1. \end{aligned}$$

As a result, no significant outliers from the average case are to be expected when utilizing this task dropping strategy.

In addition, initial self-configuration for mv tasks requires \(2mv-1\) hormone cycles, while self-healing in an overload situation takes at most \(2mv+a\) hormone cycles: Both bounds are linear in the number of tasks with different (and small) additive constants.

7 Conclusion

This paper described a priority-based extension to the AHS middleware. We analyzed its worst-case time bounds for self-configuration and self-healing. Additionally, a strategy was proposed allowing to degrade the system in case of an overload situation. Its worst and average cases were analyzed. The results show that this strategy does not perform substantially worse in its worst case than it does on average.

Future work will deal with thorough evaluations of our concept as well as further research on degrading the system in overload situations, especially with developing more elaborate task dropping strategies, possibly guaranteeing even better time bounds for self-healing in overload situations.

Additionally, our current analyses assume a reliable communication network. Although empirical experiments suggest that the AHS can handle a limited degree of communication failures quite well, we plan to shift our research focus to guaranteeing time bounds and the system’s overall consistency even in the presence of such failures.