research-article

Open access

Dynamic Adaptation Using Deep Reinforcement Learning for Digital Microfluidic Biochips

Authors:

Krishnendu Chakrabarty,

Richard FairAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems, Volume 29, Issue 2

Article No.: 24, Pages 1 - 24

https://doi.org/10.1145/3633458

Published: 15 January 2024 Publication History

PDF eReader

Abstract

We describe an exciting new application domain for deep reinforcement learning (RL): droplet routing on digital microfluidic biochips (DMFBs). A DMFB consists of a two-dimensional electrode array, and it manipulates droplets of liquid to automatically execute biochemical protocols for clinical chemistry. However, a major problem with DMFBs is that electrodes can degrade over time. The transportation of droplet transportation over these degraded electrodes can fail, thereby adversely impacting the integrity of the bioassay outcome. We demonstrated that the formulation of droplet transportation as an RL problem enables the training of deep neural network policies that can adapt to the underlying health conditions of electrodes and ensure reliable fluidic operations. We describe an RL-based droplet routing solution that can be used for various sizes of DMFBs. We highlight the reliable execution of an epigenetic bioassay with the RL droplet router on a fabricated DMFB. We show that the use of the RL approach on a simple micro-computer (Raspberry Pi 4) leads to acceptable performance for time-critical bioassays. We present a simulation environment based on the OpenAI Gym Interface for RL-guided droplet routing problems on DMFBs. We present results on our study of electrode degradation using fabricated DMFBs. The study supports the degradation model used in the simulator.

1 Introduction

In recent years, we have seen progress on the use of deep Reinforcement Learning (RL) to assist sequential decision-making problems, such as games [2, 67, 77], robotics [18], autonomous driving [57, 61, 78], quantitative trading strategies [37], and healthcare systems [46]. The systems assisted by RL have shown tremendous promise in games [50, 67], robotics [18], and natural language processing [22, 51]. This can be attributed to the fact that RL systems in dynamic environments can learn from history and adapt better to the environment. In this article, we show that because the health of an electrode in a Digital Microfluidic Biochip (DMFB) dynamically changes over time, we can utilize innovations in RL to ensure more reliable droplet transportation in DMFBs.

1.1 Digital Microfluidic Biochips

The rapid worldwide spread and impact of the COVID-19 virus has created an urgent need for reliable, accurate, and affordable testing on a massive scale. For example, the National Institutes of Health (NIH) has launched the Rapid Acceleration of Diagnostics (RADx) initiative to develop and implement technologies for COVID-19 testing [54]. One of the most promising technologies for realizing this goal is digital microfluidics. A microfluidic biochip (DMFB) manipulates tiny amounts of fluids to automatically execute biochemical protocols for point-of-care clinical diagnosis with high efficiency and fast sample-to-result turnaround [16, 65, 74]. Because of these characteristics, the RADx initiative has awarded grants to several biomedical diagnostic companies to develop microfluidic technologies that could dramatically increase testing capacity and throughput [53, 56]. Other applications of DMFBs include screening of newborn infants [33, 69], drug discovery [38], and clinical diagnostics [8, 62].

A DMFB consists of an electrode array in two dimensions that controls the movement of discrete liquid droplets. Upon actuation by a sequence of control voltages, the electrode array can perform a variety of fluidic operations, such as dispensing, mixing, and splitting [7, 25]. Figure 1(a) shows a DMFB in which two droplets are present on a patterned electrode array. Nanoliter droplets on this platform are transported using the principle of Electrowetting-on-Dielectric (EWOD) [60]. This principle refers to the modulation of the interfacial tension between a conductive fluid and a solid electrode coated with a dielectric layer through the application of an electric field between them. See Figure 1(b).

Fig. 1.

Illumina commercialized digital microfluidics for sample preparation in 2015 through NeoPrep—a nearly $40K instrument that automates the preparation of up to 16 sequencing libraries at a time [31]. Genmark has also deployed the microfluidic technology for infectious disease testing [59], and Baebies uses this technology to detect lysosomal storage diseases in newborns [26].

However, reliability remains a major concern in DMFB systems. Illumina halted the sale of NeoPrep in February 2017. In its letter to customers, Illumina cited reliability issues in-house and far worse ones in the field. Even though biochips are tested after production, defects such as electrode degradation can occur during system lifetime [13, 71]. As the electrodes are actuated over time, two types of electrode degradation might happen: charge residual and charge trapping. Charge residual is caused by the accumulated charges, which can be mitigated by inserting grounding vectors [58]. Charge trapping is when the charges are trapped in the dielectric insulator, and this phenomenon is irreversible [5]. A consequence of electrode degradation is that droplet movement is impeded [72]. An example of electrode degradation is shown in Figure 2. The figure shows two droplets on the biochip—one located on a degraded electrode. Two electrodes are actuated to move these droplets. However, one of these operations fails because the degraded electrode exerts additional surface-tension force. Detailed analyses of the relationship between electrode defects and fluidic operations can be found in the work of Drygiannakis et al. [14].

Fig. 2.

1.2 Motivating RL-Guided Droplet Routing

In a typical use model for DMFBs [70], a bioassay protocol with fluidic operations is obtained from biologists. Next, a synthesis technique maps these operations to groups of electrodes, referred to as fluidic modules, of a biochip to perform the required operations [4]. A droplet has to be transported from one module to the next. The problem of determining droplet transportation paths between modules is referred to as droplet routing. A number of droplet routing techniques have been proposed in the literature for bioassay applications [73, 82, 86]. Su et. al [73] proposed the first systematic droplet routing approach, which adopted the Lee algorithm and minimized the number of electrodes used for droplet routing. Xu and Chakrabarty [82] proposed a droplet routing aware synthesis tool, which was based on parallel recombinative simulated annealing. Zhao and Chakrabarty [86] proposed an integer linear programming based method to co-optimize droplet routing and pin mapping.

However, these methods overlook the fact that transportation of droplet may fail if the electrodes on the routing path degrade over time.

Example. Figure 3(a) shows a pre-computed routing path. We can see that this route is the shortest path between the start and the destination points. Droplet transportation can be successful because the biochip is healthy (i.e., no electrode degradation has occurred). Conversely, Figure 3(b) shows that droplet transportation to the destination fails because degraded electrodes exist in the associated path. If an online droplet router knows the locations of the degraded electrodes, it can generate another route that involves only healthy electrodes. An alternative route is shown in Figure 3(c); note that this is a shortest path, and it avoids electrodes that are degraded.

Fig. 3.

In Figure 3, a different color is used to indicate the degraded electrodes. However, in reality, we cannot identify degraded electrodes by simple examination; this is because the degradation process results from charge trapped in the insulator. When routing errors occur, simply replacing the degraded DMFB with a new one will not only increase the cost but also lead to undesirable wastage of biosamples. Droplets that are in the middle of an unfinished operation, such as mixing or diluting, need to be abandoned. The wastage of droplets is particularly undesired in some applications, such as newborn screening [32] and forensic analysis [83], since the bio-examples are limited in volume and availability. For example, in the newborn screening test provided by Baebies Inc., the entire screening test contains 10 to 20 different assays and each assay needs 100 nl of dried blood spot extract [32]. Thus, a newborn screening test needs at least 1,000 nl of dried blood spot extract, which needs 200 to 300 $\mu$ L (4–6 drops) of whole blood [3]. Prior work has led to synthesis methods that prevent excessive usage of a few electrodes by evenly distributing fluidic operations to multiple electrodes [5, 88]. However, these methods can only postpone the occurrence of electrode degradation, which still happens as electrodes are actuated over time. If such electrode degradation happens during bioassay execution and a route is associated with degraded electrodes, bioassay execution will fail, and it will need to be re-executed on a new biochip [29]. Furthermore, the locations of degraded electrodes may vary from biochip to biochip because the electrode degradation process is affected by geometric variations and different electrode actuation times [24].

Several methods have been proposed to perform error recovery when routing tasks fail [1, 40, 66]. However, these methods are focused on recovery after routing failures, and they do not proactively alleviate the occurrence of erroneous behaviors caused by electrode degradation. Recently, an RL-based routing framework was developed to identify degradation-aware routing strategies for Micro-Electrode-Dot-Array (MEDA) biochips [15]. However, this method cannot be used for DMFBs due to the inherent difference between DMFBs and MEDA biochips: MEDA biochips provide the real-time degradation status of each electrode using built-in sensing circuits. This is not the case of non-MEDA DMFBs. In this work, we adopt RL techniques to respond to the dynamic degradation environments, which is not possible with existing offline routing methods.

Numerous papers have been published in recent years to advance applications that leverage RL theory [9, 11, 49]. Our work aims to introduce RL to a new application—that is, the droplet routing problem on DMFBs. We target an RL formulation for the droplet routing problem to address the dynamic degradation of electrodes. An RL-based droplet router addresses the electrode degradation problem and ensures reliable bioassay executions in three ways. First, it provides real-time decision for droplet routing. Second, it can “learn” from the prior experience associated with electrodes that start malfunctioning. Therefore, the droplet router can generate routing paths that include any healthy electrodes. Third, even though the degradation processes may differ for two DMFBs, the router can generate different, yet reliable, routing paths on distinct DMFBs for the same routing objective.

1.3 Article Contributions

This article represents one of the first attempts to map RL to clinical microfluidic systems. The main contributions of this work are as follows:

—

We describe a new framework for RL-based droplet routing on DMFBs. We discuss the challenges inherent in formulating droplet routing as an RL task.

—

We describe an experiment using fabricated PCB-based DMFBs to gain insights into electrode degradation. The insights derived in this manner support our degradation model in the simulator.

—

We present an online droplet routing framework, which uses deep RL to generate a policy that uses real-time observations of a DMFB to choose droplet paths dynamically. Training is first carried out in a simulated DMFB. Next, the pre-trained policy is loaded on the controller for the DMFB, and the policy generates routing paths in a real-time manner.

—

We consider a parallel droplet routing scenario where multiple droplets are transported concurrently on a DMFB. We formulate a Multi-Agent Reinforcement Learning (MARL) framework for parallel droplet routing on DMFBs. Experimental results show that the MARL framework outperforms the single-agent RL framework in parallel routing scenarios.

—

We evaluate the proposed solution by executing an epigenetic bio-protocol on a fabricated DMFB. Our experiment shows that the online router can learn the degradation behavior of electrodes and generate reliable routes.

—

We identify the timing constraints associated with the use of the RL approach on a simple, GPU-less micro-computer (Raspberry Pi 4). The results show that the timing constraints arising from the RL approach do not impede the fluidic operations in a bioassay.

2 Problem Formulation

The problem formulation for the droplet routing problem on DMFBs is as follows.

Given a DMFB consists of a two-dimensional array of electrodes with size $N \times M$, let $e_{i, j}$ represent the $i^{\text{th}}$ row and the $j^{\text{th}}$ column electrode of the DMFB, where $1 \le i \le N, 1 \le j \le M$. The main objective for the droplet routing problem is to minimize the time required to transport the droplet from the source $e_{x, y}$ to the destination $e_{k, m}$. In a single-droplet routing problem without electrode degradation, the problem can be simplified as finding the shortest path between $e_{x, y}$ and $e_{k, m}$.

2.1 Routing with Multiple Droplets

Multiple droplet routing tasks can be parallel executed on a DMFB at the same time. Assume that there are n droplets in total, where $n \ge 2$. For any two of the droplets $d_{i}$ and $d_{j}$, where $1 \le i,j \le n$ and $i \ne j$, assume that their positions at timestep t are $e_{x^{t}, y^{t}}$ and $e_{k^{t}, m^{t}}$, respectively. The following are the fluidic constraints that should be satisfied [6]:

(1)

$|x^t-k^t| \gt 1$ or $|y^t-m^t| \gt 1,$

(2)

$|x^{t+1}-k^t| \gt 1$ or $|y^{t+1}-m^t| \gt 1$ or $|x^t-k^{t+1}| \gt 1$ or $|y^t-m^{t+1}| \gt 1.$

With the preceding constraints, the objective of the multi-droplet routing problem is to minimize the maximal routing timestep among all the n routing tasks.

2.2 Routing with Electrode Degradation

For the droplet routing problem that considers electrode degradation, we define function $d(e_{i, j})$ to describe the degradation status of an electrode, where $0\le d(e_{i, j})\le 1$. $d(e_{i, j})$ is 1 when the electrode $e_{i, j}$ is completely healthy. The modeling of the degradation status function will be explained in Section 5.3. Note that the value of the electrode status function cannot be obtained by the users during the execution since the degradation status is not measurable. As an electrode degrades, the success rate of droplet transition decreases. A failing transition causes the droplet to stay in the same position. For an electrode with the degradation status $d(e_{i, j})$, we assume that the success rate of transition is $d(e_{i, j})$ and the expected steps for a success transition is $1/d(e_{i, j})$. Therefore, the objective of a droplet routing problem with electrode degradation can be formulated as following: find a path $\lbrace e_{x_1, y_1}, e_{x_2, y_2}, \ldots , e_{x_T, y_T}\rbrace$ that minimizes $\sum _{i=1}^{T} 1/d(e_{x_i, y_i})$,where $e_{x_1, y_1}$ the is the source and $e_{x_T, y_T}$ is the destination.

3 Electrode Degradation in DMFBs

Previous work has shown that charge trapping in a dielectric layer follows an exponential model [13, 44, 47, 84]. To independently validate this claim, we design an experiment where we monitor electrode degradation in the fabricated PCB-based DMFB.

The electrode size of the DMFB is $2 \times 2\text{ mm}^2$ (Figure 4(a)). Four reservoir modules are placed on two sides of the biochip; these modules are used to dispense droplets of reagents. Every electrode can be individually controlled; the control signals are provided by a control board placed below the DMFB. The activation/de-activation status of each electrode is controlled by a high-voltage relay (part no. Panasonic AQW212). A high-voltage relay in our setup is controlled by a configuration bit; the configuration bits are stored in a register (part no. Texas Instruments SN74AHC595). The details of the control hardware are shown in Figure 4(b). The Raspberry Pi 4 on the left generates control signals. We used a voltage source of 1.5 KHz and 200 Vpp for electrode actuation. To avoid introducing excessive current, a resistor $R = 1$ M$\Omega$ is placed in series between each electrode and the high-voltage source.

Fig. 4.

We developed an actuation sequence for the electrodes that leads to repeated fluidic operations on the biochip. When we execute the actuation sequence on the DMFB, each electrode is actuated for 1 second for hundreds of times. After executing the actuation sequence, we actuated an electrode and measured the charging times needed using an oscilloscope. Because the electrode and the top plate form a capacitor, and a resistor is placed in series with the electrode, the charging path is a simple RC circuit. The effective capacitance of an electrode can be derived using the equation

\begin{equation*} V_{C}(t) = V{pp} \left(1 - e^{-t/RC} \right), \end{equation*}

where C is the effective capacitance of the electrode, $V_C$ is the voltage of the electrode, and t is time. The degradation results are shown in Figure 5. The results show that the capacitance of an electrode grows linearly as we repeatedly actuate the electrode.

Fig. 5.

The EWOD force of a droplet is given by Zhong et al. [87]:

\begin{equation} F_{EWOD}=\frac{C_{unit}(V_C-V_T)^2}{2}L_{eff}, \end{equation}

(1)

where $V_T$ is the threshold voltage, $C_{unit}$ is the structural capacitance per unit area in the dielectric layer, and $L_{eff}$ is the length of the contact line. Therefore, the EWOD force exerted by an electrode (relative to the same EWOD force at full health) can be estimated as

\begin{equation} \bar{F}^{(n)} \approx (V^{(n)} / V_a)^2, \end{equation}

(2)

where n is the number of actuations of the electrode, $V^{(n)}$ is the actuation voltage on (potentially affected by electrode degradation), and $V_a$ is the nominal actuation voltage. By plugging our experimental results to (2), the impact of the electrode number of actuations and the relative EWOD force is shown in Figure 5. As the EWOD force correlates to the exponential actuation times, we derive an exponential model that has the least-squared error to fit the measured data. The model fitting results show that the relationship between the number of actuation n and the relative EWOD force $\bar{F}^{(n)}$ can be modeled as

\begin{equation} \bar{F}^{(n)} \approx \tau ^{2n/c}, \end{equation}

(3)

where $\tau \in [0,1]$ and $c \in \mathbb {R}$ are constants capturing the degradation rate. The degradation parameters are estimated as $\tau \in [0.5,0.7]$ and $c \in [500,800]$.

For realistic bioassays, the value of n varies between different applications. For instance, the total number of operations range from 18 (PCR) to 1,920 (ProteinSplit7) in the work of Grissom and Brisk [17], where each operation needs several steps to be completed. Besides, additional operations might be performed when error recovery is required, increasing the total number of actuations. Thus, the number of total actuations ranges from hundreds to thousands.

4 Background On RL

4.1 Deep RL

An agent in an RL formulation is placed within an environment. The agent’s goal is to accomplish a using task with the best performance given a set of small actions. At each step, the agent takes one of these actions, and the agent receives an observation and reward from the environment [75].

RL problems can be formally stated using Markov decision processes. A Markov decision process contains two sets (S and A), a probability function f, a reward model R, and a variable $\gamma$. The observations made by the agent are included in a set S, and an observation is also referred to as a state. An element $s_t\in S$ is an observation made by the agent at time t. We use A to denote a set of actions taken by the agent. An action $a_t \in A$ denotes the action made by the agent at time t. Note that $P(s_{t+1}|a_t, s_t)$ refers to the transition model; it describes what the next state $s_{t+1}$ will be after the agent takes action $a_t$ while in state $s_t$. The reward model is denoted by $R(s_t)$; it describes the agent’s reward when it enters the state $s_t$. The parameter $\gamma$ is a discount factor, where $0\le \gamma \le 1$ and $\gamma \in \mathbb {R}$. It represents the relative importance of immediate and future rewards. The agent’s goal is to select the best policy $\pi$ that maximizes the total reward received from the environment from the start state to an end state. The expected cumulative discounted reward is expressed as $U(t) = \mathbb {E}[\sum _t \gamma ^t \cdot R(s_t)]$.

4.2 RL Algorithms

We briefly describe three deep RL algorithms that we use to evaluate our RL framework; these algorithms are Temporal-Difference (TD), on-policy gradient descent, and off-policy actor-critic approaches.

4.2.1 Double Deep Q-Network.

The Deep Q-Network (DQN) algorithm [50] is a TD method that uses a neural network to approximate the state-action value function

\begin{equation*} Q(s, a) = \underset{\pi }{\text{max}} \mathbb {E}\left[\sum _0^\infty \gamma ^i r_{t+i}|s_t = s, a_t = a, \pi \right]. \end{equation*}

DQN relies on an experience replay dataset $\mathcal {D}_t = \lbrace m_1, \ldots , m_t\rbrace$, which stores the agent’s experiences $m_t = (s_t; a_t; r_t; s_{t+1})$ to reduce correlations between observations. The experience consists of the current state $s_t$, the action the agent took $a_t$, the reward it received $r_t$, and the next state after transition $s_{t+1}$. The learning update at each iteration j uses a loss function based on the TD update:

\begin{equation*} L_j(\theta _j) = \mathbb {E}_{m_k\sim \mathcal {D}}[(r+\gamma \text{max}_{a^{\prime }}Q(s^{\prime }, a^{\prime };\theta ^-) - Q(s, a; \theta _j))^2], \end{equation*}

where $\theta _j$ and $\theta ^-$ are the parameters of the online Q-networks and the target network, respectively, and the experiences $m_k$ are sampled uniformly from $\mathcal {D}$. The parameters of the target network are fixed for a number of iterations while the online network $Q(s, a; \theta _j)$ is updated by gradient descent. In partially observable environments, an agent can only observe $o_t$ instead of the entire state $s_t$. The experience replay is therefore updated as $m_t = (o_t; a_t; r_t; o_{t+1})$.

In DQN, the max operator uses the same values to select an action and evaluate an action, which can lead to overoptimistic value estimation [21]. An improved method named double DQN was proposed to mitigate this problem [76]. In double DQN, the loss function at iteration j is updated as

\begin{equation*} L_j(\theta _j) = \mathbb {E}_{m_k\sim \mathcal {D}}[(r + \gamma Q(s^{\prime }, \text{argmax}_{a^{\prime }}Q(s^{\prime }, a^{\prime };\theta _j);\theta ^-) - Q(s, a; \theta _j))^2]. \end{equation*}

4.2.2 Proximal Policy Optimization Algorithm.

Proximal Policy Optimization (PPO) is an on-policy method that improves gradient descent stability without performance collapse [64]. It updates policies using the following equation:

\begin{equation*} \theta _{k+1} = \underset{\theta }{\text{argmax}} \underset{s, a \sim \pi _{\theta _k}}{{\bf E}}[L(s,a,\theta _k, \theta)]. \end{equation*}

The update usually takes several steps of stochastic gradient descent (SGD) to maximize the objective. Here, the loss function L is defined as

\begin{equation*} L(s, a, \theta _k, \theta) = \text{min}(\frac{\pi _\theta (a|s)}{\pi _{\theta _k}(a|s)}A^{\pi _{\theta _k}}(s, a), g(\epsilon , A^{\pi _{\theta _k}}(s, a))), \end{equation*}

where A is an estimator of the advantage function, $\epsilon$ is a hyperparameter, and

\begin{equation*} g(\epsilon , A) = {\left\lbrace \begin{array}{ll} (1+\epsilon)A & \text{if} A\ge 0\\ (1-\epsilon)A & \text{if} A \lt 0. \end{array}\right.} \end{equation*}

4.2.3 Actor-Critic with Experience Replay.

Actor-Critic with Experience Replay (ACER) is an off-policy actor-critic model that increases the sample efficiency and reduces the data correlation [80]. Similar to asynchronous advantage actor-critic (A3C) [48], ACER learns the value function by training multiple actors in parallel. To obtain stability of the off-policy estimator, ACER adopts a retrace Q-value estimation:

\begin{equation*} \Delta Q^{ret}(S_t, A_t) = \gamma ^t \underset{1\le \gamma \le t}{\prod }\text{min} \left(c, \frac{\pi (A_\tau |S_\tau)}{\beta (A_\tau |S_\tau)} \right)\delta _t, \end{equation*}

where $(\pi , \beta)$ is the target and behavior policy pair, $\delta _t$ is the TD error, and c is a constant. In addition to a retrace Q-value estimation, ACER uses importance sampling and a trust region policy optimization [63].

4.3 MARL Training Schemes

We consider three widely used training schemes for our MARL framework: centralized, concurrent, and parameter sharing [20]. We briefly describe how each approach can be used with MARL.

Centralized. The centralized learning approach assumes a joint model that receives all the observations and generates the joint actions for all the agents. A drawback of this approach is that it leads to exponential growth in the observation and actions spaces with the number of agents.

Concurrent. In concurrent learning, each agent learns its own individual policy. Each independent policy maps an agent’s private observation to an action. In the policy gradient approach, this means optimizing multiple policies simultaneously from the joint reward signal.

Parameter Sharing. Similar to concurrent learning, each agent is assigned with a neural network policy. However, in the parameter sharing approach, all the agents share the parameters of a single policy. This allows the policy to be trained with the experiences of all agents simultaneously. However, each agent is still able to act differently based on the observation it receives.

5 RL Approach to Droplet Router On DMFBs

We consider a bioassay that is executed on a cyberphysical DMFB. The droplet location is determined in real time using a CCD camera [45, 81]. A controller, connected to the DMFB, is loaded with all the droplet routing tasks needed to complete the bioassay [72]. Figure 6 illustrates the overall system.

Fig. 6.

5.1 Droplet Routing as an RL Problem

We formulate droplet routing as a sequence of decision-making problems within the RL framework. We utilize a droplet routing agent that can make real-time observations of the DMFB, it can move a droplet to an adjacent electrode at a timestep, and the agent’s goal is to transport the droplet from a given start electrode to a given destination electrode. The agent is rewarded or punished based on the state transition result after it takes an action.

Actions. At any timestep, a droplet can be transported to one of the four directions: north, south, east, and west. Therefore, we define the action set as $A=\lbrace a_n, a_s, a_e, a_w\rbrace$; each element denotes a direction along which the droplet can be moved.

States. A state $s_t$ consists of the location of the transported droplet, the droplet destination, and electrodes that are concurrently utilized by other fluidic operations. During a bioassay, multiple operations may be carried out concurrently to achieve high throughput. If a droplet is moved while a mixing operation is also being carried out, the set of electrodes used for the mixing operation cannot be used for droplet transportation in order to prevent undesirable contamination.

At any given timestep, observation made on the DMFB is processed as an RGB image. Control software is used to determine the locations of on-chip droplets [45]. The resolution of the RGB image is given by the number of electrodes on the DMFB. An electrode with a droplet on it is interpreted as a blue pixel. The destination electrode is interpreted as a green pixel. The electrodes occupied by all the other concurrent operations are interpreted as red pixels (see Figure 6(b)).

Rewards. The agent is rewarded if the droplet is transported to its destination. Let $e_{i, j}$ be the $i^{\text{th}}$ row and the $j^{\text{th}}$ column electrode of the DMFB. Suppose that in state $s_t$, a droplet is present at $e_{i, j}$, and its destination is $e_{k, m}$. We define $D(s_t)$ as the Manhattan distance of the droplet from the destination at state $s_t$; $D(s_t) = |i - k| + |j - m|$. After an action $a_t$ is taken, if $D(s_{t+1}) = 0$, the agent receives a positive reward of $+1.0$. Otherwise, the reward is computed as follows:

\begin{equation*} R_t = {\left\lbrace \begin{array}{ll} +0.5 & \text{if} D(s_{t+1}) \lt D(s_t)\\ -0.3 & \text{if} D(s_{t+1}) = D(s_t)\\ -0.8 & \text{if} D(s_{t+1}) \gt D(s_t). \end{array}\right.} \end{equation*}

In the first case, the action leads to a state in which the droplet is closer to the destination. Therefore, the reward is positive. Any positive value can facilitate agent convergence because the total reward is maximized. In the second case, the agent is punished because the action does not result in a better state. In the third case, the agent is punished with a negative value of larger magnitude because it leads to a worse state.

5.2 Formulation of Parallel Droplet Routing as MARL

We formulate parallel droplet in the MARL framework where agents are fully cooperative. The action space and state for each agent are similar to that of the single-agent formulation.

Rewards. We consider the cooperative setting for the MARL framework [12, 19, 39, 85] because the agents should not compete with each other to transport droplets. We first compute an assessment value $r^i$ of an agent i after state transition. Similar to prior definition, let $D^i(t)$ be defined as the Manhattan distance of the droplet $d_i$ from the destination at timestep t. After an action $a^i_t$ is taken, if $D^i(t+1) = 0$, the assessment value $r^i$ is assigned a positive value of $+1.0$ because the droplet has reached the destination. Otherwise, the assessment value is computed as follows:

\begin{equation*} r^i = {\left\lbrace \begin{array}{ll} -0.05 & \text{if} D^i(t+1) \lt D^i(t)\\ -0.1 & \text{if} D^i(t+1) \ge D^i(t). \end{array}\right.} \end{equation*}

In the first case, the action leads to a state in which the droplet is closer to the destination. In the second case, the action results in the same state or even a worse state. Therefore, we use a smaller value as the assessment value. In this reward setting, to gain the maximum value in a game, the agent is encouraged to take as few steps as possible to reach the destination.

As all the agents take a combination of actions, a possible resultant state is that droplets may get too close to each other, which can lead to unintended merging and sample/reagent contamination. To prevent this scenario, we also adjust the assessment values for droplets that are too close to each other. Assume that, after a joint set of actions is taken, the resultant locations of two droplets $d^i$ and $d^j$ are $e_{a, b}^{d^i}$ and $e_{c, d}^{d^j}$, respectively. The distance of the two droplets is computed as $D(d^i, d^j)=|a-c| + |b-d|$. If $D(d^i, d^j) \le 2$, the assessment values are adjusted as $r^i = r^i - 0.8$ and $r^j = r^j - 0.8$. In decentralized learning, each agent i is rewarded by its own assessment value $r^i$; in centralized learning, we give each agent a team-average reward $R_{avg}=\frac{\sum _{i=1}^N r^i}{N}$.

5.3 DMFB Simulator: Training of RL Agents

We next describe an online droplet router, incorporated as an RL agent, that can execute all the droplet routing tasks. To train the agent, we developed an OpenAI-Gym environment named DMFB-Env. The DMFB matrix consists of $N\times M$ electrodes, where N and M are inputs to DMFB-Env.

Transition Model. DMFB-Env operates in two modes: healthy and degrading. Recall that $e_{i, j}$ denotes an electrode at the $i^{\text{th}}$ row and the $j^{\text{th}}$ column of the DMFB. The transition function is defined as

\begin{equation*} T(e_{i, j}, a_t) = {\left\lbrace \begin{array}{ll} e_{i-1, j} & \text{if} a_t = a_N\\ e_{i+1, j} & \text{if} a_t = a_S\\ e_{i, j+1} & \text{if} a_t = a_E\\ e_{i, j-1} & \text{if} a_t = a_W, \end{array}\right.} \end{equation*}

where $1\lt i\lt N$ and $1\lt j\lt M$. If the droplet is present at the boundary of the electrode array and the action is toward the outside of the biochip, the droplet will remain at the same location. For example, if the droplet is present at $e_{0,0}$ and the action is either $a_N$ or $a_W$, the droplet remains at $e_{0, 0}$. Similarly, if the next location of the droplet is in the electrodes that are used for the other concurrent fluidic operations, the droplet stays at the same electrode.

For the degrading mode, we introduce a function $d(e_{i, j})$ that describes the degradation status of an electrode, where $0\le d(e_{i, j})\le 1$. If the electrode $e_{i, j}$ is healthy, $d(e_{i, j}) = 1$; otherwise, $d(e_{i, j}) = 0$. The study in the work of Dong et al. [13] showed that an electrode can only be actuated up to 200 times before it is completely degraded. Therefore, we define a degradation factor $\tau$, where $0.5\le \tau \le 0.7$, and the degradation function $d(e_{i, j})$ is defined as

\begin{equation*} d(e_{i, j}) = \tau ^{\lfloor n/250 \rfloor }, \end{equation*}

where n is the number of actuations. Each electrode is randomly assigned a different value of $\tau$ to simulate the geometric variance of the electrode array.

A Bernoulli random variable $X_{i, j}$ is defined as the transition outcome when the droplet is present at $e_{i, j}$: when $X_{i, j}=1$, the transition is successful as $T(e_{i, j}, a_t)$; when $X_{i, j}=0$, the transition fails, and the droplet remains at the same electrode. The probability mass function of $X_{i, j}$ is defined as

\begin{equation*} {\left\lbrace \begin{array}{ll} P(X_{i, j}=1) = d(e_{i, j})\\ P(X_{i, j}=0) = 1 - d(e_{i, j}). \end{array}\right.} \end{equation*}

RL Agent. The RL agent is a deep neural network (see Figure 6). It observes images and chooses an action $a_t \in A$. It receives a reward value based on the outcome of the previous action.

Over the past few years, many neural network architectures have been proposed [27, 36, 68]. Because DMFBs commercially available today typically include a few hundred electrodes [86], we evaluate the effectiveness of RL-based adaptation using DMFBs of size $N\times M$, where $25\le N\times M\le 1,225$. While fully connected neural networks are adequate for small DMFB instances (less than 100 electrodes), we found that they do not converge for large DMFBs. Our evaluation showed that Convolutional Neural Networks (CNNs) are effective for the preceding DMFB instances. However, because the network needs to be loaded on a DMFB, the computational resources on the associated controller may be limited compared to a server. For example, in the work of Willsey et al. [81], the DMFB includes only a quad-core 1.2-GHz ARMv7 processor with 1 GB of RAM, and it does not contain a GPU; therefore, large CNNs are not feasible in this application scenario. We tested several options for the number of hidden layers and the number of neurons per layer. Our results show that a simple CNN, as described in Table 1, can solve the droplet- routing problem for large DMFBs with more than 1,000 electrodes.

Table 1.

Layer	Type	Depth	Activation	Stride	Padding
1	Convolution	32	ReLU	3	1
2	Convolution	32	ReLU	3	1
3	Max Pool	N/A	N/A	2	1
4	Convolution	64	ReLU	3	1
5	Convolution	64	ReLU	3	1
6	Max Pool	N/A	N/A	2	1
7	Convolution	128	ReLU	3	1
8	Convolution	128	ReLU	3	1
9	Max Pool	N/A	N/A	2	1
10	Fully Connected	8	ReLU	N/A	N/A

Table 1. CNN Configuration

5.4 RL Training

We consider fabricated DMFBs as test cases and evaluate the effectiveness of RL-based adaptation using arrays of size $N\times N$. N is set as $10\le N\le 35$ since the number of total electrodes for recent commercial microfluidics biochips is around 500 on a chip [32]. For each training game of DMFB-Env, a random routing task is generated. In addition, DMFB-Env generates some random concurrent modules to simulate high-throughput bioassay execution during droplet routing. We evaluated three RL algorithms (i.e., double DQN, PPO, and ACER) described in Section 4 in DMFB-Env in the healthy mode. We used default parameter settings in the work of Hill et al. [23] for the three algorithms. The training was executed on a Linux platform integrated with an 11-GB-memory GPU (NVIDIA GeForce RTX 2080 Ti). The training processes using PPO take nearly 2 hours to converge, which is the fastest among the other algorithms. Although it takes several hours to train a model to perform as well as the offline method, training needs to be carried out only once, and the trained model can subsequently be used for all fabricated DMFBs. We compare the RL approaches with an offline optimization method [86].

The training processes for different sizes of DMFBs are shown in Figure 7. For each RL algorithm, we ran 18 simulations with random seeds; the average performance of each algorithm is plotted as a solid line, and the similar color region shows the interval between its best performance and its worst performance. A training epoch contains 20,000 timesteps. We observe that double DQN does not converge in all training settings. In some cases, double DQN learned sub-optimal policies first, and then the policy learned lower-reward experiences, which results in converging to more passive policies. The results are similar to RL training in other environments [34, 43]. We observe that PPO performs well in all training settings, but it sometimes takes more training epochs to converge. This is because PPO is sensitive to initialization [28, 35, 79]. In addition, the update rule of PPO encourages the policy to exploit rewards that it has already found over the training course. Therefore, if an initial network policy is far from global optima, the policy can be easily trapped in local minima. We also observe that ACER does not perform well in some training settings. As the action space and observation space grow exponentially, the experiences stored in the limited replay buffer become important for ACER training.

Fig. 7.

Our training results show that in all training settings, PPO can outperform the other two RL algorithms. To fine-tune the RL approach using PPO, we tested two significant parameters in PPO to find the best performance of our RL agent for different sizes of DMFBs, the number of concurrent environments, and the number of steps for each update.

Figure 9 shows the training rewards for agents with varying the number of concurrent environments and the number of steps for each update. Here, we show the training rewards for a $10\times 10$ DMFB, a $20\times 20$ DMFB, and a $30\times 30$ DMFB. The training is not stable when there are only a few concurrent environments. For example, when there are four environments, we found out that the performance of the training model (updated every 16 steps) drops significantly after a few training epochs. We also observed that for eight environments, irrespective of the update step interval, the model’s performance is consistently better. Similar trends are observed in training for other sizes of DMFBs. Therefore, we chose eight concurrent environments as the PPO setting for model training.

Fig. 8.

Fig. 9.

We produced a video recording of droplet routing for a $5\times 8$ DMFB during training (see [41]). From the video, we see that, at first, the agent moved the droplet randomly without knowing the policy needed to reach the destination. After 200K timesteps, the agent started to “learn” from past experience; early on, after 400K timesteps, it could transport the droplet to the destination using the shortest path for only a few of the routing tasks. However, after 800K timesteps, the agent was able to complete all the routing tasks using the shortest paths.

5.5 MARL Training

To train the agents, we developed a PettingZoo-Gym environment to simulate the parallel droplet scenarios. For each training game, $n_{rt}$ random routing tasks are generated, where $n_{rt} = \lbrace 2, 3\rbrace$. Each routing task is performed concurrently by one of the agents. The size of the DMFB is $N\times N$, where $10 \le N\le 30$. We first performed the agent training using three RL algorithms (PPO, double DQN, and ACER). We also used three different MARL training schemes, including centralized, concurrent, and parameter sharing.

Figure 10 shows the training processes for two and three concurrent routing tasks in the healthy mode. A training epoch contains 20,000 timesteps. The performance of different algorithms is compared with the offline optimization method (Baseline) and the RL agents that are trained under single routing task environments (Single). The results show that the concurrent scheme is the most effective and efficient scheme to train the MARL routing models for DMFBs. We observed that PPO and ACER have similar performance, whereas DQN fails to converge in all the training settings. In some of the settings, such as the concurrent training with DMFBs of size $20 \times 20$, the ACER algorithm converges faster than the PPO algorithm.

Fig. 10.

However, the figure shows that single agents can achieve comparable performance as PPO and ACER when the size of DMFB is small. However, as the size of DMFB grows and the number of concurrent routing task increases, the performance of single-agent models rapidly decrease since the single-agent models did not learn the coordination between droplets. The results illustrate the importance of MARL models for concurrent routing scenarios.

6 Evaluation

To evaluate our RL framework, we considered DMFBs with the number of electrodes ranging from 25 to 900. For each DMFB, we first trained three models with the same network architecture (as described in Table 1) using DMFB-Env, and the models were trained in the healthy mode to achieve the same performance as that of the baseline [86]. After training, we evaluated the performance of the models in the degrading mode of DMFB-Env. We also evaluated the RL framework by executing an epigenetic bioassay on a fabricated biochip.

6.1 Single-Agent Simulation Results

We compared the performance of the agent with the work of Zhao and Chakrabarty [86]. We set $50\%$ of the degrading electrodes for DMFBs, and the results are shown in Figure 8. Here, we show the number of actuation cycles required in a game as the performance. The fewer actuation cycles required in a game, the better the performance is. We observe that the agent performs similar to the static (offline) method when the DMFBs start to degrade. This is because the RL agent has been trained to perform as well as the baseline in the healthy mode of DMFB-Env. After a small number of training games, the RL agent sometimes performs slightly worse because the agent may explore other alternative routes to avoid the degraded electrodes, and the alternative solutions may be worse than the original route. However, as DMFBs degrade further, the agent can outperform the baseline. We also observe that the proposed solution is more effective for smaller DMFBs. This is because, in our experimental setting, the DMFB with 25 electrodes is the most dynamic environment. The performance of the baseline method decreases if electrode degradation occurs in a DMFB. We see that the performance of the baseline method significantly decreases in the $5\times 5$ DMFB. The experimental results show that the agent can adapt to all sizes of DMFBs, including the most dynamic environment (i.e., the $5\times 5$ DMFB).

We recorded a video of droplet transportation in a simulated degraded environment; the video, called Simulation.mp4, can be found in the work of Liang et al. [41]. As some electrodes started to degrade, the agent can still use them to transport the droplet. In the simulated environments, sets of faults with different sizes have been injected. However, the agent is able to learn the changing health conditions of these electrodes. For subsequent tasks, the agent transports the droplet without using these degraded electrodes.

6.2 MARL Simulation Results

In the degrading mode of MARL, we set 10% of the degrading electrodes, and the degrading level of these electrodes increases as these electrodes are used over time. We compared the performance of the MARL models with the baseline method. The results are shown in Figure 11. We used the concurrent method to train the MARL models since concurrent is the most effective training method as discussed in Section 5. For DMFBs with the size of $10 \times 10$ and $20 \times 20$, we used PPO as the training algorithm since PPO and ACER achieve similar performance while the training processes of PPO are faster. For DMFBs with the size of $30 \times 30$, we used ACER as the training algorithm since ACER achieves the best performance among the three algorithms.

Fig. 11.

The degradation processes are shown in Figure 11, where the performance is evaluated using the number of cycles needed to transport all the droplets to the destinations. Figure 11 shows that as the electrodes start to degrade, the MARL agent performs slightly worse than the baseline method since the agents are learning to avoid degraded electrodes and exploring alternative routes, which are longer than the routes taken by the baseline method. After several training epochs, the MARL agents outperform the baseline method as the DMFBs degrade further and the MARL agents have learned from the previous training games. As shown in Figure 11, for DMFBs with sizes of $10 \times 10, 20 \times 20$, and $30 \times 30$, the number of training epochs that the model needs to adapt to the degrading environments are around 5 to 10, 10 to 15, and 20, respectively. The results show that the MARL agents can adapt to dynamically degraded environments under different sizes of DMFBs and provide more reliable routing strategies than the baseline method.

6.3 RL Runtime on a Micro-Computer

As the RL router learns to adapt to a degrading biochip, the RL agent needs to be repeatedly trained and referenced on the micro-computer of the DMFB system during the biochip execution. We profiled the runtime of the PPO training and referencing for each timestep on a micro-computer (Raspberry Pi 4) for various sizes of DMFBs (Table 2). Although the micro-computer includes a modest 1.5-GHz quad-core processor and only 4 GB of memory, one training timestep takes only about 0.04 seconds, and one referencing timestep takes only about 0.06 seconds. In our DMFB design, the actuation time required to move one droplet from an electrode to an adjacent electrode is 1 second. Therefore, the training step can be carried out concurrently while the fluidic operation occurs. The additional referencing time for the RL agent to determine the next fluidic operation is 0.06 seconds. The timing overhead of using the RL framework is therefore $6\%$ when compared with the original DMFB system, which is negligible in practice.

Table 2.

Biochip Size	$5\times 5$	$10\times 10$	$15\times 15$	$20\times 20$	$25\times 25$	$30\times 30$	$35\times 35$
Training	0.01	0.02	0.03	0.04	0.06	0.07	0.1
Referencing	0.02	0.01	0.01	0.15	0.01	0.14	0.02

Table 2. Runtime (s) for RL Training and Referencing on a Micro-Computer

6.4 Bioassay Execution on a Fabricated Biochip

In this section, we show the feasibility of deploying our RL model on a fabricated chip. The model deployment is general regardless of the sizes of biochips. In addition, the proposed RL framework can be used for any bioassay. As a specific case study, We designed and executed an epigenetic bioassay on a fabricated DMFB because benchtop epigenetic bioassays require large sample volumes and long execution time, and are labor intensive. Previous work has shown the effectiveness of epigenetic bioassays on DMFBs [30]. This epigenetic bioassay includes 19 routing tasks. We used the trained RL droplet router to transport droplets.

6.4.1 Epigenetic Bioassay.

Even though all cells in the human body have the same DNA, or genotype, considerable differences in cell type and function, or phenotype, arise from the selective expression and suppression of certain genes. This phenotypic control can be attributed to various epigenetic mechanisms. These are processes and environmental factors that alter genomic behavior and its subsequent expression without any changes to the actual DNA. Epigenetics is the study of these factors and mechanisms of control in healthy and diseased populations. Chromatin Immunoprecipitation (ChIP) is used to study the epigenetic relationship between DNA and its supporting proteins [10]. Running a full ChIP protocol on a single sample requires a large starting volume of cells (which are not always available) and several days to run the assay, and is highly labor intensive. We consider Nucleosome Immunoprecipitation (NuIP) on magnetic beads in order to translate ChIP from the benchtop to automated DMFBs to reduce sample sizes, decrease runtimes, and increase throughput.

The NuIP protocol modifies the traditional ChIP assay [10, 52] by first functionalizing a magnetic bead off-chip with an antibody that targets one of the histone proteins in the nucleosome of interest. This is the capture complex as shown in Figure 12. The nucleosome-containing sample is then mixed and incubated with the capture complex followed by magnetic splitting and washing steps. In the meantime, off-chip, an antibody specific to a different histone protein in the nucleosome is incubated with a fluorescent secondary antibody. This forms the detection complex reagent. Next, the beads are incubated with the detection complex. If there are nucleosomes attached to the beads, these will bind with the detection complex. After the excess detection complex is washed away, ensuring that there are no false positives, the beads are resuspended in a droplet and routed to the detection region. An LED tuned to the excitation wavelength of the fluorescent antibody shines on the beads which are imaged using a CCD camera outfitted with the appropriate emission wavelength filter. A fluorescing sample confirms the presence of the nucleosome of interest.

Fig. 12.

6.4.2 Experimental Setup.

Fabricated DMFB. For our experiment, we designed a PCB-based DMFB and fabricated it using OSH Park [55]. The DMFB contains a $6\times 6$ electrode-array (Figure 13(a)). A reservoir module is placed on each side of the array, and the modules can dispense different reagent droplets. Each electrode can be controlled individually. The control signals come from the pin heads that are soldered on the board boundary.

Fig. 13.

Control Board. For the fabricated DMFB, the activation/de-activation status of each electrode is controlled by a high voltage relay (part no. Panasonic AQW212). A total of 44 relay ICs are soldered on the control board (36 for electrode array and 8 for reservoir modules) (see Figure 13(b)). Each high-voltage relay IC is controlled by a configuration bit, and these configuration bits are stored in the register ICs (part no. Texas Instruments SN74AHC595). In addition to these ICs, four pin-header modules (shown within the red rectangles) are used as the DMFB socket, which allows DMFB replacement on the control board.

Overall System. Figure 13(c) shows the hardware setup used to operate the DMFB. The DMFB is installed above the control board using the pin-header socket. A micro-computer (Raspberry Pi 4) on the left is used to generate control signals, and the RL agent is installed in the micro-computer. An amplifier board as well as the functional generator are used to generate a voltage source of 1 KHz and 200 Vpp, which provides actuation signals for the electrodes. A camera placed on top of the DMFB captures the droplet locations. The images are then utilized by the micro-computer for making real-time decisions.

6.4.3 Experimental Results.

We performed the droplet routing tasks of the bioassay using our fabricated DMFB, where we simulated the degradation on an electrode at the location (3, 4). The degradation is simulated by applying a lower voltage of 150 Vpp on the electrode. During the third routing task, the degraded electrode is involved in the droplet transportation path and thus a failure occurred. Then, in the following routing task, the RL agent successfully learned from the experience and adopted an alternative path to avoid the degraded electrode. Examples of routing tasks on fabricated DMFB can be seen in previous work [41]. In the recorded video DMFBExperiment.mp4 [41], intuitive routing cases are presented to show the effectiveness of our RL routing model.

7 Conclusion

We presented a novel framework for RL-based droplet routing on DMFBs. We also developed an OpenAI-Gym environment that can be used to train the RL droplet router for various DMFB sizes. The simulation is based on a study of electrode degradation using fabricated DMFBs. The experimental results showed that even though electrodes on a DMFB degrade over time, the RL droplet router can learn the degradation behavior and transport droplets using only healthy electrodes.

We also formulated a MARL framework for parallel droplet routing on DMFBs. We introduced a PettingZoo-Gym environment for DMFBs to perform the training of MARL agents. Experimental results showed that the MARL framework can learn from degradation environments and provide superior routing strategies, which results in fewer re-routes for failures, and thus the completion time of bioassay can be faster and a smaller volume of biosamples is needed.

We identified the timing constraint associated with the use of the RL approach on a micro-computer that does not contain a GPU. The results showed that the proposed RL approach does not impede the fluidic operations in time-critical bioassays. A failure of the DMFB results in costly sample and reagent loss. However, the proposed RL framework minimizes the need to discard biochips with degraded electrodes and abort bioassay protocols. This increases the lifespan of a biochip’s utility and allows for the adaptation of a plethora of immunoprecipitation assays onto the DMFB platform.

References

[1]

Mirela Alistar, Paul Pop, and Jan Madsen. 2016. Synthesis of application-specific fault-tolerant digital microfluidic biochip architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 5 (2016), 764–777.

Abstract

1 Introduction

1.1 Digital Microfluidic Biochips

1.2 Motivating RL-Guided Droplet Routing

1.3 Article Contributions

2 Problem Formulation

2.1 Routing with Multiple Droplets

2.2 Routing with Electrode Degradation

3 Electrode Degradation in DMFBs

4 Background On RL

4.1 Deep RL

4.2 RL Algorithms

4.2.1 Double Deep Q-Network.

4.2.2 Proximal Policy Optimization Algorithm.

4.2.3 Actor-Critic with Experience Replay.

4.3 MARL Training Schemes

5 RL Approach to Droplet Router On DMFBs

5.1 Droplet Routing as an RL Problem

5.2 Formulation of Parallel Droplet Routing as MARL

5.3 DMFB Simulator: Training of RL Agents

5.4 RL Training

5.5 MARL Training

6 Evaluation

6.1 Single-Agent Simulation Results

6.2 MARL Simulation Results

6.3 RL Runtime on a Micro-Computer

6.4 Bioassay Execution on a Fabricated Biochip

6.4.1 Epigenetic Bioassay.

6.4.2 Experimental Setup.

6.4.3 Experimental Results.

7 Conclusion

References

Cited By

Index Terms

Recommendations

Reinforcement Learning based Module Placement for Enhancing Reliability of MEDA Digital Microfluidic Biochips

Digital microfluidic biochips: recent research and emerging challenges

Optimization of dilution and mixing of biochemical samples using digital microfluidic biochips

Comments

Information

Published In

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations