Keywords

1 Introduction

Over the last decade, the residue number system (RNS) has been increasingly proposed to speed up arithmetic computations on large numbers in asymmetric cryptography. This representation allows to quickly perform addition, subtraction and multiplication thanks to a high degree of internal parallelism (see state-of-art in Sect. 3). This nice property has been used for implementing fast cryptographic primitives in both software and hardware systems. For some years, RNS is becoming a popular representation in implementations of elliptic curve cryptography (ECC) over \(\mathbb {F}_P\) [8, 11, 14, 31]. RNS is also proposed to implement other asymmetric cryptosystems: RSA (e.g. [4, 15, 21, 23]), pairings (e.g. [10, 32]) and very recently lattice based cryptography (e.g. [5]). In this paper, we only deal with hardware implementations of ECC.

RNS is a non positional number system without an implicit weight associated to each digit like in the standard representation. Then comparison, sign determination, division and modular reduction operations are very costly in RNS. The modular multiplication, one of the most important arithmetic operation in asymmetric cryptography, is significantly more costly than a simple multiplication. Thus, many algorithms and optimizations have been proposed for RNS modular multiplication, see [1, 2, 9, 12, 14, 18, 24, 26, 27].

In RNS, a number is represented by its residues (or remainders) modulo a set of moduli called the base. The base bit width denotes the sum of the bit sizes of all moduli. For representing \(\mathbb {F}_P\) elements, the RNS base should be large enough: the bit width of the base must be greater than the bit width of the field elements. For instance, [14] uses 8 moduli of 33 bits for \(\mathbb {F}_P\) of 256 bits.

Up to now, every RNS modular multiplication algorithm from the literature requires to double the bit width of the base for representing the full product before modular reduction. This is not the case in the standard representation where efficient algorithms allow to merge internal operations of the product and reduction only using intermediate values on the field bit width plus a few additional guard bits. Another current weak point of RNS is the lack of efficient modular reduction algorithm for specific characteristics such as (pseudo-)Mersenne primes (\(P=2^\ell -1\) or \(2^\ell -c\) with \(c < 2^{\ell /2}\) for some \(\ell \)) in the standard binary system.

In this paper, we propose a new RNS modular multiplication algorithm which only requires a single base bit width instead of a double one. Our algorithm uses field characteristics P specifically selected for very efficient computations in RNS. As far as we know, this new algorithm is the first one which performs an RNS modular multiplication without intermediate values larger than the field bit width (plus a few guard bits). It requires up to 2 times less operations and 4 times less pre-computations than state-of-art algorithms. In ECC standards, the field characteristics have been selected for fast computations in the standard binary representation (e.g. \(P_{521} = 2^{521}-1\) in NIST standard [22]). In this work, we propose a new direction to select parameters for very efficient RNS implementations of ECC. We expect scalar multiplications in RNS to be up to 2 times faster for a similar area, or twice smaller for the same computation time (other trade-offs are also possible).

The outline of the paper is as follows. Sections 2 and 3 introduce notations and state-of-art, respectively. The new modular multiplication algorithm is presented in Sect. 4. The theoretical cost of our algorithm is analyzed in Sect. 5. Section 6 presents and comments some FPGA implementation results. Section 7 provides an estimation of the impact of our proposition on complete ECC scalar multiplications in RNS. Finally, Sect. 8 concludes the paper.

2 Notations and Definitions

 

  • Capital letters, e.g. X, are \(\mathbb {F}_P\) elements or large integers

  • \(|X|_P\) is \(X \,\text {mod}\,P\) and P is an \(\ell \)-bit prime

  • \(n = \lceil \ell /w \rceil \), i.e. the minimal number of moduli to represent an \(\ell \)-bit value

  • The RNS base \(\mathcal {B}_a = (m_{a,1},\ldots , m_{a,n_a})\) composed of \(n_a\) moduli where all \(m_{a,i}\) are pairwise coprimes of the form \(m_{a,i}=2^w - h_{a,i}\) and \(h_{a,i} < 2^{\lfloor w/2 \rfloor }\)

  • \(\overrightarrow{(X)_a}\) is X in the RNS base \(\mathcal {B}_a\), abridged \(\overrightarrow{X_a}\) when no confusion is possible, and is defined by:

    $$\begin{aligned} \overrightarrow{(X)_a} =(x_{a,1}, \ldots ,x_{a,n_a}) \qquad \text {where} \qquad x_{a,i} = |X|_{m_{a,i}} \end{aligned}$$
    (1)
  • \(M_a = \prod ^{n_a}_{i=1} m_{a,i}\),   \(M_{a,i}= \frac{M_a}{m_{a,i}}\),   \(\overrightarrow{T_a}= \left( |M_{a,1}|_{m_{a,1}} , \ldots , |M_{a,n_a}|_{m_{a,n_a}} \right) \)

  • Similar definitions and notations stand for \(\mathcal {B}_b\), an RNS base coprime to \(\mathcal {B}_a\)

  • \({\mathtt {EMM}}\) is a w-bit elementary modular multiplication (e.g. \(|x_i \cdot y_i|_m\))

  • \({\mathtt {EMW}}\) is a w-bit elementary memory word (for storage)

  • \(\overrightarrow{(X)_{a|b}}\) is X in the RNS base \(\mathcal {B}_{a|b} = (m_{a,1},\ldots , m_{a,n_a},m_{b,1},\ldots , m_{b,n_b})\) i.e. the concatenation of \(\mathcal {B}_a\) and \(\mathcal {B}_b\)

  • MSBs are the most significant bits of a value

  • \(\mathtt{MM }\) denotes the state-of-art RNS modular multiplication

3 State of Art

RNS was proposed independently in [13, 29] for signal processing applications in the 50s. RNS is a representation where large numbers are split into small independent chunks. The RNS base \(\mathcal {B}\) is the set of moduli \((m_{1}, \ldots , m_{n})\) where all \(m_{i}\) are (small) pairwise coprimes. The representation of the integer X in RNS is \(\overrightarrow{X}\), the set of residues \(x_{i}=X \,\text {mod}\,m_{i}, \; \forall m_i \in \mathcal {B}\). In ECC applications, field elements of hundreds bits are usually split into 16 to 64-bit chunks. Addition/subtraction and multiplication of 2 integers are fast operations in RNS:

$$ \overrightarrow{X} \diamond \; \overrightarrow{Y} \; = \; \left( |x_{1} \diamond y_{1}|_{m_{1}}, \ldots , |x_{n} \diamond y_{n}|_{m_{n}} \right) \quad \forall \, \diamond \in \{+,-,\times \} ,$$

where the internal computations are performed independently over the channels (the i-th channel computes \(|x_{i} \diamond y_{i}|_{m_{i}}\)). There is no carry propagation between the channels. This leads to efficient parallel implementations [8, 11, 14]. This property was also used to randomize the computations over the channels (in time or space) as a protection against some side-channel attacks [6, 15, 23]. Another RNS advantage is its flexibility: the required number of moduli n and the number of physically implemented channels can be different. A physical channel can support several moduli and store the corresponding pre-computations. For instance, [21] presents an RSA implementation over 1024 to 4096 bits.

Conversion from standard representation to RNS is straightforward, one computes all residues of X modulo \(m_{i}\). To convert back, one must use the Chinese remainder theorem (CRT). The CRT states that any integer X can be represented by its residues \(x_i=|X|_{m_{i}}\), if \(X < M\) (the product of all moduli) and all moduli are pairwise coprimes. The conversion uses the CRT relation:

$$\begin{aligned} X = |X|_{M} =\left| \; \sum ^{n}_{i=1} \; \left| x_{i} \cdot M_{i}^{-1}\right| _{m_{i} } \times M_{i} \; \right| _{M} . \end{aligned}$$
(2)

As one can observe, the result of the CRT is reduced modulo M, i.e. all integers greater than M will be automatically reduced modulo M by the CRT relation. In addition to \(\{+,-,\times \}\) operations, some divisions can be performed in parallel in RNS. Exact division by a constant c coprime with M can be computed by multiplying each residue \(x_i\) by the inverse \(|c^{-1}|_{m_{i}}\) for each channel.

3.1 Base Extension

All fast algorithms for RNS modular multiplication in state-of-art use the base extension (\(\text {BE}\)) introduced in [30]. It converts \(\overrightarrow{X_a}\) in base \(\mathcal {B}_a\) into \(\overrightarrow{X_b}\) in base \(\mathcal {B}_b\). After a \(\text {BE}\), X is represented in the concatenation of the two bases \(\mathcal {B}_a\) and \(\mathcal {B}_b\) denoted \(\mathcal {B}_{a|b}\). There are mainly two types of \(\text {BE}\) algorithms: those based on the CRT relation (Eq. 2) [18, 25, 28], and those using an intermediate representation called mixed radix system (MRS) [3, 7, 30]. In hardware, the most efficient \(\text {BE}\) implementations in state-of-art use the CRT based solution [12, 14]. Our proposed modular multiplication algorithm can be used with both types of \(\text {BE}\) algorithm. In this paper, we focus on the CRT based solution since it has less data dependencies and leads to faster hardware implementations.

The state-of-art \(\text {BE}\) at Algorithm 1 from [18] computes an approximation of Eq. 2 and directly reduces it modulo each \(m_{b,i}\) of the new base \(\mathcal {B}_b\). One can rewrite Eq. 2 by \(X=\sum ^{n_a}_{i=1} \left( \left| x_{a,i} \cdot M_{a,i}^{-1} \right| _{m_{a,i}} M_{a,i} \right) - q \, M_a\) in base \(\mathcal {B}_a\), where q is the quotient of the sum part by \(M_a\). Then one has:

$$\begin{aligned} q=\left\lfloor \sum _{i=1}^{n_a} \frac{\left| x_{a,i} \cdot M_{a,i}^{-1} \right| _{m_{a,i}}}{m_{a,i}} \right\rfloor = \left\lfloor \sum _{i=1}^{n_a} \frac{ \xi _{a,i} }{m_{a,i}} \right\rfloor . \end{aligned}$$
(3)

In order to get the value of q, the reference [18] proposed to approximate the value of \(\frac{ \xi _{a,i} }{m_{a,i}}\) by using \(\frac{ \text {trunc}(\xi _{a,i}) }{2^w}\) (i.e. a few MSBs of \(\xi _{a,i}\)). In Algorithm 1, this approximation is corrected using a selected parameter \(\sigma _0\). If X is small enough, then the approximation is the exact value. Otherwise it returns either \(\overrightarrow{X_b}\) or \(\overrightarrow{(X+M_a)_b}\), and this is easily managed in \(\mathtt{MM }\) algorithms (see details in [18]).

figure a

\(\text {BE}\) algorithms are far more expensive than a simple RNS multiplication. For instance, in Algorithm 1, the CRT relation is computed on each channel of the second base. This \(\text {BE}\) costs \((n_a \,n_b+n_a)\) \({\mathtt {EMM}}\)s. In the usual case \(n_a=n_b=n\), the \(\text {BE}\) cost is \(n^2+n\) against n \({\mathtt {EMM}}\)s for a simple multiplication.

3.2 RNS Montgomery Modular Multiplication

As far as we know, the best state-of-art RNS modular multiplication has been proposed in [26] (presented Algorithm 2). It is an adaptation for RNS of the Montgomery modular multiplication [20], originally in radix-2. Various optimizations have been proposed in [12, 14] (factorization of some products by constants).

figure b

The first step of Algorithm 2 computes X Y on \(\mathcal {B}_a\) and \(\mathcal {B}_b\). This full product requires a total bit width of 2 n w bits. Then, we need to perform a multiplication modulo \(M_a\) and an exact division by \(M_a\), which is analogous to modular reduction and division by \(2^r\) in the classic Montgomery multiplication. However, modular reduction by \(M_a\) is only easy in \(\mathcal {B}_a\) (thanks to Eq. (2)) and the exact division by \(M_a\) is only easy in \(\mathcal {B}_b\) (which is coprime with \(\mathcal {B}_a\)). Then 2 \(\text {BE}\)s are required at lines 3 and 6. The result is given in the 2 bases, and is less than 3 P (with \(\text {BE}\) from [18]). Using optimizations from [12] and [14], the \(\mathtt{MM }\) costs \(2 \,n_a \, n_b + 2 n_a+2 n_b\) \({\mathtt {EMM}}\)s (or when \(n_a=n_b=n\) one has \(2 \,n^2 + 4 \, n\)).

4 Proposed RNS Modular Multiplication Algorithm

Our proposed RNS modular multiplication algorithm, called single base modular multiplication (\(\mathtt{SBMM }\)) relies on two main ideas. First, instead of working on intermediate values represented on two “full” n-moduli RNS bases, it works with only two half-bases \(\mathcal {B}_a\) and \(\mathcal {B}_b\) of \(n_a = n_b = n/2\) moduli. It reduces the leading term of the computation cost from \(n^2\) to \(\frac{n^2}{4}\) \({\mathtt {EMM}}\)s for a \(\text {BE}\). Second, we select the field characteristic \(P=M_a^2 - 2\) in order to use these half-bases efficiently, and reduce the computation cost of the RNS modular multiplication from \(2n^2\) to \(n^2\) \({\mathtt {EMM}}\)s. This type of P can be seen as analogous to Mersenne primes (\(2^\ell -1\)) for the binary representation.

4.1 Decomposition of the Operands

In order to decompose the operands, we use a similar method than the one presented in [9]. Algorithm 3 decomposes the integer X represented by \(\overrightarrow{X_{a|b}}\) on the concatenation of both half-bases. The \(\mathtt{Split }\) function returns \(\overrightarrow{(K_x)_{a|b}}\) and \(\overrightarrow{(R_x)_{a|b}}\) such that \(\overrightarrow{X_{a|b}} = \overrightarrow{(K_x)_{a|b}} \times \overrightarrow{\left( M_a\right) _{a|b}} + \overrightarrow{(R_x)_{a|b}}\), i.e. the quotient and the remainder of X by \(M_a\). Using \(P=M_a^2 - 2\), we have \(M_a=\left\lceil \sqrt{P} \right\rceil \) and \(\mathtt{Split }\) divides X of nw bits into the two integers \(K_x\) and \(R_x\) of \(\frac{n}{2}w\) bits.

figure c

At line 1 of Algorithm 3, a \(\text {BE}\) computes \(R_x=|X|_{M_a}\) in \(\mathcal {B}_b\) (thanks to CRT). Then, line 2 computes the quotient \(K_x\) of the decomposition by dividing by \(M_a\) (in base \(\mathcal {B}_b\)). Finally \(K_x\) is converted from \(\mathcal {B}_b\) to \(\mathcal {B}_a\) and the algorithm returns \((K_x,R_x)\) in both bases.

Lines 3 to 5 in Algorithm 3 are required due to the approximation in the \(\text {BE}\) algorithm from [18] for efficient hardware implementation. As seen in Sect. 3.1, \(\overrightarrow{(R_x)_b}\) is either \(R_x\) or \(R_x+M_a\) in base \(\mathcal {B}_b\). If an approximation error occurs, actually \((K_x',R_x') = (K_x-1,R_x+M_a)\) is computed. It still satisfies \(X=K_x' M_a + R_x'\), and adds one bit to \(R_x\) (i.e., \(R_x'<2M_a\)). Line 3 checks \(K_x=-1\) in base \(\mathcal {B}_b\) since this case is not compatible with \(\text {BE}\) from [18]. This test can be easily performed in \(\mathcal {B}_b\), but ambiguity remains for \(K_x=-1\) and \(K_x=M_b -1\). To avoid this ambiguity, we select \(M_b>M_a\), then \((M_b -1)M_a \ge M_a^2 > P\) and it ensures \(K_x<M_b-1\). Then lines 4–5 set \(K_x=0\) and subtract \(M_a\) to \(R_x\).

Random \(\mathbb {F}_P\) elements have a probability \(2^{-\ell /2}\) to be less than \(M_a\) (i.e. less than \(\sqrt{P}\)). On the minimum field size in standards [22] \(\ell =160\) bits, the probability to perform the correction at lines 4–5 is \(2^{-80}\). Then, the implementation of this test can be highly optimized with a very low probability of pipeline stall.

The theoretical cost of Algorithm 3 is analyzed in Sect. 5. Algorithm \(\mathtt{Split }\) mainly costs 2 “small” \(\text {BE}\)s between half bases i.e. \(\frac{n^2}{2}\) \({\mathtt {EMM}}\)s.

4.2 Proposed RNS Modular Multiplication \(\mathtt{SBMM }\)

Our \(\mathtt{SBMM }\) algorithm is presented at Algorithm 4. We decompose \(X \in \mathbb {F}_P\) into the pair \((K_x,R_x)\) directly from \(\overrightarrow{X}\) using \(\mathtt{Split }\). To recover X from \((K_x,R_x)\), it is sufficient to compute \(K_x M_a + R_x\) in both half bases \(\mathcal {B}_a\) and \(\mathcal {B}_b\). The product X by Y decomposed as \((K_x,R_x)\) and \((K_y,R_y)\) gives:

$$\begin{aligned} X Y \equiv&\quad \left( K_x M_a + R_x \right) \cdot \left( K_y M_a + R_y \right) \, \,\text {mod}\,P\\ \equiv&\quad K_x K_y M_a^2 + (K_x R_y + K_y R_x) M_a + R_x R_y \, \,\text {mod}\,P\\ \equiv&\quad 2 \, K_x K_y + R_x R_y + (K_x R_y + K_y R_x ) M_a \, \,\text {mod}\,P \\ \equiv&\quad 2 \, K_x K_y + R_x R_y + (K_x K_y + R_x R_y -(K_x - R_x) (K_y - R_y)) M_a \, \,\text {mod}\,P \\ \equiv&\quad U + V M_a \, \,\text {mod}\,P \; . \end{aligned}$$

The first line of the previous equation is the definition of the \((K_x,R_x)\) decomposition. The next line uses \(M_a^2=2 \,\text {mod}\,P\). Then we reduce the number of multiplications thanks to the Karatsuba-Ofman trick [17]. Finally we define \(U=( 2 K_x K_y + R_x R_y)\) and \(V = (K_x R_y + K_y R_x)\). By definition \(K_x\), \(K_y\), \(R_x\) and \(R_y\) are less than \(M_a\), thus \(U,V < 3 M_a^2\). The values U and V must be representable in the base \(\mathcal {B}_{a|b}\), so we need at least \(3 M_a^2 < M_a M_b\).

However \(U+V M_a\) is too large for \(\mathcal {B}_{a|b}\), thus \(\mathtt{Split }\) is used a second time. It gives the decompositions \((K_u,R_u)\) and \((K_v,R_v)\) then:

$$\begin{aligned} X Y \equiv&\quad U + V M_a \, \,\text {mod}\,P \\ \equiv&\quad K_u M_a + R_u + K_v M_a^2 + R_v M_a\, \,\text {mod}\,P \\ \equiv&\quad (K_u + R_v) M_a + R_u + 2 K_v \, \,\text {mod}\,P \\ \equiv&\quad K_z M_a + R_z \, \,\text {mod}\,P .\\ \end{aligned}$$
figure d

Using the property \(M_a^2 \equiv 2 \,\text {mod}\,P\), we can compute \(K_z\) and \(R_z\) which is the decomposition of \(XY \,\text {mod}\,P\). From \(U < 3 M_a^2\) and \(V < 2 M_a^2\), one can see that \(K_z < 4 M_a\) and \(R_z < 5M_a\). The decomposition of the \(Z=XY \,\text {mod}\,P\) is \( K_z M_a + R_z < 5 P\) (i.e., this is equivalent to have \(Z \in [0,5P[\)).

Our proposition is similar to the use of Mersenne primes in the binary system where the modular reduction is performed by the sum of “high” and “low” parts of the operand. In our case, \(K_x\) behaves as the “high” part and \(R_x\) as the “low” part. In our algorithm, the split dominates the total cost. In the standard binary system, the multiplication part is costly while the split is free (constant shifts). The complete method is described in Algorithm 4 where UV are computed at lines 1–2, decompositions at lines 3–4 and finally \(K_z, R_z\) are returned at line 5.

Using the approximated \(\text {BE}\) from [18], we have \(K_z < 5 M_a\) and \(R_z < 6 M_a\). Then we choose \(M_b > 6M_a\) in Algorithm 4 (instead of \(M_b > 5M_a\)).

In Algorithm 4, the two decompositions using \(\mathtt{Split }\) dominates the total cost which is equivalent to only one classical \(\text {BE}\). Then our algorithm requires about half operations compared to the state-of-art one (see details in Sect. 5).

Our algorithm has been tested over millions of random modular multiplications, for \(\ell = 160, 192, 256, 384\) and 512 bits.

4.3 Selecting \(P=M_a^2-2\) for RNS Efficient Implementations

We choose to use specific forms of the characteristic in order to significantly speed up computations. For instance, \(K_x K_y M_a^2 \,\text {mod}\,P\) becomes \(2 K_x K_y \,\text {mod}\,P\) using \(P=M_a^2-2\). We also tried specific forms such as \(P=M_a^2-c\) and \(P=dM_a^2-c\) where c and d are small integers. The constraint for selection is that P must be prime. For instance, using \(c=0\) or 1 and \(d=1\), P is never prime (in those cases P is a multiple of \(M_a\) or \(M_a+1\)). The specific form of the characteristic \(P=M_a^2-2\) seems to be the best solution at the implementation level.

We wrote a Maple program to find P with the fixed parameters n and w. We randomly choose odd moduli of the form \(m_{a,i}=2^w - h_{a,i}\) with \(h_{a,i} < 2^{\lfloor w/2 \rfloor }\). Each new \(m_{a,i}\) must be coprime with the previously chosen ones. In practice, this step is very fast and easy as soon as w is large enough (i.e. \(w\ge 16\) bits for ECC sizes). Using several \(M_a\) candidates from the previous step, we can check the primality of \(P=M_a^2+2\) using a probabilistic test in a first time. This gives us very quickly a large set of P candidates. For the final selection in a real full cryptographic implementation, a deterministic primality test must be used. As an example for \(\ell =512\) bits, it took us 15 s to generate 10,000 different couples \((P,M_a)\) of candidates on a simple laptop (2.2 GHz core I7 processor with 4 GB RAM). For selecting the second base, \(M_b\) just needs to be coprime with \(M_a\) and verifies \(M_b > 6 M_a\). As far as we know, there is not specific theoretical attack on ECC over \(\mathbb {F}_P\) based on the form of the prime characteristic. In the current state-of-art, the theoretical security of ECC is directly related to the working subgroup of points of the curve, and in particular its order. This is why in the NIST standards [22] are chosen very specific P (Mersenne and pseudo-Mersenne primes). We propose to have the same approach for our RNS-friendly primes.

4.4 Controlling the Size of \(\mathtt{SBMM }\) Outputs

Our \(\mathtt{SBMM }\) at Algorithm 4 returns values on a slightly wider domain than the one of its inputs (i.e., \(K_x, K_y, R_x, R_y < M_a\) lead to \(K_z < 5 M_a\) and \(R_z < 6 M_a\)). In case of successive \(\mathtt{SBMM }\) calls, one has to manage intermediate values with increasing sizes. For instance, computing \(|X^8|_P\) (i.e. 3 squares), one gets \(K_{X^8} < 30374 M_a\) and \(R_{X^8} < 42946 M_a\). Then the architecture parameters must be selected to support this expansion. In practice, the bit width of the RNS base must be large enough for the worst case of intermediate value. This can be done by selecting an adequate \(\mathcal {B}_b\). As the number of modular multiplications in ECC is very important, we must find a way to compress some intermediate values.

A first way to limit this expansion is to compute \(|X \cdot 1|_P\) using \(\mathtt{SBMM }\). Then, the decomposition of 1 is just (0, 1) then \(U=R_x\), \(V=K_x\), and the compressed outputs are \(K_z,R_z < 3 M_a\). This a simple but not very efficient solution.

figure e

A faster method involving extra hardware is proposed in Algorithm 5 (named \(\mathtt{Compress }\)). It requires a dedicated small extra modulo \(m_\gamma \), typically \(m_\gamma =2^6\), and the inputs are assumed such that \(K,R< (m_\gamma -1) M_a\). To compress a pair (KR), one converts \(R_k = |K|_{M_a}\) from \(\mathcal {B}_a\) to \(m_\gamma \) thanks to a \(\text {BE}\) at line 1 (this \(\text {BE}\) on only one moduli just requires \(n_a\) operations modulo \(m_\gamma \)). One can now computes \(K_k = \left\lceil \frac{K}{M_a} \right\rceil \) modulo \(m_\gamma \). Since \(K_k\) is less than \(m_\gamma -1\) and is less than all moduli, it can be directly used in \(\mathcal {B}_a\) and \(\mathcal {B}_b\). This enables to compute \(R_k\) in \(\mathcal {B}_b\) at line 3 without a \(\text {BE}\). Now \((K_k,R_k)\) is such that \(K=K_k M_a + R_k\), in \(\mathcal {B}_{a|b|m_\gamma }\). Lines 4–6 perform the same computations to get \((K_r,R_r)\). Finally, one can use the property \(M_a^2 = 2 \,\text {mod}\,P\) because (KR) is the decomposition of \(X\in \mathbb {F}_P\), then we have:

$$\begin{aligned} X \equiv&\quad K M_a + R \,\text {mod}\,P \\ \equiv&\quad K_k M_a^2 + R_k M_a + K_r M_a + R_r \,\text {mod}\,P \\ \equiv&\quad (R_k + K_r) M_a + 2 K_k +R_r \,\text {mod}\,P \\ \equiv&\quad K_c M_a + R_c \,\text {mod}\,P \, . \end{aligned}$$

Using approximated \(\text {BE}\) from [18], we have \(K_c < 2 M_a + m_\gamma < 3 M_a\) and \(R_c < 2 M_a + 2 m_\gamma < 3 M_a\) (using an exact \(\text {BE}\), one gets \(K_c < 2 M_a\) and \(R_c < 2 M_a \)). Using \(\mathtt{SBMM }\) on inputs in the domain \(K <3 M_a\) and \(R < 3 M_a\) gives outputs \(K_z < 29 M_a\) and \(R_z < 38 M_a\), so it is sufficient to choose \(m_\gamma = 2^6\) to compress the outputs back in the domain \([0,3M_a[\). The main parts of the \(\mathtt{Compress }\) algorithm can be performed in parallel, on a channel dedicated to \(m_{\gamma }\). The cost of Algorithm 5 is evaluated in Sect. 5 and examples of applications are presented in Sect. 7.

5 Theoretical Cost Analysis

As usually done in state-of-art references, we evaluate the cost of our algorithms (\(\mathtt{Split }\) Algorithm 3, \(\mathtt{SBMM }\) Algorithm 4 and \(\mathtt{Compress }\) Algorithm 5) by counting the number of \({\mathtt {EMM}}\)s while modular additions and other very cheap operations are neglected. Below, we use the case \(n_a=n_b=n/2\) since it is the most interesting one.

First, \(\mathtt{Split }\) at Algorithm 3 is mainly made of 2 \(\text {BE}\)s. The multiplication by the constant \(\overrightarrow{(M_a^{-1})_b}\) at line 2 can be combined with the one by the constant \(\overrightarrow{T_a^{-1}}\) at line 1 in \(\text {BE}\) Algorithm 1, this saves n / 2 \({\mathtt {EMM}}\)s. The test at line 3 is neglected because the probability to perform lines 4–5 is very low and they do not contain any \({\mathtt {EMM}}\). Then \(\mathtt{Split }\) costs 2 \(\text {BE}\)s on 2 half bit width bases or \(\frac{n^2}{2} + n\) \({\mathtt {EMM}}\)s.

In \(\mathtt{SBMM }\) in Algorithm 4, we need 3n \({\mathtt {EMM}}\)s at lines 1, 2 and 5. Multiplication by 2 is performed using an addition. To compute the 4 products \(K_x K_y\), \(R_x R_y\), \(K_x R_y\) and \(K_y R_x\) on \(\mathcal {B}_{a|b}\), we use the Karatsuba-Ofman’s trick [17], it costs 3n \({\mathtt {EMM}}\)s. The 2 \(\mathtt{Split }\)s at lines 3–4 cost \(2\left( \frac{n^2}{2} + n\right) \) \({\mathtt {EMM}}\)s. Finally, our \(\mathtt{SBMM }\) algorithm leads to \(n^2+5n\) \({\mathtt {EMM}}\)s, against \(2n^2 + 4n\) for the state-of-art algorithm from [12].

\(\mathtt{Compress }\) in Algorithm 5 performs 2 \(\text {BE}\)s from \(\mathcal {B}_a\) to \(m_\gamma \), which costs n / 2 \({\mathtt {EMM}}\)s on \(\mathcal {B}_a\) and n / 2 very cheap multiplications modulo \(m_\gamma \), typically on 6 bits (this type of small multiplications modulo \(m_\gamma \) is denoted \({\mathtt {GMM}}\)). Lines 2 and 5 require two more \({\mathtt {GMM}}\)s. Finally, 2 RNS multiplications on \(\mathcal {B}_b\) are required at lines 3 and 6 for a cost of 2(n / 2) \({\mathtt {EMM}}\)s. Thus \(\mathtt{Compress }\) costs 2 n \({\mathtt {EMM}}\)s and \((n+2)\) \({\mathtt {GMM}}\)s.

Table 1 sums up the required precomputations for \(\mathtt{SBMM }\) (including \(\mathtt{Split }\) and its \(\text {BE}\)) and \(\mathtt{Compress }\). Globally, our proposition requires 4 times less \({\mathtt {EMW}}\)s than the state-of-art algorithm. Dividing by 2 the bit widths in a quadratic cost leads to the 1/4 factor. Table 2 compares the different costs analyzed in this section for state-of-art algorithm and our algorithm.

Table 1. Precomputations details for \(\mathtt{SBMM }\) and \(\mathtt{Compress }\) in \({\mathtt {EMW}}\)s (* denotes modulo \(m_\gamma \) values).
Table 2. Cost summary for main operations and precomputations for our solution and the state-of-art one.

6 Hardware Implementation

6.1 Proposed Architecture

Our \(\mathtt{SBMM }\) architecture, depicted in Fig. 1, uses the \(\mathtt{{Cox-Rower} }\) architecture from [18] (similarly to state-of-art implementations). A \(\mathtt{{Rower} }\) unit is dedicated to each channel for computations modulo \(m_{a,i}\) and \(m_{b,i}\). The \(\mathtt{{Cox} }\) is a small unit required to compute fast \(\text {BE}\)s. Our architecture implements n / 2 \(\mathtt{{Rower} }\)s based on the ones presented in [14]. A small 6-bit \(\mathtt{{Rower} }\) is implemented to compute on the \(m_\gamma \) channel. This small additional channel has 2 different roles. First, for computations over \(\mathcal {B}_b\), it adds the extra modulo \(m_{\gamma }\) for \(\mathcal {B}_b\) and enables \(M_b>6 M_a\). Second, it is used to compute modulo \(m_\gamma \) in \(\mathtt{Compress }\) from Sect. 4.4. In Fig. 1, the white circles are control signals (clock and reset are not represented). The small squares (in red) are just wire selections to extract the 6 MSBs of a w-bit word. The bottom right rectangle (in blue) between the extra \(\mathtt{{Rower} }\) and the multiplexer just pads \(w-6\) zeros to the MSBs.

Fig. 1.
figure 1

Proposed architecture for our \(\mathtt{SBMM }\) algorithm with an extra 6-bit channel (Color figure online).

Our \(\mathtt{{Rower} }\)s are implemented in a 6-stage pipeline as the one proposed in [14]. Then we designed our extra \(\mathtt{{Rower} }\) for computations modulo \(m_{\gamma }\) on 6 stages to simplify synchronizations. We select \(m_{\gamma }=2^6\) and all other moduli as odd values to make this unit very small and simple.

Our architecture, depicted in Fig. 1, is close to the state-of-art one presented in [14]. The 2 differences are the number of \(\mathtt{{Rower} }\)s and the presence of an extra 6-bit channel. We only have n / 2 \(\mathtt{{Rower} }\)s in our architecture instead of n in the state-of-art one. The control in both architectures is similar. We choose to only implement n / 2 \(\mathtt{{Rower} }\)s to reduce the silicon area in this paper. In the future, we will implement another version with more \(\mathtt{{Rower} }\)s to speed up computations. Our \(\mathtt{SBMM }\) algorithm allows to use n \(\mathtt{{Rower} }\)s (instead of n / 2 in the current version) and really leads up to 2 times faster computations while the state-of-art algorithm do not lead to a doubled speed when using 2n moduli (due to dependencies). In our algorithm, the 2 independent \(\mathtt{Split }\)s and the other operations over \(\mathcal {B}_{a|b}\) can be performed in parallel over n \(\mathtt{{Rower} }\)s.

Both architectures have been validated using numerous VHDL simulations for several sets of parameters (see Table 3).

Table 3. Selected (nw) parameters couples for the 3 evaluated field sizes \(\ell \).

6.2 Implementation Results on Various FPGAs

We have completely implemented, optimized and validated both the state-of-art algorithm from [12] and our \(\mathtt{SBMM }\) algorithm on various FPGAs. The results are given for 3 FPGA families: two low cost Spartan 6 (XC6SLX45 denoted S6 or XC6SLX100 denoted S6* when the LX45 is too small), a high performance Virtex 5 (XC5VLX220 denoted V5) and a recent mid-range technology Kintex 7 (XC7K70T denoted K7) all with average speed grade. We used ISE 14.6 tools with medium efforts. Below, we report results for a single \(\mathtt{MM }\) and a single \(\mathtt{SBMM }\) without \(\mathtt{Compress }\). Currently, \(\mathtt{Compress }\) control is not yet implemented, we will add it and evaluate complete ECC scalar multiplication algorithms and architectures in the future.

The selected parameters are given in Table 3. In order to compute a \(\mathtt{MM }\) over \(\ell \) bits using the state-of-art algorithm, one needs an RNS base with a bit width slightly larger than \(\ell \) bits. In state-of-art architectures, the fastest ones are obtained for moduli with 1 additional bit (instead of an additional w-bit channel).

Both architectures have been implemented with and without DSP blocks. The corresponding results are reported in Tables 4 and 5 respectively. Parameter n impacts the clock cycles count while w impacts the clock period (frequency).

Table 4. FPGA implementation results of state-of-art \(\mathtt{MM }\) and \(\mathtt{SBMM }\) algorithms with DSP blocks and BRAMs.
Table 5. FPGA implementation results of state-of-art \(\mathtt{MM }\) and \(\mathtt{SBMM }\) algorithms without DSP blocks and BRAMs.
Fig. 2.
figure 2

Summary of obtained area reductions (in number of slices).

These results are graphically compared in Figs. 2 and 3 for area and computation time, respectively. Our \(\mathtt{SBMM }\) algorithm leads to 26 to 46 % area reduction (in slices) for up to 10 % computation time increase. When using DSP blocks, the reduction of the number of slices is in the range 26–46 % and the number of DSP blocks is divided by 2. Without DSP blocks, \(\mathtt{SBMM }\) leads to 40–46 % area reduction w.r.t. state-of-art \(\mathtt{MM }\) (except for \(\ell =192\) on K7).

Using DSP blocks (bottom of Fig. 3) the computation time required for \(\mathtt{SBMM }\) is very close to the state-of-art one (overhead from 4 % to 10 %). Without DSP blocks (top of Fig. 3), our solution is as fast as the state-of-art or slightly faster. For the widest fields on the low-cost FPGA Spartan 6 devices, the \(\mathtt{MM }\) area is so large that it brings down the frequency.

Fig. 3.
figure 3

Time for a single modular multiplication with \(\mathtt{SBMM }\) and \(\mathtt{MM }\), with (bottom) and without (top) DSP blocks activated

Table 6. Formulas for Weierstrass form (\(y^2=x^3+ax+b\)), RNS optimizations from [3], (XZ) coordinates and Montgomery ladder from [16].
Table 7. Operation costs (in \({\mathtt {EMM}}\)s) for \(\mathtt{MM }\) and \(\mathtt{SBMM }\) algorithms, various curve-level operations, formulas from Table 6 and Montgomery ladder (ML).

7 Examples of ECC Computations

We evaluate the gain of our \(\mathtt{SBMM }\) method for full ECC scalar multiplication example. This example is used in [14] and [8] the two best state-of-art RNS implementations of ECC on FPGA with a protection against SPA (simple power analysis [19]). Both papers use formulas presented in Table 6, originally proposed in [3], with (XZ) coordinates and the Montgomery ladder from [16]. These formulas use the lazy reduction from [3]: \((A B + C D) \,\text {mod}\,P\) is computed instead of \((A B) \,\text {mod}\,P\) and \((C D) \,\text {mod}\,P\), separately.

Figure 4 describes a computation flow example for formulas of Table 6. It shows that each intermediate value can be compressed in parallel with a new \(\mathtt{SBMM }\) call, except for \(Z_3\) and I. In these cases, one has two choices: wait for the compression function or add few bits to \(m_\gamma \) to be able to compress more bits. To compress the result of 2 successive multiplications, \(m_\gamma \) must be at least \(2^{10}\). Moreover, 2 additional bits are required due to the lazy reduction. For this example, we consider \(m_\gamma = 2^{12}\).

Fig. 4.
figure 4

Example of execution flow using \(\mathtt{SBMM }\) and \(\mathtt{Compress }\) in parallel using formulas of Table 6.

Point addition (ADD) from Table 6 requires 6 modular reductions and 11 multiplications (5 without reduction). Multiplications by 2 are performed using additions. Point doubling (DBL) requires 7 modular reductions and 9 multiplications. Each multiplication without reduction costs 2n \({\mathtt {EMM}}\)s using the state-of-art algorithm and 3n \({\mathtt {EMM}}\)s using \(\mathtt{SBMM }\) (to compute U and V). ADD requires 3 compressions for D, \(Z_3\) and \(X_3\), and DBL requires 4 compressions for H, I, \(X_3\) and \(Z_3\), since \(m_\gamma \) is defined to support compression after 2 successive multiplications. The total costs for ADD and DBL are reported in Table 7 for both algorithms.

The reduction in the number of \({\mathtt {EMM}}\)s for various practical values of n (from [14] and [8]) is: 25 % for \(n=8\), 33 % for \(n=12\), 37 % for \(n=16\), 41 % for \(n=24\) and 43 % for \(n=32\). In case we deal with the pessimistic cost assumption \({\mathtt {GMM}}= {\mathtt {EMM}}\), the reduction lies in the range 23–42 % (instead of 25–43 %).

8 Conclusion

In this paper, we proposed a new method to efficiently compute modular multiplication in RNS for ECC computations over \(\mathbb {F}_P\). This method relies on the selection of pseudo-Mersenne like primes as field characteristic for \(\mathbb {F}_P\). These specific primes lead to very efficient RNS modular multiplication with only one single RNS base instead of two for state-of-art algorithms. Our new algorithm theoretically costs about 2 times less operations and 4 times less precomputations than the state-of-art RNS modular multiplication. Our FPGA implementations leads up to 46 % of area reduction for a time overhead less than 10 %.

In the future, we plan to implement a complete ECC accelerator in RNS with this new technique and we expect important improvements at the protocol level. In this paper, we designed an operator with reduced area but without speed improvement. We will design other versions with a non reduced area but significant speed up (or other trade-offs).