Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

CSAIL2019 Crypto-Puzzle Solver Architecture

Published: 17 September 2024 Publication History

Abstract

tThe CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations, with each iteration performing a very large (3,072-bit for CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known field-programmable gate array (FPGA) implementations, and the number of iterations has been increased by about 1,000x compared with LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone, versions of the puzzle have been specified. In this article, we present several FPGA architectures for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We introduce a class of multi-cycle squarer-based architectures that allow for better resource and area trade-offs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 23 out of 28 milestone solutions of the puzzle, which are the first reported results for this problem.

1 Introduction

This article presents several field-programmable gate array (FPGA)–based solver architectures for the CSAIL2019 crypto-puzzle [12]. The CSAIL2019 puzzle is a “refreshed” version of the LCS35 crypto-puzzle [13] published in 1999. This area of work is no longer just an academic exercise; rather, it is relevant to current industry requirements with the increasing importance of blockchain.
The stated problem is the compute of \(2^{2^t} \bmod N\) for specified values of \(t\) and \(N\) . LCS35 defines \(t = 79685186856218\) , and \(N\) is 2048 bits in length. The original estimate was that LCS35 would need 35 CPU years to solve, assuming that the yearly milestone (i.e., intermediate) value would be transferred to the latest CPU technology available. In 2019, Bernard Fabrot solved LCS35 using 3 years of continuous CPU runtime [16]. Shortly thereafter, the Cryptophage team used an FPGA to solve the puzzle in only 2 months [3].
The CSAIL2019 puzzle defines \(t=2^{56}=72057594037927936\) which is nearly 1000 times larger than LCS35. The word size of \(N\) is now 3072 bits, which is 1.5 times greater than LCS35. Thus, each iteration is at least 2.25 \(\times\) more computationally demanding than before (based solely on multiplier dimension increase \(2.25 = 1.5^2\) ). Attempting to solve CSAIL2019 with a design that had the same compute capability as the Cryptophage approach and the same hardware would result in an expected runtime of about 375 years. While solving the full puzzle requires a solution for \(t = 2 ^ {56}\) , CSAIL2019 is also interested in solutions for \(t = 2 ^ k\) for \(56/2 \le k \lt 56\) . These intermediate solutions are called “milestone versions of the puzzle”.
The strength of this type of puzzle comes from its intrinsically sequential nature. This means that computing the solution cannot be parallelized given a polynomial quantity of resources in \(N\) and \(t\) . The solution is computed as a modular exponentiation, which is implemented as an iterative algorithm that has modular multiplication (squaring) at its kernel. Each modular multiplication follows the completion of the previous one. One way to accelerate the solution is to complete a single modular multiplication instance faster.
In this article, we present several multi-cycle-based implementations that target currently available mid-grade FPGA devices. A significantly larger FPGA die, having more DSP Blocks, could potentially reduce the latency by as much as 10 \(\times\) . The later architectures built around newer node process technologies that could provide this level of density would likely also run faster, although this would be offset by the increased combinatorial delay through the single core. Likewise, spreading this problem out over multiple FPGAs would also cause an increase in combinatorial delay due to the inter-chip communication times, which would ultimately lead to slower modular multiplications and thus slower modular exponentiations.
The main contributions of this work are the following:
We show some modified approaches to high-precision modular multiplication that can be used to implement a more efficient FPGA solution.
We show a multi-cycle implementation that can also map to more modest-sized FPGAs, and also develop a left-to-right calculation with which we can perform the multiplication and reduction steps at the same time.
We extend the multi-cycle approach to tackle modular squarings — as required by the modular exponentiation for solving the puzzle. We show that these new architectures outperform multiplier-based ones in both area and latency.
We introduce a robust error-checking mechanism to independently verify running results of a long computation run. We show how this can be used to safely overclock large complex systems such as the CSAIL2019 solver.
We describe the directed internal pipelining of a large combinatorial circuit to improve fitting and timing closure.
This article is organized as follows. We first describe the CSAIL2019 problem statement in Section 2. Next, Section 3 reviews several prior FPGA implementations of modular multiplication, which is the core function of these puzzles. We then describe in Section 4 a breakthrough approach in detail for large-precision FPGA modular multiplications [39], followed by an improvement to create more efficient designs for current FPGA devices [29]. We then describe in Section 5 two classes of CSAIL2019 flexible solver architectures: (a) multiplier based and (b) squarer based —both of which employ a novel left-to-right calculation method. We then describe our directed pipelining approach that allows for reduced overall latency (Section 5.6), as well as our error-detection mechanism (Section 5.5) that allows us to safely overclock the FPGA. The proposed architectures are then benchmarked in Section 6 both in terms of place-and-route results and in terms of overclocking potential. We propose several possible leads for future optimizations in Section 7. Finally, we conclude in Section 8.
In this work, we use the term latency to denote the total time from the input to output of a circuit, independent of how many clock cycles are taken.

2 CSAIL2019 Problem Statement

CSAIL2019 is specified as the compute of \(2^{2^t}\bmod N\) , with \(t = 2 ^ {56} = 72057594037927936\) , and N = 3072 bits.
N=4748097547272012866175034130616773885051260744920056444867106196360710424558147654252707604941012311775892012567579064620536874633385055919001167621577710311366072057029421705135684303934811390137937802096433163959216892351184826691180016055198866796536230085523200683549066995672155839042282955591568494603061113292039044753843846484807112228389204239581712931108919820250218586352043897306238872025378193141111507426311444613498736315614218304761735541626997839036517728000688394015610618179768868342070395100147620295616695834440894241147905565567808298149024668527045239650145862092904119412874007763041042314287604772876861294417664020832796209135587181826458235580003825823724235800850160284850809737200983703552179354691863876044443377822439834079313578029085658078575731290244778595615229472411326831502667425768520006371752963274296294506063182258064362048788338392528266351511304921847854750642192694541125065873977.
While the full puzzle requires a solution for \(t = 2 ^ {56}\) , CSAIL is also interested in solutions for \(t = 2 ^ k\) for \(56/2 \le k \lt 56\) . These intermediate solutions are called ”milestone versions of the puzzle”.

3 Prior Work

The current interest in high-performance large-precision modular multiplication in FPGAs is driven by blockchain, with VDF (Verifiable Delay Functions) as a key motivation. One inflection point was around the VDF Alliance Contest of 2019 [6], which was initiated shortly after the Cryptophage solution of LCS35 was disclosed.
The breakthrough that made this level of modular multiplication performance possible in FPGAs is due to Ozturk [39], which will be explored in more detail in the next section.
The VDF Alliance FPGA contest consisted of the computation of \(x^{2^t}\bmod N\) with \(t = 2 ^ {33}\) and \(N\) = 1024 bits. Therefore, each iteration is about an order of magnitude smaller than the modular multiplication used by CSAIL2019. Moreover, the number of iterations is also significantly smaller. The FPGA target was the Amazon F1 card [11], which used Xilinx Virtex Ultrascale+ VU9P devices [17]. These devices contain 1182K 6-LUTs and 6840 DSP (DSP48E2) blocks [10]. Each DSP Block can support a 27x18 multiplier instance, with the ability to cascade and sum multiple DSP Blocks together. The VDF competition was held in 2 rounds, with the winner of the first round [8] performing each modular squaring in 28.6 ns. The second round of the contest was run, with the goal of improving on the latencies of the successful architectures of the first round. Pearson [7] was the winner again, with the latency of each iteration reduced to 25.2 ns.
A third round was run in order to find the fastest solution that was not based on Ozturk’s method. This round was won by Ben Devlin [25] using a Montgomery multiplier-based approach [32]. Performance was approximately half the speed of the Ozturk multiplier, at 46 ns. Devlin also reported that he investigated and rejected the other non-Ozturk approaches suggested by the VDF contest, such as the Chinese Remainder Theorem [20] (no reason given) and Barrett’s reduction [19] (because of the impact of the final adjustment subtractor, presumably as it requires a full carry propagation as opposed to the redundant bit value of the Ozturk method). Devlin also investigated alternate multiplier constructions, such as Toom-Cook, Karatsuba [27], and Booth’s recoding [22], but stated that these were not suitable, without explanation. We will analyze the effect of the Karatsuba approach in detail later in this section. This round also solicited algorithms in addition to implementation. The winner of this section described nested reductions in an RNS (Residue Number System) [37].
More recently, the VU9P target was used to build a DSP Block-based Montgomery multiplication using HLS [36], in order to further reduce latency. This design achieved 22.91 ns latency for a 512-bit modular multiplication, but at the cost of 203K 6-LUTs and 2367 DSP Blocks. This technique cannot be scaled up to the VDF Alliance contest problem as the number of single-cycle resources would exceed the DSP resources available on any current device.
Both Montgomery’s and Barrett’s approaches have been previously used in building high-precision FPGA modular multipliers. The smaller resource counts of earlier FPGA devices is the reason that most of the previously reported designs are based on folded (multi-cycle or resource shared) architectures. Devlin’s architecture is also multi-cycle but uses predictive branching to improve throughput (i.e., reduce the overall number of cycles required). Before the advent of VDF, in which latency is the most important metric, resource reduction, especially of DSP Blocks, appeared to be the key goal. As an example, an earlier Montgomery multiplier [26] used 3 iterations of a 4-level Karatsuba decomposition to perform a 256-bit modular multiply. This design used 81 18x18 multipliers. We can compare this to Devlin’s design, which has a 16 \(\times\) higher arithmetic complexity (1024-bit vs. 256-bit) but a 28 \(\times\) increase in multiplier count. Iterative Montgomery-style modular multipliers of up to 1024 bits are also studied in [33] for FPGA devices. Barrett’s method requires even deeper pipelining. One recent design [35] distributed six 48-bit multipliers over the calculation of a 512-bit modular multiplication. This calculation required 79 cycles. The automatic generation of folded modular multipliers is studied in [21]. Results are reported up to 256-bits but could clearly be extended to wider bit widths.
A large modular multiplier based on an updated Barrett’s approach was recently reported [30]. We will analyze this design because it reports details on different combinations of multi-radix Karatsuba decomposition, FPGA mapping, and analysis of Barrett’s algorithm. All of these were mentioned by Devlin; thus, we can now examine why the latest approaches used for VDF-type calculations (long iterations of modular exponentiations) are better.
If we look at the history of large multiplier implementations in FPGAs [24, 28, 31, 38], we can see that Karatsuba methods feature prominently. Latency (or more usually, pipeline depth) will generally not be a consideration; rather, the reduction in the number of resources will be considered, especially DSP Blocks.
Barrett’s algorithm requires three chained multiplications: an initial full multiplication, followed by two multiplications to create an approximate estimate of the modular value. A fine adjustment then takes place to calculate the exact result. The two multipliers in the estimation phase do not have to be full multipliers; we need the most significant bits (MSBs) from one of them and the least significant bits (LSBs) from the other, which means that we do not have to construct the full parallel multipliers for this. The reported method used a looser error bound to reduce the required accuracy of two of the three multiplier operations (i.e., allowing a multi-bit error in both of the subset multipliers) in the algorithm, thereby reducing DSP and logic resource requirements. Although the Karatsuba decomposition still requires that we need to generate all of the partial products, we do not need to sum all of the combinations of partial products together, which saves a significant amount of soft logic. Additionally, perhaps of even greater impact is the simplification of the place and route problem.
The aggregate cost of the modified Barrett’s approach is 255K ALMs, 1183 DSP Blocks, and requires 143 cycles per modular multiplication. A key point is that this is the depth of the modular multiplier —once the pipeline is full, a new result can be output per clock cycle. However, a VDF function depends on round-trip latency, not the number of modular multiplications per unit time. The 500-MHz target performance is achieved on the older Stratix 10 [1] FPGA. It is likely that there would be a 35% increase in performance if recompiled into the newer Agilex [23] device family. We can therefore estimate the expected 1-Kb squaring time at 212 ns. We can now compare the three approaches: Ozturk, predictive Montgomery, and modified Barrett’s, as shown in Table 1.
Table 1.
ArchitectureLUTsDSPlatency (ns)
Ozturk464K (39%)2212 (32%)25.2
Montgomery201K (17%)2272 (33%)46.0
Barrett255K (21%)1183 (17%)est. 212
Table 1. Performance of 1024-bit Modular Multiplication Methods (Xilinx Virtex Ultrascale+ VU9P)
We need to look at this data with many qualifiers. We estimated the performance improvement of Barrett’s design when ported to a newer and faster FPGA. We treat the cost and capability of the Xilinx/AMD and Altera/Intel devices for soft logic and DSP resources the same, although there are differences. While the Ozturk and Montgomery architectures have been optimized for squaring (which cuts the resource usage approximately in half compared with a full multiplication), Barrett’s architecture uses a full A \(\times\) B multiplication. If Barrett’s design were optimized for squaring, the area reduction would not be as high, as the squaring tuning would only apply to the first multiplier. A 20% area reduction of the three-multiplier system would be more likely in this case.
One of Devlin’s goals was to implement a design that fits into one of the super logic regions (SLRs) [9] of the VU9P to improve timing closure. An SLR is an FPGA die, and multiple SLRs are combined using an interposer to create a larger FPGA device. Crossing SLR boundaries reduces performance because of the interface latency. In contrast, the Pearson design is spread out over the entire FPGA. This may become more significant as we try to support the much larger (3072-bit vs. 1024-bit) word size for CSAIL2019. None of these designs reports routing stress, which will become more pronounced for this new puzzle, as the arithmetic complexity of each iteration increases by roughly an order of magnitude.
At the 1024-bit word size, we can see that the Ozturk method is clearly the winner when the important metric is latency. Efficiency only becomes an issue if it prevents the design from running in a single FPGA, although with the approximately 9 \(\times\) increase in multiplier size, we can see this is a possibility. We have a number of ways to manage this. First, we will explore high-performance multi-cycle approaches, which may fit into current FPGAs more readily. Second, we need to find more efficient ways of implementing the Ozturk algorithm. DSP blocks are intrinsically more efficient than soft logic, as the functionality is already in ASIC form. This gives us a strong motivation to restate the Ozturk approach from a table-based to an arithmetic-based reduction operation.
Finally, it is important to remember that although the Ozturk approach is the best for round-trip latency, a modified Barrett’s design is still both the most efficient (least number of resources per computation) and fastest for throughput — for example, if many independent RSA cryptographic operations ran simultaneously but the latency is many times that of the Ozturk or predictive Montgomery methods.

4 Background

In this section, we start by briefly presenting the right-to-left modular exponentiation algorithm as used to obtain the solution to the CSAIL2019 puzzle. Having reduced the computation to a succession of modular squarings, we then present techniques behind current FPGA modular multiplication methods. We start with a review of Ozturk’s approach and then present a newer method [29] that uses the embedded FPGA DSP resources for a more efficient and higher performance result. In the next section, we will further develop the DSP-based approach into an efficient multi-cycle version.
The solution to the puzzle is computed as the result of the modular exponentiation:
\begin{align*} R = 2^{2^t}\bmod N. \end{align*}

4.1 Modular Exponentiation

An approach often used to compute modular exponentiations is the right-to-left algorithm, which is based on the binary expansion of the exponent. Let us write the general form for modular exponentiation below, where we denote by \(x\) the base, and by \(e\) the generic exponent:
\begin{align*} R = x^{e}\bmod N. \end{align*}
Expanding \(e\) into its binary representation (and assuming that \(e \ge 0\) ), we have that
\begin{align*} e=\sum _{i=0}^{w-1}e_i2^i, \end{align*}
with \(e_i\) being the binary digits of the exponent \(e\) . Then, we can write
\begin{align} R &= x^{\sum _{i=0}^{w-1}e_i2^i }\bmod N \nonumber \nonumber\\ &= \prod _{i=0}^{w-1}x^{e_i2^i}\bmod N. \end{align}
(1)
Equation (1) shows how the modular exponentiation can be computed as a product of terms of the form \(x^{e_i2^i}\) . The value of the term is
\begin{equation*} x^{e_i2^i}= \left\lbrace \begin{array}{ll} x^{2^i} & \textrm {when $e_i$ is 1 } \\ 1 & \textrm {otherwise.} \end{array} \right. \end{equation*}
Moreover, using the identity
\begin{align*} x^{2^i} \equiv x^{2^{i-1}} x^{2^{i-1}} \bmod N, \end{align*}
it can be observed that the term \(x^{2^{i}}\) can be computed starting from the previously computed term \(x^{2^{i-1}}\) .
A straightforward implementation of the generic right-to-left algorithm uses two running values, one that holds the base and that is updated at each iteration and another holding the running value of the exponentiation, updated only if the corresponding exponent bit is high. As previously shown, both base and running exponentiation value updates consist of a modular multiplication or modular squaring in the case of the base update. The exponentiation update is computed as the modular multiplication between the current base and current exponentiation value, whereas the base update is computed as the modular squaring of the current base value.
In the case of CSAIL2019, the binary form of \(e=2^t\) can be exploited to simplify this exponentiation algorithm. In particular, since \(t=2^{56}\) , we know that bits \(e_0\) to \(e_{2^{56}-2}\) will all be 0, whereas bit \(e_{2^{56}-1}\) is 1. Consequently, we now know that at each iteration only the modular base squaring is required. Moreover, since a single update of the running exponentiation value \(R\) is performed throughout the algorithm, no actual modular multiplication is required. This simplified version of the approach is presented in Algorithm 1.

4.2 Integer Multiplication

In this section, we discuss large integer multiplications in the context of low-latency implementations. For this, we implement large integer multiplication as a polynomial multiplication. The inputs \(A\) and \(B\) are unsigned integers represented in polynomial form as degree \(d\) , radix \(R=2^{w+1}\) digits. The polynomial is constructed with a 1-bit overlap between consecutive digits. This redundancy feature allows reduction of the latency of modular multiplication, as will be made clear in this section. However, we note that for the initial conversion from unsigned integers to polynomial form, the top bit in all coefficients \(a_i\) and \(b_i\) will be forced to ”0”.
\begin{align*} A=\sum _{i=0}^{d}b_{i}2^{wi}, B=\sum _{i=0}^{d}b_{i}2^{wi} \end{align*}
From the radix- \(R\) digit notation, the polynomial notation ( \(x=2^{w}\) ) follows naturally:
\begin{align*} A(x)=\sum _{i=0}^{d}a_{i}x^i, B(x)=\sum _{i=0}^{d}b_{i}x^i. \end{align*}
Here, \(a_{i}, b_{i}\) are the coefficients of the polynomials and correspond to the radix- \(R\) digits from the original representation. This is highlighted in Figure 1.
Fig. 1.
Fig. 1. Integer viewed as a degree- \(d\) polynomial with overlapping coefficients leading to a redundant representation.
The product \(P\) of two degree- \(d\) polynomials \(A\) and \(B\) is a degree \(2d\) polynomial. Figure 2 highlights the partial products for \(d=3\) . The subproducts \(a_ib_j\) are \(2w+2\) -bit wide values, and can be written in terms of two \(w\) -bit values and one 2-bit value ( \(P_{i,j}^\textrm {H}\) ). Note that \(L\) , \(M\) , and \(H\) indicate the low, medium, and high partial-product limbs, respectively:
\begin{align*} a_ib_j&=P_{i,j} =P_{i,j}^\textrm {H}2^{2w} + P_{i,j}^\textrm {M}2^w + P_{i,j}^\textrm {L}. \end{align*}
Fig. 2.
Fig. 2. Subproduct alignments, column-based summations, and short carry-propagation stage for a degree-3 polynomial multiplication (corresponding to 4-digit radix \(R\) inputs), where partial products are split into three parts.
Knowing that \(x=2^w\) , the subproduct alignments are such that
\(P_{i,j}^\textrm {H}\) overlaps over \(P_{k,l}^\textrm {M}\) where \(k+l=i+j+1\) ,
\(P_{i,j}^\textrm {M}\) overlaps over \(P_{k,l}^\textrm {L}\) where \(k+l=i+j+1\) ,
\(P_{i,j}^\textrm {H}\) overlaps over \(P_{k,l}^\textrm {L}\) where \(k+l=i+j+2\) .
These alignments can be observed in Figure 2. The columns of subproduct sections aligned at weights \(2^{wi}\) correspond to non-evaluated sums that, when evaluated, correspond to coefficients of the output polynomial. Therefore, each column is summed together (columns have between 1 and 10 subproduct components for the degree-3 polynomial product in Figure 2) to generate intermediary coefficients \(D_i\) , having widths ranging from \(w\) bits (for \(D_0\) ) to \(w+4\) bits (for \(D_4\) ).
\begin{align*} D_k = \sum _{i+j+2=k} P_{i,j}^H + \sum _{i+j+1=k} P_{i,j}^M + \sum _{i+j=k} P_{i,j}^L \end{align*}
Alternatively, the subproducts \(a_ib_j\) ( \(2w+2\) -bit wide values) can be written in terms of a \(w\) -bit ( \(P_{i,j}^\textrm {L}\) ) value and one \(w+2\) -bit value ( \(P_{i,j}^\textrm {H}\) ):
\begin{align*} a_ib_j&=Pij =P_{i,j}^\textrm {H}2^{2w} + P_{i,j}^\textrm {L}. \end{align*}
This increases the column-wide adders by up to two bits but reduces the number of terms to be summed up, which ultimately reduces latency. The subproduct alignments for the same degree-3 polynomial multiplication but for a 2-way split of the subproducts is depicted in Figure 3. We can observe that compared with Figure 2, the tallest column ( \(D_4\) ) is now reduced from 10 to 7 terms. This reduces the adder tree depth by one level, thus reducing latency.
Fig. 3.
Fig. 3. Subproduct alignments, column-based summations and short carry-propagation stage for a degree-3 polynomial multiplication (corresponding to 4-digit radix \(R\) inputs) where partial products are split into two parts.
In both cases, a set of \(w\) -bit wide additions, aligned on the column outputs, are used for creating modified polynomial coefficients such that their maximum widths do not exceed \(w+1\) bits. These short adders sum the lower \(w\) bits of \(D_i\) ( \(D_i \textrm { mod } 2^{w}\) ) with the bits having weights larger than \(2^{w-1}\) from \(D_{i-1}\) ( \(D_{i-1}\gg w\) ). This propagation is only required for \(i\ge 2\) , as for \(i=0,\) the subproduct column has only one term.
\begin{align*} C_i = (D_i \textrm { mod } 2^{w}) + (D_{i-1}\gg w), i\in [2, 2d+2] \end{align*}
This level of adders is depicted at the bottom of Figures 2 and 3.
Finally, the product \(P\) can be written in polynomial form:
\begin{align*} P=\sum _{i=0}^{2d+2}C_ix^i, \end{align*}
with \(C_i\) holding on \({w+1}\) bits.

4.3 Integer Squaring

The multiplication implementation can be simplified if inputs \(A\) and \(B\) match. In this case, the problem is reduced to that of squaring a single input. In our case, in which we operate on polynomials, we note that some of the cross products are identical, \(P_{i,j} = P_{j,i}\) . Moreover, these products also have the same weights in the final additions:
\begin{align*} P_{i,j}+P_{j,i} &= a_ia_j2^{(i+j)w} + a_ja_i2^{(j+i)w} \\ &= 2a_ia_j2^{(i+j)w} \\ &= 2P_{i,j} \end{align*}
Therefore, the computation of the two subproducts and their sum can be replaced with the computation of a single subproduct, and the sum is replaced with a simple 1-position left shift, which in our case only amounts to wiring. The number of addends in the expression for the intermediary coefficients \(D_k\) is therefore reduced to nearly half (7 down to 4 for the tallest column), as can be observed in Figure 4. This reduces the number of levels of the corresponding adder tree, thus improving latency. For all products \(P_{i,j}\) for which \(i\ne j\) , the corresponding \(P_{i,j}^{H}\) terms will be \(w+3\) -bits wide (due to the 1-bit left shift), whereas for \(P_{i,i}\) the corresponding high term \(P_{i,i}^{H}\) will be 2 bits wide. Conversely, for all products \(P_{i,j}\) for which \(i\ne j\) , the corresponding \(P_{i,j}^{L}\) term will be shifted left by one position when compared with the \(2^{kw}\) alignments.
Fig. 4.
Fig. 4. Subproduct alignments, column-based summations and short carry-propagation stage for a degree-3 polynomial squaring (corresponding to 4-digit radix \(R\) inputs) where partial products are split into two parts.

4.4 Modular Reduction

The second part of the modular multiplication involves reducing \(P\) \((\textrm {mod } N)\) :
\begin{align*} M = P \bmod N, \end{align*}
where \(P\) is the polynomial previously generated on the output of the multiplier.
In the context of modular exponentiation, we do not require an exact (non-reducible) \(M\) ; rather, any equivalent \(M\) is sufficient as long as the output has the same form as the input of the polynomial multiplication. Moreover, we favor solutions that are cost-efficient to obtain.
The following identity is used for obtaining \(M\) :
\begin{align*} A+B \textrm { mod } N &\equiv (A \textrm { mod } N) + (B \textrm { mod } N) \\ &\equiv ((A \textrm { mod } N) + B) \textrm { mod } N. \end{align*}
Note that with respect to the degree \(d\) of the input polynomials \(A\) and \(B\) , the degree of the modulus \(N\) is \(d-1\) .
We split \(P\) in two parts:
\begin{align*} P=\sum _{i=d}^{2d+2}C_ix^i+\sum _{i=0}^{d-1}C_ix^i. \end{align*}
Next, the high part is composed of \(d+3\) radix \(2^{w+1}\) coefficients. For each coefficient and for a constant value of \(N\) , the reduced value mod \(N\) can be pre-computed and tabulated:
\begin{align*} M_i = C_ix^i \textrm { mod } N, i \in [d, 2d+2]. \end{align*}
Additionally, each \(M_i\) can be viewed as a polynomial of degree \(d-1\) , with coefficients \(M_{i,j}\) radix \(2^{w}\) digits. This allows for the following rewrite:
\begin{align*} M=\sum _{i=0}^{d-1}\left(C_i+\sum _{j=d}^{2d+2}M_{j,i}\right)x^i \textrm { mod } N. \end{align*}
This results again in column-based summations, as shown in Figure 5.
Fig. 5.
Fig. 5. ROM-based polynomial reduction modulus, a degree 2 modulus \(N\) , resulting in a degree 3 output polynomial.
A final reduction consisting of \(d-1\) parallel \(w\) -bit adders (similar to the structure at the bottom of Figure 2 for the case of integer multiplier) is implemented in order to obtain \(w+1\) -bit wide coefficients for the output polynomial. This is shown on the bottom of Figure 5.
Note that the previous reduction phase produces an output that is in redundant form. An alternative implementation based on a full-width adder would produce an output in standard form (non-overlapping coefficients). This would have a higher area (associated to pipelining) and longer latency ( \(\approx (d-1)\times\) higher for a ripple-carry adder implementation). Fortunately, this is not required since the polynomial multiplier described in Section 4.2 is specifically designed to accept coefficients in this redundant form. Moreover, the \(w+1\) bitwidth used for coefficient radix \(R\) is selected to match the DSP Block multiplier size. For instance, Intel FPGA devices contain 27-bit multipliers, making \(w=26\) a good choice.

4.5 Multiplicative Reduction

Lookup-based reduction combined with redundant polynomial-based multiplication is a significant improvement for low-latency modular multiplication in FPGAs. Nonetheless, tackling very large word sizes exposes a significant limitation of the approach —as the partial reduction values \(M\) are stored in lookup tables (LUTs), the amount of logic required to store the tables can become prohibitively large (Section 5.2 presents a resource estimation in the context of the CSAIL2019 3072-bit multiplier size). In addition to high logic utilization, place and route will also become an increasingly difficult problem as multiplier size increases. The partial reduction terms \(M_{j,i}\) need to be summed with the result of the multiplicative expansion ( \(C_i\) ), with many long wires inevitably being necessary for routing operands to the corresponding adder trees.
A recent solution presented in [29] reduces the LUT reduction cost (and alleviates some of the routing problems) by implementing a multiplier-based reduction (therefore, replacing the LUTs in Ozturk’s algorithm with DSP Blocks).
The DSP-based solution is based on the following rewrite:
\begin{align*} M_i = C_i (x^i \textrm { mod } N), i \in [d+2, 2d+2]. \end{align*}
Here, ( \(x^i \textrm { mod } N\) ) is a constant (for constant N); therefore, it can be pre-computed. Then, \(M_i\) is calculated by simply multiplying \(C_i\) by that constant. This can be implemented by using a layer of DSP blocks, as shown on Figure 6. Note that \(M_i\) is now \(w+1\) bits wider than \(N\) . To account for this, the input polynomial degree needs to be increased by 1.
Fig. 6.
Fig. 6. Simplified diagram showing the use DSP blocks for performing modular reduction in polynomial form.
The multiplicative modular-reduction scheme presented here significantly reduces the number of adder-tree terms compared with the lookup-based solution, reducing the overall latency of the implementation as shown in [29].

5 CSAIL2019 Solver Architectures

5.1 Iterative Modular Multiplication

CSAIL2019 uses a much larger value of the modulus \(N\) (3072-bit) compared with VDF (1024-bit) or LCS35 (2048-bit). A fully parallel [29, 39] 3072-bit modular squaring operation does not fit into even the largest FPGAs available today.
An iterative approach saves FPGA area, but it also increases latency. Therefore, it reduces overall design performance. A straightforward iterative modular multiplication mapping uses an iterative multiplication block followed by an iterative modular reduction block. Therefore, the overall operation latency is a sum of the multiplication latency and the modular reduction latency.
In order to reduce latency, we developed an optimized iterative modular multiplication algorithm in which the iterative multiplication and the iterative modular reduction operate in parallel, with a 1-iteration offset. The first modular reduction iteration, operating on a partial multiplication result, starts immediately after the first multiplication iteration. Consequently, the overall modular multiplication latency is only one modular reduction iteration latency higher than the latency of a regular (non-modular) iterative multiplication.
The parametrized multi-cycle high-level modular multiplication approach is shown in Algorithm 2. The implementation splits the multiplication and reduction operations over \(n\) cycles. For this, the \(W\) -bit wide inputs \(A\) and \(B\) are subdivided into \(n\) limbs each. On every iteration (loop index \(i\) is decremented from \(n-1\) down to 0), \(A\) is multiplied by \(B_i\) by means of a rectangular multiplier to produce an \(n+1\) limb product. The lower \(n\) limbs of the product are stored in variable \(M\) to use in the next iteration. The upper limb of that product ( \(P_n\) ) is sent to the multiplier-based modular reduction circuit, where it is reduced to modulo \(N\) . The reduced value is then fed into the running accumulator \(S\) . Upon completion of the loop, one modular reduction is required in order to reduce \(M\) mod \(N\) , before constructing the final result \(Z\) .
Using this algorithm, we can calculate ( \(AB \bmod N\) ) in \(n+1\) cycles (assuming multiplication and reduction each take 1 cycle) using \(n\) times fewer resources when compared with a fully parallel implementation. Note that in order to be able to start the modular reduction on cycle 2, we need to perform the multiplication starting with the most significant limb of \(B\) — which corresponds to a left-to-right approach — as opposed to the classical (pen-and-paper) approach, which is right-to-left.

5.2 Hardware Architecture and FPGA Mapping for the Multiplier-Based Approach

The hardware implementation of this algorithm contains two main modules: the iterative modular multiplier and a customized modular reduction, as shown in Figure 7.
Fig. 7.
Fig. 7. High-level architecture of the iterative modular multiplier used for the CSAIL2019 puzzle.
The iterative multiplier operates on polynomials \(A\) and \(B\) of degree 120 (121 terms) — with \(x\) corresponding to \(2^{26}\) and coefficient radix \(R=2^{27}\) (denoted by [121x27] in Figure 7). This allows for integer inputs of up to \(3146=26\cdot 121\) bits to be represented. This bit-width is sufficient to handle the \(3072+32=3104\) bit \(N^{\prime }\) (the additional 32 bits are required for the error detection mechanism described in Section 5.5). Note that due to the redundant polynomial representation (1-bit overlap between consecutive coefficients), the total number of bits used to manipulate the polynomials is \(27\cdot 121=3267\) .
The polynomial \(B\) is split into 8 limbs, with each limb having 16 coefficients (the most significant limb has only the 9 least significant coefficients populated, with the rest tied to zero). The Polynomial Multiplication component multiplies iteratively \(A\) by the limbs of \(B\) , starting from the most significant one, as previously explained in Algorithm 2. For the first iteration, the product flows through the polynomial adder unaltered and gets split into an upper part \(P_n\) and a lower part { \(P_{n-1}, \ldots ,P_0\) }. Subsequent iterations recirculate the aligned low part ({ \(P_{n-1}, \ldots ,P_0\rbrace \ \ll 16\) ) back on the second input of the adder, to be summed with the next partial product \(AB_{i-1}\) .
For each iteration, the high part of the sum \(P_n\) is propagated to the Modular Reduction component. The DSP-based modular reduction outputs a 121-coefficient result that is fed into the polynomial accumulator \(S\) . On the last iteration, all but the most significant bit of the most significant limb of { \(P_{n-1}, \ldots ,P_0\) } also gets added into \(S\) . The most significant bit of \(P_{n-1}\) is passed through a LUT-based modular reduction and gets added into \(S\) as well. The 3120-bit range offered by the 121-coefficient polynomials ensures that at the output of the DSP-based modular reduction, no overflow can happen in the most significant coefficient by summing up 17 3104-bit terms. Even considering the 8 iterations required for performing the full modular multiplication would not grow the most significant limb contribution above 3111 bits, which is lower than 3120.
Focusing on the modular reduction component, the standard Ozturk modular multiplication [39] implements the modular reduction using an array of read-only memory (ROM), as shown in Figure 5. The multiplicative reduction method [29] (Figure 6) instead uses an array of DSP blocks configured as multipliers by constants. However, neither of these approaches can be directly applied to the iterative modular reduction design because the relative weight of the \(P_n\) term changes with every iteration, thus changing the value that needs to be reduced.
The Ozturk multiplier therefore offers minimal savings in an iterative implementation because all values outside of the \(N\) bit range still need to be tabulated. The expansions in both multiplier and reduction tree are smaller, but most of the logic is in the ROM tables. The total amount of storage is unchanged. The ROM cost for the 3072-bit case (CSAIL2019 bitwidth) can be calculated easily. We split the out-of-band expansion into 6-bit segments, with each segment containing a 3072-bit modulus value. We therefore have 512 ROMs of 3K LUTs each, or over 1.5M 6-LUTs. This is as much logic as some of the largest FPGAs available, which in itself will create a structure that will be difficult to place and route. As a comparison, this calculation shows that the 1024-bit modulus of the VDF Alliance competition would require about 175K 6-LUTs, which appears correct in the context of the reported solution size of 464K 6-LUTs [7].
We have a similar problem for the multiplicative reduction method: every iteration needs a different constant. The difference is that the cost of tabulation is much less and in one case can also be absorbed by the FPGA DSP Blocks themselves. The read address for the table is the index of the current iteration (see Figure 8).
Fig. 8.
Fig. 8. Iterative modular reduction using DSPs and ROMs.
Here, the “State” register contains the current iteration index and is used to select the correct constant from the ROMs at every iteration. In some FPGAs [5], a built-in coefficient storage (which was originally designed to support multi-channel FIR filters [2]) can be repurposed for the \(C_i\) storage. These internal ROMs are 8 elements deep; as long as the number of iterations per modular multiplication does not exceed 8, we are able to absorb the entire coefficient storage into these embedded blocks. The cost of soft logic for this method is, therefore, zero as opposed to 1.5M 6-LUT using the tabular reduction case. In contrast, it is also possible to map our iterative ROM tables to soft logic at the cost of 14 half-ALMs per DSP block. One table would be required per DSP Block in the reduction section, or about an additional 27K ALMs.

5.3 Iterative Modular Squaring

As discussed in Section 4.3, we can further optimize the iterative polynomial multiplier if we take into account the fact that inputs \(A\) and \(B\) match. In this section, we demonstrate how to apply this optimization to the iterative polynomial multiplier implementation.
In Section 5.1, we have shown that our proposed iterative polynomial multiplication obtains the result by multiplying the input \(A\) iteratively by 8 successive limbs of input \(B\) starting from the most significant one. Then, the modular reduction part of the modular multiplier reduces the overhanging 1-limb wide part of this product. After 8 multiplication and reduction iterations, the final modular multiplication result is obtained at the output of the architecture in Figure 7.
However, in the case of a squarer, all partial products of the form \(P_{i,j}\) for which \(i\ne j\) are computed twice. These duplicate partial products are highlighted in light gray in Figure 9 (note that Figure 9 uses the notation a[i]a[j] to indicate the partial products \(P_{i,j}\) ). Consequently, in order to reduce the compute overhead, we can ensure that these duplicated partial products are computed only once and are accounted for twice.
Fig. 9.
Fig. 9. Iterative squaring: subproduct group identification for a 4-iteration implementation.
By skipping the calculation of the duplicated partial products, we can reduce the number of computed partial products from \(n^2\) (where n is the number of limbs that the input \(A\) is split into) to \((n^2+n)/2\) . The iterative implementation strategy of the squarer drives the splitting in limbs of \(A\) . For instance, in the case of a 4-iteration squarer ( \(K=4\) ), the number of limbs that \(A\) is split into is 8 ( \(n=8\) ) and the number of subproducts is reduced from 64 to 36, or by 43%. In the case of an 8-iteration squarer \(n=16\) and the number of subproducts is reduced from 256 to 136, or by 53%.
Having identified the duplicate partial products, as depicted in light gray in Figure 9 for \(n=8\) , the next step is to find an efficient way of exploiting the fact that we are skipping these computations. In the case of the iterative multiplier, we use an approach in which the multiplication was split in ‘horizontal’ stripes ( \(AB[i]\) ), where each stripe is implemented by a rectangular multiplier. Analyzing the rectangular stripes on the top of Figure 9, we can see that the number of active subproducts varies among horizontal stripes, making it non-obvious how to exploit the subproduct count reduction in the context of an iterative implementation.
In order to continue operating on stripe-like multiplication structures, we propose the subproduct regrouping depicted at the bottom of Figure 9 for a 4-iteration implementation. The main feature of this regrouping is that the slice ‘shape’ — described here by the total number of subproduct terms and their relative alignment with respect to each other —remains constant for all stripes. Moreover, the number of computed subproducts is minimal (no subproduct is computed twice). In the case of the 4-iteration implementation, the number of subproduct terms computed each iteration is 9 ( \(n+1\) in general) whereas the total number of computed subproducts is 36 ( \((n^2+n)/2\) in general). Similarly, in the case of an 8-iteration implementation (for which \(n=16\) ), the number of subproducts computed during each iteration is 17 whereas the total number of computed subproducts is 136.
Focusing on the subproducts highlighted at the bottom of Figure 9, we distinguish two types of tiles. First, the blue subproducts of the form \(a[i]a[i]\) correspond to squarers. They have a regular weight in the final summation and internally exploit the fact that the inputs match, reducing the number of subproduct terms (during the polynomial multiplication), as shown in Section 4.3. Second, the black subproducts correspond to duplicated tiles. Their weight in the final summation will need to account for this; thus, their contribution to the final sum will be multiplied by 2 (implemented by a 1-bit left shift).

5.3.1 Slice Multipliers.

The custom slice multipliers compute the entire set of partial products required for squaring the input \(A\) over \(K\) iterations. At each iteration, they compute \(n+1\) of the \((n^2+n)/2\) partial products. The partial groups computed at each iteration (index 1 to \(K\) ) are highlighted in Figure 9 for \(K=4\) . The custom slice multipliers operate on two inputs; each input is composed of a set of \(n+1\) limbs. The limbs are obtained from the input \(A\) , which is split into \(n\) consecutive, non-overlapping limbs (having indices \(\lbrace 0..n-1\rbrace\) ). Let \(L=\lbrace l_n, \ldots , l_0\rbrace\) represent the list of limb indices of the left input and let \(R=\lbrace r_n, \ldots , r_0\rbrace\) represent the list of limb indices of the right input. These indices relate back to the partial product groups identified in Figure 9: e.g., \(L_\textrm {Iteration[1]}=\lbrace 7,7,7,7,7,7,7,7,3\rbrace\) and \(R_\textrm {Iteration[1]}=\lbrace 7, 6, 5, 4, 3, 2, 1, 0, 3\rbrace\) .
The following equality holds for the left and right indices:
\begin{align*} r_i + l_i + 1 = r_{i+1} + l_{i+1}, \forall i\in \lbrace 0..n-1\rbrace . \end{align*}
Additionally, since the end partial products of the partial product groups correspond to squarers, the indices of these corresponding multipliers match. Therefore, \(r_n= l_n\) and \(r_0= l_0\) .
Since the \(W\) -bit input \(A\) is split into \(n\) limbs, the weight of each limb is \(2^{iW/n}, i\in \lbrace 0..n-1\rbrace\) . The input \(A\) can then be expressed in terms of its composing limbs:
\begin{align*} A = 2^{(n-1)W/n}A[n-1] + 2^{(n-2)W/n}A[n-2] + \cdots + 2^{2W/n}A[2] + 2^{W/n}A[1] + A[0]. \end{align*}
Since the product of two limbs will produce a two-limb-wide product, the \(n+1\) -limb slice multiplier will produce an \(n+2\) -limb product:
\begin{align} P_\textrm {slice} &= \textrm {Mult_Slice}(L, R) \nonumber \nonumber\\ &= \sum _{i=0}^{n}{2^{iW/n} 2^{l_i\ne r_i} A[l_{i}] A[r_{i}] } \nonumber \nonumber\\ &= \sum _{i=0}^{n}{\left(2^{iW/n} 2^{l_i\ne r_i} \left(2^{W/n} P_i^\textrm {H} + P_i^\textrm {L}\right)\right)} \nonumber \nonumber\\ &= 2^{(n+1)W/n}P_i^\textrm {H} + \sum _{i=1}^{n}{\left(2^{iW/n}\left(2^{i\gt 1} P_{i-1}^\textrm {H} + 2^{i\lt n} P_{i}^\textrm {L}\right)\right)} + P_0^\textrm {L}. \end{align}
(2)
The partial product limb alignment for the slice multiplier is depicted in Figure 10. The limb-by-limb products \(A[l_i]A[r_i]\) overlap neighboring multiplier products ( \(A[l_{i-1}]A[r_{i-1}]\) and/or \(A[l_{i+1}]A[r_{i+1}]\) ) by 1 limb. The lowest partial-product limb \(P_0^\textrm {L}\) and the highest partial-product limb \(P_{n}^\textrm {H}\) don’t overlap any other limbs. The rest of \(n\) product limbs are obtained by summing the high partial product limb from index \(i-1\) with the low partial-product limb from index \(i\) , as shown in Equation (2). We note that the black limbs (as depicted in Figure 10) will have a weight of 2X in the multiplier partial-product summation. Thus, they are depicted as shifted left from the dotted vertical lines by 1 bit.
Fig. 10.
Fig. 10. Slice multiplier: partial-product limb alignments.
Returning to the inputs of the multiplier slice component, these are assembled out of limbs of the input \(A\) . In the case of a 4-iteration design, the input \(A\) is split into \(n=8\) limbs, whereas in the case of an 8-iteration design, the input \(A\) is split into \(n=16\) limbs. In general, for a \(K\) -iteration design, the input \(A\) is split into \(n=2K\) limbs. The slice multiplier has two inputs of \(n+1\) limbs each, which we will refer to as the ”left” and ”right” operands. For a \(K\) iteration design, \(K\) sets of indices need to be generated for each of the two operands. Table 2 lists the set of limb indices required for ”left” and ”right” operands for both a 4-iteration implementation and an 8-iteration one. We note that only 9 indices are generated to \(K=4\) , whereas for \(K=8,\) 17 indices are generated.
Table 2.
Iteration CountIteration IndexOperandMultiplier Slice Limb Indices
161514131211109876543210
41Left 777777773
2 666666222
3 555511111
4 440000000
1Right 765432103
2 654321432
3 543254321
4 436543210
81Left151515151515151515151515151515157
21414141414141414141414141414666
313131313131313131313131355555
4121212121212121212124444444
51111111111111111333333333
610101010101022222222222
799991111111111111
888000000000000000
1Right15141312111098765432107
21413121110987654321876
3131211109876543298765
4121110987654310987654
5111098765411109876543
6109876512111098765432
7987613121110987654321
88714131211109876543210
Table 2. Partial Product Operand Index

5.3.2 Limb-Selection Logic.

The next goal is to find an area and routing-efficient implementation of the limb selection logic. Figure 11(a) depicts the circuitry used to select the \(n+1\) limbs for the “left” multiplier slice operand for the case \(K=4\) . Note that this circuitry can easily be generalized for any value of \(K\) . The circuitry is based on a set of \(n-2\) multiplexers fed from two \(K\) -element shift registers. In particular, the “left” shift register (connected to multiplexer data input “1”) is initialized with the top \(K\) limbs of A. The most significant limb of \(A\) initializes the shift register so that it will be popped out in the next clock cycle. The “right” shift register is initialized with the bottom \(K\) limbs of \(A\) , with element index \(K-1\) closest to the shift register output. The output \(A[l_0]\) only selects among elements of index \(\lbrace 3, 2, 1, 0\rbrace\) , which are all present in the “right” shift register. Similarly, outputs \(A[l_{n-1}]\) and \(A[l_{n}]\) only select among the elements of index \(\lbrace 7, 6, 5, 4\rbrace ,\) which are all present in the “left” shift register. Consequently, for these 3 outputs, the multiplexers are not necessary as outputs are read directly from the corresponding shift register output. A network of \(n-1\) control registers (each one bit wide) is used to control these multiplexers. These registers are initialized to the values shown in Figure 11(a) upon reset and for each modular squaring operation. Then, at each clock cycle, control register values are transmitted to the neighboring control registers to the left (by sets of two) such that appropriate control sequences are generated to select the limb indices shown in Table 2.
Fig. 11.
Fig. 11. Slice multiplier: Operand Limb Selection Logic.
The circuitry used for the right operand limb selection is depicted in Figure 11(b). The architecture is based on an \(n\) -limb circular shift register used for shifting the limbs of \(A\) over \(K\) cycles. This shift register is initialized at the start of a modular squaring operation with the value of input \(A\) (as depicted). Then, over \(K\) cycles, it shifts data to the left as indicated by the arrows in the figure. A set of \(n-2\) multiplexers select one of two limbs that are found \(K\) -elements away in the circular list. For instance, a multiplexer used for selecting \(A[r_1]\) selects among register “0” of the shift register chain (initialized with \(A[0]\) ) and register “4” of the shift register chain (initialized with \(A[4]\) ). Another set of control registers organized as a skip-one shift register is used for controlling the multiplexer outputs. The set of registers is organized as shown in Figure 11(b). After one clock cycle, the control registers will read “1111000”, after the second clock cycle “1100000”, and, finally, the control registers will read “0000000” after the third clock cycle.

5.3.3 Algorithm.

The multi-cycle high-level modular squaring approach is shown in Algorithm 3. The algorithm inputs \(A\) (the squarer input) and \(K\) , the number of iterations. The input \(A\) provided is split into \(n\) limbs \(\lbrace A_{n-1}, \ldots , A_0\rbrace\) . The algorithm returns \(Z\) , an \(n\) -limb representation of the modular reduction of \(A^2\) by the modulus \(N\) :
\begin{align*} Z = A^2 \bmod N. \end{align*}
The algorithm computes the modular squaring and reduction as an iterative process over \(K\) steps. It is similar to Algorithm 2 in that each step consists of a partial multiplication and a partial reduction. The partial multiplication operates on two sets of \(n+1\) limbs of the input \(A\) . For each iteration, the indices for the left and right slice multiplier inputs are populated in the \(L\) and \(R\) lists (as exemplified in Table 2 for \(K=4\) and \(K=8\) ). The \(\textrm {Mult}\_\textrm {Slice}\) primitive operates on the limbs of \(A\) indexed by the \(L\) and \(R\) index sets. The \(n+2\) -limb output of the \(\textrm {Mult}\_\textrm {Slice}\) primitive is obtained as explained in Equation (2). As part of the iterative multiplication and reduction process, the \(\textrm {Mult}\_\textrm {Slice}\) primitive output is summed with the bottom \(n\) limbs of the accumulated partial product ( \(M\) ), shifted left by two limbs in order to align with the current weight of the product. A second accumulation value, \(S\) , is used to hold the partial reduction result. The two top limbs of \(P\) are reduced \(\pmod N\) , then are accumulated onto \(S\) . Once the \(K\) iterations are completed, a final reduction of the running \(M\) (an \(n\) -limb value) is summed to \(S\) and is returned as the output \(Z\) of the algorithm.

5.4 Hardware Architecture and FPGA Mapping for the Squarer-Based Approach

The architecture for the proposed iterative modular squarer is depicted in Figure 12. Similar to the iterative modular multiplier architecture depicted in Figure 7, we input and output 121-coefficient polynomials \(A\) and \(Z,\) respectively, with coefficients represented on 27 bits, and \(x=2^{26}\) . Contrary to the static description of the iterative multiplier architecture, we use a description of the iterative squarer implementation that is parametrized on \(K\) , the number of iterations used for implementing the squaring. The total number of limbs that the input \(A\) is split into is \(n=2K\) . We denote by \(\textrm {L}_\textrm {W}=\lceil 121/n \rceil\) the limb width, expressed in terms of the number of polynomial coefficients per limb. We note that since \(n\) is even, the final limb will always need to be padded with zeros up to width \(\textrm {L}_\textrm {W}\) .
Fig. 12.
Fig. 12. High-level architecture of the iterative modular squarer.
The start signal is asserted at the first of the \(K\) iterations. This drives the left and right operand selection logic (detailed individually and correspondingly in Figures 11(a) and 11(b)), which has its own internal control state machine. The limb selection logic returns operands that are \(n+1\) limbs wide and that feed into the \(\textrm {Mult}\_\textrm {Slice}\) primitive.
At each iteration, and for each new set of limbs returned by the operand limb-selection logic, \(\textrm {Mult}\_\textrm {Slice}\) implements one of the \(K\) partial-product computations associated with the squarer implementation. The compute requirement of the primitive is similar to that of the Polynomial Multiplier in Figure 7 (which computed the partial products \(AB_i\) ), with the exception that the first operand would be \(n+1\) limbs wide, thus, 1-limb wider than \(A\) . Consequently, the number of multiplier tiles used for the same value of \(n\) is increased by \((n+1)/n\) . The architecture of the \(\textrm {Mult}\_\textrm {Slice}\) primitive is depicted in Figure 10.
The output of the \(\textrm {Mult}\_\textrm {Slice}\) primitive feeds into an \(n+2\) -limb Polynomial Adder. The adder is used to accumulate the appropriately shifted \(\textrm {Mult}\_\textrm {Slice}\) outputs over \(K\) cycles. During the first of the \(K\) iterations, the output of the Polynomial Adder is simply the output of \(\textrm {Mult}\_\textrm {Slice}\) , which is the highest-weight partial-product regrouping (see bottom rows of Figure 9). This output is split into two parts. First, the lower \(n\) limbs \(\lbrace P_{n-1}, \ldots , P_{0}\rbrace\) are shifted left by two limbs (creating an \(n+2\) -limb value: \(\lbrace P_{n-1}, \ldots , P_{0}, 0, 0\rbrace\) ) and are fed back to the right input of the Polynomial Adder. Second, the upper two limbs \(\lbrace P_{n+1}, P_{n}\rbrace\) are forwarded to the DSP-based Modular Reduction circuit, where they are reduced mod \(N\) to a 121-coefficient polynomial form. The same process is repeated for the remaining \(K-1\) iterations, with the only exception that \(\lbrace P_{n+1},\ldots , P_{0}\rbrace\) are the result of successive accumulations of appropriately aligned partial-product terms. This is similar to the multiplier-based approach but is different in that the relative alignment of successive \(\textrm {Mult}\_\textrm {Slice}\) outputs is at 2 limbs as opposed to 1-limb in the case of the multiplier-based approach.
During the final iteration, the MSB of the \(P_{n-1}\) limb is also reduced using a LUT-based reduction circuitry. This reduction produces a 120-coefficient degree polynomial. The output of the DSP-based reduction circuitry, the LUT-based reduction circuitry, and the limbs \(\lbrace P_{n-1}, \ldots , P_{0}\rbrace\) (except the MSB of \(P_{n-1}\) ) all feed into the second Polynomial Adder to produce the output \(Z\) .

5.5 Error Detection

For very long running computations — such as those required to unlock this puzzle — it is very important to be able to detect errors early on. If an error goes undetected, then all computations performed after the error (which may amount to many years worth of compute) will be useless. The error-detection mechanism needs to be combined with a checkpointing mechanism. At regular time intervals, intermediary results are checked using the error-detection mechanism. If no errors are detected, then the results are stored offline as a valid system state (checkpoint).
Having an error-detection mechanism in place, we can safely overclock the hardware (run it using a clock frequency larger than what is reported by the Quartus Timing Analyzer) and rely on the error detection mechanism to catch errors caused by overclocking. In the unlikely event of an error, the system will only need to revert to a starting point (checkpoint state) several minutes old. This is insignificant in the case of what can amount to multi-month or multi-year runs.
To be able to detect an error, we use the following approach: instead of performing calculations modulo \(N\) , we perform calculations modulo \(N^{\prime }=N \xi\) , where \(\xi = 4294963787\) is a 32-bit prime that produces the longest possible cycle (see [34]) \(\lambda = (\xi -3)/2 = 2147481892\) . Converting of a value modulo \(N^{\prime }\) into a value modulo \(N\) simply requires taking the remainder modulo \(N\) of that value. Operating \(\textrm {mod }N^{\prime }\) provides a way to check for errors in the calculations at any moment in time. The process involves comparing the result modulo \(\xi\) with the expected value as shown in Algorithm 4. Note that \(J\) in Algorithm 4 represents the total number of modular multiplications (squarings) done so far. We reported our method to the CSAIL team to check that it was correct [Ron Rivest, Personal communication, 2022-04-19].
Running the error detection algorithm takes just a couple of seconds on a central processing unit (CPU). A probability of an undetected error is \(1/2147481892,\) which is extremely small.

5.6 Directed Pipelining

Recent designs for large modular multiplication [29, 39] contain a datapath organized as a “simple loop” as shown on Figure 13(a). Adding additional clock stages into a simple loop architecture does not improve the overall speed of computations. Even though the additional pipeline stage allows the computations to run at a higher clock frequency, the trade-off is that it also increases the number of clock cycles to go through the loop, marginally reducing the overall performance — in our case, loop completion latency. We see this trade-off in recently reported designs. Some of these designs [3, 4] use just 1 or 2 clock cycles per iteration and, therefore, contain very deep combinatorial datapaths with very slow clock frequencies (20–40 MHz). If the designs are pipelined slightly deeper, the clock frequency increases, but this is almost perfectly offset by the increase in number of cycles per iteration, again, with an iteration rate of 40 MHz [7].
Fig. 13.
Fig. 13. Loop Construction Development.
We now describe an improved way to introduce pipelining into a selected class of very deep combinatorial designs for FPGAs. We use the iterative modular multiplier architecture from Figure 7 as a working example. A high-level view of the loop organization of this design is shown in Figure 13(b). The design is organized as a nested two-level loop structure. Both inner loops iterate 8 times (synchronously) to produce a modular multiplication result while the outer loop iterates over the modular multiplication \(t=2^{56}\) times to complete the modular exponentiation. The total time to finalize the modular exponentiation therefore equals \(2^{56}T\) , where \(T\) denotes the time to complete one modular multiplication.
Denote by \(X\) the inner loop iteration count (the execution stays in the inner loop for \(X=8\) clock cycles), while completing one iteration of the outer loop takes an additional \(Y\) iterations. The total number of clock cycles per iteration is therefore \(X+Y\) . Adding an extra pipeline stage into the outer loop ( \(Y\rightarrow Y+1)\) increases the total number of clock cycles per iteration by 1 ( \(X+Y \rightarrow X+Y+1\) ). The relative increase in clock cycles required to compute one modular multiplication (outer loop) can be expressed as \(C_1=(X+Y+1)/(X+Y)\) .
Adding the additional pipelining stage in the outer loop would decrease the maximum logic depth by a coefficient \(C_2=(Y+1)/Y\) (assuming that the pipeline stages are distributed evenly). Since \(C_1\) < \(C_2\) , the overall design performance will increase if the logic depth in the inner loop is smaller than the outer loop logic depth. Therefore, we can improve the overall performance by adding pipeline stages into an outer loop until the outer loop maximum logic depth will match the inner loop logic depth as shown in Figure 13(c). In our case, this inner loop logic depth corresponds to a polynomial adder, which has a relatively short latency. This suggests that the outer loop latency, which mostly translates to number of pipeline stages in the corresponding adder trees, can be increased while increasing the performance of the design.
Figure 14 shows an example for \(X=8\) in which by increasing the pipeline depth of the outer loop from \(Y=2\) to \(Y=3\) the critical path delay (which was assumed to be in the outer loop) has decreased by one-third, from 3 period units to 2 period units. Consequently, the total delay of the deeper pipelined design has decreased from 30 to 22 period units.
Fig. 14.
Fig. 14. Directed Pipelining Example.
We applied this approach to the two-level loop datapath shown in Figure 15. A total of 9 additional pipeline stages were added to the outer loop, bringing the overall number of pipeline stages to 12. Since the accumulation loops (inner loops) run for 8 cycles, the overall number of clock cycles per iteration (of the outer loop) is 12+8–1 = 19. The maximum logic depth of the pipelined design corresponds to 2 consecutive adders, which is the minimum possible depth of the accumulation loop.
Fig. 15.
Fig. 15. Pipelined iterative modular multiplier.

6 Results and Discussion

6.1 Implementation

We implemented several solver architectures using an Intel Agilex F-Series FPGA Development Kit [14] based on the AGFB014R24A2E3VR0 FPGA device, as this was the most recent FPGA available to us in a development board form. This is a mid-size, slowest-speed grade (–3) Agilex device that motivates our development of an area-efficient solution. Although this device has less than half the logic of the Xilinx VU9 devices on the Amazon F1 instances (as used by the VDF Alliance competition and the Cryptophage LCS35 solution), our DSP-based method is very logic efficient as we do not need any soft logic for the coefficient storage. The number of DSP resources of our selected Agilex device is also much lower than the VU9 device (4510 vs. 6840). Nonetheless, the Agilex DSP Blocks can support 27 \(\times\) 27 multipliers directly, which is 50% more arithmetically dense than the individual Xilinx VU9 DSP Blocks (which support 27 \(\times\) 18 multipliers) for this application.
Our goal was to create regularly placed designs. We achieved this by a combination of explicit (placing DSP blocks in column groups) and implicit (the directed pipelining method introduced in the earlier section) methods. We did not use any logic floor planning for any of our architectures; rather, we let the DSP placement drive the place-and-route of the solver.
We evaluated four different solver architectures: an 8-iteration multiplier-based architecture and 3 squarer-based architectures for 6, 7, and 8 iterations. The performance results for these architectures are presented in Table 3. For each architecture, we list the number of pipelining stages used in the two adder trees associated with the polynomial multiplier (PM) and polynomial reduction (PR) parts of the architecture. The total number of clock cycles for each architecture is computed using the expression \(7+PM+PR+K\) , which is also reported in the table. For each of the 4 architectures, we also list the resource utilization and FMax data as reported by Quartus. We note that, for each architecture, the reported resource utilization and frequency values are extracted for the best of a 30-design seed sweep. The total squaring latency is computed based on the clock period derived from the Quartus-reported FMax, multiplied by latency in clock cycles of the modular squaring operation. Additionally, and based on extensive experimentation with the specific development kit board, we list the maximum stable overclocking frequency for each of these designs. We note that the reported overclocking frequency is the highest for which we were able to run the designs for several days without any errors being produced. Based on these overclocked frequency values, we also list the corresponding latencies of the modular squaring architectures.
Table 3.
ArchitectureMultiplier, 8Squarer, 6Squarer, 7Squarer, 8
Iteration Count8678
Adder Tree Latency (PM, PR)(2, 2)(2, 2)(1, 1)(1, 1)
Total Clock Cycles19171617
ALMs161,269 (33%)171,410 (35%)141,422 (29%)128,392 (26%)
DSPs3,891 (86%)4,218 (94%)3,400 (75%)3,043 (67%)
FMax (MHz), Agilex (–3)326.37313.09284.09297.27
FMax (MHz), Agilex (–3), Overclock420.00400.00370.00390.00
Squaring Latency (ns)58.2154.2956.3257.18
Squaring Latency (ns), Overclock45.2442.543.2443.58
Overclocking Improvement (ns)12.9711.7913.0813.6
Overclocking Improvement (%)22%21%23%23%
Table 3. Performance Results for the Proposed Architectures
In the case of the multiplier-based modular squarer architecture, we found that 8 iterations was an optimal value for the FPGA device that we were targeting. A smaller number of iterations leads to larger FPGA resource utilization and more congested placement. This results in FMax degradation, which is not compensated by decreased latency. On the other hand, a higher iteration count increases latency but does not improve Fmax sufficiently, leading to overall performance degradation compared with the 8-cycle design. Additionally, a larger number of iterations also prevents efficient usage of the DSP coefficient ROM feature. Nonetheless, as discussed in Section 5.2, for \(K\gt 8,\) the cost of lookup-based coefficient storage in the DSP-based reduction will be significantly less than the LUT cost in the Ozturk reduction technique.
Table 4 presents the total resource utilization for the 8-iteration multiplier-based solver architecture (the Solver column), combined with a resource breakdown for our two main units: Multiplication and Modular Reduction. As can be observed from the table, this architecture balances well the resources between the multiplication and reduction components (both ALMs and DSPs). This can also be observed from the floor plan shown in Figure 16(a), where the multiplier is shown in purple and the modular reduction in red. Figure 16(e) shows the routing heatmap for this design. From this routing heatmap, we can deduce that the highest routing pressure is in the modular reduction component.
Table 4.
ComponentSolverMultiplicationReduction
 ALM(%)DSP(%)ALM(%)DSP(%)ALM(%)DSP(%)
Multiplier, 81612693338918672256151936438224018195543
Squarer, 617141035421893617561215733510154221264559
Squarer, 71423152934007549982101215278421017218548
Squarer, 8128392263043674340081088247694615195543
Table 4. Total Resource Utilization and Main Component Resource Breakdown Report for the 4 Proposed Architectures
Fig. 16.
Fig. 16. Floor plan and routing heatmap of multiplier (purple) and reduction modules (red) for the 8-iteration multiplier architecture, as well as squarer architectures for 6, 7, and 8 iterations.
Selecting the best squarer-based modular squarer architecture is less obvious. There are multiple parameters that ultimately influence performance. First — and most obvious — the reduced number of iterations ( \(K\) ) directly influences the number of modular squarer cycles, and may thus reduce the latency. At the same time, a lower value of \(K\) yields a larger modular reduction component, which potentially results in deeper adder trees and a harder place-and-route problem. We also have the pipelining of the two major adder trees (PM and PR) that influences performance. A deeper pipeline increases latency but can also increase throughput.
Out of the set of possible squarer-based modular squarer architectures, we selected 3 that were better fits for the current FPGA device. First, we selected a 6-iteration implementation that uses 2 adder tree pipeline stages for both PM and PR. This matches the adder-tree pipelining of the multiplier-based architecture; thus, the total latency is reduced from 19 to 17 cycles. The resource utilization is slightly increased compared with the multiplier-based design, with the DSP Block count increasing by 327 to 94%. Combined with the deeper logic stages in the reduction part, this resulted in a reported frequency of 313 MHz compared with the 326 MHz reported for the multiplier-based architecture. Despite the reduced frequency, the combined effect of reducing the number of iterations by 2 (from 8 to 6) allows reduction of the total latency compared with the 8-iteration multiplier-based solution by nearly 4 ns.
Second, we selected a 7-iteration squarer-based architecture with shallow-pipelining adder trees (1 cycle per adder tree). For this architecture, the latency is decreased by 3 cycles from 19 to 16 cycles: 1 cycle for each adder tree and 1 cycle for the reduced iteration count. The reported frequency of 284.09 MHz is lower than both the multiplier-based design and Squarer,6 design. Despite having the lowest number of cycles, the total latency is actually larger than Squarer,6 and Multiplier,8 at 56.32 ns. We note that this design also has slightly lower resource utilization than the previous two designs, which is making the place-and-route task slightly simpler. Nonetheless, the 1-cycle adder trees that are used to reduce total cycle count end up degrading frequency too much.
Finally, we selected an 8-iteration squarer-based architecture with the same shallow-pipelining adder trees (1 cycle per adder tree). This architecture uses the same number of iterations (8) as the multiplier-based architecture. Nonetheless, the complexity of the architecture is much reduced compared with the multiplier-based architecture, particularly in the multiplier part of the architecture. This allows for a shallow pipelining of the adder trees (1 cycle per adder tree), which ultimately reduces the total modular squarer latency to 17 cycles. The reported FMax for this architecture is 297 MHz, which results in a total squaring latency of 57.18 ns. Although the slowest of the squarer architectures, this architecture still improves the latency of the multiplier-based implementation while also significantly reducing resources.
Table 4 presents the resource utilization breakdown between the polynomial multiplier and reduction components across the 4 architectures. This table shows the different resource utilization ratio between the multiplier and reduction components when moving from the multiplier-based to the squarer-based architectures. We complement the information in Tables 3 and 4 with Figure 16 presenting visually the resource breakdown between the multiplication or correspondingly squaring (in purple) and the reduction parts (in red) of the solver, as well as the corresponding routing heatmap for each architecture. We note that whereas for the 8-iteration multiplier-based architecture, the resource utilization of the two major components was similar (see breakdown in Table 4), for the squarer-based architectures the reduction part of the circuitry dominates. Out of the 4 architectures, the 6-iteration squarer-based architecture has the lowest modular squaring latency at 54.29 ns, which is 3.92 ns faster than the 8-cycle multiplier-based implementation.
When overclocked, the latency of all architectures improves by more than 20%, with minor differences between these architectures. Although the latency improvements are slightly different from design to design, for this set of architectures the monotonicity is maintained between the default and overclocked latency results. The overall best value is obtained by the Squarer,6 architecture at 42.5 ns.
We evaluated the expected performance improvement when switching from the slowest (–3) to the fastest (–1) speed grade Agilex device. For this, we compiled the same 8-iteration multiplier-based modular squarer design on the same Agilex FPGA variant (same ALM, DSP, and Memory block count), but with the fastest speed grade. On the similar 30 design seed-sweep that we used for the (–3) part, the best seed reported a frequency of 379.36 MHz on the (–1) part compared with the 326.37 MHz on the (–3) part. This corresponds to a frequency increase of approximately 17%. We are confident that we should be able to observe similar scaling factors applied to the frequencies of the squarer architectures. Nonetheless, any statement regarding overclocking scalability will need to be confirmed empirically.
We note that for all architectures, roughly 6500 ALMs and 14 M20K memory blocks — not reported in the table —are needed to construct the control circuitry to complete the entire solver implementation.

6.2 Power Analysis and Performance Comparison

We measured the actual dynamic power of our FPGA to be 29.5 W for a 400-MHz frequency 6-iteration squarer-based solver. We also ran the Agilex Power Analyzer [15] tool for comparison. The default settings estimated the dynamic power consumption at 12.8 W. In our experience, the default settings are not correct for arithmetically dense designs, as the default toggle rate of 12.5% is too low. At a 50% toggle rate, the tool returned 26.9 W, which is in line with our measured value. This understanding is important, as the reported (but only simulated) VDF Alliance contest result power numbers are likely incorrect. Devlin [25] reports that his design consumes 4.9 W and the Pearson design 18.3 W (both numbers from the Xilinx Power Estimator Tool). We ran the reported area for Devlin’s design through the Power Estimator [18] and obtained similar numbers. Like the Intel tool, Xilinx default toggle rates are 12.5%. Again, we believe that the correct toggle rates for these applications is 50%, which would approximately quadruple the estimated power consumption reported.
We normalized the performance (latency/((bits/1024) \(^2\) )) to obtain a comparison of the arithmetic efficiency of all three approaches despite them operating on different bit widths. We acknowledge that this is not a perfect metric but found this sufficiently accurate for this comparison. We used the Normalized Latency proxy that approximates the latency of an equivalent 1024-bit multiplier based on the absolute latency of our 3072-bit multiplier. Selecting Devlin’s multiplier as the baseline, we report the normalized performance of our Squarer,6 implementation alongside Pearson’s VDF winner modular multiplier latency. The table indicates that the normalized performance of our approach is roughly 1 order of magnitude faster than Devlin’s modular multiplier and is over 5 times faster than Pearson’s multiplier.
We also evaluated the power efficiency of these designs. Using Devlin’s multiplier again as the baseline, we computed the power efficiency by dividing the normalized performance by the estimated power (all at 50% toggle rate). For instance, compared with Devlin’s multiplier, which has a power efficiency of 1, our proposed Squarer,6 implementation has a power efficiency of \(4.2 = 9.7 / (27.2 / 12.0)\) . The results are shown in Table 5.
Table 5.
ParameterOursDevlinPearson
DeviceAgilex-7Virtex Ultrascale+Virtex Ultrascale+
Process10 nm16 nm16 nm
Bits307210241024
Latency42.5 ns46 ns25.2 ns
Normalized Latency4.72 ns46 ns25.2 ns
Normalized Performance9.711.8
Power \(est.\) 27.2 W12.0 W65 W
Power Efficiency4.210.32
Table 5. Power Efficiency Analysis
As with our earlier latency comparison, we need to apply some caution. We need to discount our efficiency somewhat, as the Agilex devices are on 10-nm FinFET, and the VU9P devices use 16-nm FinFET. Our multiplier reduction also requires more DSP Blocks, although for a VDF application the latency metric greatly outweighs cost or power considerations.

6.3 Solver

The solver started running in February 2022. In the first 12 months, it found 22 milestone solutions (from \(t=2^{28}\) to \(t=2^{49}\) ) of the puzzle. For the 22 milestones, the solver used the 19-cycle multiplier-based modular squaring implementation running at a 350-MHz clock target. Upon improvements in the performance of the multiplier-based architecture, which now runs at an overclocked frequency of 420 MHz (as reported in Table 4), and the development of the squarer-based family of architectures, which now outperform the multiplier-based solution, the solver has been migrated to using the Squarer,6 architecture running at a frequency of 400 MHz, which results in a modular squarer latency of 42.5 ns. The actual runtimes for the 22 completed milestones and estimated runtimes for future milestones, together with the completion date of the achieved milestones, are given in Table 6. Note that the new estimated times are based on the performance of the proposed Squarer,6 architecture.
Table 6.
MilestoneRuntimeObtainedStatus
\(t=2^{28}\) 14.57 sFeb 23, 2022Done
\(t=2^{29}\) 29.14 sFeb 23, 2022Done
\(t=2^{30}\) 58.29 sFeb 23, 2022Done
\(t=2^{31}\) 1.94 mFeb 23, 2022Done
\(t=2^{32}\) 3.89 mFeb 23, 2022Done
\(t=2^{33}\) 7.77 mFeb 23, 2022Done
\(t=2^{34}\) 15.54 mFeb 23, 2022Done
\(t=2^{35}\) 31.09 mFeb 23, 2022Done
\(t=2^{36}\) 1.04 hFeb 23, 2022Done
\(t=2^{37}\) 2.07 hFeb 23, 2022Done
\(t=2^{38}\) 4.14 hFeb 24, 2022Done
\(t=2^{39}\) 8.29 hFeb 24, 2022Done
\(t=2^{40}\) 16.58 hFeb 27, 2022Done
\(t=2^{41}\) 1.38 dFeb 28, 2022Done
\(t=2^{42}\) 2.76 dMar 2, 2022Done
\(t=2^{43}\) 5.53 dMar 7, 2022Done
\(t=2^{44}\) 11.05 dMar 14, 2022Done
\(t=2^{45}\) 22.11 dMar 25, 2022Done
\(t=2^{46}\) 44.21 dApr 15, 2022Done
\(t=2^{47}\) 88.43 dMay 31, 2022Done
\(t=2^{48}\) 176.85 dAug 26, 2022Done
\(t=2^{49}\) 353.71 dFeb 22, 2023Done
\(t=2^{50}\) 1.87 yJan 8, 2024Done
\(t=2^{51}\) \(est.\) 3.03 y In progress
\(t=2^{52}\) \(est.\) 6.06 y not started
\(t=2^{53}\) \(est.\) 12.13 y not started
\(t=2^{54}\) \(est.\) 24.27 y not started
\(t=2^{55}\) \(est.\) 48.55 y not started
\(t=2^{56}\) \(est.\) 97.10 y not started
Table 6. Solver Milestone Status as of January 2024
We reported our results to the MIT CSAIL team. Our milestone solutions have been recorded and are at the top of the leader-board [Ron Rivest, Personal Communications, 2022-04-19, and 2024-01-08].
To meet the 2033 deadline of the CSAIL2019 puzzle, we need roughly a 10 \(\times\) improvement – in other words, our 42.5 ns iteration time must reduce to 4.25 ns. Our best architecture uses a 6-iteration modular squarer implementation, which is first driven by the number of resources (principally, the 4150 DSP Blocks) on the mid-size device we are using (this solution uses 94% of the available DSP Blocks on the device). Larger FPGAs are currently available [4], with over 12K DSP Blocks on some of the larger devices. This same FPGA family (Agilex) also has members with 3x the DSP Blocks. Although this would not evenly divide into our iteration granularity, we can also investigate a mixed (multiplicative and table-based) approach. As both types of reduction calculations remap portions of values that are outside the \(N\) modulus width back into that space, they will be compatible with each other. We are using a relatively small amount of soft logic compared with the amount of DSP; thus, it is likely that a reasonably routable solution could be realized. One caveat is that reducing the number of iterations would increase the critical path, especially in the summation of the partial product columns (comprising both the multiplication and reduction portions of the implementation), which may impact the operating frequency negatively.
While the performance of the presented solver is not sufficient to fully solve the puzzle before the deadline in 2033, we hope that our work will inspire more FPGA researchers to work on the problem and will serve as a base architecture to solve other time-locked puzzles with future generations of FPGAs and more architectural improvements.

7 Future Work

We believe that the clock rate of our solver implementation may be further improved. We are currently overclocking the FPGA for this application by over 30% (for the Squarer,8 architecture) and we are confident that it is possible to push this even further. Given that the design has a low logic usage and almost no memory blocks, the power consumption is lower than it is typically for a full-chip design of this size. We know that we are not near the thermal limits of this device, and we also have a robust error checking method to continuously verify our results. There are several different possibilities to further develop overclocking: both tuning the existing methods as well as developing new ones. We believe that we can boost the operating frequency by another few percentage points by monitoring power (which will increase with increased frequency, thereby increasing temperature and, in turn, reducing our thermal margin) and using a better device cooling solution. The power-based performance optimization will be a slowly varying parameter. The continued correct operation of the circuit can be monitored by our error checking methodology. We also identified possible data-based clocking improvements, which we may be able to apply on an iteration-by-iteration basis.
The combination of using larger existing devices and both the slowly and quickly varying clock rates will likely not achieve a 10x increase in performance, although a 5x to 6x improvement is possible.

8 Conclusions

In this article, we analyzed the CSAIL2019 problem and compared the computational scale of the problem with known approaches to solving it. It appears that a standard CPU is out of the running and that the current FPGA methods will have difficulty scaling.
We described a new FPGA-based algorithm that shifts the computation from tables to DSP Blocks, which allows for a higher-density solution, with higher throughput and more predicable timing closure. This also reduces the routing problem for soft logic, allowing more scalability with higher performance. We then developed multi-cycle implementations, both multiplier and squarer based, using FIR filter coefficient DSP Block storage, which lets us directly calculate milestone solutions of the puzzle even using modest-sized FPGAs. We also introduced a new way of pipelining large systems, with repeatable and predictable performance. This allows for modifications to the design to be made with a high degree of confidence in the new FPGA fitting.
As part of our discussion on FPGA performance, we described how we can confidently run the design somewhat faster than the reported timing closure and how this might be applicable to many other types of designs.
We compared normalized metrics of our architecture to similar designs used in the VDF Alliance contest and found that our approach (which uses methods such as multiplicative reduction as well as a left-to-right multiplication calculation that allows a simultaneous multiplication expansion and reduction operation) provides significantly better arithmetic and power efficiencies. More importantly for VDF applications, our method will provide the lowest overall latency and is also scalable to larger bit widths.
Finally, we look to the future. While the performance and computation density of our new work is a significant improvement over previously reported work, the presented design will not meet the full CSAIL2019 puzzle–solving deadline. However, we may be able to get close assuming that we can continue to develop more ways of increasing performance, either by improving our implementation parallelism, our frequency tuning, or a combination of both. In any case, the CSAIL2019 puzzle is a very challenging problem for FPGAs.

Appendix

The latest milestone in hexadecimal: \(2^{2^{50}}\bmod N\) =7E4C8A50E7464A80161CDEA2E42A057DE3750B1974DB35B4BC704E94D91451980755B884853A69833DE9D0D914D1098CB3197B1A27B12019AC587B228925E752BAE3408D0C1353D884D96D7BCC1F3CFAEB2FEA85650CF4B45186EFD0FAA6B56D0888BAC2C9477F2C726128EA984AA511822827685A6CDA9F5E88B52FE3EE27948867D01ED7C59FCF415EFBE51F9E181DD213ECC4A9B018696C78FD3D66D75A756896945AFC9015634DBA4A451E0094BF29F2966A42FA6F803A0D12DDE3583E39530C3DA960921BCD628173930E983D24C8A477CCAAE8D3F0888305EB7CE2E53E126AD390937A5A95EB671AB79C9E2B074D6BA663CBE25D1BEDF468E0EDE866EE3018C1F273F2BEEF9356D6A13588EBD7B2815500D5F0D84B20BC45130E645A0F9D2C662028280B135AD1DECA6CC927A04A5317A0A001EAE8705E6B2D14FB6AFF248A014318489A5E88C036873D41B1B3234BB07F526C7B0FDDB73857D7AE165D4DDFE37A342FF8E34AB323E20D491236FEA07BFF70D168063A0595431742A7F1

References

[3]
2019. Cryptophage LCS35 Solver. https://github.com/supranational/lcs35/blob/master/README.md/ (2019). Accessed: 2022-09-14.
[7]
2019. VDF Alliance FPGA Competition - Round 2 - Eric Pearson. https://github.com/supranational/vdf-fpga-round2-results/tree/master/eric_pearson_2. (2019). Accessed: 2022-09-14.
[8]
2019. VDF Alliance FPGA Competition Round 1 Results and Announcements. https://www.vdfalliance.org/news/fpga-competition-round-1-results (2019).
[9]
2021. UltraFast Design Methodology Guide for Xilinx FPGAs and SoCs. https://docs.xilinx.com/r/2021.2-English/ug949-vivado-design-methodology/Super-Logic-Region-SLR (2021). Accessed: 2021-11-19.
[10]
2021. UltraScale Architecture DSP Slice. https://docs.xilinx.com/v/u/en-US/ug579-ultrascale-dsp (2021). Accessed: 2021-08-30.
[11]
2022. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/instance-types/f1/ (2022). Accessed: 2022-09-14.
[12]
2022. Description of the CSAIL2019 Time Capsule Crypto-Puzzle. https://people.csail.mit.edu/rivest/pubs/Riv19f.new-puzzle.txt (2022).
[13]
2022. Description of the LCS35 Time Capsule Crypto-Puzzle. http://people.csail.mit.edu/rivest/pubs/Riv99b.lcs35-puzzle-description.txt (2022).
[14]
2022. Intel® Agilex™ F-Series FPGA Development Kit. Accessed: 2022-09-14.
[16]
2022. Programmers Solve MITs 20-year-old Cryptographic Puzzle. https://www.csail.mit.edu/news/programmers-solve-mits-20-year-old-cryptographic-puzzle (2022). Accessed: 2022-09-14.
[17]
2022. Virtex UltraScale+ Product Tables. https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus.html (2022). Accessed: 2022-09-14.
[19]
Paul Barrett. 1987. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Advances in Cryptology — CRYPTO’86, Andrew M. Odlyzko (Ed.). Springer, Berlin,311–323.
[20]
Daniel J. Bernstein, Jonathan Sorenson, and P. Sorenson. 2007. Modular exponentiation via the explicit Chinese remainder theorem. Journal Math. Comp. 76, 257 (2007), 443–454.
[21]
J. Beuchat and J. Muller. 2008. Automatic generation of modular multipliers for FPGA applications. IEEE Trans. Comput. 57, 12 (Dec 2008), 1600–1613.
[22]
Andrew D. Booth. 1951. A signed binary multiplication technique. The Quarterly Journal of Mechanics and Applied Mathematics 4, 2 (1951), 236–240.
[23]
Jeffrey Chromczak, Mark Wheeler, Charles Chiasson, Dana How, Martin Langhammer, Tim Vanderhoek, Grace Zgheib, and Ilya Ganusov. 2020. Architectural enhancements in Intel® Agilex™ FPGAs. In FPGA’20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, February 23-25, 2020, Stephen Neuendorffer and Lesley Shannon (Eds.). ACM, 140–149.
[24]
Florent de Dinechin and Bogdan Pasca. 2009. Large multipliers with fewer DSP blocks. In International Conference on Field Programmable Logic and Applications. IEEE.
[25]
Benjamin Devlin. 2020. Really low latency multipliers and cryptographic puzzles. https://blog.janestreet.com/really-low-latency-multipliers-and-cryptographic-puzzles. (2020).
[26]
Y. Gong and S. Li. 2010. High-throughput FPGA implementation of 256-bit Montgomery modular multiplier. In 2010 2nd International Workshop on Education Technology and Computer Science, Vol. 3. 173–176.
[27]
A. Karatsuba and Y. Ofman. 1962. Multiplication of multidigit numbers on automata. USSR Academy of Sciences 145 (1962), 293–294. DANKAS
[28]
Martin Kumm, Oscar Gustafsson, Florent de Dinechin, Johannes Kappauf, and Peter Zipf. 2018. Karatsuba with rectangular multipliers for FPGAs. In 25th IEEE Symposium on Computer Arithmetic, ARITH 2018, Amherst, MA, USA, June 25-27, 2018. IEEE, 13–20.
[29]
M. Langhammer, S. Gribok, and B. Pasca. 2022. Low-latency modular exponentiation for FPGAs. IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (2022).
[30]
Martin Langhammer and Bogdan Pasca. 2021. Efficient FPGA modular multiplication implementation. In FPGA’21: The 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Virtual Event, USA, February 28 - March 2, 2021, Lesley Shannon and Michael Adler (Eds.). ACM, 217–223.
[31]
Martin Langhammer and Bogdan Pasca. 2021. Folded integer multiplication for FPGAs. In FPGA’21: The 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Virtual Event, USA, February 28 - March 2, 2021, Lesley Shannon and Michael Adler (Eds.). ACM, 160–170.
[32]
Peter L. Montgomery. 1985. Modular multiplication without trial division. Math. Comp. 44, 170 (1985), 519–521.
[33]
M. Morales-Sandoval and A. Diaz-Perez. 2016. Scalable GF(p) Montgomery multiplier based on a digit-digit computation approach. IET Computers Digital Techniques 10, 3 (2016), 102–109.
[34]
T. D. Noe. 2008. The On-Line Encyclopedia of Integer Sequences: Sequence A141305. (2008). https://oeis.org/A141305Accessed: 2022-08-16.
[35]
E. Ozcan and S. S. Erdem. 2019. A high performance full-word Barrett multiplier designed for FPGAs with DSP resources. In 2019 15th Conference on PhD Research in Microelectronics and Electronics (PRIME). 73–76.
[36]
Ismail San. 2021. LLMonPro: Low-latency Montgomery modular multiplication suitable for verifiable delay functions. IACR Cryptol. ePrint Arch. (2021), 4. https://eprint.iacr.org/2021/004
[37]
Phil Sun. 2020. Modular Exponentiation via Nested Reduction in a Residue Number System. https://github.com/supranational/vdf-fpga-round3-results/blob/master/papers/phil_sun_rns.pdf. (2020).
[38]
Emanuele Vitali, Davide Gadioli, Fabrizio Ferrandi, and Gianluca Palermo. 2021. Parametric throughput oriented large integer multipliers for high level synthesis. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). 38–41.
[39]
E. Öztürk. 2020. Design and implementation of a low-latency modular multiplication algorithm. IEEE Transactions on Circuits and Systems I: Regular Papers 67, 6 (2020), 1902–1911.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 17, Issue 3
September 2024
51 pages
EISSN:1936-7414
DOI:10.1145/3613592
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 September 2024
Online AM: 29 December 2023
Accepted: 14 December 2023
Revised: 15 September 2023
Received: 15 September 2023
Published in TRETS Volume 17, Issue 3

Check for updates

Author Tags

  1. Crypto-puzzle
  2. FPGA
  3. modular exponentiation
  4. modular multiplication
  5. modular squaring
  6. iterative algorithm
  7. DSP-based modular reduction

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 128
    Total Downloads
  • Downloads (Last 12 months)128
  • Downloads (Last 6 weeks)35
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media