This manuscript makes the claim of having computed the \(9\)th Dedekind number, D(9). This was done by accelerating the core operation of the process with an efficient FPGA design that outperforms an optimized 64-core CPU reference by 95\(\times\). The FPGA execution was parallelized on the Noctua 2 supercomputer at Paderborn University. The resulting value for D(9) is 286386577668298411128469151667598498812366. This value can be verified in two steps. We have made the data file containing the 490 M results available, each of which can be verified separately on CPU, and the whole file sums to our proposed value. The paper explains the mathematical approach in the first part, before putting the focus on a deep dive into the FPGA accelerator implementation followed by a performance analysis. The FPGA implementation was done in Register-Transfer Level using a dual-clock architecture and shows how we achieved an impressive FMax of 450 MHz on the targeted Stratix 10 GX 2,800 FPGAs. The total compute time used was 47,000 FPGA hours.
1 Introduction
Calculating Dedekind numbers On-Line Encyclopedia of Integer Sequences (OEIS) series A000372 [11]) has always followed progress in both the mathematical domain, and the computational implementation domain. Computing subsequent numbers is known as “Dedekind’s Problem.” Since the year 1991 and up until now, the highest known term in this sequence was D(8), computed by Doug Wiedemann [22] on a Cray 2 supercomputer. In this paper, we’ll present our computation of the nineth instance in this sequence, and explain the steps it took, both mathematical and our field-programmable gate array (FPGA) implementation. This computational endeavor originally started as the master’s thesis of the main author [8], and through collaboration with the people at the Paderborn Center for Parallel Computing we were able to make it a reality in the subsequent two years.
Fundamentally, the Dedekind Sequence grows at a rate of \(O(2^{2^{n}})\). Also, with all algorithms known to date, the fundamental complexity is bounded by \(O(2^{2^{n-1}})\). In this work, we do not break this complexity barrier, but by efficiently making use of the symmetries inherent in the problem, and building an optimized FPGA accelerator for the problem, we have been able to attain the required amount of work. As has happened at the time of the computations of the Dedekind numbers 5–8 over the previous century, the present computation of Dedekind number 9 hence demonstrates the power of contemporary computing hardware, in our case FPGAs, while at the same time building on mathematical and algorithmic progress.
This paper will in the first half go through the mathematics of monotone Boolean functions (MBFs) and the core formula used in the computation, followed by an in-depth explanation of the FPGA accelerator we built to accelerate the computation. Finally we describe the issues we faced during computation, and the steps we took to mitigate them.
2 Mathematical Background
Richard Dedekind first defined the numbers \(D(n)\) in 1897 [6]. Over the previous century, Dedekind numbers have been a challenge for computational power in the evolving domain of computer science. Computing the numbers proved exceptionally hard, and so far only formulas with a double exponential time complexity are known. Until recently, the largest known Dedekind number was \(D(8)\). In this paper, we report on a computation of \(D(9)\). Table 1 shows the known numbers, including the result of our computation. As we explain in Section 9, some uncertainty about the correctness of the number existed at the time of the first computation and we planned a verification run. This was due to the real possibility of a bit error having occurred due to Cosmic Radiation. In the meantime however, results of an independent computation were reported [10], confirming our result. Since computational methods as well as hardware implementation differ significantly between the two computations, we can conclude that the result is correct with a probability very close to 1.
Table 1.
D(0)
2
Dedekind (1897)
D(1)
3
Dedekind (1897)
D(2)
6
Dedekind (1897)
D(3)
20
Dedekind (1897)
D(4)
168
Dedekind (1897)
D(5)
7581
Church (1940)
D(6)
7828354
Ward (1946)
D(7)
2414682040998
Church (1965)
D(8)
56130437228687557907788
Wiedemann (1991)
D(9)
286386577668298411128469151667598498812366
Our result (2023)
Table 1. Known Dedekind Numbers, OEIS Series A000372 [11, 22], Including Our Result
A Boolean Function in \(n\) variables is defined as a map from \(n\) Boolean inputs to a single Boolean output. Since the input space is finite, we can fully represent any Boolean Function as a lookup table (LUT) over all possible input combinations. In the case of a \(n\) input Boolean Function, this becomes a \(2^{n}\) entry table, where each output has two options. We shall name the set of all \(2^{n}\) possible inputs \(B_{n}\). The total number of Boolean Functions is therefore \(2^{2^{n}}\). Throughout the implementation, we shall use \(2^{n}\) bit bitvectors to represent all Boolean Functions, be they monotone or not. The binary representation of the position of each bit indicates the input values that correspond to this output. As an example, take the (Monotone) Boolean Function \(a\& b\). The truth table for this MBF is \(ab\Rightarrow 1,b\Rightarrow 0,a\Rightarrow 0,\emptyset\Rightarrow 0\). Translating to the bitset gives 1,000.
To arrive at the set of MBFs, we add a constraint to regular Boolean Functions, namely that of monotonicity. A Boolean Function is monotone if and only if for every possible configuration of inputs, when we flip any false input to true, the output will never go from false to true. An example illustrates this well, as shown in Figure 4.
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Most of the mathemathics in this paper derive from Patrick De Causmaecker et al’s work [2, 5]. In those manuscripts however, instead of MBFs they use Anti-Chains, which are more compact from a set theory point of view. These representations produce equivalent structures though, and because operations on MBFs are easier to understand and more efficient to implement, we opted to use these. Knowledge of Anti-Chains is not necessary to read this paper.
The set of all possible MBFs in \(n\) variables is denoted as \(D_{n}\). The sizes of these sets are named Dedekind numbers (Usually written as \(|D_{n}|\) or \(D(n)\)1) and the known values in this sequence are listed in Table 1.
Table 2 shows the known numbers \(R(n)\) of equivalence classes of MBFs under permutation of the input variables. Note that the last result dates from 2023. These Equivalence Class counts are a sequence closely linked to the Dedekind numbers themselves, closely bounded as \(R(n){\gt}\frac{D(n)}{n!}\). These numbers are used to estimate the time methods in this space usually take, as a common optimization will use the symmetries in permuting the input variables.
This partial ordering defines a complete lattice on \(D_{n}\). We denote by \(\bot\) and \(\top\) the smallest, respectively the largest, element of \(D_{n}\)
Finally, in the formulas below, a number defined for each pair \(\alpha\leq\beta\in D_{n}\) plays an important role. We refer to this number as the connector number \(C_{\alpha,\beta}\) of \(\alpha\) and \(\beta\). It counts the number of connected components of true nodes in the resulting graph when the operations, a task which they are uniquely well suited, but not specialized for true nodes in bottom \(\beta\) are removed from \(\alpha\). An example of this is shown in Figure 5.
for \(\chi,\upsilon\in D_{n}\) is given by \(2^{C_{\alpha,\beta}}\). This is called the P-Coefficient also denoted as PCoeff [4, 5].
Furthermore, we define several more operators that are needed further on:
–
The dual of an MBF where both its inputs and output are inverted. The dual of \(\alpha\) is written as an overbar: \(\overline{\alpha}\). In practice this is implemented as inverting the whole bitset, and reversing the order of all the bits (Figure 2).
–
varSwap swaps two variables which we’ll call \(a\) and \(b\), producing a new Boolean Function that behaves as if the inputs \(a\) and \(b\) were swapped. Full permutations of all variables is built upon repeated application of this operation. In practice (assuming \(a{\gt}b\)) this means shifting all bits whose position containing \(a\) but not \(b\ 2^{a}-2^{b}\) bits to the right, and those containing \(b\) but not \(a\ 2^{a}-2^{b}\) to the left. All other bits remain the same (Figure 1).
–
monotonizeUp/down: These functions take a Boolean Function, and for any true value in it, activate all nodes above, and respectively below it. monotonizeDown creates an MBF, and monotonizeUp creates an inverse MBF. They are implemented by successively shifting right/left and OR-ing the bitset with the original by \(2^{1},2^{2},2^{3}\) up to \(2^{n}\) bits (Figure 3).
3 Related Work
3.1 Mathematical Related Work
The work in this article is built upon three papers by Patrick De Causmaecker et al. [3–5]. Another author who has recently made big strides in this field is Bartłomiej Pawelski, who has among other things extended the sequence for the Inequivalent MBF counts by two terms in the last few years[16].
Finally of course we must mention the work of Christian Jäkel, who computed the 9th Dedekind number at about the same time as us, but using different formulas and GPUs instead of FPGAs [10]. The underlying method employed in his paper is in the same family of “Jumping Formulas” as ours. While we use a 2-jump formula (Equation (9)), Jäkel used a 4-jump. (Theorem 8 in [10]). Now in practice, there is not that much to gain with differing jump sizes, as the exponential decrease in \(|D_{n}|\) due to shrinking \(n\) is almost equally balanced out by an exponential increase in the number of recursive summations that must be made over this smaller set. To illustrate, for a jump of 1, we need to iterate through \(D(n-1)\) MBFs once, for a jump of 2, we need to iterate through all 2-tuples of all \(D(n-2)\) MBFs, 4 implies iterating through all 8-tuples of all \(D(n-4)\) MBFs.
Jäkel’s implementation used Nvidia A100 GPUs and took 5,311 hours in GPU time, compared to 47,000 FPGA hours used for our project. Despite the higher power consumption per GPU, this makes Jäkel’s project more efficient in terms of energy consumed and probably also in terms hardware investments, even though the utilized FPGAs where first deployed in 2018 and thus might already be completely written of in a commercial setting. However, it is hard to attribute this difference to any specific aspect of the two projects, since the hardware involved, the algorithms employed, and the platform advantages capitalized on are so different. Jäkel was able to leverage the GPU hardware so efficiently, because he was able to reformulate this eight-way tuple summation in terms of 4,257,682,565 large matrix multiplications. This way, the GPUs were executing a task they were specifically designed for on the architecture level. Our approach on the other hand employed the reconfigurability of FPGAs to implement fixed structure boolean operations—a task which they are uniquely well suited, but not specialized for—for example our design leaves the DSP blocks, a specific strength of the Stratix 10 architecture, entirely unused. One might be tempted to attribute the difference in execution time to the much lower clock frequency of FPGAs or more generally to the perspective that GPUs in some regards resemble an ASIC specialized for matrix multiplication, in contrast to the larger versatility of the FPGA architecture. However, that would unjustly ignore the differences on multiple levels of algorithmic optimizations. All in all, each approach used their respective platform to the fullest, and achieved an impressive result that had not been reached for over thirty years.
3.2 FPGA Accelerators with Similar Characteristics
FPGA accelerators are still a relatively niche platform in the high performance computing (HPC) space. Their high purchase cost, steep development difficulty, and relatively low clock frequency make them a challenging target compared to traditional HPC computing platforms such as CPU and GPU computing. Nevertheless, FPGAs excel at certain specific types of tasks. Particularly high acceleration can be expected for tasks in which bit-level operations are key, or which employ an unpredictable branching structure, both of which are characteristics of this work. A domain where FPGAs are often employed due to their ability to implement bit-level operations efficiently is cryptography, for example for the calculation of message-digest algorithm hashes [7], or for searching [18] and reconstructing [17, 19] Advanced Encryption Standard keys from decayed memory. Another recent example of FPGA acceleration with a focus on bit-level operations is mutation tree reconstruction [13] in bioinformatics.
4 Workload Characterization and Approach
We start from the original PCoeff Formula as taken from [4]
In the master thesis of the first author of the current paper, Lennart Van Hirtum [8], the author reworked this formula to a form making use of equivalence classes to reduce the total number of terms
The \(Permut_{\beta}\) term is the collection of all \(n!\) equivalents of \(\beta\) under permutation of the input variables. \(E_{\beta}\) is the number of distinct equivalents, and hence, \(Permut_{\beta}\) contains duplicates iff \(E_{\beta}{\lt}n!\). The number of duplicates of any given permutation is \(\frac{n!}{E_{\beta}}\) and are divided out by the \(\frac{E_{\beta}}{n!}\) factor. Technically the condition isn’t required on the sum over all \(\beta\), since any \(\beta\)s that don’t will have zero valid \(\gamma\) terms. It is included here however, because the implementation efficiently enumerates these \(\beta\)s by iterating through the Equivalence Class graph shown in Figure 12
For D(9), this means iterating through \(D_{7}\). That would require iterating over an estimated \(4.59\times 10^{16}\ \alpha,\beta\) pairs [8, Section 9.1]. A lower bound for the total number of P-Coefficients (\(2^{C_{\alpha,\gamma}}\)) that need to be computed can be calculated exactly by counting the number of terms in Equation (9). This is a lot of work still though, so we reworked the formula to Equation (11) giving us \(11483553838459169213=1.148\times 10^{19}\). This is only a lower bound, as iterating over all permutations as Equation (10) does, introduces some redundancy to allow for a simpler implementation. However it is a pretty tight bound, as the number of Equivalence Classes that doesn’t have the full 5,040 distinct permutations is only 4.3% (see Table 2).
We were able to improve on this further using the process of “deduplication,” where we can halve the total amount of work again, by noticing that pairs of \(\alpha,\beta\) give identical results to their dual pair \(\overline{\beta},\overline{\alpha}\).2 As per Equation (12). This allowed us to halve the total amount of work to \(5.574\times 10^{18}\) P-Coefficients.3
From now on when we refer to a “top” or a “bottom,” we will be referring to the \(\alpha\), respectively \(\beta\) in Equation (10). When referring to “bottom permutations” we are referring to \(\gamma\)
Table 3. MBF LUT By Equivalence Class Size for MBFs of 7 Variables
The second column sums to \(R(7)\). The sum of the products of the columns gives \(D(7)\).
4.1 Computing P-Coefficients on FPGA
Computing P-Coefficients requires solving the problem of counting the number of disjoint connected components within the activated nodes of an \((n-2)\)-dimensional hypercube. An example of such a graph with its disjoint connected components colored is shown in Figure 5 for five variables. For calculating D(9), corresponding graphs for seven variables are used, which contain 128 vertices and 448 edges. Even though the graph structure itself is fixed and only the set of active vertices changes for every P-Coefficient, a typical CPU implementation will contain many branches and it’s hard to make use of vectorization based on the Single Instruction, Multiple Data (SIMD) paradigm. This observation motivated the use of FPGAs in this work, since counting connected components involves only plain Boolean operations along the graph edges, which can be mapped extremely well to FPGA technology. Because the overall structure of the graph is fixed, the wires and gates resemble the edge and node structure of the graph. A simple schematic implementation of a Count Connected Core is shown in Figure 6. An implementation of the Count Connected Core with first estimates about resource consumption and possible parallelism was first proposed in [8]. In the thesis, two preprocessing steps—Singleton Elimination and Leaf Elimination [8, Section 9.3]—are derived that bring the average number of iterations (and thus clock cycles) down to 4.061. These are described further in Section 5.3. In this manuscript, we expand upon this and describe a complete FPGA accelerator in Section 5, with further details about counting connected components on FPGA and with optimized CPU code in Section 5.3 and further performance analysis in Section 7.
Fig. 6.
5 Accelerator Implementation
Since the Count Connected Core is the element which enables the large speedup of our FPGA computation, our goal is to fit as many of these on a single FPGA as possible, and run them at an as high as possible clock frequency. At this rate however, it also becomes an important concern to create a design that can keep up with Count Connected Core’s data consumption and result production rate. Furthermore the peripheral component interconnect express (PCIe) communication bandwidth has to be taken into account to deliver work items fast enough.
To parallelize a total of 300 Count Connected Cores per FPGA, while staying aware of these constraints, we divide them into a hierarchy. Groups of 30 count connected cores form work units called full permutation pipelines (FPPs), and further parallelization and load balancing happens between these 30 core groups.
In the subsequent parts of this section, we will go down the hierarchy. We will start at the top-level OpenCL glue code level in Section 5.1 which deals with the input and output buffering, Double Data Rate External memory provided on the FPGA cards (DDR) access, and reordering the bottoms to balance the throughput on each component. Section 5.2 deals with the top-level Verilog Module. It handles the load-balancing over the 300 Count Connected Cores. At the top level it load-balances the bottoms stream over the 10 FPPs. Each of these 10 FPPs consists of 30 Count Connected Cores each responsible for 1/30th of the 5,040 permutations of the bottom. Finally, in Section 5.3 we delve into the actual implementation of these Count Connected Cores, how the iteration structure works, and how they are optimized for speed. Results of actually synthesizing the design are presented in Section 5.4.
A pseudocode overview of the algorithm is given in Listing 1.
5.1 OpenCL Interface
The OpenCL code is actually a rather simple wrapper around the Verilog code, as shown in Listing 2.
We iterate over the given 32-bit MBF indices, look them up in our MBF LUT, and run the PCoeff processor on them. Because the interface to the Verilog code can’t be called in a different way for the top as for the bottom, we pass the top to the module first, with the startNewTop bit “1.” To further simplify the hardware we store this bit as part of the index. We just use a free bit from the index which is “1” for the first element, and “0” for all others. We took care that shuffleClusters doesn’t reorder this first element, so that it always gets passed to pcoeffProcessor first.
Each result is then stored in the output buffer. The reason we use a LUT instead of sending the MBFs over directly is to limit PCIe bandwidth use. If we only send indices, we only need to send 32-bit indices, instead of the full 128-bit MBFs. We use two copies of the LUT to parallelize our accesses over 2 of the 4 memory banks. The results are returned in 64-bit buffers, of which the first 48 bits contain the sums for each \(\beta\), Equation (13). The next 13 bits are used for the number of valid permutations Equation (14) (this turned out to be very useful in Section 6). Finally one more bit is used to flag detected bit errors in the FPGA Intel Stratix M20k Block RAMs.
We run the function on two MBFs at a time, to double our bandwidth, as OpenCL will compile this to a module processing one iteration per clock cycle.
The shuffleClusters call shuffles the order in which we iterate through the MBF list to keep the computational difficulty reasonably even. That is because the elements at the front of the buffer tend to be high MBFs, of which only a small fraction of the 5,040 permutations is valid, whereas the end of the buffer has the bottom most MBFs, and pretty much all of these have all 5,040 valid permutations. A consequence is that the computational difficulty generally rises from front to back. If implemented straightforwardly, at first, the computation module will be underutilized and the bottleneck is in permutation checking whereas toward the end of the list the computation module will be overutilized. This is alleviated by alternating easy and difficult bottoms leading to a relatively even difficulty across the whole run. The implementation we used for it is shown in Listing 3.
5.2 PCoeff Computation Module
The outermost module divides the two incoming MBF streams across 10 “FPP” module copies we were able to fit on the FPGA. Each module takes at most one input every 7 clock cycles. The outputs from the computation modules are then ordered again and returned as two output streams. This is shown in Figure 7.
Fig. 7.
Each of these modules then does the actual computational work. The actual work is to try all \(5{,}040=7!\) permutations of each bottom, and for the ones that are dominated by the top MBF, compute and sum their P-Coefficients. Iterating these permutations is done using “varswap” operations. As shown in [8] a swap operation in hardware just swaps \(2\times 32\) wires out of the 128 wires. In fact, any static permutation can be done at zero cost in hardware, as it simply reorders the wires.
The strategy for iterating all permutations is recursively swapping every possible pair of variables. So variable 0 is swapped with variable 0–6, For each var 0, var 1 is swapped with variables 1–6, var 2 is swapped with variables 2–6, and so forth. Doing this for all variables gives us the \(5{,}040\) permutations we need.
These permutation factors are split across the length of the pipeline as follows: The first 7 var swaps are iterated at the front. Permutations 6 and 5 are then implemented in wires and split across \(6*5=30\) computation modules. Each of these then iterates through the remaining \(4*3*2*1=24\) permutations, which of these are valid permutations, and sends only those to the Count Connected Core for processing. This is graphically shown in Figure 8. Do note there are two filtering stages, if all \(4*3*2*1=24\) permutations bundle are all invalid, they are not stored in the FIFO. Then during iteration over these permutations in each module, only the valid permutations are iterated through, this saves the clock cycles that would otherwise be wasted iterating through invalid ones. Checking for validity of various permutations can actually reuse a lot of hardware, which is why the validity for all \(6*5*4*3*2*1=720\) permutations are checked in one large module at the front, and these validity wires are then passed to the filters of the computation pipelines.
Fig. 8.
The results from each Count Connected Core are summed separately, and finally all 30 pipeline results are added together to produce the total P-Coefficient sum for this top-bottom combination.
5.3 Count Connected Core
Firstly, the final four variables are permuted, outputting only the valid permutations. (Those where the permuted bot \(P(\beta)\) is a proper subset of \(\alpha\)). In that case, the graph for which the number of connected components is to be computed is given by \(\alpha\& P(\beta)\).
Finally we arrive at the part of the system that does actual computation, the count-connected-core. It implements a version of the Depth First Search algorithm for finding the connected components [21] which we will call the “FloodFill” algorithm as described in [8]. It is shown in Listing 4.
The basic principle is that, so long as the remaining graph isn’t empty (see Figure 5), grab an arbitrary new seed, and from there, expand the component frontier into its connected nodes until there are no new nodes to find. Due to the structure of Boolean Functions, this exploration can actually be split up efficiently into a MonotonizeUp and MonotonizeDown step. Once we can explore no further, the explored region is removed from the remaining graph, the connection count is incremented, and a new seed is chosen. This is repeated until the graph is entirely empty. The fact that the graph only consists of 128 nodes that may be activated or not, allows us to represent the whole graph as just 128 wires. The operations can operate on all nodes simultaneously, which allows us to explore many nodes in a single clock cycle.
For the CPU reference, we took great care to produce as optimal code as possible for the 7 variable case (required for computing D(9)). We store the vector in a 128-bit SIMD register, this way we can work on the whole vector at once. An exerpt of the monotonizeUp routine is given in Listing 5. It is basically a sequence of masked shifts and ORs. First we mask out every second bit, shift over and or with the original, then mask out 2 out of every 4, 4 out of 8, and so forth.
This masking and shifting is one area where the FPGA saves resources over CPU, because on the CPU we are spending valuable instructions and registers masking out bits and shifting them, whereas on FPGA this masking and shifting is done implicitly in the connecting wires. Furthermore, the FPGA can save logic blocks by merging adjacent ORs into single 4-input OR gates.
Perhaps further instruction level parallelization could be possible with 256 or 512 bit Advanced Vector Extension (AVX) registers by monotonizing multiple vectors at once, but that would be difficult to do due to the number of iterations not being constant and usually small (1–5). We didn’t expect much gain from following this path.
In [8] the author described two preprocessing steps that reduce the number of iterations of the FloodFill algorithm. The first is “Leaf Elimination.” This algorithm removes any nodes which are connected to only a single other node. Of course that might accidentally remove a whole component if it only consists of 2 nodes. To solve that, the leafElimination comes in two kinds: Eliminating only leaves that have a single node below them, and above them respectively. It turns out, due to the fact that in the FPGA implementation the termination condition is only checked after monotonizeDown (see Figure 6), only leafEliminationDown has a significant effect on the iteration count.
The second preprocessing step is SingletonElimination. This step counts and removes any nodes that have no neighbors. Since on average 2.670 components are singletons, this is a significant improvement. As shown in [8, Table 9.3], these operations together significantly reduce the number of FloodFill iterations, from 6.967 down to 4.061.
The preprocessing steps both have been implemented on CPU and on FPGA. Next we consider their further integration into the FPGA design. Because we’ll be pipelining these modules significantly, multiple work items will be in flight at any time, and due to the variable iteration time, they won’t all finish at the same time. For this reason we include a reordering buffer at the output of the Count Connected Core. This means that every graph that goes into the core gets an ID, which is then used to index into the output memory block to store the resulting count. This memory block is then continuously read in ascending order, which puts the results back into the correct order.
You’ll notice two shaded regions in Figure 9. In order to improve the processing speed of the most performance critical part of the design, we doubled the clock speed in the green region to 450 MHz. This of course further extended the pipelining demands of the core pipeline, but overall was worth the required resources for the higher compute speed. The boundaries of this clock domain are the input FIFO, and the reordering memory module, which handle the clock domain crossing at their inputs and outputs natively.
Fig. 9.
A major design consideration for the reorder buffer is to make the output memory large enough, such that by the time a valid index is reached, its computation must have finished. In our case, this was proven by computing the maximum number of cycles that a new element could take from the moment it is assigned an ID and placed in the input queue, to the moment its result is written to the reorder buffer. The extreme case for this is that every MBF that needs to be processed is the worst case MBF as shown in Figure 10. It will take 20 iterations to compute it. The total pipeline depth is 20 stages. Let’s say for the sake of argument that the whole pipeline and the input FIFO are filled with fresh worst-case MBFs. Our target MBF is now fed into the FIFO at the 63rd spot. (Input FIFO is 64 elements). Then it takes 20*20 \({=}\) 400 cycles to resolve all MBFs in the pipeline. At this point 20 new MBFs from the queue are fed in. We need to process five such batches to get to and process our target MBF. This would take 20*20*5 \({=}\) 2,000 clock2\(\times\) cycles. This would be cutting it close with a reorder buffer of size 2,048, were it not for the fact that the indices are generated by the regular clock, which runs half as fast. At 1,000 regular clock cycles max, our 2,048 element reorder buffer is more than comfortable, and it happened to be a convenient size to use as this is the default depth of Stratix M20k blocks.
Fig. 10.
Figure 11 shows the time-forward combinatorial logic of the Count Connected Core. You can clearly see the three main components of the CountConnected algorithm there: Exploring new nodes is done with MonotonizeUp and MonotonizeDown on the left, selecting new seed nodes is done on the top right, and the iteration management hardware is in the middle. The advantage of representing it like this is that pipelining this design becomes trivial. Just draw vertical lines through the diagram, and place registers wherever one crosses the data lines.
Fig. 11.
To reach the 450 MHz frequency we need, we pipelined this module in this way quite extensively. We settled on 20 pipeline stages in the end. About 10 stages for the exploration pipeline, 5 stages for the new-seed pipeline, and 5 stages of input FIFO read latency. To save on resources, as some of the signals started using many registers for their pipeline, (\(\sim\)1,500 registers for leftover\(\_\)graph and another 500 for connection \(\_\)count and output\(\_\)index together), we replaced those with Block RAM shift registers.
5.4 Synthesis
The hardware accelerator was implemented on Bittware 520N cards, containing an Intel Stratix 10 GX 2800 FPGA, and 16 GB of DDR4 external RAM. It was synthesized through combining AOCL’s build with extra Logic Lock regions to improve the generated design’s clock speed. Synthesis output is summarized in Table 4.
Table 4.
Name
Value
Utilization
ALUTs
1,067,787
Registers
1,492,343
Logic utilization
772,483/933,120
83%
I/O pins
561/912
62%
DSP blocks
2/5,760
<1%
Memory bits
52,971,640/240,046,080
22%
RAM blocks
7,489/11,721
64%
Actual clock freq
225.0
Kernel fmax
225.425
1 \(\times\) clock fmax
228.67
2 \(\times\) fmax
450.85
Highest non-global fanout
2,332
Table 4. Output of the Synthesis Process Using AOCL
Reported values include resources used by the Board Support Package and are reported relative to the total resources of the FPGA. Within the kernel region, more than 90% logic density is achieved. ALUTs, Adaptive Lookup Tables.
We also performed power measurements on the synthesized FPGA kernel with the help of the onboard INA219 power monitors than can be queried using the bittware\(\_\)power command in the PC2 infrastructure. The reported power consumption of the FPGA card fluctuates between 96 W and 111 W. The average power use is 102 W. In contrast, a single AMD Epyc 7,763 CPU as installed in Noctua 2 [1] already consumes 250 W (not including Main Memory power) when using all 64 cores for the same workload—at much lower performance, as discussed in Section 7.1.
5.5 Single Node System Integration
Finally, putting it all together, we now have a hardware accelerator module that can process on average 33 billion P-Coëfficients per second, equivalent to about 275 million \(\alpha,\beta\) pairs per second. We now need the capacity to feed it data rapidly enough.
We have two 520N FPGA cards per FPGA node on Noctua 2, so we need the capacity to generate 550 million \(\beta\) elements per second for processing. To save on bandwidth and FPGA processing time, we want to only send over \(\alpha,\beta\) pairs for which we know they will have at least one valid \(P(\beta)\leq\alpha\) permutation. To find these, we use a data structure called the “downlink structure” as explained in the thesis [8], Chapter 7. It divides all MBF Equivalence Classes into layers determined by their size, and then links elements from these layers together when they can be turned into each other by adding/removing a single node and re-canonizing. An example for this is shown in Figure 12. To use it for iterating over all MBFs which are lower than some given MBF \(\alpha\) for example, one can simply start with the canonization of \(\alpha\), and then iterate over all Equivalence Classes below it in a layer-by-layer fashion. Since we already have an MBF LUT on the FPGA DDR itself, we actually only need to send the indices themselves. That saves on CPU bandwidth as well.
Fig. 12.
Finally, to get enough bandwidth, we need enough processing capacity to generate these buffers, and process the results. Each node of Noctua 2 [1] is equipped with two AMD Epyc (codename Milan) CPUs with together 128 Zen3 cores, Epyc 7,763 CPUs in the compute nodes and slightly lower clocked Epyc 7,713 CPUs in the FPGA nodes. Figure 13 shows the structure of tasks and data structures running on these CPUs. We chose 16 threads for the Input Producers, 1 thread to operate both FPGAs, 8 threads for processing the results, and the remaining cores were used for performing CPU validation of the FPGA results. These threads were divided equally over the 16 core complexes and 8 Non-Uniform Memory Access (NUMA) nodes to optimally access the system’s memory.
Fig. 13.
The memory layout is carefully balanced to fit into the available 512 GB of RAM available in each node. The InputProducer threads share 2 copies of the DownLink structure (each on one socket’s memory) of 45 GB each, though each InputProducer also has a private 2GB for operating its Swapper for iterating through the DownLink structure. InputProducers always produce eight buffers at once, to share the memory overhead of reading the Downlink structure.
The processor threads upload the 8 GB MBF LUT to the FPGAs, and the validation threads store this same buffer, also in one copy for each socket. The result aggregators have another LUT for the Interval and class size infos, also replicated per socket. But then each result aggregator also has a private 8 GB Validation Buffer, the results of which are summed in the end.
All remaining memory is used for storing Input Index and Result buffers, of 1 GB and 2 GB each respectively. It is important to have enough buffers circulating to ensure modules don’t waste time waiting for new empty buffers.
6 Challenges
There were a few major challenges we encountered in developing for this FPGA platform that we deem interesting enough to share. The subtleties exposed and the solutions we had to develop are directly linked to the major challenge in this computational project.
During development, the major issue was achieving the proper clock speed for our design. Commonly OpenCL designs for Stratix 10 GX FPGAs range from 200 to 300 MHz. Putting the core part of the computation in the clock2\(\times\) clock domain already improved the performance a lot, where for a single copy of the FullPermutationPipeline we could go up to 550 MHz, but once we started filling up more of the FPGA area, performance quickly dropped down to 150–200 MHz. It seemed that the compiler was mixing up the hardware for each module. By adding Logic Lock regions to forcibly separate the 10 pipelines’ hardware, as shown in Figure 14, we were able to reach our target frequency of 450 MHz.
Fig. 14.
Another issue we encountered was that very rarely a large error would occur during buffer transfer from FPGA memory to main memory. It appeared that multiple pages of data were skipped over, or taken from a different buffer. This turned out to be an issue in the Bittware-provided library for FPGA communication.
In the end we were able to retroactively eliminate all of the instances in which this problem occured by checking the extra data that was saved along with the main data, namely the counts. This error type that altered whole pages of these buffers was nearly guaranteed to affect the resulting count, and so by checking the count we can check if a subresult was likely to have had an error.
These counts can then also in turn be used to check the validity of the whole computation, by using them to compute \(D(n+1)\) instead of \(D(n+2)\) as shown in Equation (15), Again, deduplication has to be taken into account for this computation
In order to quickly recompute these term counts, we made use of the Fast Interval Counting algorithm developed by Bartłomiej Pawelski [15].
7 Computation on Noctua 2
We implemented this hardware accelerator on the 520N cards operated in Paderborn University’s Noctua 2 [1] supercomputer. We were able to fit 300 of these CountConnected Cores on a single FPGA die. These CountConnected Cores run at 450 MHz. This gives us a throughput of about 33 Billion CountConnected operations per second. At this rate, a pair of 520N FPGA cards process about 5.8 \(\alpha\) values per second, taking 47,000 FPGA hours to compute D(9) on Noctua 2, or about 3 months real-time on up to 32 FPGAs used concurrently. The full hierarchy is shown in Figure 15.
Fig. 15.
The computation is split across the system along the lines of Equation (10). \(\alpha\) values (also named tops) are divided on the job level. There are 490 M tops to be processed for D(9). We split these into 15,000 jobs of about 32,700 tops each. The \(\beta\) values per top (also named bottoms) are placed in large buffers of 46 M bots on average, and sent over PCIe to the FPGA. The FPGA then computes all 5,040 permutations (\(\gamma\)) of each bottom, computes and adds up their P-Coëfficients, and stores the result in an output buffer of the same size.
Table 5 summarizes the components and factors that contribute to the overall scale of the project. The upper half of the table designates the huge amount of computational work that was actually performed, as central operation processing a total of around \(5.574\times 10^{18}\) P-Coefficients (that is 5.574 Exa-P-Coefficients) with the approach described since Section 4.1. In contrast, the lower half of the table illustrates the amount of computational work that has been avoided by the mathematical approach (Sections 2 and 4) by providing the scale of the factors from Equation (10). Note that the individual factors vary widely within the summation formula and thus can not be multiplied step by step as in the upper part of the table, but it is evident that both algorithmic and computational efficiency were of paramount importance for the overall result. Table 6 then dives deeper into the cycle-wise utilization of the components along the FPGA accelerator. It emphasizes how more headroom was created in the data generation and filtering steps, to firmly place the computation bottleneck on the most expensive part of the pipeline: The FloodFill Iterations.
Table 5.
Factor
Value
per FPGA h
Tops \(\alpha\)
\(4.90\times 10^{8}\)
\(1.04\times 10^{4}\)
Bottoms \(\beta\) per Top
up to \(4.90\times 10^{8}\)
\((\alpha,\beta)\) Pairs
\(4.59\times 10^{16}\)
Symmetry Factor
\(2\)
\((\alpha,\beta)\) Pairs incl symmetry
\(2.29\times 10^{16}\)
\(4.883\times 10^{11}\)
Possible permutations per \((\alpha,\beta)\)
\(5040\)
Total Permutation #
\(1.157\times 10^{20}\)
\(2.461\times 10^{15}\)
Valid Permutation Fraction
\(4.96\%\)
Valid Permutations #
\(5.574\times 10^{18}\)
\(1.186\times 10^{14}\)
P-Coefficients to Compute
\(5.574\times 10^{18}\)
\(1.186\times 10^{14}\)
Connected Components/PCoeff
\(4.074\)
FloodFill Iterations/PCoeff
\(4.061\)
Total FloodFill Iterations
\(2.264\times 10^{19}\)
\(4.816\times 10^{14}\)
\(|[\bot,\alpha]|\)
up to \(2.4\times 10^{12}\)
\(E_{\alpha}\)
up to \(5040\)
\(|[\beta,\top]|\)
up to \(2.4\times 10^{12}\)
\(E_{\beta}/n!\)
up to \(1.0\)
\(2^{C_{\alpha,\gamma}}\)
up to \(2^{35}=3.4\times 10^{10}\)
D(9)
\(2.86\times 10^{42}\)
Table 5. Table Illustrating the Factors That Bring the Computation of D(9) Together
The first section shows the total work size, the second section shows how many iterations it takes to process a single connected component. (Iteration count is not strongly related to the number of Connected Components. Their similarity is merely coincidental, due to it taking multiple cycles per component and the algorithm optimizations employed, See [8] Table 9.3). The third section bridges the gap between the number of terms and D(9) by illustrating the factors involved in the computation. These factors don’t factor neatly though, since they are highly correlated. They are merely provided as illustration. Refer to Equation (10).
Table 6.
per FPGA h
per FPGA s
max per FPGA clock
Utilization
\((\alpha,\beta)\) Pairs incl symmetry
\(4.883\times 10^{11}\)
\(1.356\times 10^{8}\)
10 * 1/7
225 MHz
0.422
Total Permutation #
\(2.461\times 10^{15}\)
\(6.836\times 10^{11}\)
10 * 30 * 24
225 MHz
0.422
P-Coefficients to Compute
\(1.186\times 10^{14}\)
\(3.392\times 10^{10}\)
10 * 30
225 MHz
0.503
FloodFill Iterations
\(4.816\times 10^{14}\)
\(1.337\times 10^{11}\)
10 * 30
450 MHz
0.990
Table 6. Table Illustrating the Average Utilization of Components along the FPGA Accelerator
It gives more information on the cycle-wise performance of Table 5. A utilization of \(\sim\)50% for the data production and filtering components was desired, ensuring that there was no parts except the FloodFill Iteration component would be the bottleneck, irrespective of fluctuations in processing difficulty.
The artifact of this computation is a dataset with an intermediary result for each of the 490 M \(\alpha\) values. Each of these can be checked separately,5 and the whole file sums to
In order to assess the FPGA design not only in absolute terms, but also in comparison to a CPU reference, we ran a benchmark for the CPU code on one of the jobs. We used a highly optimized code around manually vectorized components as described in Section 5.3 and Listing 4. Calculating tops is an embarrassingly parallel task, which would make running each top on a single core optimal in terms of minimizing communication between cores. But because running a single top on a single core can take several days, we opted to strike a balance between parallelization within a top, and running several \(\alpha\)s in parallel to avoid performance penalties from load-balancing. We divide the 128 cores of a Noctua 2 FPGA node into 16 blocks of 8 cores, corresponding to the core complex size of the AMD Epyc CPUs. Each complex is then responsible for calculating a top, where all cores in the complex work together. This allows for parallelization while still keeping communication overhead negligible. Running one job on 128 AMD Epyc CPU Cores took 150 hours. In this time it had processed 32,664 \(\alpha\)s. This gives us a computation rate of 0.0604 \(\alpha\)s per second. This is 95\(\times\) slower per node compared to a node of the FPGA implementation6. In other words, one Intel Stratix 2,800 FPGA outperforms 64 AMD Epyc Milan 7,713 cores by a factor of 95\(\times\). To calculate D(9) on a CPU cluster in the same time as on 32 FPGAs, 3,040 AMD Epyc Milan 7,713 CPUs with a total of 194,560 cores would have been required.
8 Correctness
As much of the code as possible is written generically. This means the same system is used for computing D(3)–D(8). All of these yield the correct results. Of course, the FPGA kernel is written specifically for D(9) computation, so its correctness was verified by comparing its results with the CPU results for a small sample. In effect, both methods verified each other’s correctness.
However, we do have several checks to increase our confidence in the result.
–
The most direct is the \(D(9)\equiv 6\ mod\ 210\) check provided by Pawelski and Szepietowski [16]. Our result passes this check. Sadly, due to the structure of our computation, nearly all terms are divisible by 210, strongly weakening the check. One thing that this check does give us is that no integer overflow has occurred, which was an important concern given we were working with integers of 128 and 192 bits wide.
–
As mentioned earlier, our computation was plagued by one issue in particular. We encountered a bug in the Bittware-provided library for communication over PCIe, wherein, occasionally and at a low incidence rate, full 4 K pages of FPGA data are not copied properly from FPGA memory to host memory. This results in large blocks of incorrect bottoms for some tops. We encountered this issue in about 2,300 tops. We were able to mitigate this issue by including extra data from the FPGA to host memory, namely the “valid permutation count.” By checking these values, we could determine if a bottom buffer had been corrupted. Additionally, adding all of these counts yields the value for D(8), which shows that the correct number of terms have been added. This issue has at this point been fixed on Noctua 2.
–
Finally, there is an estimation formula, which gives us an estimation which is relatively close to our result. The Korshunov estimation formula estimates D(9) = \(1.15\times 10^{41}\) which is off by about a factor 2.7
9 Danger of Cosmic Radiation
The one way our result could still have been wrong was due to a single event upset (SEU), such as a bitflip in the FPGA fabric during processing, or a bitflip during data transfer from FPGA DDR memory to Main Memory.
It is difficult to characterize the odds of these SEU events. Expected number of occurrences for the FPGAs we used are not available to the best of our knowledge. But example values shown on Intel’s website pin the error rate at around 5,000 SEU events per billion FPGA hours. In that case, given our 47,000 FPGA hours, we expect to see 0.235 Poisson distributed errors, giving us a chance of 20% of a hit. Of course, this is just an example and the real odds might have been higher than that.
We did work to harden the accelerator against such bit errors to the extent that this was feasible. Intel Stratix’ M20k blocks come equipped with built-in error detection and correction if enabled. We enabled the error detection, and thereby rejected any run that encountered such a bit error. Not a single one happened though during our run, which gives more confidence that the error rate is relatively low. As for the program Configuration RAM (CRAM) it is difficult to tell if it came pre-hardened out of the box. The Intel Stratix chips all provide an automatic round-robin test of the CRAM contents, and while this information is available on a package pin, Bittware’s documentation did not mention it, so we don’t know if they implement it or not. Finally the contents of registers in the logic are near-impossible to harden, so we didn’t try.
In the end, the most powerful confirmation that no bit errors occured was that our result was identical to Jäkel’s [10].
10 Conclusion
In this paper we tackled the problem of computing the 9th Dedekind number. We first provided a mathematical framework describing MBFs and the operations that can be performed on them. Then we went into detail on how the implementation works, starting from the key advantage that made FPGAs a strong choice for the algorithm we wished to use. We build up the FPGA accelerator from this building block up the hierarchy for data flow management and parallelization. Once we had a full FPGA accelerator, we describe the steps that needed to be taken to produce the data at the rate that the accelerator consumes it. Finally, we give the results of actually executing the compute project, and show the observed advantage against what it would have been, had we implemented the computation on CPU.
So what does the future hold for the 10th Dedekind number? Computing that will require a change in methods, due to the sheer increase in computational labor that would need to happen. The fundamental complexity of our algorithm was on the order of \(O(R(n-2)\times D(n-2))\). For our implementation this amounted to \(4.59\times 10^{16}\) terms, for D(10), that would transform to something in the ballpark of \(10^{34}\) terms, far exceeding current computational ability. Also Jäkel’s approach would crumble under this demand, the computational complexity is just too great.
Our result was originally uncertain due to the possibility of bit errors due to cosmic radiation. We held off on publication until we had completed the second run. although we did register the value we got in the corresponding github commit from 9 March 2023:
On 4 April 2023, a preprint claiming D(9) was published “A computation of the ninth Dedekind number A Preprint” by Christian Jäkel [10], which arrived at the exact same number we had. Since this proved the correctness of our own result, we decided to publish our own preprint as well [9]. All code is available at https://github.com/VonTum/Dedekind. All data is available at https://hirtum.com/dedekind.
Acknowledgments
The authors gratefully acknowledge the funding of this project by computing time provided by the Paderborn Center for Parallel Computing (PC2).
Footnotes
1
Other sources sometimes also use \(M(n)\).
2
This follows immediately from the definition of the PCoef in Equation (7).
3
We made sure not to deduplicate pairs that were their own dual, ie when \(\beta=\overline{\alpha}\).
5
It takes about 10–2,000 seconds to compute a single \(\alpha\) result on 128 AMD Epyc CPU cores.
6
Earlier texts on this work had mentioned speedups of 500\(\times\), but it turned out there was a memory bottleneck giving the CPU an unfair disadvantage. This was fixed for a fairer comparison.
7
This isn’t too unusual though, as the results for odd values are off by quite a lot. Estimation for D(3) overestimates by a factor 2, D(5) also overestimates by a factor 2, and D(7) overestimates roughly 10%.
References
[1]
Carsten Bauer, Tobias Kenter, Michael Lass, Lukas Mazur, Marius Meyer, Holger Nitsche, Heinrich Riebler, Robert Schade, Michael Schwarz, Nils Winnwa, Alex Wiens, Xin Wu, Christian Plessl, and Jens Simon. 2024. Noctua 2 Supercomputer. Journal of Large-Scale Research Facilities (JLSRF) 9 (2024). https://doi.org/10.17815/jlsrf-8-187
Patrick De Causmaecker and Lennart Van Hirtum. 2024. Solving systems of equations on antichains for the computation of the ninth Dedekind number. arXiv:2405.20904. Retrieved from
Patrick De Causmaecker, Stefan De Wannemacker, and Jay Yellen. 2016. Intervals of Antichains and Their Decompositions. arXiv:1602.04675. Retrieved from
J. Deepakumara, H.M. Heys, and R. Venkatesan. 2001. FPGA implementation of MD5 hash algorithm. In Proceedings of the Canadian Conference on Electrical and Computer Engineering (CCECE ’01), Vol. 2. 919–924.
Lennart Van Hirtum. 2021. A Path to Compute the 9th Dedekind Number using FPGA Supercomputing. KU Leuven, Masters Thesis. Retrieved from https://hirtum.com/thesis.pdf.
Lennart Van Hirtum, Patrick De Causmaecker, Jens Goemaere, Tobias Kenter, Heinrich Riebler, Michael Lass, and Christian Plessl. 2023. A computation of D(9) using FPGA supercomputing. arXiv:2304.03039. Retrieved from
The On-Line Encyclopedia of Integer Sequences® (OEIS®). 1964. A000372, Dedekind Numbers or Dedekind’s Problem. Retrieved October 16, 2023 from https://oeis.org/A000372
The On-Line Encyclopedia of Integer Sequences® (OEIS®). 1964. A003182, Dedekind Numbers: Inequivalent Monotone Boolean Functions of N or Fewer Variables. Retrieved October 16, 2023 from https://oeis.org/A003182
Jan-Oliver Opdenhövel, Christian Plessl, and Tobias Kenter. 2023. Mutation tree reconstruction of tumor cells on FPGAs Using a bit-level matrix representation. In Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART ’23). ACM, New York, NY, 27–34. DOI:
Heinrich Riebler, Tobias Kenter, Christian Plessl, and Christoph Sorge. 2014. Reconstructing AES key schedules from decayed memory with FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM ’14). IEEE Computer Society, Washington, DC, 222–229.
Heinrich Riebler, Tobias Kenter, Christoph Sorge, and Christian Plessl. 2013. FPGA-accelerated key search for cold-boot attacks against AES. In Proceedings of the International Conference on Field Programmable Technology (ICFPT ’13). IEEE Computer Society, 386–389.
Heinrich Riebler, Michael Lass, Robert Mittendorf, Thomas Löcke, and Christian Plessl. 2017. Efficient branch and bound on FPGAs using work stealing and instance-specific designs. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 10, 3, Article 24 (Jun. 2017), 23 pages. DOI:
ISCSIC 2019: Proceedings of the 2019 3rd International Symposium on Computer Science and Intelligent Control
Field Programmable Gate Arrays (FPGAs) play a significant role in modern supercomputing by providing benefits of hardware speed as well as programmability. Recently, there has been many contributions about the application of FPGAs in the area of ...
This work presents a new automatic mechanism to explore the solution space between Field Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs). This new solution is termed as an Application-Specific Inflexible FPGA (ASIF) ...
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Low-power, high-performance computing nowadays relies on accelerator cards to speed up the calculations. Combining the power of GPUs with the flexibility of FPGAs enlarges the scope of problems that can be accelerated [2, 3]. We describe the performance ...