Fast Falcon Signature Generation and Verification pqc2022
Fast Falcon Signature Generation and Verification pqc2022
Fast Falcon Signature Generation and Verification pqc2022
Abstract. We present our speed records for Falcon signature generation and verifi-
cation on ARMv8-A architecture. Our implementations are benchmarked on Apple
M1 ’Firestorm’ and Raspberry Pi 4 Cortex-A72 chips. Compared with lattice-based
CRYSTALS-Dilithium, our optimized signature generation is 2× slower, but signature
verification is 3–3.9× faster than the state-of-the-art CRYSTALS-Dilithium imple-
mentation on the same platforms. Compared with stateful and stateless hash-based
digital signatures, XMSS and SPHINCS+ , our optimized Falcon implementation has
10–950× faster signature generation and 23–31× faster signature verification. Faster
signature verification may be particularly useful for the client side on constrained
devices. Our Falcon implementation outperforms the previous work in both Sign
and Verify operations. We achieve improvement in Falcon signature generation by
supporting all possible parameters N of FFT-related functions and applying our
compressed twiddle-factor table to reduce memory usage. We also demonstrate that
the recently proposed signature scheme Hawk, sharing functionality with Falcon,
offers 17% smaller signature sizes, 3.3× faster signature generation, and 1.6–1.9×
slower signature verification when implemented on the same ARMv8 processors as
Falcon.
Keywords: Number Theoretic Transform · Fast Fourier Transform · Falcon ·
lattice-based cryptography · Post-Quantum Cryptography · ARMv8 · NEON
1 Introduction
When large quantum computers arrive, Shor’s algorithm [Sho94] will break almost all
currently deployed public-key cryptography in polynomial time [CJL+ 16] due to its
capability to obliterate two cryptographic bastions: the integer factorization and discrete
logarithm problems. While there is no known quantum computer capable of running Shor’s
algorithm with parameters required to break current public-key standards, the process
of selecting, standardizing, and deploying new cryptographic algorithms has always been
taking years, if not decades.
In 2016, NIST announced the Post-Quantum Cryptography (PQC) standardization
process aimed at developing new public-key standards resistant to quantum computers. In
July 2022, NIST announced the choice of three digital signature algorithms [AAC+ 22]:
CRYSTALS-Dilithium [BDK+ 21], Falcon [FHK+ 20] and SPHINCS+ [BHK+ 19] for a likely
standardization within the next two years. Additionally, NIST has already standardized
two stateful signature schemes XMSS [HBG+ 18] and LMS [MCF19].
Compared to Elliptic Curve Cryptography and RSA, PQC digital signatures have
imposed additional implementation constraints, such as bigger key and signature sizes (and
thus increased bandwidth), higher memory usage, support for floating-point operations,
etc. In many common applications, such as the distribution of software updates and the
2 Fast Falcon Signature Generation and Verification Using ARMv8 NEON Instructions
use of digital certificates, a signature is generated once by the server but verified over and
over again by clients forming the network.
In this paper, we examine the PQC digital signatures’ speed on ARMv8 platforms.
ARMv8 (a.k.a. ARMv8-A) supports two instruction set architectures AArch64 (a.k.a.
arm64) and AArch32 (a.k.a. armeabi). The associated instruction sets are referred to as
A64 and A32, respectively. AArch32 and A32 are compatible with an older version of
ARM called ARMv7-A. AArch64 and A64 support operations on 64-bit operands.
NEON is an alternative name for Advanced Single Instruction Multiple Data (ASIMD)
extension, mandatory since ARMv7. NEON includes additional instructions that can
perform arithmetic operations in parallel on multiple data streams. It also provides a
developer with 32 128-bit vector registers. Each register can store two 64-bit, four 32-bit,
eight 16-bit, or sixteen 8-bit integer data elements. NEON instructions can perform the
same arithmetic operation simultaneously on the corresponding elements of two 128-bit
registers and store the results in the respective fields of a third register. Thus, an ideal
speed-up vs. traditional single-instruction single-data (SISD) ARM instructions varies
between 2 (for 64-bit operands) and 16 (for 8-bit operands).
In today’s market, there is a wide range of ARMv8 processors supporting NEON. They
are developed by manufacturers such as Apple Inc., ARM Holdings, Cavium, Fujitsu,
Nvidia, Marvell, Qualcomm, and Samsung. They power the majority of smartphones
and tablets available on the market. The ARM Cortex-A72 is a core implementing the
ARMv8-A 64-bit instruction set. It was designed by ARM Holdings’ Austin design center.
This core is used in Raspberry Pi 4 - a small single-board computer developed in the
United Kingdom by the Raspberry Pi Foundation in association with Broadcom. The
corresponding SoC is called BCM2711. This SoC contains four Cortex-A72 cores. Apple
M1 is an ARMv8-based system on a chip (SoC) designed by Apple Inc. for multiple Apple
devices, such as MacBook Air, MacBook Pro, Mac Mini, iMac, and iPad Pro. The M1 has
four high-performance ’Firestorm’ and four energy-efficient ’Icestorm’ cores supporting
an extension of ARMv8 called ARMv8.4-A. Each of them supports NEON. Starting from
version Armv8.3-A, and thus included in version Armv8.4-A, there is NEON support for
operations on complex numbers.
In this work, we developed an optimized implementation of Falcon targeting ARMv8
cores. We then reused a significant portion of our Falcon code to implement Hawk – a new
lattice-based signature scheme proposed in Sep. 2022 [DPPvW22]. Although this scheme
is not a candidate in the NIST PQC standardization process yet, it may be potentially
still submitted for consideration in response to the new NIST call, with the deadline in
June 2023.
We then benchmarked our implementation and existing implementations of Falcon,
Hawk, CRYSTALS-Dilithium, SPHINCS+, and XMSS using the ’Firestorm’ core of Apple
M1 (being a part of MacBook Air) and the Cortex-A72 core (being a part of Raspberry
Pi 4), as these platforms are widely available for benchmarking. However, we expect
that similar rankings of candidates can be achieved using other ARMv8 cores (a.k.a.
microarchitectures of ARMv8).
Contributions. In this paper, we overcome the high complexity of the Falcon implemen-
tation and present a speed record for its Signature generation and Verification on two
different ARMv8 processors.
In a signature generation, we constructed vectorized scalable FFT implementation that
can be applied to any FFT level greater than five. We compressed the twiddle factor table
in our FFT implementation, inspired by the complex conjugate root of FFT. In particular,
we reduced the size of this table from 16 Kilobytes in the reference implementation down to
4 Kilobytes in our new ref and neon implementations. The modified FFT implementation
with 4× smaller twiddle factor table is not specific to any processor. Thus, it can be used
Nguyen et al. 3
2 Previous Work
The paper by Streit et al. [SDS18] was the first work about a NEON-based ARMv8 im-
plementation of the lattice-based public-key encryption scheme New Hope Simple. Other
works implement Kyber [ZZH+ 21], SIKE [JAMK+ 19], and Frodo [KJK+ 21] on ARMv8.
The most recent works on the lattice-based finalists NTRU, Saber, and CRYSTALS-Kyber
are reported by Nguyen et al. [NG21a,NG21b]. The paper improved polynomial multiplica-
tion and compared the performance of vectorized Toom-Cook and NTT implementations.
Notably, the work by Becker et al. [BHK+ 21] showed a vectorized NEON NTT implemen-
tation superior to Toom-Cook, introduced fast Barrett multiplication, and special use of
multiply-return-high-only sq[r]dmulh instructions. The SIMD implementation of Falcon
was reported by Pornin [Por19] and Kim et al. [KSS22]. On the application side, Falcon is
the only viable option in hybrid, partial, and pure PQC V2V design described in the work
of Bindel et al. [BMTR22]
In the area of low-power implementations, most previous works targeted the ARM
Cortex-M4 [KRSS19]. In particular, Botros et al. [BKS19], and Alkim et al. [AABCG20]
developed Cortex-M4 implementations of Kyber. Karmakar et al. [KBMSRV18] reported
results for Saber. Chung et al. [CHK+ 21] on Saber and NTRU, and later work by Becker et
al. [BMK+ 22] improved Saber by a large margin on Cortex-M4/M55. The latest work by
Abdulrahman et al. [AHKS22] improved Kyber and Dilithium performance on Cortex-M4.
The most comprehensive Fast Fourier Transform (FFT) work is by Becoulet et
al. [BV21]1 . FFTW and FFTS by Frigo et al. The publications [FJ12] and Blake [BWC13]
describe the SIMD FFT implementations. Impressive results have been reported for
AutoFFT on both ARM and x86 CPUs by Li et al. [LJZ+ 20]. However, we were unable to
locate the corresponding source code.
3 Background
Table 1 summarizes values of parameters n and q for various NIST security levels. n is a
parameter in the cyclotomic polynomial ϕ = (xn + 1), and q is a prime defining a ring
Zq [x]/(ϕ). The sizes of the public key and signature in bytes (B) are denoted with |pk|
and |sig|. The signature ratio |sig| ratio is the result of dividing the signature size of other
schemes by the signature size of falcon512, dilithium3, and falcon1024.
3.1 Falcon
Falcon is a lattice-based signature scheme utilizing the ’hash-and-sign’ paradigm. The
security of Falcon is based on the hardness of the Short Integer Solution problem over
1 https://github.com/diaxen/fft-garden
4 Fast Falcon Signature Generation and Verification Using ARMv8 NEON Instructions
Table 1: Parameter sets, key sizes, and signature sizes for falcon, hawk, dilithium,
xmss, and sphincs+ in comparison with falcon512, dilithium3, falcon1024 according
to three security levels
NIST
level n q |pk| |sig| |pk + sig| |sig| ratio
falcon512 12,289 897 652 1,549 1.00
I 512
hawk512 65,537 1,006 542 1,548 0.83
dilithium2 256 8,380,417 1,312 2,420 3,732 3.71
II
xmss16 -sha256 - - 64 2,692 2,756 4.12
sphincs+ 128s - - 32 7,856 7,888 12.05
I
sphincs+ 128f - - 32 17,088 17,120 26.21
dilithium3 256 8,380,417 1,952 3,293 5,245 1.00
sphincs+ 192s III - - 48 16,224 16,272 4.93
sphincs+ 192f - - 48 35,664 35,712 10.83
falcon1024 12,289 1,793 1,261 3,054 1.00
1024
hawk1024 65,537 2,329 1,195 3,524 0.95
dilithium5 V 256 8,380,417 2,592 4,595 7,187 3.64
sphincs+ 256s - - 64 29,792 29,856 23.62
sphincs+ 256f - - 64 49,856 49,920 39.53
NTRU lattices, and the security proofs are given in the random oracle model with tight
reduction. Falcon is difficult to implement, requiring tree data structures, extensive floating-
point operations, and random sampling from several discrete Gaussian distributions. The
upsides of Falcon are its small public keys and signatures as compared to Dilithium. As
shown in Table 1, the signature size of Falcon at the highest NIST security level is still
smaller than that of the lowest security level of Dilithium, XMSS, and SPHINCS+ . Key
generation in Falcon is expensive. However, a key can be generated once and reused later.
The signature generation (Algorithm 17 in Appendix A) of Falcon first computes
hash value c from message m and salt r. Then, it uses (f, g, F, G) from the secret key
components to compute two short values s1 , s2 such that s1 + s2 h = c mod (ϕ, q). Falcon
relies extensively on floating point computation during signature generation, used in Fast
Fourier Transform (FFT) over the ring Q[x]/(ϕ), and Gaussian and Fast Fourier Sampling
(ffSampling) for Falcon tree T.
The signature verification (Algorithm 18 in Appendix A) checks if two short values
(s1 , s2 ) are in acceptance bound ⌊β 2 ⌋ using the knowledge from public key pk, and signature
(r, s). If the condition at line 5 is satisfied, then the signature is valid; otherwise it is
rejected. As opposed to signature generation, Falcon Verify operates only over integers.
Falcon only supports NIST security levels 1 and 5. A more detailed description of the
underlying operations can be found in the Falcon specification [FHK+ 20].
3.2 Dilithium
Dilithium is a member of the Cryptographic Suite for Algebraic Lattices (CRYSTALS)
along with the key encapsulation mechanism (KEM) Kyber. The core operations of
Dilithium are the arithmetic of polynomial matrices and vectors. Unlike ’hash-and-sign’
used in Falcon, Dilithium applied Fiat-Shamir with Aborts [DFG+ , Lyu09] style signature
scheme and bases its security upon the Module Learning with Errors (M-LWE) and Module
Short Integer Solution (M-SIS) problems.
Compared with Falcon, Dilithium only operate over the integer ring Zq [x]/(ϕ) with
Nguyen et al. 5
3.3 XMSS
XMSS [HBG+ 18] (eXtended Merkle Signature Scheme) is a stateful hash-based signature
scheme based on Winternitz One-Time Signature Plus (WOTS+) [Hül13]. XMSS requires
state tracking because the private key is updated every time a signature is generated.
Hence, the key management of XMSS is considered difficult. Consequently, XMSS should
only be used in highly controlled environments [CAD+ 20]. The advantages of XMSS over
SPHINCS+ are smaller signature sizes and better performance. XMSS is a single-tree
scheme, with a multi-tree variant XMSSM T also included in the specification. The security
of XMSS relies on the security of collision search of an underlying hashing algorithm.
Single-tree XMSS has faster signature generation and verification than the multi-tree
XMSSM T and comes with three tree heights: h = [10, 16, 20], which can produce up to
2h signatures. In this paper, we select a single-tree variant of XMSS with a reasonable
number of signatures, 216 , and choose sha256 as underlying hash functions. This variant
is denoted as xmss16 -sha256 in Table 1.
3.4 SPHINCS+
SPHINCS+ [BHK+ 19] is a stateless hash-based signature scheme that avoids the complexi-
ties of state management associated with using stateful hash-based signatures. SPHINCS+
security also relies on hash algorithms. The algorithm is considered a conservative choice,
preventing any future attacks on lattice-based signatures. SPHINCS+ provides ’simple’
and ’robust’ construction. The ’robust’ construction affects the security proof and runtime.
In addition, small (’s’) and fast parameters (’f ’) influence execution time. These parameter
set variants are over 128, 192, and 256 quantum security bits.
Based on the performance provided in the specification of SPHINCS+ , we select the
’simple’ construction, and both ’s’ and ’f ’ parameters for NIST security levels 1, 3, and 5,
as shown in Table 1. Unlike XMSS, we select shake as the underlying hash function.
3.5 Hawk
Hawk is a recent signature algorithm proposed by Ducas et al. [DPPvW22] based on the
Lattice Isomorphism Problem (LIP). Hawk avoids the complexities of the floating-point
discrete Gaussian sampling, which is a bottleneck in our optimized Falcon implementation.
Hawk chooses to sample in a simple lattice Zn [BGPS21, DvW22] with a hidden rotation.
An avx2 implementation of hawk1024 is faster than the equivalent implementation of
falcon1024 by 3.9× and 2.2×. With our optimized neon implementation of Falcon, we
decided to port our optimized Falcon code to Hawk and investigated if such performance
gaps between Hawk and Falcon still hold. In this work, we select hawk512, hawk1024 at
NIST security levels 1 and 5.
Complete NTT is similar to traditional FFT (Fast Fourier Transform) but uses the
root of unity in the discrete field rather than in a set of real numbers. NTT and NTT−1
are forward and inverse operations, where N T T −1 (N T T (f )) = f for all f ∈ Rq .
N T T (A) ∗ N T T (B) denotes pointwise multiplication. Polynomial multiplication using
NTT is shown in Equation 1.
′
1 t ← sqrdmulh(a, b ) ▷ hi(round((2 · a · b′ ))
2 c ← mul(a, b) ▷ lo(a · b)
3 c ← mls(c, t, q) ▷ lo(c − t · q)
The basic idea of Montgomery multiplication is summarized in Appendix B. Its use in our
implementation is explained below.
First, Falcon Verify computes only one polynomial multiplication (as in line 4 of
Algorithm 18). Two polynomials (s2 , h) are converted to NTT domain. Then, we per-
form pointwise multiplication between two polynomials in NTT domain. To efficiently
compute modular reduction for two unknown factors during pointwise multiplication, the
conventional way is to convert one polynomial to the Montgomery domain and perform
Montgomery multiplication. Eventually, the multiply with a constant factor n−1 is applied
at the end of Inverse NTT. We apply small tweak by embedding n−1 into Montgomery
conversion during pointwise multiplication (ai bi n−1 instead of ai bi for i ∈ [0, . . . , n − 1])
to avoid multiplications with n−1 in Inverse NTT.
The Montgomery n−1 conversion uses Barrett multiplication with a known factor
amont = barrett_mul(a, b, b′ ), with b = R · n−1 mod q. Furthermore, it is beneficial at
the instruction level to embed n−1 when n is a power of 2 in Montgomery conversion. In
particular, when (R, n) = (216 , 210 ) then b = R · n−1 = 216 · 2−10 = 26 mod q. Hence, line
2 of Algorithm 1 can be replaced by a shift left shl instruction.
Secondly, we apply Algorithm 2 for pointwise multiplication. Another variant of
Montgomery multiplication via rounding using only 4 instructions is considered, but it
only works if one factor is odd (Section 3 in Becker et al. [BHK+ 21]). Thus, Montgomery
multiplication via doubling is selected since it does not assume the parity of coefficients.
Nguyen et al. 7
Details of the proof can be found in Becker et al. [BHK 21]. Given that q = 12289, R = 216 ,
+
R
and center around 0, the maximum bound of signed arithmetic center around 0 is 2.6q ≈ 2q
instead of 5.3q ≈ Rq in unsigned arithmetic.
During Forward and Inverse NTT, we carefully control the bound of each coefficient
by applying our strict Barrett multiplication bound analysis. The naive 2.6q bound
assumption will lead to performing Barrett reduction after every NTT level in CT butter-
flies (Algorithm 4). To minimize the number of Barrett reductions, we validate the range of
c = barrett_mul(b, ω, ω ′ ) for all unknown values of |b| < R2 and ω ∈ ωi and ω ′ ∈ ωi′ table
according to each NTT level by exhaustive search. The bound output c of barrett_mul
is increasing if |b| < q and decreasing if |b| ≥ q. For example, if qb ≈ (0.5, 1.0, 2.0, 2.5) then
after Barrett multiplication, the obtained bounds are qc ≈ (0.69, 0.87, 1.25, 1.44).
As a result, we were able to minimize the number of reduction points in the Forward
and Inverse NTT from after every one NTT level to every two NTT levels. In our case,
an exhaustive search works in an acceptable time for the 16-bit space. However, it takes
much longer for the 32- and 64-bit spaces. A formal, strict bound analysis instead of an
exhaustive search approach is considered as future work.
Algorithm 3: Signed Barrett reduction [BHK+ 21] for prime q = 12289
Input: Any |a| < R = 2w , constants (q = 12289, w = 16, v = 5461, i = 11)
Output: c = barrett_mod(a, q) ≡ a mod q, and − 2q ≤ c < 2q
1 t ← sqdmulh(a, v) ▷ hi(2 · a · v)
2 t ← srshr(t, i) ▷ round(t ≫ i)
3 c ← mls(a, t, q) ▷ lo(a − t · q)
The discrete Fourier transform in Equation 2 has time complexity of O(n2 ). FFT
improves the transformation with the time complexity of O(n log n) [CT65].
The advantage of FFT over NTT for polynomial multiplication is that the root of
unity ei2π/N always exists for arbitrary N . Additionally, FFT suffers precision loss caused
by rounding in the floating-point-number computations, while polynomial multiplication
using NTT guarantees correctness due to NTT operating in the integer domain.
Nguyen et al. 9
Complex number arithmetic Let a, b = (are , aim ), (bre , bim ) then a±b = (are ±bre , aim ±
bim ) and a ∗ b = (are bre − aim bim , are bim + aim bre ). A counterclockwise rotation by 90 and
270 degree of a are (ia) = (−aim , are ) and (−ia) = (aim , −are ), respectively. Conjugate of
a complex number a = (are , aim ) is â = conjugate(a) = (are , −aim ).
cache spatial and temporal locality with constant access patterns when load and store
instructions are executed. All the butterflies in Algorithm 8 are computed in-place. Note
that the two for loops from line 10 to 17 can be executed in parallel, which may be of
interest when developing a hardware accelerator.
In a signature generation, at line 5 in Algorithm 17, Fast Fourier sampling (ffSampling)
builds an FFT tree by traveling from top level l = log2 (N ) to lower level l − 1, l − 2, . . . 1,
where N is a total number of real and imaginary points. Hence, FFT implementation
must support all FFT levels from log2 (N ) to 1.
Our C FFT implementation supports all levels of Forward and Inverse FFT trivially.
However, our vectorized FFT must be tailored to support all FFT levels. Instead of
vectorizing l − 1 FFT implementations, first, we determined the maximum number of
coefficients that can be computed using 32 vector registers. Then, we select l = 5 as the
baseline to compute FFT that minimizes load and store. In case l < 5, we unroll the FFT
loop completely, and save instructions overhead. When l ≥ 5, we use the base loop with
l = 5 with 32 64-bit coefficients and implement additional two FFT loops. The second
loop computes two FFT levels per iteration to save load and store instructions. The third
loop is an unrolled version of a single FFT level per iteration. In short, using three FFT
loops, we can construct arbitrary FFT level l ≥ 5 by using the base loop with 5 levels,
then a multiple of two FFT levels by the second loop. Finally, the remaining FFT levels
are handled by the third loop, as shown in Table 2.
6 Results
ARMv8 Intrinsics are used for ease of implementation and to take advantage of the
compiler optimizers. The optimizers know how intrinsics behave and tune performance
toward the processor features such as aligning buffers, scheduling pipeline operations, and
instruction ordering 4 . In our implementation, we always keep vector register usage under
32 and examine assembly language code obtained during our development process. We
acknowledge that the compiler occasionally spills data from registers to memory and hides
load/store latency through the instructions supporting pipelining.
Falcon and Hawk The reference implementation of Hawk 5 uses fixed-point data type by
default. Since our choice of processors supports floating-point arithmetic, we convert all
fixed-point arithmetic to floating-point arithmetic and achieve a significant performance
boost in reference implementation as compared to the default setting. Notably, this choice
disables NTT implementation in Hawk Sign and Verify, while Falcon Verify explicitly
uses integer arithmetic for NTT implementation by default. Although it is possible to
vectorize NTT implementation in Hawk Verify, we consider this as future work. In both
implementations, we measure the Sign operations in the dynamic signing - the secret key
is expanded before signing, and we do not use floating-point emulation options.
Results for FFT and NTT are summarized in Table 4 with FMA enable. When N = 4,
we realized that our vectorized code runs slower than the C implementation due to the
FFT vectorized code being too short. Therefore, we use the same ref implementation.
Starting from N ≥ 8, our FFT vectorized code achieved speed-up from 1.2× to 2.3× and
1.4× to 2.4× compared to the ref implementation of the Forward and Inverse FFT on
Apple M1, and from 1.8× to 1.9× and 1.9× to 2.3× for both Forward and Inverse FFT on
Cortex-A72. These speed-ups are due to our unrolled vectorized implementation of FFT,
which supports multiple FFT levels and twiddle-factor sharing.
In Falcon Verify, our benchmark of Inverse NTT is without multiplication by n−1
because we already embed this multiplication during Montgomery conversion. For both FFT
and NTT, our neon implementation in both Cortex-A72 and Apple M1 achieve better speed-
up ratio than the avx2 implementation, as shown in Table 4. There is no avx2 optimized
implementation of Falcon Verify, it is the same as pure ref implementation. Overall, our
neon NTT implementation is greatly improved compared to ref implementation, which
determines the overall speed-up in Falcon Verify.
The increasing and then decreasing trend in Cortex-A72 shows the limit of the cache
on low-end devices, while high-end neon and avx2 processors show a declining trend at
N = 1024 for both Forward and Inverse FFT.
Comparison with previous work In Table 5, by comparing the cycles of ref implementa-
tions, we can see that our implementation outperforms the work by Kim et al. [KSS22]
conducted on Jetson AGX Xavier. On Jetson AGX Xavier, the implementation improved
from 1.17× to 1.69× as compared to the ref implementation in both Sign and Verify
operations at security level 5. The corresponding improvements on Cortex-A72 and Apple
M1 are from 1.51× to 2.10× and 1.49× to 2.06× as compared to the C implementation.
In a signature generation, as compared to the Kim et al. [KSS22] approach, we decided
against replacing macro functions FPC_MUL, FPC_DIV,.... Our manual work of unrolled
versions of the Forward and Inverse FFT, as well as splitfft, mergefft, mulfft,...,
8 mitigations=off https://make-linux-fast-again.com/
9 https://github.com/mupq/pqax/tree/main/enable_ccr
10 https://github.com/dougallj
11 https://github.com/GMUCERG/PQC_NEON/blob/main/neon/kyber/m1cycles.c
Nguyen et al. 17
Table 4: Cycle counts for the implementation of FFT (with FMA enabled) and NTT with
the size of N coefficients on Apple M1 and Cortex-A72 - neon vs. ref. On Intel i7-1165G7
- ref vs. avx2
Apple M1 Forward FFT(cycles) Inverse FFT(cycles)
N ref neon ref/neon ref neon ref/neon
8 115 91 1.26 127 88 1.44
16 167 110 1.52 178 103 1.73
32 245 147 1.67 261 144 1.81
64 408 232 1.76 424 228 1.86
128 759 404 1.88 847 401 2.11
256 1,633 789 2.07 1,810 794 2.28
512 3,640 1,577 2.31 3,930 1,609 2.44
1024 7,998 3,489 2.29 8,541 3,547 2.41
Table 5: Speed-up ratio comparison (with FMA enabled) with previous work from Kim et
al. [KSS22] on Raspberry Pi 4, Apple M1, and Jetson AGX Xavier - kc-kilocycles
contribute to greatly improving the performance of Falcon. We also modify the code
logic to reduce calling memcpy functions during operation and sharing twiddle factors,
which greatly reduces memory load and store overhead. In signature verification, our NTT
speed-up is 6.2× and 6.0× with respect to ref implementation for Forward and Inverse
NTT, as shown in Table 4, while Kim et al. [KSS22] only achieve less than 3× speed-up.
The significant speed-up is accomplished due to our signed-integer implementation of
NTT, with values centered around 0, while previous work used unsigned integer NTT. Our
signed integer NTT implementation utilized fast Barrett multiplication in 3 instructions
by Becker et al. [BHK+ 21], which greatly reduces the number of multiply instructions,
creating a shorter dependency chain and eliminating zip and unzip instructions.
Falcon and Dilithium In Table 6, we rank our implementations in respect to the state-
of-the-art CRYSTALS-Dilithium implementation from Becker et al. [BHK+ 21] across all
security levels. Please note that in the top rows, only dilithium2, xmss16 -sha256 have
security level 2, while the rest algorithms have security level 1. For all security levels of
Falcon and Dilithium, when executed over messages of the size of 59 bytes, for Signature
generation, Falcon is comparable with Dilithium in ref implementations. However, the
landscape drastically changes in the optimized neon implementations. Dilithium has an
execution time 2× smaller than Falcon at the lowest and highest security levels.
The speed-up ratio of Dilithium in ARMv8 is 3.3× as compared to ref implementation.
This result is due to the size of operands in the vectorized implementation. Dilithium
uses 32-bit integers and parallel hashing shake128/256. This leads to a higher speed-up
ratio as compared to 64-bit floating-point operations, serial hashing using shake256, and
serial sampling in Falcon. Additionally, the computation cost of floating-point operations
is higher than for integers. Hence, we only achieve 1.42× speed-up (with FMA disable)
compared to the ref implementation for the signature generation operation in Falcon.
Although there are no floating-point computations in Falcon Verify, our speed-up is smaller
than for Dilithium due to the serial hashing using shake256. We believe that if Falcon
adopts parallel hashing and parallel sampling algorithms, its performance could be further
improved.
In Verification across all security levels, Falcon is consistently faster than Dilithium by
3.0× to 3.9× in both ref and neon implementations.
Hawk vs. Falcon and Dilithium At security levels 1 and 5, Hawk outperforms both
Falcon and Dilithium in ref and neon implementations. The exception is the neon
Nguyen et al. 19
Table 6: Signature generation and Verification speed comparison (with FMA disable)
over three security levels, signed 59 bytes message. ref and neon results for Apple M1.
kc-kilocycles.
Table 7: Signature generation and Verification speed comparison (with FMA disabled)
over three security levels, signing a 59-byte message. Ranking over the Verification speed
ratio. ref and neon results for Cortex-A72. kc-kilocycles.
1.9× faster and in Verify are 0.8× and 2.5× faster for the neon implementation at security
level 1. Our neon implementation of Hawk achieves a similar speed-up ratio as in the case
of avx2 implementations reported in Ducas et al. [DPPvW22] (Table 1).
7 Conclusions
Falcon is the only PQC digital signature scheme selected by NIST for standardization using
floating-point operations. Unless significant changes are introduced to Falcon, floating-
point instructions required in Key generation and Sign operations will continue to be a key
limitation of Falcon deployment. Additionally, the complexity of serial sampling and serial
hashing significantly reduces the performance of Falcon Key and Signature generation. We
demonstrate that Hawk, outperforms Dilithium and has faster Signature generation than
Falcon. Its performance and bandwidth may be interesting to the community.
In summary, we report the new speed record for Falcon Sign and Verify operations using
NEON-based instruction on Cortex-A72 and Apple M1 ARMv8 devices. We present a
comprehensive comparison in terms of performance and bandwidth for Falcon, CRYSTALS-
Dilithium, XMSS, and SPHINCS+ on both aforementioned devices. We believe that in
some constrained protocol scenarios, where bandwidth and verification performance matter,
Falcon is the better option than Dilithium, and lattice-based signatures are a far better
choice than hash-based signatures in terms of key size and efficiency.
Lastly, we present a 7% mismatch between the Fuse Multiply-Add instructions on
ARMv8 platforms. We recommend disabling the Fuse Multiply-Add instruction to guaran-
tee implementation correctness. Further security analysis of this behavior is needed.
Nguyen et al. 21
References
[AABCG20] Erdem Alkim, Yusuf Alper Bilgin, Murat Cenk, and François Gérard. Cortex-
M4 optimizations for {R,M} LWE schemes. TCHES, 2020(3):336–357, June
2020.
[AAC+ 22] Gorjan Alagic, Daniel Apon, David Cooper, Quynh Dang, Thinh Dang,
John Kelsey, Jacob Lichtinger, Yi-Kai Liu, Carl Miller, Dustin Moody, et al.
Status report on the third round of the nist post-quantum cryptography
standardization process. 2022.
[AHKS22] Amin Abdulrahman, Vincent Hwang, Matthias J. Kannwischer, and Daan
Sprenkels. Faster kyber and dilithium on the cortex-M4. pages 853–871,
2022.
[ANB+ 18] Marc Andrysco, Andres Nötzli, Fraser Brown, Ranjit Jhala, and Deian Ste-
fan. Towards verified, constant-time floating point operations. In Proceedings
of the 2018 ACM SIGSAC Conference on Computer and Communications
Security, CCS ’18, page 1369–1382, New York, NY, USA, 2018. Association
for Computing Machinery.
[BDH11] Johannes Buchmann, Erik Dahmen, and Andreas Hülsing. XMSS - A
Practical Forward Secure Signature Scheme Based on Minimal Security
Assumptions. In Bo-Yin Yang, editor, Post-Quantum Cryptography, volume
7071, pages 117–129. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
[BDK+ 21] Shi Bai, Léo Ducas, Eike Kiltz, Tancrède Lepoint, Vadim Lyubashevsky,
Peter Schwabe, Gregor Seiler, and Damien Stehlé. CRYSTALS-Dilithium:
Algorithm Specifications and Supporting Documentation (Version 3.1). 2021.
[BGPS21] Huck Bennett, Atul Ganju, Pura Peetathawatchai, and Noah Stephens-
Davidowitz. Just how hard are rotations of Zn ? algorithms and cryptography
with the simplest lattice. Cryptology ePrint Archive, Report 2021/1548,
2021. https://eprint.iacr.org/2021/1548.
[BHK+ 19] Daniel J. Bernstein, Andreas Hülsing, Stefan Kölbl, Ruben Niederhagen,
Joost Rijneveld, and Peter Schwabe. The SPHINCS+ Signature Framework.
In Proceedings of the 2019 ACM SIGSAC Conference on Computer and
Communications Security, November 2019.
[BHK+ 21] Hanno Becker, Vincent Hwang, Matthias J. Kannwischer, Bo-Yin Yang,
and Shang-Yi Yang. Neon ntt: Faster dilithium, kyber, and saber on
cortex-a72 and apple m1. IACR Transactions on Cryptographic Hardware
and Embedded Systems, 2022(1):221–244, Nov. 2021.
[BK22] Hanno Becker and Matthias J. Kannwischer. Hybrid scalar/vector imple-
mentations of keccak and SPHINCS+ on AArch64. Cryptology ePrint
Archive, Report 2022/1243, 2022. https://eprint.iacr.org/2022/1243.
[BKS19] Leon Botros, Matthias J. Kannwischer, and Peter Schwabe. Memory-
Efficient High-Speed Implementation of Kyber on Cortex-M4. Technical
Report 489, 2019.
[BMK+ 22] Hanno Becker, Jose Maria Bermudo Mera, Angshuman Karmakar, Joseph
Yiu, and Ingrid Verbauwhede. Polynomial multiplication on embedded
vector architectures. IACR Transactions on Cryptographic Hardware and
Embedded Systems, 2022.
22 Fast Falcon Signature Generation and Verification Using ARMv8 NEON Instructions
[BMTR22] Nina Bindel, Sarah McCarthy, Geoff Twardokus, and Hanif Rahbari. Drive
(quantum) safe! – towards post-quantum security for v2v communications.
Cryptology ePrint Archive, Paper 2022/483, 2022.
[BWC13] Anthony M Blake, Ian H Witten, and Michael J Cree. The fastest fourier
transform in the south. IEEE transactions on signal processing, 2013.
[CAD+ 20] David A Cooper, Daniel C Apon, Quynh H Dang, Michael S Davidson,
Morris J Dworkin, Carl A Miller, et al. Recommendation for stateful
hash-based signature schemes. NIST Special Publication, 800:208, 2020.
[CHK+ 21] Chi-Ming Marvin Chung, Vincent Hwang, Matthias J. Kannwischer, Gregor
Seiler, Cheng-Jhih Shih, and Bo-Yin Yang. NTT Multiplication for NTT-
unfriendly Rings: New Speed Records for Saber and NTRU on Cortex-M4
and AVX2. TCHES, pages 159–188, February 2021.
[CJL+ 16] Lily Chen, Stephen Jordan, Yi-Kai Liu, Dustin Moody, Rene Peralta, Ray
Perlner, and Daniel Smith-Tone. Report on Post-Quantum Cryptogra-
phy. Technical Report NIST IR 8105, National Institute of Standards and
Technology, April 2016.
[CT65] James W Cooley and John W Tukey. An Algorithm for the Machine
Calculation of Complex Fourier Series. Mathematics of computation, 1965.
[DPPvW22] Léo Ducas, Eamonn W. Postlethwaite, Ludo N. Pulles, and Wessel van
Woerden. Hawk: Module LIP makes lattice signatures fast, compact and
simple. Cryptology ePrint Archive, Report 2022/1155, 2022. https://
eprint.iacr.org/2022/1155.
[DvW22] Léo Ducas and Wessel P. J. van Woerden. On the lattice isomorphism
problem, quadratic forms, remarkable lattices, and cryptography. pages
643–673, 2022.
[FHK+ 20] Pierre-Alain Fouque, Jeffrey Hoffstein, Paul Kirchner, Vadim Lyubashevsky,
Thomas Pornin, Thomas Prest, Thomas Ricosset, Gregor Seiler, William
Whyte, and Zhenfei Zhang. Falcon: Fast-Fourier Lattice-based Compact
Signatures over NTRU: Specifications v1.2, 2020.
[FJ12] Matteo Frigo and Steven G Johnson. Fftw: Fastest fourier transform in the
west. Astrophysics Source Code Library, pages ascl–1201, 2012.
[HBG+ 18] Andreas Huelsing, Denis Butin, Stefan-Lukas Gazdag, Joost Rijneveld, and
Aziz Mohaisen. XMSS: eXtended Merkle Signature Scheme. RFC 8391,
May 2018.
[HW22] James Howe and Bas Westerbaan. Benchmarking and analysing the nist pqc
finalist lattice-based signature schemes on the arm cortex m7. Cryptology
ePrint Archive, Paper 2022/405, 2022.
[JAMK+ 19] Amir Jalali, Reza Azarderakhsh, Mehran Mozaffari Kermani, Matthew
Campagna, and David Jao. Armv8 sike: Optimized supersingular isogeny
key encapsulation on armv8 processors. IEEE Transactions on Circuits and
Systems I: Regular Papers, 66(11):4209–4218, 2019.
[KBMSRV18] Angshuman Karmakar, Jose Maria Bermudo Mera, Sujoy Sinha Roy, and
Ingrid Verbauwhede. Saber on ARM. IACR Transactions on Cryptographic
Hardware and Embedded Systems, 2018(3):243–266, August 2018.
[KJK+ 21] Hyeokdong Kwon, Kyungbae Jang, Hyunjun Kim, Hyunji Kim, Minjoo
Sim, Siwoo Eum, Wai-Kong Lee, and Hwajeong Seo. Armed frodo. In
Information Security Applications, pages 206–217, Cham, 2021. Springer
International Publishing.
[KSS22] Youngbeom Kim, Jingyo Song, and Seog Chung Seo. Accelerating falcon
on armv8. IEEE Access, 10:44446–44460, 2022.
[LJZ+ 20] Zhihao Li, Haipeng Jia, Yunquan Zhang, Tun Chen, Liang Yuan, and
Richard Vuduc. Automatic generation of high-performance fft kernels on
arm and x86 cpus. IEEE Transactions on Parallel and Distributed Systems,
31(8):1925–1941, 2020.
[MCF19] David McGrew, Michael Curcio, and Scott Fluhrer. Leighton-micali hash-
based signatures. Technical report, 2019.
[NG21a] Duc Tri Nguyen and Kris Gaj. Fast neon-based multiplication for lattice-
based nist post-quantum cryptography finalists. In Post-Quantum Cryptog-
raphy, pages 234–254, Cham, 2021. Springer International Publishing.
[NG21b] Duc Tri Nguyen and Kris Gaj. Optimized software implementations of
crystals-kyber, ntru, and saber using neon-based special instructions of
armv8. In Proceedings of the NIST 3rd PQC Standardization Conference
(NIST PQC 2021), 2021.
A Pseudocode of Falcon
B Modular Multiplication
x · t′
′ (x · y) x (y · R)
z = (x · y) − q · =z−q· · =z−q· (4)
q R q R
where hi and lo are lower and upper part of 2w bits, each is w-bit respectively.
26 Fast Falcon Signature Generation and Verification Using ARMv8 NEON Instructions
re im re im re im
re im re im
re re re re
im im im im
im re im re
re im re im
Figure 1: Single pair complex multiplication using fmul, fcmla. Real and imagine points
are stored adjacently.
Figure 2: Two pairs complex multiplication using fmul, fmls, fmla. Real and imagine
points are stored separately.
Nguyen et al. 27
Forward FFT
Fold by 2 Fold by 4
Inverse FFT
applied
applied
conjugate rotation
Figure 3: Deriving full twiddle factor table by applying complex conjugate and rotation.
G Benchmark Keccak-f1600
Table 11: thash-sha256 and thash-shake speed comparison between the C, sha2
Crypto instruction and 2-ways sha3 instructions by Bas Westerbaan [Wes21] on Apple
M1, measured with 64 bytes input and 32 bytes output. Results are in cycles.