Parallel Accelerating Number Theoretic Transform for Bootstrapping on a Graphics Processing Unit
<p>CMux gate.</p> "> Figure 2
<p>Overview of the accelerating NTT on a GPU in a whole gate bootstrapping process.</p> "> Figure 3
<p>The running time of small-point NTT by three different methods.</p> "> Figure 4
<p>The running time of a 1024-point NTT achieved using different decomposition methods.</p> ">
Abstract
:1. Introduction
1.1. Motivation
1.2. Prior Work
1.3. Our Contributions
2. Materials and Methods
2.1. Preliminary
2.1.1. Notations
2.1.2. Number Theoretic Transform
Algorithm 1. Cooley–Tukey NTT |
Algorithm 2. Butterfly operation |
|
2.1.3. Ciphertexts in TFHE
2.1.4. Graphical Processing Unit (GPU)
- Streaming Multiprocessor (SM): It is the core component of the GPU hardware and serves as the basic control instruction unit with an independent instruction scheduling circuit. Multiple streaming processors exist within a single streaming multiprocessor, and all streaming processors share the same set of control instructions;
- Streaming Processor (SP): It is the fundamental processing unit where specific instructions and tasks are processed. A GPU performs parallel computation, which means numerous streaming processors are processing simultaneously.
- Kernel: In CUDA, you will execute the calculations you want to run on a GPU in a form similar to C/C++ functions, which are called kernels;
- Grid, Block, Thread: This is a hierarchical structure for thread allocation. When programming with CUDA, a grid is divided into multiple blocks, while a block is divided into multiple threads. The division is based on the btask characteristics and the hardware features of the GPU.
- Registers: They are the smallest storage units in a GPU, but they offer the fastest execution speed. Each thread is allocated private registers during execution, and other threads cannot read or write to these registers;
- Constant Memory: it is used to cache constant data utilized in the code executed on the streaming multiprocessor. We need to explicitly declare objects as constants in the code so that a GPU can cache and store them in the constant cache;
- Shared Memory: Each streaming multiprocessor also has a block of shared memory. It is a small, fast and low-latency on-chip programmable static random access memory (SRAM) used for sharing data among the thread blocks running on the SM;
- Global Memory: A GPU also has an off-chip global memory, which is a large-capacity and high-bandwidth dynamic random access memory (DRAM).
2.2. Parallel Accelerating NTT for Bootstrapping on a GPU
2.2.1. Initialization
2.2.2. Gadget Decomposition
2.2.3. Decomposition of an N-Point NTT
2.2.4. Thread Assignment
2.2.5. n-Point NTT
- 1
- Merging
- 2
- Expansion and logical left shift
- 3
- Turn back to 64-bit version
2.2.6. N-Point NTT
2.2.7. Multiplication of Polynomials
2.2.8. The Result of Bootstrapping
3. Results
4. Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
Abbreviations | Extension |
---|---|
TFHE | Fully homomorphic encryption over the Torus |
BGV | Brakerski–Gentry–Vaikuntanathan homomorphic encryption scheme [2] |
BFV | Brakerski–Fan Vercauteren homomorphic encryption scheme [3] |
CKKS | Cheon–Kim–Kim–Song homomorphic encryption scheme [4] |
NTT | Number theoretic transform |
INTT | Inverse number theoretic transform |
DFT | Discrete Fourier transform |
FFT | Fast Fourier transform |
GPU | Graphics processing unit |
CPU | Central processing unit |
CUDA | Compute unified device architecture |
FPGA | Field programmable gate array |
TRLWE * | A ciphertext for the bootstrapping input |
TLWE * | The scalar form of the TRLWE |
TRGSW * | A ciphertext for the bootstrapping key |
TGSW * | The scalar form TRGSW |
References
- Gentry, C. Fully homomorphic encryption using ideal lattices. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing; Association for Computing Machinery: New York, NY, USA, 2009; Volume 9, pp. 169–178. [Google Scholar]
- Brakerski, Z.; Gentry, C.; Vaikuntanathan, V. (Leveled) Fully Homomorphic Encryption without Bootstrapping. ACM Trans. Comput. Theory 2014, 6, 1–36. [Google Scholar] [CrossRef]
- Fan, J.; Vercauteren, F. Somewhat Practical Fully Homomorphic Encryption. Iacr Cryptology Eprint Archive. 2012, pp. 1–19. Available online: https://eprint.iacr.org/2012/144 (accessed on 11 October 2023).
- Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic Encryption for Arithmetic of Approximate Numbers. In Proceedings of the 23rd International Conference on the Theory and Application of Cryptology and Information Security, Hong Kong, China, 3–7 December 2017; Springer: Cham, Switzerland, 2017; pp. 409–437. [Google Scholar]
- Chillotti, I.; Gama, N.; Georgieva, M.; Izabachène, M. Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds. In Proceedings of the Advances in Cryptology–ASIACRYPT 2016: 22nd International Conference on the Theory and Application of Cryptology and Information Security, Hanoi, Vietnam, 4–8 December 2016; pp. 3–33. [Google Scholar]
- Bianchi, T.; Piva, A.; Barni, M. On the implementation of the discrete Fourier transform in the encrypted domain. IEEE Trans. Inf. Forensics Secur. 2009, 4, 86–97. [Google Scholar] [CrossRef]
- Wang, W.; Hu, Y.; Chen, L.; Huang, X.; Sunar, B. Accelerating fully homomorphic encryption using GPU. In Proceedings of the 2012 IEEE Conference on High Performance Extreme Computing, Waltham, MA, USA, 10–12 September 2012; pp. 1–5. [Google Scholar]
- Lee, Y.; Micciancio, D.; Kim, A.; Choi, R.; Deryabin, M.; Eom, J.; Yoo, D. Efficient fhew bootstrapping with small evaluation keys, and applications to threshold homomorphic encryption. In Proceedings of the Annual International Conference on the Theory and Applications of Cryptographic Techniques, Lyon, France, 23–27 April 2023; pp. 227–256. [Google Scholar]
- Xiang, B.; Zhang, J.; Deng, Y.; Dai, Y.; Feng, D. Fast blind rotation for bootstrapping FHEs. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 20–24 August 2023; pp. 3–36. [Google Scholar]
- Xie, X.; Sun, L.; Huang, X.M.; Han, S.F. Design and implementation of finite field NTT algorithm based on FPGA. Mod. Electron. Tech. 2019, 41, 1855–1860. [Google Scholar]
- Mert, A.C.; Öztürk, E.; Savaş, E. FPGA implementation of a run-time configurable NTT-based polynomial multiplication hardware. Microprocess. Microsyst. 2020, 78, 103219. [Google Scholar] [CrossRef]
- Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Kyber: A CCA-Secure Module-Lattice-Based KEM. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), London, UK, 24–26 April 2018; pp. 353–367. [Google Scholar]
- Zhou, S.; Xue, H.; Zhang, D.; Wang, K.; Lu, X.; Li, B.; He, J. Preprocess-then-NTT technique and its applications to Kyberand NewHope. In Proceedings of the Information Security and Cryptology: 14th International Conference, Fuzhou, China, 14–17 December 2018; Inscrypt 2018. Springer: Cham, Switzerland, 2019; Volume 11449, pp. 117–137. [Google Scholar]
- Zhu, Y.; Liu, Z.; Pan, Y. When NTT Meets Karatsuba: Preprocess-then-NTT Technique Revisited. In Proceedings of the International Conference on Information and Communications Security, Chongqing, China, 17–19 September 2021; pp. 249–264. [Google Scholar]
- Wang, W.; Hu, Y.; Chen, L.; Huang, X.; Sunar, B. Exploring the Feasibility of Fully Homomorphic Encryption. IEEE Trans. Comput. 2015, 64, 698–706. [Google Scholar] [CrossRef]
- Schönhage, A.; Strassen, V. Schnelle multiplication grosser Zahlen. Computing 1971, 7, 281–292. [Google Scholar] [CrossRef]
- Dai, W.; Sunar, B. cuHE: A homomorphic encryption accelerator library. In Proceedings of the Cryptography and Information Security in the Balkans: Second International Conference, BalkanCryptSec 2015, Koper, Slovenia, 3–4 September 2015; pp. 169–186. [Google Scholar]
- Goey, J.Z.; Lee, W.K.; Goi, B.M.; Yap, W.S. Accelerating number theoretic transform in GPU platform for fully homomorphic encryption. J. Supercomput. 2021, 77, 1455–1474. [Google Scholar] [CrossRef]
- Lee, W.-K.; Akleylek, S.; Wong, D.C.-K.; Yap, W.-S.; Goi, B.-M.; Hwang, S.-O. Parallel implementation of Nussbaumer algorithm and number theoretic transform on a GPU platform: Application to qTESLA. J. Supercomput. 2021, 77, 3289–3314. [Google Scholar] [CrossRef]
- Nussbaumer, H. Fast polynomial transform algorithms for digital convolution. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 205–215. [Google Scholar] [CrossRef]
- Kim, S.; Jung, W.; Park, J.; Ahn, J.H. Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs. In Proceedings of the 2020 IEEE International Symposium on Workload Characterization (IISWC), Beijing, China, 27–30 October 2020; pp. 264–275. [Google Scholar]
- Özerk, Ö.; Elgezen, C.; Mert, A.C.; Öztürk, E.; Savaş, E. Efficient number theoretic transform implementation on GPU for homomorphic encryption. J. Supercomput. 2022, 78, 2840–2872. [Google Scholar] [CrossRef]
- Cooley, J.W.; Tukey, J.W. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
- CUDA-Accelerated Fully Homomorphic Encryption Library. Available online: https://github.com/vernamlab/cuFHE (accessed on 21 June 2023).
- Owens, J.D.; Houston, M.; Luebke, D.; Green, S.; Stone, J.E.; Phillips, J.C. GPU computing. Proc. IEEE 2008, 96, 879–899. [Google Scholar] [CrossRef]
- Sanders, J.; Kandrot, E. CUDA by Example: An Introduction to General-Purpose GPU Programming; Addison-Wesley Professional: Boston, MA, USA, 2010. [Google Scholar]
- Solinas, J. Generalized Mersenne Numbers; Tech. Rep. 06/MI/006; Blekinge College Technology: Karlskrona, Sweden, 1999. [Google Scholar]
- Wang, W.; Huang, X.; Emmart, N.; Weems, C. VLSI design of a large-number Multiplier for fully homomorphic encryption. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2014, 22, 1879–1887. [Google Scholar] [CrossRef]
4-Point NTT | 8-Point NTT | 16-Point NTT | 32-Point NTT | |
---|---|---|---|---|
Our scheme | 0.72 | 1.37 | 2.93 | 5.28 |
Our scheme (no merging) | 1.18 | 2.33 | 4.78 | 9.42 |
Butterfly algorithm [27] | 0.98 | 2.49 | 7.11 | 19.52 |
Decomposition Method | 4 × 4 × 4 × 4 × 4 | 8 × 8 × 16 | 16 × 16 × 4 | 32 × 32 |
---|---|---|---|---|
Running time (ms) | 7.60 | 6.03 | 7.24 | 11.45 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, H.; Pan, D.; Li, J.; Wang, H. Parallel Accelerating Number Theoretic Transform for Bootstrapping on a Graphics Processing Unit. Mathematics 2024, 12, 458. https://doi.org/10.3390/math12030458
Li H, Pan D, Li J, Wang H. Parallel Accelerating Number Theoretic Transform for Bootstrapping on a Graphics Processing Unit. Mathematics. 2024; 12(3):458. https://doi.org/10.3390/math12030458
Chicago/Turabian StyleLi, Huixian, Deng Pan, Jinglei Li, and Hao Wang. 2024. "Parallel Accelerating Number Theoretic Transform for Bootstrapping on a Graphics Processing Unit" Mathematics 12, no. 3: 458. https://doi.org/10.3390/math12030458