A scalable multi-TeraOPS deep learning processor core for AI trainina and inference
A multi-TOPS AI core is presented for acceleration of deep learning training and inference in
systems from edge devices to data centers. With a programmable architecture and custom …
systems from edge devices to data centers. With a programmable architecture and custom …
RaPiD: AI accelerator for ultra-low precision training and inference
…, M Scheuermann, J Silberman… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
The growing prevalence and computational demands of Artificial Intelligence (AI) workloads
has led to widespread use of hardware accelerators in their execution. Scaling the …
has led to widespread use of hardware accelerators in their execution. Scaling the …
9.1 A 7nm 4-core AI chip with 25.6 TFLOPS hybrid FP8 training, 102.4 TOPS INT4 inference and workload-aware throttling
Low-precision computation is the key enabling factor to achieve high compute densities (TOPS/W
and TOPS/mm 2 ) in AI hardware accelerators across cloud and edge platforms. …
and TOPS/mm 2 ) in AI hardware accelerators across cloud and edge platforms. …
Efficient AI system design with cross-layer approximate computing
…, M Schaal, M Serrano, J Silberman… - Proceedings of the …, 2020 - ieeexplore.ieee.org
Advances in deep neural networks (DNNs) and the availability of massive real-world data
have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive …
have enabled superhuman levels of accuracy on many AI tasks and ushered the explosive …
A 1.0-GHz single-issue 64-bit PowerPC integer processor
J Silberman, N Aoki, D Boerstler… - IEEE Journal of Solid …, 1998 - ieeexplore.ieee.org
The organization and circuit design of a 1.0 GHz integer processor built in 0.25 /spl mu/m
CMOS technology are presented, a microarchitecture emphasizing parallel computation with a …
CMOS technology are presented, a microarchitecture emphasizing parallel computation with a …
A 3.0 TFLOPS 0.62 V scalable processor core for high compute utilization AI training and inference
A processor core is presented for AI training and inference products. Leading-edge compute
efficiency is achieved for robust fp16 training via efficient heterogeneous 2-D systolic array-…
efficiency is achieved for robust fp16 training via efficient heterogeneous 2-D systolic array-…
A scalable multi-TeraOPS core for AI training and inference
…, B Fleischer, M Ziegler, J Silberman… - IEEE Solid-State …, 2019 - ieeexplore.ieee.org
This letter presents a multi-TOPS AI accelerator core for deep learning training and inference.
With a programmable architecture and custom ISA, this engine achieves >90% sustained …
With a programmable architecture and custom ISA, this engine achieves >90% sustained …
A 7-nm four-core mixed-precision AI chip with 26.2-TFLOPS hybrid-FP8 training, 104.9-TOPS INT4 inference, and workload-aware throttling
Reduced precision computation is a key enabling factor for energy-efficient acceleration of
deep learning (DL) applications. This article presents a 7-nm four-core mixed-precision …
deep learning (DL) applications. This article presents a 7-nm four-core mixed-precision …
470 ps 64-bit parallel binary adder [for CPU chip]
J Park, HC Ngo, JA Silberman… - 2000 Symposium on …, 2000 - ieeexplore.ieee.org
This paper presents a fast 64-bit parallel carry look-ahead binary adder implemented in a 1
GHz research prototype 64-bit PowerPC microprocessor. Efficient use of dynamic compound …
GHz research prototype 64-bit PowerPC microprocessor. Efficient use of dynamic compound …
A 3D system prototype of an eDRAM cache stacked over processor-like logic using through-silicon vias
M Wordeman, J Silberman, G Maier… - … Solid-State Circuits …, 2012 - ieeexplore.ieee.org
3D integration (3DI) holds promise for improved performance of integrated systems by
increasing interconnect bandwidth [1]. A processor stacked with cache memory is one potential …
increasing interconnect bandwidth [1]. A processor stacked with cache memory is one potential …