11 TOPS Photonic Convolutional Accelerator For Optical Neural Networks - RM - Nature
11 TOPS Photonic Convolutional Accelerator For Optical Neural Networks - RM - Nature
11 TOPS Photonic Convolutional Accelerator For Optical Neural Networks - RM - Nature
https://doi.org/10.1038/s41586-020-03063-0 Xingyuan Xu1,9, Mengxi Tan1, Bill Corcoran2, Jiayang Wu1, Andreas Boes3, Thach G. Nguyen3,
Sai T. Chu4, Brent E. Little5, Damien G. Hicks1,6, Roberto Morandotti7,8, Arnan Mitchell3 &
Received: 15 April 2020
David J. Moss1 ✉
Accepted: 20 October 2020
Artificial neural networks are collections of nodes with weighted con- including approaches that have the potential for full integration on a
nections that, with proper feedback to adjust the network parameters, single photonic chip8,12, in turn offering an ultrahigh computational
can ‘learn’ and perform complex operations for facial recognition, density. However, there remain opportunities for substantial improve-
speech translation, playing strategy games and medical diagnosis1–4. ments in ONNs. Processing large-scale data, as needed for practical
Whereas classical fully connected feedforward networks face chal- real-life computer vision tasks, remains challenging for ONNs because
lenges in processing extremely high-dimensional data, convolutional they are primarily fully connected structures and their input scale is
neural networks (CNNs), inspired by the (biological) behaviour of the determined solely by hardware parallelism. This leads to tradeoffs
visual cortex system, can abstract the representations of input data between the network scale and footprint. Moreover, ONNs have not
in their raw form, and then predict their properties with both unprec- achieved the extreme computing speeds that analogue photonics is
edented accuracy and greatly reduced parametric complexity5. CNNs capable of, given the very wide optical bandwidths that they can exploit.
have been widely applied to computer vision, natural language process- Recently22, the concept of time–wavelength multiplexing for ONNs
ing and other areas6,7. was introduced and applied to a single perceptron operating at 11 billion
The capability of neural networks is dictated by the computing power (109) operations per second (giga-ops per second). Here, we demon-
of the underlying neuromorphic hardware. Optical neural networks strate an optical convolutional accelerator (CA) to process and extract
(ONNs)8–12 are promising candidates for next-generation neuromorphic features from large-scale data, generating convolutions with multiple,
computation, because they have the potential to overcome some of the simultaneous, parallel kernels. By interleaving wavelength, temporal
bandwidth bottlenecks of their electrical counterparts6,13–15 such as for and spatial dimensions using an integrated Kerr microcomb source23–32,
interconnections16, and achieve ultrahigh computing speeds enabled by we achieve a vector computing speed as high as 11.322 TOPS. We then
the >10-THz-wide optical telecommunications band8. Operating in ana- use it to process 250,000-pixel images, at a matrix processing speed
logue frameworks, ONNs avoid the limitations imposed by the energy of 3.8 TOPS.
and time consumed during reading and moving data back and forth for The CA is scalable and dynamically reconfigurable. We use the same
storage, known as the von Neumann bottleneck13. Important progress hardware to form both a CA front end and a fully connected neuron
has been made in highly parallel, high-speed and trainable ONNs8–12,17–22, layer, and combine them to form an optical CNN. The CNN performs
1
Optical Sciences Centre, Swinburne University of Technology, Hawthorn, Victoria, Australia. 2Department of Electrical and Computer Systems Engineering, Monash University, Clayton,
Victoria, Australia. 3School of Engineering, RMIT University, Melbourne, Victoria, Australia. 4Department of Physics, City University of Hong Kong, Tat Chee Avenue, Hong Kong, China.
5
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, China. 6Bioinformatics Division, Walter & Eliza Hall Institute of Medical Research, Parkville, Victoria,
Australia. 7INRS-Énergie, Matériaux et Télécommunications, Varennes, Québec, Canada. 8Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of
China, Chengdu, China. 9Present address: Electro-Photonics Laboratory, Department of Electrical and Computer Systems Engineering, Monash University, Clayton, Victoria, Australia.
✉e-mail: dmoss@swin.edu.au
Power
The CA processes vectors, which is extremely useful for human
Kernel weights W Comb shaping
speech recognition or radio-frequency signal processing, for exam-
Wavelength
ple. However, it can easily be applied to matrices for image process-
Sliding receptive field
ing by flattening the matrix into a vector. The precise way that this
is performed is governed by the kernel size, which determines both
EOM
Input data X
the sliding convolution window’s stride and the equivalent matrix
computing speed. In our case the 3 × 3 kernel reduces the speed by a
Tim
Photonic VCA
factor of 3, but we outline straightforward methods to avoid this (see
Wavelength
e–
int wav Supplementary Information, including Supplementary Figs. 1–30 and
SMF
erl e
ea leng Weighted
vin t
g h
Supplementary Tables 1, 2).
replicas of X
PD
Convolution output Y
Time Fig. 2 shows the matrix version used to process a classic 500 × 500
image, adopted from the University of Southern California-Signal and
Fig. 1 | Operation principle of the TOPS photonic CA. EOM, electro-optical Image Processing Institute (USC-SIPI) database (http://sipi.usc.edu/
Mach–Zehnder modulator; SMF, standard single mode fibre for database/). The system performs simultaneous image convolutions
telecommunications; PD, photodetector. with ten 3 × 3 kernels. The weight matrices for all kernels were flattened
into a composite kernel vector W containing all 90 weights (10 kernels
with 3 × 3 = 9 weights each), which were then encoded onto the optical
power of 90 microcomb lines by an optical spectral shaper (the wave-
simultaneous recognition of images from the MNIST handwritten digit shaper), with each kernel occupying its own band of 9 wavelengths.
dataset33, achieving an accuracy of 88%. Our ONN represents a major The wavelengths were supplied by a soliton crystal microcomb with a
step towards realizing monolithically integrated ONNs and is enabled spacing of about 48.9 GHz (refs. 22–24,30,32), with the 90 wavelengths occu-
by our use of an integrated microcomb chip. Moreover, the scheme is pying 36 nm across the C-band (see Extended Data Fig. 2 and Methods).
stand-alone and universal, fully compatible with either electrical or opti- Figure 3 shows the image processing results. The 500 × 500 input
cal interfaces. Hence, it can serve as a universal ultrahigh-bandwidth image was flattened electronically into a vector X and encoded as the
front end that extracts data features for any neuromorphic hardware intensities of 250,000 temporal symbols, with a resolution of 8 bits
(optical or electronic-based), bringing massive-data machine learning per symbol (see Supplementary Information for a discussion on the
for both real-time and ultrahigh-bandwidth data within reach. effective number of bits), to form the electrical input waveform via
a high-speed electrical digital-to-analogue converter, at a data rate
of 62.9 gigabaud (with time slot τ = 15.9 ps; Fig. 3b). The duration of
Principle of operation each image for all 10 kernels (3.975 μs) equates to a processing rate of
The photonic vector convolutional accelerator (VCA, Fig. 1) features 1/3.975 μs, or 0.25 million ultralarge-scale images per second.
high-speed electrical signal ports for data input and output. The input The input waveform X was then multi-cast onto 90 shaped comb lines
data vector X is encoded as the intensity of temporal symbols in a serial via electro-optical modulation, yielding replicas weighted by the kernel
electrical waveform at a symbol rate 1/τ (baud), where τ is the symbol vector W. The waveform was then transmitted through around 2.2 km
period. The convolutional kernel is represented by a weight vector W of standard single mode fibre (dispersion about 17 ps nm−1 km−1) such
of length R that is encoded in the optical power of the microcomb lines that the relative temporal shift between the adjacent weighted wave-
via spectral shaping by a waveshaper (see Methods). The temporal length replicas had a progressive delay of 15.9 ps, matching the data
waveform X is then multi-cast onto the kernel wavelength channels symbol duration τ. This resulted in time and wavelength interleaving
via electro-optical modulation, generating the replicas weighted by W. for all 10 kernels. The 90 wavelengths were then de-multiplexed into
The optical waveform is then transmitted through a dispersive delay 10 sub-bands of 9 wavelengths, with each sub-band corresponding
with a delay step (between adjacent wavelengths) equal to the symbol to a kernel, and separately detected by 10 high-speed photodetec-
duration of X, effectively achieving time and wavelength interleaving. tors. The detection process effectively summed the aligned symbols
Finally, the delayed and weighted replicas are summed via high-speed of the replicas (the electrical output waveform of kernel 4 is shown
photodetection so that each time slot yields a convolution between X in Fig. 3c). The 10 electrical waveforms were converted into digital
and W for a given convolution window, or receptive field. signals via analogue-to-digital converters and resampled so that each
As such, the convolution window effectively slides at the modulation time slot of each individual waveform (wavelengths) corresponded to
speed matching the baud rate of X. Each output symbol is the result of a dot product between one of the convolutional kernel matrices and
R multiply-accumulate (MAC) operations, with the computing speed the input image within a sliding window (that is, receptive field). This
given by 2R/τ TOPS. Since the speed of this process scales with both effectively achieved convolutions between the 10 kernels and the raw
the baud rate and number of wavelengths, the massively parallel num- input image. The resulting waveforms thus yielded the 10 feature maps
ber of wavelengths from the microcomb yields speeds of many TOPS. (convolutional matrix outputs) containing the extracted hierarchical
Moreover, the length of the input data X is theoretically unlimited, so features of the input image (Fig. 3d, Supplementary Information).
the CA can process data with an arbitrarily large scale—the only practi- The VCA makes full use of time, wavelength and spatial multiplex-
cal limitation being the external electronics. ing, where the convolution window effectively slides across the input
Simultaneous convolution with multiple kernels is achieved by add- vector X at a speed equal to the modulation baud rate of 62.9 billion
ing sub-bands of R wavelengths for each kernel. Following multicast- symbols per second. Each output symbol is the result of 9 (the length
ing and dispersive delay, the sub-bands (kernels) are demultiplexed of each kernel) MAC operations, and so the core vector computing
and detected separately, generating electronic waveforms for each speed (that is, the throughput) of each kernel is 2 × 9 × 62.9 = 1.13 TOPS.
kernel. The VCA is fully reconfigurable and scalable: the number and For 10 kernels the total computing speed of the VCA is therefore
Flatten
kernels
control
W EDFA MRR
Power
Comb spectral
(·) (*) shaper
Length of each kernel = 9 Wavelength
Raw input image
500 × 500 Electronic input 9 kernel weights on
Flatten X 250,000 pixels 9 wavelengths
and DAC
EOM
Time
15.9 ps Modulation rate = 62.9 gigabaud
EDFA
Time
9 wavelength channels for kernel 1
9 wavelength channels for kernel 2
9 wavelength channels for kernel 10
9 aligned data symbols over 9 wavelength
channels summed for each kernel
Kernel 1 PD
ADC and
resample
Time
Result of 2×9 operations
Electronic output kernel 1
Output
10 convolved image Electronic output kernel 2
feature maps of Electronic output kernel 10
82,668 pixels each
9 wavelength channels
Wavelength
for kernel 1
9 wavelength channels
for kernel 2
Overall vector
computing 9 wavelength channels
speed for kernel 10
= 10 × 1.13 =
11.3 TOPS Time
Fig. 2 | Image processing. The experimental setup (right panel), the optical continuous-wave pump laser; EDFA, erbium-doped fibre amplifier; MRR,
and electronic control and signal flow (middle panel), and the corresponding micro-ring resonator. DAC, digital-to-analogue converter.
processing flow of the raw input image (left panel) are shown. PUMP,
Designed
0 10 20 30 40 50 60 70 80 90
Spectral
shaping
Channel number
Optical power (dBm)
–20
Shaped comb
–40
spectrum
–60
Positive
1,535 1,540 1,545 1,550 1,555 1,560 1,565 Negative
Wavelength (nm)
b
Zoom in
Electronic
0.5 input
2 ns
Convolution
accelerator
0
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1.012 1.0121 1.0122
Time (μs) Time (μs)
c
1.0 15.9 ps per symbol
Intensity (a.u.)
Zoom in Electronic
0.5 output of
kernel 4
0 2 ns
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 2.5949 2.5950 2.5951
Time (μs) RMSE 0.037689 Time (μs)
d Kernel 1 Kernel 2 Kernel 3 Kernel 4 Kernel 5
0 0 0 1 2 1 –1 –2 –1 1 2 1 1 0 –1
Resample
0 1 0 2 4 2 0 0 0 0 0 0 2 0 –2
0 0 0 1 2 1 1 2 1 –1 –2 –1 1 0 –1
Output
10 convolved
Kernel 6 Kernel 7 Kernel 8 Kernel 9 Kernel 10 feature maps
–1 0 1 –1 –1 0 –1 –1 –1 0 –1 0 1 0 0
–2 0 2 –1 0 1 –1 8 –1 –1 5 –1 0 1 0
–1 0 1 0 1 1 –1 –1 –1 0 –1 0 0 0 1
Fig. 3 | Experimental results of the image processing. a, The kernel weights gradient-based method that looks for strong changes in the first derivative of
(tap weights) and the shaped microcomb’s optical spectrum. b, The input an image (the grey and red lines show the ideal and experimentally generated
electrical waveform of the image (the grey and blue lines show the ideal and waveforms, respectively). d, The weight matrices of the kernels and
experimentally generated waveforms, respectively). c, The convolved results corresponding recovered images. RMSE, root-mean-square error.
of the fourth kernel that performs a top Sobel image processing function—a
Optical input
Optical input
Power
CW
Power
FC
pump WFC(10)[1]
Wconv(3)[1] Wconv(2)[1] Wconv(1)[1] WFC(10)[71]
EDFA MRR WFC(10)[2]
O1 O2 O25 O26 O27 O50 O51 O52 O75 O1 O2 O3 O72 O72 O72
Wavelength, λi R = 25 × 3 Wavelength, λi R = 72
Kernel 1 Kernel 2 Kernel 3
5×5 6 × 26
Convolutional Comb power splitting and flattening Feature
kernels maps
Pool
Kernel 3 Kernel 2 Kernel 1 TOPS
convolution 6×4
Flatten accelerator Feature
30 × 30 maps
Flatten
Input image Electrical input vector Xconv[n]
L = 900 L = 72 Electrical input vector XFC[n]
EOM EOM
Kernel 2
PD Kernel 3
1 2 10
1 3 5 7 9 R+1 L+1 1 3 5 7 9 R+1 2R – 4 2R – 2 2R
Electrical output (1) Time n Feature maps Output of neurons Electrical output of Time n
Yconv =Wconv(1)*Xconv
of convolutional kernel 1 neuron 10 YFC(10)[R + 1] = WFC(10)·XFC
Sample Kernel 1
Fig. 4 | Experimental schematic of the optical CNN. Left side is the input wavelengths for both the TOPS photonic CA as well as the fully connected layer
front-end CA while the right side is the fully connected layer, both of which form systems. The electronic digital signal processing (DSP) module used for
the deep learning optical CNN. The microcomb source supplies the sampling and pooling and so on is external to this structure.
(30 × 30 greyscale matrices with 8-bit resolution) were flattened into systems, including signal sampling, nonlinear function and pooling,
vectors and multiplexed in the time domain at 11.9 gigabaud (time slot were implemented electronically with digital signal processing hard-
τ = 84 ps). Three 5 × 5 kernels were used, requiring 75 microcomb lines ware, although many of these functions (for example, pooling) can be
(Fig. 5) and resulting in a vertical convolution stride of 5. The dispersive achieved optically (for example, with the VCA). Supervised network
delay was achieved with around 13 km of standard single mode fibre training was performed offline electronically (see Supplementary
for telecommunications to match the data baud rate. The wavelengths Information).
were de-multiplexed into the three kernels which were detected by We first experimentally tested 50 images of the handwritten digit
high-speed photodetectors and then sampled as well as nonlinearly MNIST dataset33, followed by more extensive testing on 500 images
scaled with digital electronics. This recovered the hierarchical fea- (see Supplementary Information for 500 image results). The confusion
ture maps that were then pooled electronically and flattened into a matrix for 50 images (Fig. 6) shows an accuracy of 88% for the gener-
vector XFC (72 × 1) for each image, forming the input data to the fully ated predictions, in contrast to 90% for the numerical results calcu-
connected layer. lated on an electrical digital computer. The corresponding results for
The fully connected layer had 10 neurons, one for each of the 10 500 images are essentially the same—89.6% for theory versus 87.6% for
categories of handwritten digits (0 to 9), with the synaptic weights of experiment (Supplementary Fig. 25). The fact that the CNN achieved
the lth neuron (l ∈ [1, 10]) represented by a 72 × 1 weight matrix WFC(l). close to the theoretical accuracy indicates that the impact of effects
The number of comb lines (72) matched the length of the flattened that could limit the network performance and reduce the effective
feature map vector XFC. The shaped optical spectrum at the lth port had number of bits (see Supplementary Information), such as electrical
an optical power distribution proportional to the weight vector WFC(l), and optical noise or optical distortion (owing to high-order disper-
serving as the optical input for the lth neuron. After being multicast sion), is small.
onto 72 wavelengths and progressively delayed, the optical signal was The computing speed of the VCA front end of the optical CNN was
weighted and demultiplexed with a single waveshaper into 10 spatial 2 × 75 × 11.9 = 1.785 TOPS. For processing the image matrices with 5 × 5
output ports, each corresponding to a neuron. Since this part of the kernels, the convolutional layer had a matrix flattening overhead of 5,
network involved linear processing, the kernel wavelength weighting yielding an image computing speed of 1.785/5 = 357 billion operations
could be implemented either before electro-optical modulation or per second. The computing speed of the fully connected layer was 119.8
later—that is, just before photodetection. The advantage of the latter billion operations per second (see Supplementary Information). The
is that both demultiplexing and weighting can be achieved with a single waveform duration was 30 × 30 × 84ps = 75.6 ns for each image, and so
waveshaper. Finally, the different node/neuron outputs were obtained the convolutional layer processed images at the rate of 1/75.6 ns = 13.2
by sampling the 73rd symbol of the convolved results. The final output million handwritten digit images per second. The optical CNN supports
of the optical CNN was represented by the intensities of the output online training, given that the dynamic reconfiguration response time
neurons (Extended Data Fig. 4), where the highest intensity for each of the optical spectral shaper used to establish the synapses is <500 ms,
tested image corresponded to the predicted category. The peripheral and even faster with integrated optical spectral shapers34.
Optical power
MRR
–20
Optical input
(dBm)
–40
–60
EDFA
1,540 1,550 1,560
Wavelength (nm)
5×5
OSS convolutional
kernels
CW
pump Kernel 3 Kernel 2 Kernel 1
FC
900 1
layer
symbols
0
–1
Electrical input
Intensity (a.u.)
0.94 0.96 0.98
EOM
30 × 30 input
Convolutional layer
1 Flatten
image
84 ps per
symbol 0
TOPS photonic analogue signal processor
SMF
–1
0.942 0.944 0.946 6 × 26 feature maps
Time (ms)
Feature
1
map 1
Sample
EDFA 0
Intensity (a.u.)
WDM –1
Feature
map 2
0.94 0.96 0.98
Electrical output
PD 0
Feature
map 3
–1
0 0.5 1 1.5 2 2.5 3 3.5 4
Feature maps Time (ms)
Fig. 5 | Convolutional layer. Architecture and experimental results. The left electrical waveform for the digit 3 (middle: the grey and yellow lines show the
panel shows the experimental setup. The right panel shows the experimental ideal and experimentally generated waveforms, respectively), the convolved
results of one of the convolutional kernels, showing the shaped microcomb’s results and the corresponding feature maps. CW pump, continuous-wave
optical spectrum and the corresponding kernel weights (the blue and red lines pump laser. OSS, optical spectral shaper. WDM, wavelength de-multiplexer.
denote the positive and negative synaptic weights, respectively), the input FC, fully connected.
Although handwritten digit recognition is a common benchmark memory, and processing 4K-resolution (4,096 × 2,160 pixels) images
for digital hardware, it is still largely beyond current analogue recon- at >7,000 frames per second is possible.
figurable ONNs. Digit recognition requires many physical parallel The 720 synapses of the CNN (72 wavelengths per synapses per neu-
paths for fully connected networks (for example, a hidden layer with ron, 10 neurons), a substantial increase for optical networks12, ena-
10 neurons requires 9,000 physical paths), which represents a huge bled us to classify the MNIST dataset33. Nonetheless, further scaling is
challenge for nanofabrication. Our CNN represents the first reconfig- needed to increase the theoretical prediction accuracy from 90% to that
urable and integrable ONN capable not only of performing high-level of state-of-the-art electronics, typically substantially greater than 95%
complex tasks such as full handwritten digit recognition, but to do (see Supplementary Information). Both the CA and CNN can be scaled
so at many TOPS. substantially in size and speed using only off-the-shelf telecommuni-
cations components. The full S, C and L telecommunications bands
(1,460–1,620 nm, >20 THz) would allow more than 400 channels (at a
Discussion 50-GHz spacing), with the further ability to use polarization and spatial
Although the performance of ONNs is not yet competitive with dimensions, ultimately enabling speeds beyond a quadrillion (1015)
leading-edge electronic processors at >200 TOPS (for example, Google operations per second (peta-ops per second) with more than 24,000
TPU15 and other chips13,14,35), there are straightforward approaches synapses for the CNN (see Supplementary Information). Supplemen-
towards increasing our performance both in scale and speed (see Sup- tary Fig. 30 shows theoretical results for a scaled network that achieves
plementary Information). Further, with a single processor speed of an accuracy of 94.3%. This can, in principle, be further increased to
11.3 TOPS, our VCA is approaching this range. The CA is fundamentally achieve accuracies comparable to state-of-the-art electronic chips for
limited in data size only by the electrical digital-to-analogue converter the tasks performed here.
r Intensity
Neuron numbe (a.u.) 1
0
(a.u.)
0
–1
–1
2 2
Neuron numbe
4 4
6 6
8 8 50
10 40 50 10 30 40
20 30 20
10 10
Image number Image number
True label
True label
Experimental
Calculated
Predicted label Predicted label
Fig. 6 | Experimental and theoretically calculated results for image matrices (see Supplementary Information), with the darker colours indicating
recognition. The upper figures show the sampled intensities of the 10 output a higher recognition score.
neurons at the fully connected layer, while the lower figures show the confusion
Extended Data Fig. 1 | VCA, for processing one-dimensional data. It consists of the experimental setup (right panel), the optical and electronic control and
signal flow (left panel). ADC, analogue-to-digital converter. 1D, one-dimensional.
Extended Data Fig. 2 | Generation of soliton crystal microcombs. c, Measured dispersion Dint of the MRR showing the mode crossing at about
a, Schematic diagram of the soliton crystal microcomb, generated by pumping 1,552 nm. d, Measured soliton crystal step of the intra-cavity power. e, Optical
an on-chip high-Q (quality factor >1 million) nonlinear micro-ring resonator spectrum of the microcomb when sweeping the pump wavelength. f, Optical
with a continuous-wave laser. b, Image of the MRR (upper inset) and a scanning spectrum of the generated coherent microcomb at different pump detunings
electron microscope image of the MRR’s waveguide cross-section (lower inset). at a fixed power. FSR, free spectral range.
Article
Extended Data Fig. 3 | The architecture of the optical CNN. The architecture includes a convolutional layer, a pooling layer and a fully connected layer.
Extended Data Fig. 4 | Fully connected layers. Architecture and experimental and red lines illustrate the ideal and experimentally generated waveforms,
results. The left panel depicts the experimental setup, similar to the respectively; middle); and the output waveform of the neuron and sampled
convolutional layer. The right panel shows the experimental results for one intensities (bottom). Conv layer, convolutional layer. CW pump,
output neuron, including the shaped comb spectrum (top); the pooled feature continuous-wave pump laser.
maps of the digit 3 and the corresponding input electrical waveform (the grey