Paper • The following article is Open access

Lightweight jet reconstruction and identification as an object detection task

Adrian Alan Pol, Thea Aarrestad, Ekaterina Govorkova, Roi Halily, Anat Klempner, Tal Kopetz, Vladimir Loncar, Jennifer Ngadiuba, Maurizio Pierini, Olya Sirkin and Sioni Summers

Published 4 July 2022 • © 2022 The Author(s). Published by IOP Publishing Ltd
Machine Learning: Science and Technology, Volume 3, Number 2Citation Adrian Alan Pol et al 2022 Mach. Learn.: Sci. Technol. 3 025016DOI 10.1088/2632-2153/ac7a02

Download Article PDF

Article metrics

721 Total downloads
0 Video abstract views

Submit

Submit to this Journal

Dates

Received 15 February 2022
Accepted 17 June 2022
Published 4 July 2022

Peer review information

Method: Single-anonymous
Revisions: 2
Screened for originality? Yes

Buy this article in print

Journal RSS

Abstract

We apply object detection techniques based on deep convolutional blocks to end-to-end jet identification and reconstruction tasks encountered at the CERN large hadron collider (LHC). Collision events produced at the LHC and represented as an image composed of calorimeter and tracker cells are given as an input to a Single Shot Detection network. The algorithm, named PFJet-SSD performs simultaneous localization, classification and regression tasks to cluster jets and reconstruct their features. This all-in-one single feed-forward pass gives advantages in terms of execution time and an improved accuracy w.r.t. traditional rule-based methods. A further gain is obtained from network slimming, homogeneous quantization, and optimized runtime for meeting memory and latency constraints of a typical real-time processing environment. We experiment with 8-bit and ternary quantization, benchmarking their accuracy and inference latency against a single-precision floating-point. We show that the ternary network closely matches the performance of its full-precision equivalent and outperforms the state-of-the-art rule-based algorithm. Finally, we report the inference latency on different hardware platforms and discuss future applications.

Export citation and abstractBibTeX RIS

Previous article in issue

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

The world's largest and most powerful particle accelerator, the CERN large hadron collider (LHC) [1], operates at a nominal proton-proton collision rate of 40 MHz. Due to storage constraints and technological limitations (e.g. fast enough read-out electronics), the volume of recorded data must be significantly reduced by the experiments operating around the accelerator ring. To this purpose, a set of algorithms collectively referred to as the trigger system are typically used to filter the incoming data stream. Trigger algorithms are designed to reduce the rate of recorded collision events (e.g. the collection of sensor readouts at each bunch crossing) while preserving the physics reach of the experiments. For example, at the Compact Muon Solenoid (CMS) experiment, the trigger system [2, 3] is structured in two stages using increasingly complex information and more refined algorithms:

the Level 1 (L1) Trigger, implemented on custom-designed electronics; reduces the 40 MHz input to a 100 kHz rate in $\lt$ 10 µs.
the high level trigger (HLT), a collision reconstruction software running on a computer farm; reduces the 100 kHz rate output of the L1 trigger to 1 kHz in $\lt$ 150 ms.

With the planned LHC high-luminosity upgrade [4], the number of proton-proton collisions per second will surge approximately four-fold. The latency of legacy reconstruction algorithms will increase by more than the factor of three as they may suffer from execution time scaling worse than linearly [5]. Along with the computing infrastructure upgrades, it is worth investigating solutions that could execute many tasks at once, while retaining accuracy and benefiting from the additional speedup offered by parallel computing architectures. Deep neural networks, such as those used for computer vision tasks, are an obvious candidate in this endeavour.

The majority of particles produced in LHC events are unstable and immediately decay to lighter particles. The new particles can decay themselves to others in a so-called decay chain. Such a process terminates when the decay products are stable particles, e.g. charged pions. This collimated shower of particles with adjacent trajectories is called a jet. Jets are central to many physics studies at the LHC experiments [6–9]. In particular, a successful physics program requires aggregating particles into jets (jet clustering), an accurate determination of the jet momentum (momentum measurement) and the identification of which particle kind started the shower (jet tagging) [10–13].

In this work, we show how jet clustering, momentum measurement, and tagging could all be handled simultaneously on parallel computing architectures. Besides the practical advantages of our approach, one could benefit from multitask learning when accomplishing more tasks at once [14]. For instance, a classifier and a regression running at once can learn that calibration constants depend on the nature of the jet, an issue which is now handled with ad-hoc post-processing [15], i.e. when factorizing the reconstruction problem to energy regression and tagging the overall performance may drop for both. Our main contributions are as follows:

We introduce the PFJet-SSD algorithm to perform localization, classification and additional regression tasks on jets in a single feed-forward pass (concurrently, or single-shot). We combine ideas from different fields of deep learning, i.e. object detection, attention mechanisms, network slimming and quantization.
We report acceleration on different computing architectures.
We generate and publicly share a dataset of simulated LHC collisions, pre-processed to be suited for computer vision applications similar to those discussed in this work, as well as for point-cloud end-to-end reconstruction. The dataset is available on Zenodo [16] and it is accompanied by annotated jet labels, to be used as ground truth during training.

We use the CMS detector and trigger system as an illustrative example. One could apply the same approach to other detectors, adapting the architecture to the detector granularity and latency constraints. The dataset, instructions, and code to fully reproduce our results are available at https://github.com/AdrianAlan/PF-Jet-SSD.

The remainder of this paper is structured as follows. In section 2 we review the key building blocks for this work, i.e. jet images, single-shot detection, attention mechanisms, and efficient model design. In section 3 we introduce the PFJet-SSD model and its quantized variants. In section 4 we describe the dataset and the training procedure. Finally, in sections 5 and 6 we discuss the results and future directions, respectively.

2. Techniques

In this section, we review the background techniques for this work, i.e. jet images, single-shot detection, attention mechanism and designing efficient inference networks with pruning and quantization. We examined architecture suggestions from [17], which lists methods for designing efficient networks for computer vision tasks achieving state-of-the-art results. Some of these methods, e.g. GELU activation layer [18], are currently unsupported by SensPro, our target hardware (see section 5.2). Thus we excluded them from this work but we suggest they are examined in further optimization studies.

2.1. Jet images

Traditional approaches to jet tagging rely on features, such as jet substructure, designed by experts that detect characteristic energy deposit patterns [19–27]. In recent years, several studies applied computer vision for event reconstruction at particle colliders, e.g. [28–43]. This was obtained by projecting the lower level detector measurements of the emanating particles onto a cylindrical detector and then unwrapping the inner surface of the calorimeter on a rectangle. Such information was further interpreted as an image with calorimeter cells as pixels, where pixel intensity maps the energy deposit of the cell, i.e. jet images. This approach was also applied to end-to-end reconstruction, considering not just the individual jet but the whole event [44, 45]. Building on these works, we extend the end-to-end reconstruction to include a localization task, merging the jet clustering and classification tasks in a single operation. Centralized computing environments are the only viable options for this: end-to-end approaches require as input a raw data representation, which is not available with reduced analysis data formats. For this reason, we also consider how the model could be compressed to reduce computing footprint, having in mind an approach optimized for a trigger application.

2.2. Single shot detection

Object detection is a fundamental task in computer vision. It is defined as the classification of objects from predefined categories in the image along with their precise spatial locations. The spatial location and extent of an object can be defined coarsely using a bounding box, which is an axis-aligned rectangle tightly bounding the object. Modern object detection focuses on using primarily convolutional neural networks (CNNs) as the building block. Deep learning object detection achieved state-of-the-art results in tasks such as face [46] or pedestrian detection [47]. For a general survey on this subject, see [48, 49].

Deep-learning-based object detection models are typically divided into one- [50–54] or two-stage [55–59] detectors. Two-stage detectors generate a sparse set of regions with a high probability of an object being present first (region proposals), followed by a simple classification step. This two-step process is inefficient for real-time applications, due to task serialization. Single-step approaches classify and regress object locations concurrently (in a single feed-forward pass) and as such tend to achieve lower accuracy than two-stage detectors but are simpler and significantly more latency and memory efficient, hence having greater applicability to online problems.

The single-shot multibox detector (SSD) [60], is a simple one-stage, anchor-based detector. First, a set of default regions in an image with a fixed shape and size is predefined to discretize the output space of bounding boxes, called anchors. These anchors have a diverse set of shapes to detect objects with different dimensions, i.e. multiple scales and aspect ratios. Based on the ground truth, the object locations are matched with the most appropriate anchors to obtain the supervision signal for the anchor estimation. At inference, each anchor is refined by four box coordinates (width, height, x and y offsets) and predicts the categorical probabilities. To avoid a huge number of negative proposals dominating training gradients, hard negative mining is used to train the network, which fixes the foreground and background ratio [61]⁶ . Alternatively, a focal loss [52] could be used. In this case, the price to pay would be more hyperparameters to tune. The SSD architecture is fully convolutional, with initial layers based on a pre-trained backbone architecture, such as VGG-16 [62], followed by extra convolutional and pooling layers which progressively decrease image size and thus increase the receptive field. The information in the last layer may be too coarse spatially to allow precise localization and at the same time, detecting large objects in shallow layers is non-optimal without large enough receptive fields. As a countermeasure for this issue, the SSD performs detection over multiple scales by operating on multiple feature maps, i.e. at different depths of the network. Each of these feature maps is responsible for detecting objects according to their receptive field. To detect large objects and increase receptive fields extra convolutional feature maps were added to the backbone architecture. The final prediction is made by merging all detection results from different feature maps followed by a non-maximum suppression (NMS) [60] step and producing the final detection information. NMS removes duplicate predictions originating from multiple anchors.

2.3. Attention mechanisms

Visual attention gates (AGs), e.g. [63–65], learn to suppress feature activations in irrelevant regions in an input image without additional supervision. At inference, the gates generate soft region proposals to highlight salient features useful for a specific task. Recently, the performance of deep CNNs on visual tasks was improved with scale-aware [66, 67], spatial-aware [68, 69] and channel-wise [70, 71] attention. On the contrary, most of the attention modules inevitably increase model complexity. Efficient channel attention (ECA) gate [72] is a soft attention mechanism that addresses this issue. It avoids dimensionality reduction and captures cross-channel interaction efficiently. ECA gate ω is given by $\omega = \sigma(\textbf{W} \odot g(y)),$ where $y \in \mathbb{R}^C$ is the feature map activation with channels C, g is channel-wise global average pooling, σ is the Sigmoid function and W is a weight tensor of a 1 D convolution of filter size k.

2.4. Quantization

Optimizing deep neural networks for efficient inference is an essential task in modern machine learning pipelines due to limitations presented by edge devices. Models should provide high accuracy with a minimum of computing time and resources. Apart from accelerating inference online, e.g. through parallelization or hardware optimizations, models can be optimized offline, through compression [73].

Network compression [74] is a common technique to reduce the number of operations and model size, energy consumption, and over-training of deep neural networks. As neural network synapses and neurons can be redundant, compression techniques attempt to reduce the total number of them, effectively reducing multipliers. Several approaches have been successfully deployed without much loss in accuracy, including selective removal of parameters based on a particular ranking and regularization, i.e. parameter pruning [75–77], compact network architectures [78–80], and reducing the precision of operations and operands, i.e. quantization [81–88].

It has been observed that reducing the precision of the calculations, i.e. weights and biases, has little impact on performance compared to speedup and resource usage gains. This includes moving away from 32-bit floating-point calculations (or full-precision, FP) to fixed points, reducing bit-width and weight sharing. An example of a very aggressive strategy is reducing weight precision to ternary values restricted to $\{-1,0,1\}$ only, called ternary weight network (TWN) [89]. The quantization is performed during training, using a straight-through estimator [81], where ternary weights are used during the forward and backward propagation but not during the parameter update. To quantize the full precision weights W to ternary ones W $^*$ , TWN uses a threshold value Δ:

with approximated solution $\Delta^{*}\approx0.7\cdot\mathrm{E}(|\textbf{W}|)$ , where $\mathrm{E}$ is the expectation value. To make the network perform well, TWN minimizes the Euclidian distance between W and W $^{*}$ along a non-negative scaling factor α that can be implemented with per-network, per-layer or per-channel granularity, transforming the weights to $\alpha\textbf{W}^{*}$ . For any Δ the optimal α is computed as: $\alpha_\Delta^* = \frac{1}{\textbf{I}_\Delta}\sum_{i \in \textbf{I}_\Delta}|\textbf{W}_i|,$ where $\textbf{I}_{\Delta} = \{i \big| |\textbf{W}_i|\gt\Delta\}$ and $|\textbf{I}_\Delta|$ denotes number of elements in $\textbf{I}_\Delta$ .

3. Methodology

The PFJet-SSD architecture is shown in figure 1. We modify the original SSD architecture [60] and Jet-SSD architecture proposed in [90]. Having in mind an HLT application with a typical latency of ≈150 ms, we extend the event image representation to include the information from the charged-particle reconstruction. We do so by adding a tracker channel to the image, in front of the calorimeter channels already introduced in [90]. We use a lightweight MobileNet architecture [78] as a backbone for our detector which replaces the convolution operation with a combination of depthwise and pointwise versions. Each convolution is followed by a batch normalization [91, 92] and parametric rectified linear unit (PReLU) [93] activation layers. We use the AveragePool layer to decrease the size of the feature map. The extra convolutional layers proposed by the original SSD do not contribute to accurate detection (recall the remark about the increasing receptive field from section 2.2). This is due to the size of the jets. As done in [90] we remove these layers already at the training time. Retaining the deeper layers of the backbone, i.e. Block10 and Block11, does not show improvements at inference but is necessary during training due to additional signals during back-propagation. Hence, these deeper layers are only purged after training, i.e. the concatenation layer ignores them only at inference. This alone reduces the number of parameters in the final model by approximately $30\%$ .

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** PFJet-SSD architecture. The convolution block ( $3~\times 3$ convolution followed by batch normalization and PReLU activation) is in yellow, the average pooling ( $2~\times 2$ kernel) is in red, the detection head which is the output layer is in blue. ω is the attention module and $\frown$ is the concatenation. The numbers indicate the number of output channels in each block. `Block10` and `Block11` are removed at inference.
Download figure:
Standard image High-resolution image

**Figure 1.** PFJet-SSD architecture. The convolution block ( $3~\times 3$ convolution followed by batch normalization and PReLU activation) is in yellow, the average pooling ( $2~\times 2$ kernel) is in red, the detection head which is the output layer is in blue. ω is the attention module and $\frown$ is the concatenation. The numbers indicate the number of output channels in each block. `Block10` and `Block11` are removed at inference.
Download figure:
Standard image High-resolution image

We add two new modules to the network. First, the initial convolutional layer is now followed by spatial dropout [94] (with p = 0.1). Second, we attach the ECA gate [72] (with k = 3) before the localization classification regression (LCR) layer.

The detection head, which is the concatenation of LCR layers, outputs correspond to jet class, localization (η and φ offsets) and p_T value (see the definitions in section 4.1). One might easily extend this output to include jet mass regression as well (we left this out for simplicity). Each row in the detection head corresponds to an anchor box, i.e. fixed position in the image. For localization, we regress only the centre of the jet, as we can determine its size from its class. For wide jets we assume $\Delta R = 0.8$ – $46~\mathrm{px}$ , for narrow jets $\Delta R = 0.4$ – $23~\mathrm{px}$ . This allows us to set only one scale and one aspect ratio for anchors in each feature map which reduces the complexity of the network. The detection head is an input to the NMS layer.

We use magnitude pruning [95] during training to find the optimal allocation of resources between layers. Unstructured pruning generality leads to a higher compression rate and/or higher accuracy when compared to the structured version, but it requires special software or hardware accelerators to fully benefit from it. Since the outcome of an unstructured pruning is a sparse tensor, one needs a dedicated way to handle sparse memory access on hardware to turn pruning compression into a computational advantage at inference time. We use an alternative, a version of structured pruning that removes whole channels in a convolutional block slimming the network without increasing sparsity. We target the hardware implementation that benefits from fusing batch normalization and convolution parameters at runtime. Doing so, the target filter weights W of block l are $\textbf{W}_l = \gamma_l\textbf{W}_l^{\texttt{conv}}$ , where the $\textbf{W}^{\texttt{conv}}$ are the weights of the convolution and γ is the scale parameter of the affine transformation of the subsequent batch normalization layer. We thus add a regularizer that pushes the influence of filters down through batch normalization γ L1 penalty, similarly to [77, 96]. We scale this penalty based on the number of operations $\mathcal{O}$ in each layer. The sparsifying regularizer $\texttt{G}(\gamma)$ is calculated as $\texttt{G}(\gamma) = \sum_l|\gamma_l|\mathcal{O}_l$ . We mark channels to prune based on the γ distribution in each layer, using the rule: $|\gamma_l|\lt\mu_{|\gamma_l|} - \sigma^2_{|\gamma_l|}$ , where $\mu_{|\gamma_l|} = \frac{1}{N}\sum_{i = 1}^{N} |\gamma_{l}^{i}|$ , $\sigma^2_{|\gamma_l|} = \sqrt{\frac{1}{N}\sum_{i = 1}^{N} (|\gamma_{l}^{i}| - \mu_{|\gamma_l|})^2}$ and N is the number of channels for each layer. When this rule is not sufficient to remove the specified number of channels we simply select the remaining ones based on ascending magnitudes of γ.

Also during training, we quantize the network to homogeneous 8-bit fixed point precision for both weights and activations and 2-bit TWN with layer- and channel-dependent scaling factors. For the latter, we experimented with a grace period of frozen quantization for which the Δ and α parameters remain unchanged. Training TWN in this manner may offer greater stability, i.e. weights have time to adjust to new parameters, but in our case, the final results did not improve.

4. Experiments

In this section, we review the experimental dataset and training procedure used for the experiments:

4.1. Dataset

The input dataset consists of 13 TeV proton-proton collision events, in which Randall-Sundrum (RS) gravitons with 3.5 TeV mass are produced. This is a proxy of a sample that would give us jets of various kinds and populating a large spectrum of p_T ranges, i.e. RS gravitons decay to $b \bar b$ , gg, qq, HH, WW, ZZ, or $t \bar t$ final states. The choice of this particular process is motivated by the possibility of creating well-defined jet pairs belonging to specific jet classes and with the same kinematic properties across classes. In addition to the hard collision, parasitic pileup collisions are also simulated, overlapping minimum bias events. The number of pileup collisions is sampled from a Poisson distribution. We note that this process populates a large spectrum of p_T ranges.

The detector effects and hadronization have an important effect on a jet substructure. Events are generated with Pythia [97]. We use the CMS Delphes [98] description to mimic the effect of detector reconstruction. To apply this algorithm to another detector (e.g. ATLAS), one would have to modify the geometry of the input layer to match the detector geometry. In addition, one would have to repeat the training. Other effects (e.g. theoretical uncertainties related to hadronization models) would be detector independent. These kinds of uncertainties also affect rule-based algorithms and are usually neglected at the trigger stage, where they are subdominant. These uncertainties are measured with data control samples at the analysis stage and, usually, they are mitigated by applying a selection on the offline object so that the trigger behaviour is stable. The same set of state-of-the-art procedures could be applied to the algorithm we present. Being all this part of a standard data analysis workflow (and beyond the scope of this paper), we do not comment on this further.

The core of the CMS detector is a multi-layer silicon tracking device, operating in a 4 T magnetic field. Two calorimeter layers surround the tracker: the lead tungstate crystal ECAL is designed to stop particles whose main interaction is electromagnetic (photons and electrons); the brass and scintillator HCAL is designed to stop hadrons. They give a measurement of the energy of particles (charged and neutrals). Each of them is composed of a barrel and two endcap sections. Forward calorimeters extend the pseudorapidity (η) coverage provided by the barrel and endcap detectors. The calorimeter cells (towers) in the barrel region together with tracker cells are arranged in a fixed discrete space with fine segmentation in η and φ, where φ is the translated azimuthal angle. A more detailed description of the CMS detector, together with a definition of the coordinate system used and the relevant kinematic variables, can be found in [99].

Before the LHC, jets were usually reconstructed from their calorimeter deposits (known as CaloJet). With the start of the LHC, the CMS particle flow (PF) algorithm [100] demonstrated that the additional information from track reconstruction could increase the accuracy of jet reconstruction. In CMS, this was crucial to compensate for the poor energy resolution of the HCAL. In the long term, this strategy was found to be effective beyond jet momentum measurement, since the angular resolution of the tracking algorithm provided valuable information for jet tagging and substructure algorithms.

The PF algorithm for jet reconstruction was eventually adopted also by the ATLAS experiment [101]. Taking this as our starting point, we build our event image starting from the PF jet constituents (as returned by the Delphes PF algorithm), arranging the particles in three groups: charged particles, used to create the tracker channel; photons and electrons, used for the ECAL channel; neutral hadrons, used for the HCAL channel. In a real-life application, one could use the same approach or build the channels from the raw detector hits in the tracker, ECAL, and HCAL. The best approach to follow depends on the context of the application (e.g. online vs offline).

We unwrap the cylindrical detector to compose the final image which is formed by translating the calorimeter energy deposits and tracker momentum into pixels using ECAL granularity, which results in $340\times360\times3$ pixel samples. An example is shown in figure 2. Some previous studies on jet images implemented data pre-processing steps such as translation, rotation, re-pixelation, or inversion. However, in our study, we only limit the input to $\eta \in (-3, 3)$ and standardize pixel intensities.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** An example input to the PFJet-SSD network: tracker information and energy deposits in CMS electromagnetic calorimeter (ECAL) and hadronic calorimeter (HCAL) translated to a two-dimensional image. The white bounding boxes correspond to ground truth with target label and momentum.
Download figure:
Standard image High-resolution image

Jet labels are obtained using generator-level information. We assign the jet η (pseudo-rapidity and not rapidity as it is normally done in L1 trigger reconstruction), φ and p_T (transverse momentum) measurements to the properties of the same particle. The minimum jet p_T in the dataset is 7 GeV. Details on the dataset profile are given in table 1 which describes the jet statistics across datasets. Figure 3 shows the p_T, η and φ distributions.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Dataset profile as a function of p_T (left), η (middle) and φ (right).
Download figure:
Standard image High-resolution image

Table 1. Number of samples in the datasets.

	Train	Validation	Test
t jets	59 388	23 802	59 392
W/Z jets	118 701	47 493	118 832
H jets	59 967	23 997	59 978
$\sum$	238 056 (41.6%)	95 192 (16.7%)	238 202(41.6%)

4.2. Training procedure

The PFJet-SSD network is implemented on NVidia Tesla GPUs using PyTorch [102]. For training, we use stochastic gradient descent with an initial learning rate of 10⁻³ with momentum set to 0.9 and weight decay to 0.0005. We train the network for 100 epochs with a batch size of 25, decreasing the learning rate by a factor of 2 after every 10 epochs after the 20th epoch. We use 90k and 36k samples for training and validation, respectively. The training is performed in mixed-precision to speed up computation and distributed across 3 GPUs. Thus, we replace the standard batch normalization layer with the SyncBatchNorm layer provided by PyTorch to synchronize statistics across the machines while training.

We minimize the following cost function:

where the $\mathcal{L}_c$ is the classification loss, the $\mathcal{L}_l$ is the localization loss, the $\mathcal{L}_r$ is the regression loss. We use cross-entropy with smooth labels (α = 0.1) for classification [103], and Huber loss [57] (δ = 1) for localization and regression.

A common challenge when training object detection models from scratch is the insufficient amount of training data which may lead to overfitting⁷ . Thus it is common to see practitioners pre-loading weights from pre-trained classification models on the real-world ImageNet [104] dataset. We found that such a procedure slows down our learning as the real-world images have little relation to our calorimeter images. The full precision network (FPN) can learn faster by using Xavier uniform initialization [105] (which helps with the sparsity of the input). We also augment the training dataset by random flips along η and φ dimensions, which we find to greatly stabilize the training. We did not experiment with other augmentation techniques such as changing brightness, contrast, saturation and hue as jets are not invariant to such transitions. The experiments with other commonly used techniques such as Mix-Up [106] or Mosaic [107] yield subpar results, again. This is likely because of the different nature of our input.

We perform five steps of iterative pruning, each with 20 epochs of retraining a gradually decreasing number of channels in each block. We then retrain the network for the last time for 100 epochs. We found out that pre-loading FPN weights when training the quantized versions, i.e. TWN and 8-bit fixed-precision (INT8) network, greatly speeds up convergence.

5. Results

In this section, we present the detection and latency performance of PFJet-SSD.

5.1. Detection performance

As a proof of concept, we investigate the tagging of the top-quark (t), W and Z bosons (V) and Higgs boson (H) jet. An example of the PFJet-SSD output is shown in figure 4. PFJet-SSD outputs predicted categorical label, prediction confidence and the centre coordinates of the object. In object detection true positive is defined as prediction with predicted category equal to the ground truth label and intersection over union (IoU) above the predefined threshold, usually 0.5. Successful prediction meets both criteria, otherwise, it is considered as a missed detection. In our case we substitute the IoU requirement with the distance metric $d = \sqrt{\Delta\phi^2+\Delta\eta^2}\lt33$ pixels as we regress only the centre of the box and box dimensions are universal across target classes.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Two examples of the PFJet-SSD at inference for two events with the input image and highlighted true labels (left) and predicted bounding boxes (right). The overlapping boxes in the second event correspond to the $t \rightarrow bW$ decay where two jets, t and W, are very close.
Download figure:
Standard image High-resolution image

Our investigation into inference does not find any systematic issues. Occlusion, such as the one in $t \rightarrow bW$ decay, where jets are near, is not an obstacle against correct detection. Also, the jets close to the image edges are, generally, correctly classified.

To evaluate the model we use precision (or positive predictive value, PPV, $\frac{TP}{TP+FP}$ ) and recall (true positive rate, TPR) curve, and an average precision metric (AP), see figure 5. Intuitively, precision measures how accurate the predictions are while recall measures the quality of the positive predictions. Collectively, they determine how well the found set of jets corresponds to the set we expect to find. To draw a precision-recall (PR) curve, the predictions are first sorted in order of confidence followed by calculation of PPVs and TPRs for each confidence threshold. We held out 90k samples as our test dataset. The TWN network results are closely matching the results of the FPN. TWN benefits from the long retraining period, as it yields marginally better AP. For performance details across target jet classes see table 2.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Precision-Recall curves for the baseline algorithm and three flavours of the PFJet-SSD model. The inference is performed on original, non-mixed samples (top), two mixed events in superposition (middle), three mixed events in superposition (bottom).
Download figure:
Standard image High-resolution image

Table 2. Detection performance for the baseline and PFJet-SSD algorithms, reporting the number of parameters (NoP), the number of operations (NoOps), the precision of weights/activations (W/A), average precision (AP) and precision at 0.3 (P@R = .3) and 0.5 (P@R = .5) recall. The table does not report parameters and bit precision for the baseline as it is a non-parametric method: not applicable (N/A). The baseline is also unable to reach 0.3 and 0.5 in several cases: no statistics (N/S).

		Physics baseline	PFJet-SSD
		Physics baseline	FPN	TWN	INT8
NoP		N/A	111 228
NoOps		N/A	1.095G
W/A		N/A	32/32	2/32	8/8
AP		.161	.848	.857	.566
t jet	AP	.420	.865	.872	.473
	P@R = .3	.736	.985	.988	.531
	P@R = .5	.627	.975	.980	.453
W/Z jet	AP	.245	.847	.859	.629
	P@R = .3	.584	.944	.955	.673
	P@R = .5	N/S	.929	.943	.653
H jet	AP	.107	.860	.872	.335
	P@R = .3	N/S	.992	.996	.453
	P@R = .5	N/S	.978	.986	.400

For non-mixed samples in figure 5 TWN and FPN remarkably agree and yield an appealing precision for any given recall. Fix-point INT8 network drops in detection precision hinting that the network is sensitive to activation but not weight quantization. Hence, in the future, a mixed-precision should be explored as it is likely that not all layers contribute equally to this reduced performance. All flavours of the PFJet-SSD outperform the physics baseline. We also experimented with two and three events overlaid as the input to the network. This creates much noisier input and results in visibly reduced performance of PFJet-SSD. However, the network was not trained on such samples and such a drop is expected. Besides, the difference between two and three mixed events is minor. In the future, we suggest training the network with Mix-Up [106] or Mosaic [107] augmentations (techniques for mixing multiple samples) which could improve performance on noisier inputs.

Throughout, we compare PFJet-SSD to the baseline which is a physics-based algorithm combining a jet soft-drop mass [108], m, selection (under a specific mass hypothesis) and threshold requirement on the appropriate ratio of N-subjettiness [109] variables, τ. In particular, we require $105 \lt m \lt 210$ GeV for t jets and use the $\tau_3/\tau_2$ N-subjettiness ratio as a score defined in $[0, 1]$ . With this score we make a performance assessment that we can directly compare to that obtained with the PFJet-SSD algorithm. Similarly, we require $65 \lt m \lt 105$ GeV for V jets and $105 \lt m \lt 140$ GeV for H jets, using the $\tau_2/\tau_1$ N-subjettiness ratio as a score for these baseline taggers. This physics-motivated baseline has performance that is typical of a rule-based state-of-the-art substructure jet tagger, with the typical recall of 0.3 for the precision of 0.6.

Figure 6 shows the dependence of the precision at fixed recall across different jet classes. The precision is rather flat in all cases. The TWN results match closely the FPN ones, while an overall drop in performance (approximately constant across η, φ, and p_T) is observed for the INT8 network. A drop is observed at the boundaries of the η region, as a consequence of jets leaking out of acceptance at the edge of the endcaps (missing information of a part of the shower). Such a drop is not observed in the φ dimension suggesting that the network can handle the periodicity of the image. The precision across p_T stays relatively flat, however, the sudden drop in the high p_T region of V jets is due to the low number of samples in that region, see the details in section 4. Notice that the drop in TRP at low jet p_T for top, W/Z, and H tagging is induced by transition from a boosted-jet to a resolved jets regime. Despite the fact that there so far little use of boosted jets from heavy particles in this low p_T regime, it is interesting to notice that the drop in efficiency of the PFJet-SSD in the low-p_T regime is less pronounced than for the baseline algorithm, which could be interesting to increase the reconstruction efficiency in transition between boosted and resolved topologies.

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** Precision at $30\%$ Recall (PPV@R = 0.3) for t (top), V (centre) and H (bottom) jets as a function of η (left), φ (middle) and p_T (right). For each block of figures (t, V, and (H), we show results for the FPN, TWN, and INT8 models.
Download figure:
Standard image High-resolution image

**Figure 6.** Precision at $30\%$ Recall (PPV@R = 0.3) for t (top), V (centre) and H (bottom) jets as a function of η (left), φ (middle) and p_T (right). For each block of figures (t, V, and (H), we show results for the FPN, TWN, and INT8 models.
Download figure:
Standard image High-resolution image

Figure 7 shows the residual in the determination of η and φ and the ratio of the reconstructed-to-true jet p_T, as a function of the jet p_T for the different classes.

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** Displacement in η, φ, and relative p_T regression error (bottom row) for top (top), V (centre) and H (bottom) jets, as a function of generator-level η (left), φ (middle) and p_T (right). For each block of figures (top, V, and Higgs), we show results for the FPN (top), TWN, and INT8 models.
Download figure:
Standard image High-resolution image

Finally, we visualize the most repeating filters of the TWN in figure 8. Remarkably, the network optimizes to use a set very similar to the commonly used ones, e.g. smoothing, corner detection or edge detection filters.

Figure 8. Refer to the following caption and surrounding text. — **Figure 8.** Most common filters of the PFJet-SSD TWN.
Download figure:
Standard image High-resolution image

5.2. Latency and power measurements

We investigate the latency and throughput of the proposed algorithm on architectures where parallel computing is more adequate. We compare the baseline, running native PyTorch inference on the Intel Xeon Silver 4114 CPU with ONNX accelerated version and TensorRT optimized version on Nvidia Tesla V100. Results are given in figure 9, separately for CPUs and GPUs. Having in mind an offline application, one could maximize the throughput by running the network at once across batches of events, e.g. implementing the inference-as-a-service concept discussed in [110].

Figure 9. Refer to the following caption and surrounding text. — **Figure 9.** Comparison of inference latency and throughput for different versions of PFJet-SSD running on different platforms.
Download figure:
Standard image High-resolution image

While the inference-as-a-service paradigm could also be implemented online, the current design of HLT farms foresees that processing parallelization is achieved by sending different events to different computing units. In this context, the batch size is constrained to one, since the inference of the proposed SSD model happens per event. In this case, execution on CPU would be borderline, within the average event processing latency but consuming most of it. On the other hand, moving the execution to a GPU would reduce the execution time to negligible levels. This could be particularly interesting under the assumption that GPUs would be used to run the local reconstruction [111–113] and the creation of PF candidates [114].

Deep learning inference at scale requires high power consumption, especially with the use of GPUs and CPUs. It is possible to keep the power and die area at more manageable levels by deploying an AI-specific hardware platform as used in edge devices. Since edge devices usually operate on batteries where power is a limited resource, AI-specific hardware platforms for edge devices are highly power efficient. With smaller die areas, manufacturing costs and power consumption can be reduced.

SensPro is a family of ultra-light AI DSPs that can perform efficient inference while consuming only a fraction of the power and area used by GPUs and CPUs. CEVA's hardware platform for jet detection consists of a stack of ten SensPro (SP) DSP cores. Each core delivers 2 TOPS. An additional SP core is added to serve as a controller. This solution delivers 20 TOPS and can run TWN natively, reaching latency comparable to a GPU running an 8-bit network. This proposed layout has orders of magnitude lower area and power consumption than GPU and CPU, see table 3. The SP ultra-light solution can also be synthesized to an FPGA and used in collision detection.

Table 3. Die area, power and latency measurements for different hardware architectures. The latency is measured for inference on a single input. The reason for this is that collision detection is done sequentially in real-time. The input data is fed to the network directly from the sensor without storing it.

	Die area (mm²)	Power (W)	Latency (ms)
DSP CEVA SP1000 2x8	0.77	0.75	8.5
DSP CEVA 10xSP1000 + Controller 2x8	8.47	8.25	0.9
GPU Nvidia Tesla V100 8x8	815	250	1.1
CPU Intel Xeon Silver 4114. 32float	4294	85	134

6. Conclusions

We propose a fast and lightweight detection algorithm for jet tagging and reconstruction based on computer vision techniques. Naturally high precision and generalization are required, but nuisance factors of variations can break the algorithm. That makes this problem hard. Intra-class variations, such as perspective distortion, e.g. rotation; densely arranged jets (occlusion); or blurred signatures (the detector response may not be clear) are common challenges. Besides, as jets are small objects, a reappearing issue with object detectors and background pileup may further disturb their visual appearance. Thus, robustness to detector effects, its imperfections and failures is required.

Even after a successful proof-of-concept deployment to production will still produce challenges as many of the problems lay outside of the simulation. More importantly, the real-time detection requirements force further investigations into more optimizations on algorithm and hardware runtime.

The PFJet-SSD paves the way for solving these issues. The algorithm did not experience accuracy drops during pruning, suggesting that the depth of the network is more important than the width. The number of channels can likely be reduced further and thus speed up computations. We observed a gap between TWN and INT8 performance which suggests to us that the optimal quantization level could be achieved through mixed-precision, a possible direction for future studies.

From the physics point of view, the algorithm manifests an interesting behaviour in low momentum regions out of reach for the baseline model, see high precision results in figure 6, which could help increasing the reconstruction efficiency for all-hadronic decays of heavy particles in the transition regime between boosted and resolved topologies.

Acknowledgments

We thank Loukas Gouskos and Huilin Qu for useful discussions and suggestions. A A P, M P, S S and V L are supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (Grant Agreement No. 772369). A A P is supported by CEVA under the CERN Knowledge Transfer Group.

Data availability statement

The data that support the findings of this study are openly available at https://doi.org/10.5281/zenodo.4883651.

Footnotes

6
By background we refer to the areas without target objects, i.e. jets.
7
That is not a problem in our case as we can generate more events with low cost.

Please wait… references are loading.