Open AccessArticle

An Encoder–Decoder Architecture within a Classical Signal-Processing Framework for Real-Time Barcode Segmentation

Óscar Gómez-Cárdenes

José Gil Marichal-Hernández

^1,*

Jung-Young Son

Rafael Pérez Jiménez

and

José Manuel Rodríguez-Ramos

^1,4

Department of Industrial Engineering, Universidad de La Laguna, 38200 La Laguna, Spain

Biomedical Engineering Department, Konyang University, Nonsan-si 320-711, Republic of Korea

Institute for Technological Development and Innovation in Communications, Universidad de Las Palmas de Gran Canaria, 35017 Las Palmas, Spain

⁴

Research & Development Department, Wooptix S.L., 38204 La Laguna, Spain

Author to whom correspondence should be addressed.

Sensors 2023, 23(13), 6109; https://doi.org/10.3390/s23136109

Submission received: 3 May 2023 / Revised: 25 June 2023 / Accepted: 26 June 2023 / Published: 3 July 2023

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Graphical abstract
"> Figure 1
Topology of the VGG16 model. ReLU stands for the type of activation function used: rectified linear unit. "> Figure 2
SegNet topology. It uses VGG16 as the encoder and adds a specularly symmetric decoder. Note that the indices of the max pooling operations are used when upsampling. "> Figure 3
A series of rotated image tiles containing bars, surrounded by plots of their pixel intensity sums per row (red) and per column (blue). The variance difference among them is maximal when the bars are fully vertically or horizontally oriented. "> Figure 4
Illustration of how adjacent pixels and subsequently line segments are combined in the partial stages of DRT. In this example, <math display="inline"><semantics><mrow><mn>16</mn><mo>×</mo><mn>16</mn></mrow></semantics></math> pixels (represented by circles) are combined initially, to the left, two by two; in the next stage, four by four; and finally, eight by eight. The final tile spacing between solutions (arrows, at the bottom) is eight. In each partial-stage solution, pixels are combined to consider each displacement and slope, but in non-overlapping vertical and horizontal bands, separated by dashed lines. Eminently horizontal lines are rendered in red, whilst vertical ones are rendered in blue. "> Figure 5
Dilemma of how to choose the right tile size to study a barcode region (in blue). From left to right, the tile subdivisions are shown for three different tile sizes, each of which quadruples the previous one in area. The tiles are depicted as rectangles with a circle at their centre and define the spacing of the output grid. The tiles whose centres are within the barcode region are depicted in a coarser line. "> Figure 6
From left to right, from top to bottom: the output of the method by Oliva-García et al. [<a href="#B33-sensors-23-06109" class="html-bibr">33</a>] given the zoomed example image in the top-left corner at scales ranging from <tt>tile_size = 2</tt> to <tt>tile_size = 32</tt>. Note that as the <tt>tile_size</tt> is doubled, the areas are quadrupled in size, but their number is reduced to a quarter. Each result was upsampled with nearest-neighbour interpolation so that all are shown to be the same size. The hue encodes the angle of the bars, and the intensity encodes the intensity of detection. "> Figure 7
On the left, large tile size with smaller tile spacing, implying tile overlapping. On the right, the merging of tiles at various scales to optimise barcode coverage. "> Figure 8
A depiction of how non-overlapping bands fuse at scales two and four to obtain a final output spacing of eight pixels per dimension. The depiction combines, in one picture, the evolution of vertical bands as they are shown at the bottom of <a href="#sensors-23-06109-f004" class="html-fig">Figure 4</a>. "> Figure 9
Illustration of how the final tile spacing determines the bands of computations in partial stages. Both represent <math display="inline"><semantics><mrow><mi>N</mi><mo>=</mo><mn>16</mn></mrow></semantics></math> and <tt>tile_size</tt><math display="inline"><semantics><mrow><mo>=</mo><mn>8</mn></mrow></semantics></math>. They differ in the <tt>tile_spacing</tt>, which is 4 on the left and 2 on the right. "> Figure 10
Architecture of the detector based on the partial DRT, as described in [<a href="#B33-sensors-23-06109" class="html-bibr">33</a>] but taking advantage of all the calculated stages up to <math display="inline"><semantics><mrow><mi mathvariant="monospace">tile</mi><mo>_</mo><mi mathvariant="monospace">size</mi><mo>=</mo><mn>32</mn></mrow></semantics></math>. "> Figure 11
New encoder topology, based on the work by Oliva-García et al. [<a href="#B33-sensors-23-06109" class="html-bibr">33</a>] and modified as described in this section. Note that for simplicity, the partial DRT data structure was made smaller in scale for display purposes. "> Figure 12
From left to right: input image and activation maps at scale = 2, slicing for slopes 0 to 12, which correspond to angles in the range <math display="inline"><semantics><mrow><mo>[</mo><mo>−</mo><mi>π</mi><mo>/</mo><mn>4</mn><mo>,</mo><mn>3</mn><mi>π</mi><mo>/</mo><mn>4</mn><mo>]</mo></mrow></semantics></math>. A colormap is used to represent intensity, ranging from a darker blue to a lighter blue. The upper part of the image contains lines at an angle of <math display="inline"><semantics><mrow><mi>π</mi><mo>/</mo><mn>6</mn></mrow></semantics></math>, and the lower part of the image contains lines at an angle of <math display="inline"><semantics><mrow><mi>π</mi><mo>/</mo><mn>3</mn></mrow></semantics></math>. Note that the activation map correctly activates more in the corresponding slice of each half. "> Figure 13
From left to right: input image and activation maps for the central angle of each scale, from finer (first) to coarser (last) scales. A colormap is used to represent intensity, ranging from a darker blue to a lighter blue. It is clear that finer lines (lower part) produce more intensity at the second and third scales, while coarser lines (upper part) are better detected at the last scale. "> Figure 14
Example of unpooling operation. The pooling operation reduces dimensionality, and the unpooling operation increases it. Note that there are “gaps” where no indices refer to the array on the right. "> Figure 15
From left to right: cropped part of input image; coarse activation map for horizontal angles; fine activation map for horizontal angles; the result of performing the unpooling described in Algorithm 3. Note that there are “gaps” and that each activation function that got propagated used the coarse angle information, and the correct spot was calculated by finding the argument of the maxima within the corresponding area at a finer scale. "> Figure 16
Full encoder–decoder architecture, where the DRT-based encoder is matched with the newly designed decoder integrating the operations described in this section. "> Figure 17
Average execution times of the Halide implementation of the four algorithms. These measurements were performed on an i9 9900 desktop CPU and on the CPU of the Qualcomm Snapdragon 888 mobile SoC. The red dashed line emphasises the 1/30 s time limit. "> Figure 18
Left: unprocessed output of the <tt>MDD DRT</tt> segmenter. Right: superimposed on the input image, connected regions whose aspect ratio between the major and minor oriented axes is consistent with a barcode and whose area exceeds a minimum threshold. "> Figure 19
The <tt>MDD DRT</tt> algorithm was executed for thresholds ranging from 10 to 38 on both the WWU Münster dataset and the Arte-lab Rotated dataset. The selected threshold value is marked with a dashed red line. "> Figure 20
The IoU of a barcode that was scaled to different sizes. "> Figure 21
The IoU of a barcode that was rotated to angles ranging from 0 to 180 degrees. "> Figure 22
Columns, from left to right: input image and output of algorithms <tt>PDRT 2</tt>, <tt>PDRT 32</tt>, <tt>PS DRT</tt>, and <tt>MDD DRT</tt>. Rows, challenging scenarios, from top to bottom: elongated, cluttered, mild lens blur, strong lens blur, motion blur, low contrast, and shear. "> Figure 22 Cont.
Columns, from left to right: input image and output of algorithms <tt>PDRT 2</tt>, <tt>PDRT 32</tt>, <tt>PS DRT</tt>, and <tt>MDD DRT</tt>. Rows, challenging scenarios, from top to bottom: elongated, cluttered, mild lens blur, strong lens blur, motion blur, low contrast, and shear. "> Figure 23
Frame of the video provided in the repository documentation. Available at <a href="https://raw.githubusercontent.com/DoMondo/an_encoder_decoder_architecture/master/readme_data/cover.gif" target="_blank">https://raw.githubusercontent.com/DoMondo/an_encoder_decoder_architecture/master/readme_data/cover.gif</a>, accessed on 25 June 2023. ">

Versions Notes

Abstract

In this work, two methods are proposed for solving the problem of one-dimensional barcode segmentation in images, with an emphasis on augmented reality (AR) applications. These methods take the partial discrete Radon transform as a building block. The first proposed method uses overlapping tiles for obtaining good angle precision while maintaining good spatial precision. The second one uses an encoder–decoder structure inspired by state-of-the-art convolutional neural networks for segmentation while maintaining a classical processing framework, thus not requiring training. It is shown that the second method’s processing time is lower than the video acquisition time with a 1024 × 1024 input on a CPU, which had not been previously achieved. The accuracy it obtained on datasets widely used by the scientific community was almost on par with that obtained using the most-recent state-of-the-art methods using deep learning. Beyond the challenges of those datasets, the method proposed is particularly well suited to image sequences taken with short exposure and exhibiting motion blur and lens blur, which are expected in a real-world AR scenario. Two implementations of the proposed methods are made available to the scientific community: one for easy prototyping and one optimised for parallel implementation, which can be run on desktop and mobile phone CPUs.

Keywords:

Radon transform; scale-space methods; multiscale DRT; barcodes; encoder–decoder; pixelwise segmentation; classical signal processing

Graphical Abstract

1. Introduction

One-dimensional barcodes are a means for visually encoding information to make it machine readable, making use of alternating bars and spaces of different widths [1].

This paper deals with barcode segmentation for camera-based barcode readers. To reduce user assistance in aligning the reader with the codes, segmentation produces an image mask that indicates exactly in which pixels, if any, barcodes are present, thereby indicating where the next stage, that is decoding the data encoded in the barcode, should take place.

The localisation and proper segmentation of barcodes within an image have been frequently addressed in the literature, but not so frequently when augmented reality (AR) [2] goals are also set. For the real-time augmentation of a device’s camera stream, the entire barcode localisation/segmentation and decoding pipeline must fit within a few milliseconds. If the identification of the code in a frame on sight is transferred to the user with any latency, the augmented experience becomes unfeasible due to the misalignment of the marks in the augmented world and their real-world equivalents. In this sense, even if a method achieves high detection accuracy, it is not suited to AR applications if it does not keep the latency low.

1.1. Barcode Detection Summary

Up until a decade ago, most of the proposed methods could be classified as classical [3,4,5,6,7,8,9,10,11] or based on machine learning, i.e., the latest trend.

In classical methods, the main underlying idea is that the gradient in a zone of parallel bars shares a common preferential orientation. Thus, in the absence of rotation, when the camera and code are aligned, areas with high horizontal gradients and low vertical gradients can be classified as one-dimensional barcoded areas. This main idea can be complemented using methods to achieve rotation invariance by detecting the orientation of the codes, for example by analysing image patches with Radon or Hough transforms; contrast invariance can be achieved using adaptive or histogram thresholding; further, in order to delimit the code region, a mixture of thresholding, connected components, and/or morphological operations are used to finish off the contours of the codes.

Other classical methods rely on a geometrical line or bar detector [12,13,14,15]. In the case of two-dimensional codes, which are not the subject of this work, the gradient criterion is not sufficient, so edge detection is replaced by corner detection [9].

Since 2013 [15,16,17], these classical techniques have been hybridised with neural networks, either to replace computationally expensive line transforms for angle detection or to better contour the barcode shape.

In recent years, there has been a renewed interest in the subject, with several papers applying advances in machine learning to the problem at hand, borrowing generic networks and retraining them to solve the specific problem of barcode localisation [18,19]. In this latter group of articles, the results are quite good—comparable to those of the best classical approaches—both in terms of detection accuracy and inference time, but also quite similar among the different proposed methods [20].

1.2. Semantic Segmentation

Recently, there has been an enormous amount of contributions in the field of object recognition [21], and a new type of convolutional neural network (CNN) has emerged to address the problem of semantic segmentation at the pixel level. The problem that these networks solve consists of assigning a label to each pixel in the image, indicating that it is part of an object of a certain class. CNNs have become better and better at solving this problem lately [22].

These networks usually have some kind of symmetry. The first part of the network, the encoder, starts with the input size of the image and lowers its spatial dimensions while increasing the depth of the data size, as shown, for example, in Figure 1. The second part of the network performs a reversed path, where the spatial dimension increases and the depth decreases again. A good example of this is the SegNet network [23], which utilises the VGG16 network as the encoder and adds a new decoder that enables the network to perform pixelwise semantic segmentation, as shown in Figure 2.

The result of this kind of topology is that not only the image is classified into classes, effectively recognising the objects that are present as VGG16 does, but also the object that lies behind each pixel is labelled with great accuracy, and the labels are correctly constrained to the borders of the object.

1.3. Review of Selected Works

In this subsection, we provide a more detailed analysis of a selection of previous contributions.

Robust angle-invariant, 1D barcode detection [16]

One of the most-relevant papers in barcode recognition—dating from 2013—is Zamberletti et al.’s, who pioneered the use of machine learning and line domain transforms in angle detection, targeted mobile phones as their platform for combined capture and computation, and established one of the publicly available datasets still in use to assess detection accuracy. Their technique consists of a concatenation of tasks: Canny detection for edge enhancement, Hough transform to detect lines, a multilayer perceptron operating on the angular dimension of the Hough domain to detect the presence of barcodes, and histogram-guided selection in the displacement dimension to determine bounding boxes. Despite its low throughput—each image takes 270 ms to be processed on a mobile phone—its rotation invariance achieved by using the Hough transform is unprecedented. Thanks to the use of machine learning, the technique is also capable of detecting partially occluded barcodes, setting the state of the art at the time, correctly detecting around 83% of barcodes in labelled datasets.

Real-time barcode detection in the wild [10]

Creusot and Munawar, in 2015, used maximal stable extremal region to detect, filter, and cluster bars pertaining to barcodes. They also used the Hough transform in the clustering stage. They increased the accuracy obtained by Zamberletti et al. [16] by 10 percent and, by moving to a desktop PC, could process a frame of 640 × 480 in the order of 100 ms.

Low-computation egocentric barcode detector for the blind [11]

The same authors, a year later, aiming to process frames obtained with a wide-angle video camera, discarded their previous method because it was not resistant to motion blur and came up with a new method consisting of a geometric algorithm to detect parallel lines, where a representative line of the cluster determines the height of the barcode bounding box and an estimator of variations in the bisector of the line accounts for the width of the box. This new method increased by up to 98% the capacity to detect codes on simple datasets and lowered the time cost of processing a frame to 40 ms. However, on a more realistic dataset of images taken with a high-resolution video camera—which they created and called EGO, but which is not publicly available—the detection rate dropped to 60%, and the computation time on a PC increased to 116 ms. However, this could be considered the best approach to barcode segmentation with classical methods.

Real-time barcode detection and classification using deep learning [18]

In 2017, Hansen et al. used a recently developed deep learning object detection algorithm, You Only Look Once, with the Darknet19 network structure on 416 × 416 images. As the YOLO output consists of rectangles containing barcodes, which can appear rotated, they used another network to predict the rotation angle. They obtained the same accuracy as Creusot and Munawar [11], but thanks to the use of a GTX 1080 GPU, they could process a frame in 14 ms. However, this method cannot be considered to have processing time comparable to video acquisition time on CPUs.

Universal barcode detector using semantic segmentation [17]

More recently, Zharkov and Zagaynov, following the advances in semantic segmentation, specifically designed a neural network algorithm and trained it to detect 16 different types of one-dimensional barcodes and 5 types of two-dimensional barcodes. Its accuracy was on par with that of the method by Creusot and Munawar [11] on one-dimensional barcodes. Its network structure, much more trimmed than YOLO’s, achieved the processing time of only 3.8 ms on a GPU; however, when employed on a desktop CPU, computation took 44 ms on 512 × 512 images, which is the same time taken by the best of the classical methods. Though it is close, it is not in real time, even on low-resolution images.

One-dimensional barcode detection: novel benchmark datasets and comprehensive comparison of deep convolutional neural network approaches [24]

Following another work by the same group [20], where they systematically analysed the literature on neural networks applied to the problem under discussion, the authors created two training datasets: one containing consumer goods codes and the other containing postal labels. Both were created based on real, uncontrolled environments. They then trained and tested a well-known algorithm, Faster R-CNN [25], as well as four underexplored ones, EfficientDet [26], RetinaNet [27], YOLO v5 [28], and YOLO x [29]. They concluded that the YOLO v5 algorithm performed better in terms of accuracy and runtime on most datasets. Unfortunately, the authors did not provide the inference time on CPUs.

1.4. Remaining Challenges

Despite such a proliferation of articles, several challenges remain unaddressed.

Success in training neural networks heavily depends on the quality of the set of images they are trained on, and these have usually been taken as sharp, still pictures under good lighting conditions. In some cases, they have even been synthetically generated by overlapping undistorted barcodes on real images [30,31].

Publicly available labelled barcode datasets usually have small dimensions, e.g.,

640 \times 480

pixels [16], and since the computational requirements markedly increase with image size in the training phase, those sizes have been accepted as good enough. In addition, this is justified by the fact that barcodes so small that they cannot not be detected at that scale would also not be decodable. Nevertheless, in AR, barcode detection is desirable even at non-decodable scales.

More importantly, it is neglected that a barcode reader designed to work in AR applications is not limited to analysing static and well-exposed images but also analyses short-exposure image sequences. Moreover, when the camera approaches the code, until the movement stabilises, most frames exhibit motion blur. In addition, lens blur is frequent when a user starts to aim at a barcode and the lens has not had time to lock focus or when there are objects at different focus distances. However, out-of-focus images have generally not been of interest in previous works, which is again collaterally justified by the impossibility of decoding them.

Another problem is that quality assessment often relies on metrics that, despite being objective and repeatable, do not necessarily explain the goodness of the method in real-world scenarios. The often used Jaccard index, or the similar Dice coefficient, measures the overlapping percentage between the ground truth barcode region and the predicted bounding boxes. Probably induced by machine learning competitions [32], it is assumed that a higher Jaccard index means better performance. However, in a real-use case, a raster line good enough to perform decoding can be extracted from a predicted bounding box with a low Jaccard index. To our knowledge, no study has been conducted to determine the Jaccard index threshold needed to retrieve a satisfactory bisector line. Instead, most neural network papers provide detection metrics for various Jaccard index thresholds; virtually all achieve perfect detection rates at lower index values. The more barcodes can be decoded at those low thresholds, the lower the benefits of improving accuracy at high thresholds are. However, it is at these thresholds, close to perfect overlapping, that diminishing improvements are being reported in the state of the art.

On the other hand, very few works claim to achieve computation time lower than 33 ms, as required by AR applications. Moreover, when they do, it is by running them on powerful desktop GPUs, which are not available in a real AR deployment.

The last issue is that the evaluation of the methods lacks repeatability in the absence of the actual implementation code. This leads to the fact that some of the methods that are considered slow today could probably favourably compare to the latest methods if run on modern platforms. Because of this, it is not possible to repeat quality assessment with metrics other than those proposed by the authors.

1.5. Purpose of This Work

To overcome the challenges mentioned above, this paper revises a modification of the discrete Radon transform [33], which makes the local detection of bar-shaped structures at a certain scale possible. This method presents a trade-off between the density of the grid on which it returns the results and the accuracy with which it determines the existence and orientation of the bars. This method works locally on non-overlapping image tiles; therefore, the spacing, or stride, between tiles is equal to the tile size.

In this contribution, the above method was improved to achieve multiscale detection. Multiscale methods [34] can help by serving two purposes: barcode patterns are identified with scale invariance—which is important in a scenario where the operator can freely zoom the reader in and out of the codes—and computation time is reduced by working at each scale with the appropriate scale size, which is usually a fraction of the input resolution.

The following tasks were tackled, with an emphasis on reducing computation speed, and thus latency, to make AR feasible without degrading the quality of segmentation:

1.: To propose an adapted version of the above method to allow overlapping among local detection zones so that tile sizes and spacing between detection processes are decoupled.
2.: To propose a second adapted version that works by merging the multiple scales of the original method, i.e., the different tile sizes, in an automatic and optimal way.
3.: To perform a comparison between the original method and the two cited innovations, and one between these and existing methods for barcode segmentation.

The following section presents the ideas behind the existing method that are necessary to realise the proposed methods. Then, the outline of the paper is given at the end of Section 2.3.

2. Discrete Radon Transform as a Bar Detector

2.1. Discrete Radon Transform

The Radon transform [35] makes it possible to study a two-dimensional function

f (x, y)

in terms of the integrals across all the lines that pass through it. The parameters that define the different lines are slope and displacement, or the interception of the line with the axis.

In the 1990s, several authors [36,37,38] simultaneously proposed a discrete Radon transform (DRT), which exhibits linearithmic complexity,

O (N^{2} log N)

, by expressing the line integrals as summations over a stripe of connected samples of unit width and approximately constant slope over discrete signals. The virtue of the method is to work with “loose digital lines” that do not require interpolation. Moreover, these lines are constructed recursively with a multiscale approach, where the sums of segments of length two are carried out only once; when segments of size four are constructed, it is performed by optimally reusing the partial results of the previous scale, and so on. The result, for an input of size

N \times N

, is obtained after

{log}_{2} N

stages. In this method, the partial results are a means to arrive at the summation over the full domain and are discarded, but now, they will be essential to our purposes.

Another remarkable characteristic of the DRT is that the lines are expressed in the form

y = x \cdot s l o p e + d i s p l a c e m e n t

, with the slope value being between 0 and 1, so the basic algorithm can only solve a quadrant consisting of

45^{\circ}

. If the input is vertically inverted, the result then corresponds to

y = - (x \cdot s l o p e + d i s p l a c e m e n t)

. Similarly, when adding a transposition, the eminently vertical lines are obtained. Thus, after running the basic quadrant algorithm on three additional transpositions and/or the flipping of the input, the

180^{\circ}

needed to cover the angular dimension of the slope is obtained.

Two modifications to the original DRT algorithm are particularly relevant to the subsequent discussion and elucidation of the new methods:

1.: A method to better expose the parallelism of the DRT [39], which achieves, in one pass, the eminently horizontal lines of the form $y = \pm (x \cdot s l o p e + d i s p l a c e m e n t)$ , and in another pass, those of the form $x = \pm (y \cdot s l o p e + d i s p l a c e m e n t)$ , thus changing from four quadrants to two groups of 90 degrees (around the horizontal axis and the vertical axis). Additionally, this reformulation eliminates the need for prior zero padding, which the conventional DRT requires.
2.: A bar detector method, built upon the aforementioned modified DRT, that acts locally [33], giving an estimate of the presence of bars and their inclination for each block of size $2^{t} \times 2^{t}$ into which an image can be subdivided without overlapping. This local size is denoted by tile_size.

The second method is described in a more detailed manner in the following subsection.

2.2. From a Line Detector to a Local Bar Detector

An image region containing bar structures is characterised by the fact that the intensity of the pixel traversal at the angle at which the bars point is constant as we move longitudinally, either on a bar or in a space between bars. On the other hand, the intensity variation is maximal in the orthogonal direction, as we alternately traverse bars and spaces as fast as possible. This property seems to indicate that a bar detector must always consist of a gradient estimator, but this characteristic of bar structures can be expressed in another, equivalent way: If what is observed are the line integrals in the direction coincident with the bars, there is large variance between neighbouring line integrals (the result of accumulating the intensity of the bars at some displacements and the intensity of the spaces at nearby displacements), whereas in the orthogonal direction, all displacements cross both bars and spaces to the same extent; therefore, the variance between neighbours is low. See Figure 3.

The method described by Oliva-García et al. [33], consisting of the above method, works because the partial m stage of the modified DRT as described by Gómez-Cárdenes et al. [39] computes the sums at

2^{m + 1} - 1

possible angles of line segments of length

2^{m}

, starting at each position

(2^{m} \cdot k_{x}, d_{y})

with

d_{y} \in {0 . . N_{y} - 1}

but

k_{x} \in {0 . . \frac{N_{x}}{2^{m}}}

for eminently horizontal lines and similarly for vertical ones. Note that in each partial stage m, since no overlapping is allowed, the solutions are

2^{m}

apart from each other, as shown in Figure 4. The goodness of the method lies in the fact that the images themselves are not sheared to different angles and then columnwise and rowwise summed, but the very internal structure of the partial DRT stages, normally discarded as an intermediate product, consists precisely of the pixels accumulated by angle and displacement; this means that the values plotted in blue and red in Figure 3 can be found, just in another order, in the same colours in Figure 4.

Mixing the above two illustrative images, Figure 4 can be understood as the first three partial stages of the DRT on a problem of size

N = 16

. If the process is stopped in this stage, the line integrals can be compared around the centres of the four tiles of size

8 \times 8

, which the DRT inner stages have created. The method returns the angle argument of the maximum variance differences between the horizontal and vertical lines within each tile, which are represented in red and blue, respectively, in the figure, supplemented with an activation metric that combines both variances.

The estimations at each scale are non-overlapping; the evaluations can only be given for sets of T neighbours separated by T from each other in each dimension. In turn, the number of evaluable angles also depends on the tile_size, since in areas of width T, there are T positive and T negative slopes that can be evaluated, giving a total of

2 T

. If exclusively considering horizontal lines, the number becomes

2 T - 1

when the double occurrence of the common

\pm 0

slope is subtracted. Since they are to be compared with their vertical opposites, one more combination has to be subtracted due to coincidence and then the number has to be doubled (since the same comparison determines the existence in one direction and its orthogonal), up to a total of

2 (2 T - 2) = 4 (T - 1)

Importantly, with this method, if the scale of analysis—the tile_size—matches the scale of the bars and as long as the whole tile consists of an area containing a barcode, the method executes detection and returns the angle that is most coincident with the direction of the bars.

However, similar to other pattern-recognition problems, the scale is unknown a priori, and if it is not appropriately chosen, detection does not occur. Figure 5 symbolises the relationships that can take place between a barcoded area, represented by the blue rotated rectangle, and three possible scales. At smaller scales, as already emphasised, fewer angles are evaluated, but the grid of results is denser, and vice versa. Furthermore, the larger the scale is, the less guarantee there is that the tiles show areas that belong entirely to a barcode.

2.2.1. Disadvantages of Working at a Single Scale

Figure 6 shows the result of performing the method by Oliva-García et al. [33] at various scales on an area of an actual image containing barcodes and other patterns, such as separator lines or alphanumeric characters. Certain conclusions, which are helpful in proposing improvements in this initial method, can be drawn from it:

1.: At smaller scales, the angle cannot be determined with high precision. However, at the smallest scale, it is observed that the method discriminates between vertical bars, in colours in the range of the blues, and horizontal bars, in colours in the range of the oranges.
2.: If the scale is so small that only a uniform area is observed in a tile (e.g., the interiors of bars and spaces at scales of two and four), no activation is generated, which, although correct, represents a problem when grouping all the bars and spaces, with varying widths belonging to the same barcode.
3.: On the other hand, at larger scales, there may be tiles that simultaneously observe part of the barcode and the background. This poses a problem in accurately determining the boundaries of the code.
4.: In the previous case, i.e., tiles containing both bars and background, but also in the case of tiles that contain a zone with only a few thick bars, with respect to a tile covering a zone with thin bars, the activation intensity decreases. This explains the variability in detection intensity while keeping the hue—where the angle is encoded—uniform in areas with barcodes at the greatest scale.
5.: At large scales, a drop in detection intensity can also occur when there are differently angled structures within a single tile. This is the case of zones with alphanumeric characters. In these cases, unlike the previous two, in smaller-scale sub-tiles, there may be angular disagreement, and there may even be greater intensity of detection in the parts than in the whole.

2.3. Initial Ideas for the New Methods

Figure 7 employs the symbology of Figure 5 to visually express the two variations that are proposed in this contribution.

The first idea is to decouple tile size and tile spacing, i.e., to allow overlapping among tiles. The subfigure on the left shows local evaluation with the largest tile size of Figure 5, but with tile spacing of half that size. It is outlined in Section 3 how to modify the stages prior to t, so that the latter can be optimally computed, allowing tile overlapping.

The second idea consists in the fusion of several scales so that smaller-scale tiles vote coordinately with larger-scale tiles in which they are framed, and vice versa. The solution has the spacing of the smaller scale, and the angular precision and accuracy of the larger scale. This achieves better definition of barcode edges and avoids, on the one hand, small-scale non-activation in uniform inner barcode areas and, on the other hand, disagreement voting at different scales in areas lacking barcodes.

It is shown that both variations increase the computation cost to different degrees, even if the computation of the previous stages,

m < t

, is always performed.

Moreover, ideas taken from the encoder–decoder architecture, proven to be successful in pixelwise segmentation tasks in the framework of convolutional neural networks (CNNs) [22], are considered for the fusion of multiple scales but without resorting to machine learning. The encoder, in artificial intelligence (AI) terminology, is constituted, in this case, by the output of the DRT at all scales up to a certain tile_size. The mixing of the results is performed in the decoder, and as in AI, there are upsampling operations, but by remaining within the classical framework, there are no learnable weights. All of this leads to a scale-invariant method that can be executed in real time without latency.

This second method is presented in Section 4. In Section 5, details about optimised implementations for both methods are given. The paper concludes after presenting comparisons between these new methods and pre-existing ones in Section 6.

3. Overlapping Tiles for Increased Spatial Resolution

As explained in Figure 8, the unmodified DRT algorithm takes neighbour pixels in a vertical band consisting of pixels in horizontal positions

2 \cdot k_{x}

and

2 \cdot k_{x} + 1, \forall k_{x} \in {0 . . N_{x} / 2}

, and computes and stores the sums with them with possible slopes

- 1, 0

, and

+ 1

. The next stage proceeds similarly and takes sums of length two from bands starting at

4 \cdot k_{x}

and

4 \cdot k_{x} + 2

to create any possible sum of length four, in bands whose initial positions are now

4 \cdot k_{x}, \forall k_{x} \in {0 . . N_{x} / 4}

. This is also illustrated in Figure 4. The final tile spacing for this example is eight, while the number of pixels is

N = N_{x} = N_{y} = 16

. The algorithm that accomplishes such a task can be found in the work by Oliva-García et al. [33], and its computational complexity in big O notation is

O (N^{2} {log}_{2} (tile_size))

, as it stops after just a few stages. It is thoroughly discussed below in Section 3.1.

This algorithm must now be modified to allow overlapping, which implies computing, in stages greater than

s = {log}_{2} (tile_spacing)

, which are combinations that are not necessary in the original method. Specifically, where in the existing method, in stage m, computations have to be carried out with the separation of

2^{m}

, now, the separation must be

2^{min (s, m)}

. This and the appropriate memory allocation for these new computations are the only aspects that need to be changed to make an output with that spacing in the upper stages possible.

This is exemplified in Figure 9. On the left, it is shown what the spacing should look like for the same N and tile_size, but now with tile_spacing = 4. Nothing changes at scales

m = 1

and

m = 2

, but something does at scale

m = 3

. To introduce new line summations of length eight, no new combinations (with respect to the example in Figure 5) of length four and length two need to be created.

On the right side of the above figure, the case for tile_spacing

= 2

is shown, with N and tile_size remaining the same. In this case, additional bands must be created starting at scale

m = 2

, because it is already greater than

{log}_{2} (tile_spacing)

. Note that the algorithm is still optimal in the sense that it only computes the necessary sums once, and if they are required again, it reuses them.

The new algorithm that allows overlapping, if tile_spacing differs from tile_size, is listed in Algorithm 1, highlighting the lines that need to be modified with respect to the algorithm by Oliva-García et al. [33]. If tile_spacing = tile_size, both algorithms perform the same computations and thus exhibit the same complexity.

3.1. Computational Complexity without Overlapping

The unmodified Radon transform [39] ends in stage

n = {log}_{2} N

, with N being the side size of a square image, exhibiting computational complexity

O (N^{2} {log}_{2} N)

or, more precisely, examining the range of the for loops; the number of operations (two data reads, one addition and one write per iteration) for the horizontal partial DRT is

4 \sum_{m = 1}^{n} (N \frac{N}{2^{m}} (2^{m + 1} - 1)) = 4 N^{2} (2 n + \frac{1}{2^{n}}) \approx 8 N^{2} n .

(1)

To also consider the vertical lines, the same computations must be repeated after transposing the input, for a total of

8 N^{2} n + 2 N^{2} + 8 N^{2} n = N^{2} (16 n + 2) .

(2)

In the method by Oliva-García et al. [33], on the other hand, a new parameter appears, tile_size, which indicates the desired size of local zones where to evaluate the existence of bars, i.e., the scale of the analysis. Let tile_size

= T = 2^{t} < 2^{n}

. Essentially, only the first t partial stages of the DRT are computed, discarding the results of the stages up to

t - 1

. This step, as discussed above, requires

N^{2} (16 t + 2)

operations.

Algorithm 1 Computing the partial DRT of an image with tile overlapping.

1:: functionpartial_drt(I, tile_size, tile_spacing)
2:: Input: $I (x, y) ⟶$ Image consisting of $N \times N$ data
3:: Input: $tile_size ⟶$ Size of the tiles
4:: Input: $tile_spacing ⟶$ Spacing of the tiles
5:: Output: $F (s q u a r e, s l o p e, d i s p l a c e m e n t) ⟶$ Result of the transform
6:
7:: $tile_size_bits \leftarrow ⌊ {log}_{2} (tile_size) ⌋$
8:: $tile_spacing_bits \leftarrow ⌊ {log}_{2} (tile_spacing) ⌋$
9:: $N \leftarrow size (I) [0]$ ▹ Size supposed to be the same for x and y.
10:: $F_{0} \leftarrow I$ ▹ Initialise first stage with unmodified input.
11:: for $stage = 1$ to $tile_size_bits$ do ▹ Allocate memory for next stages and fill with zeros.
12:: $[n_squares, n_slopes] \leftarrow get_size_partial_drt (N, stage, tile_size, tile_spacing)$
13:: $F_{stage} \leftarrow zeros (n_squares, n_slopes, N)$
14:: end for
15:: for $stage = 0$ to $tile_size_bits - 1$ do ▹ Compute results in stage+1 from data in stage
16:: $current_tile_size \leftarrow 2^{stage}$
17:: $next_tile_size \leftarrow 2^{stage + 1}$
18:: $[out_n_squares, out_n_slopes] \leftarrow get_size_partial_drt (N, stage + 1, tile_size, tile_spacing)$
19:: for $out_y_square = 0$ to $out_n_squares - 1$ do
20:: for $unsigned_slope = 0$ to $out_n_slopes - 1$ do ▹ slope as unsigned index
21:: $slope \leftarrow unsigned_slope - next_tile_size + 1$ ▹ positive and negative slopes
22:: $ab_s \leftarrow | slope |$
23:: $s_sign \leftarrow 1$
24:: if $slope < 0$ then
25:: $s_sign = - 1$
26:: end if
27:: $s 2 \leftarrow ⌊ \frac{ab_s}{2} ⌋$
28:: $rs \leftarrow ab_s - 2 \times s 2$ ▹ remainder of absolute slope
29:: $slopeM \leftarrow current_tile_size - 1 + s 2 \times s_sign$ ▹ slopes of previous segments
30:: $incIndB \leftarrow s_sign \times (s 2 + rs)$ ▹ displacement among segments
31:: for $writeIdx = 0$ to $N - 1$ do
32:: $readIdx \leftarrow writeIdx$
33:: $A \leftarrow 0$
34:: $B \leftarrow 0$
35:: if $0 \leq readIdx < N$ then
36:: if $stage < tile_spacing_bits$ then
37:: $A \leftarrow F_{stage} (out_y_square \times 2, slopeM, readIdx)$
38:: else
39:: $A \leftarrow F_{stage} (out_y_square, slopeM, readIdx)$
40:: end if
41:: end if
42:: if $0 \leq readIdx + incIndB < N$ then
43:: if $stage < tile_spacing_bits$ then
44:: $B \leftarrow F_{stage} (out_y_square \times 2 + 1, slopeM, readIdx + incIndB)$
45:: else
46:: $B \leftarrow F_{stage} (out_y_square + stage - tile_spacing_bits + 1, slopeM, readIdx + incIndB)$
47:: end if
48:: end if
49:: $F_{stage + 1} (out_y_square, unsigned_slope, writeIdx) \leftarrow A + B$
50:: end for
51:: end for
52:: end for
53:: end for
54:: return $F_{tile_size_bits}$
55:: end function
56:
57:: function $get_size_partial_drt$ ( $N, stage, tile_size, tile_spacing$ )
58:: $stage_size \leftarrow 2^{stage}$
59:: $n_squares \leftarrow ⌊ \frac{N - \min (stage_size, tile_size)}{\min (stage_size, tile_spacing)} ⌋ + 1$
60:: $slope_size \leftarrow 2 \times stage_size - 1$
61:: return $[n_squares, slope_size]$
62:: end function

After the partial DRTs have been calculated, the variance in horizontal and vertical neighbourhoods of T values around the tile centres of bands in stage t are compared for

2 T - 1

angles. Those centres are located at positions

(k_{x} + \frac{1}{2} \cdot T, k_{y} + \frac{1}{2} \cdot T)

, with

k_{x} \in {0 . . \frac{N_{x}}{T} - 1}

and

k_{y} \in {0 . . \frac{N_{y}}{T} - 1}

. This step increments the number of arithmetical and I/O data operations to

(2 T - 1) \frac{N_{x}}{T} \frac{N_{y}}{T} (2 T + 1)

. The algorithm that calculates the variance is listed as bar_detector by Oliva-García et al. [33] and is referenced with the same name in Section 4.

Typical values of n and t could be

n = 10

and

N = 1024

, and

t = 5

and

T = 32

, and in this case, the final number of operations represent just 48% of the operations required by the unmodified DRT reaching stage n. If, instead, the scale of analysis is

T = 16

, with

t = 4

, the number of operations would be 38%. This is due to the early halting of the transform, and as stated by Oliva-García et al. [33], some remarkably fast implementations can be obtained, even in mobile phones, when overlapping is not allowed.

3.2. Computational Complexity with Overlapping

Let tile_spacing

= S = 2^{s} \leq 2^{t} < 2^{n}

. Now, the number of operations of the partial DRTs, instead of

N^{2} (16 t + 2)

, increments to

4 \sum_{m = 1}^{t} (N \frac{N}{min (2^{m}, 2^{s})} (2^{m + 1} - 1)) = 4 N^{2} (\frac{3}{min (2, 2^{s})} + \frac{7}{min (4, 2^{s})} + \dots + \frac{2^{t + 1} - 1}{m i n (2^{t}, 2^{s})}) .

(3)

Afterwards, the variance comparisons add

(2 T - 1) \frac{N_{x}}{S} \frac{N_{y}}{S} (2 T + 1)

operations.

In total, for

N = 1024

and

T = 32

, but with

S = 2

, the number of operations, instead of reducing, now increases by up to 9.27 times that of the unmodified DRT. For the case where

S = 4

and

T = 16

, the increment is of just 1.13 times.

If comparisons are made not with the unmodified DRT but with the method by Oliva-García et al. [33], the proposed method is 19.19 and 2.94 times more costly, respectively, for the

N, T,

and S parameter combinations provided above. So this new method depends, to a large extent, on the output resolution and amount of overlapping set or, in other words, the quality improvement it provides. This encourages finding an alternative way to increase quality without such a steep increment in complexity, as discussed in Section 6.

In the following sections, what has been thus far called tile_spacing is now referred to as stride, as it is a more commonly used term in the CNN literature.

4. Multiscale Domain-Based Segmentation

4.1. Modifying DRT Output for Multiscale Analysis

In order to propose multiscale domain-based segmentation, the partial DRT algorithm was modified so that it returns not merely the last stage but all the

F_{i}

initial stages, until it halts in the

{log}_{2} (tile_size)

stage.

Now, it is possible to run bar_detector at all the scales, resulting in a pyramid of data detection actions. For the considered case, the sizes of the data structures are shown in Table 1 for input size

1024 \times 1024

and

tile_size = 32

These changes give place, optimally, to all the subfigures in Figure 6, simultaneously. The analysis of these data yielded interesting insights, as discussed in Section 2.2.1.

4.2. Finding Similarities with Machine Learning-Based Algorithms

Taking advantage of the goodness of each scale is the goal of this approach. There are barcodes that can be correctly detected at different scales, mainly because of the size of the barcode but also because of lens blur and motion blur. By looking at the pattern of the data structures that appear, where we have maps at different scales and the spatial resolution becomes smaller, while the angle-domain resolution becomes bigger with the scale, it almost resembles a CNN for classification. The depth, channels, features, or classes of a classification network are equivalent to the angles in our problem. For example, the new pyramid of detection data, shown in Figure 10, is remarkably similar to the architecture of the VGG16 model [40] widely used in the literature, shown in Figure 1.

Models such as VGG16, LeNet-5 [41], and GoogleLeNet [42] all share the same purpose: the classification of objects that are represented in an image. They also have the same funnel shape, which starts with the size of the image and reduces the spatial dimensions while increasing the depth dimension—the number of classes for classification. In the partial DRT-based detector, the same process happens, substituting the concept of class with that of angle. This “funnel” shape is what constitutes an encoder.

4.3. Adapting the Existing Algorithm to Resemble an Encoder

Returning to the problem at hand and reformulating some of the vocabulary used, it can be said that in [33], there is already an algorithm that can classify the bars present in the input image into angles. That is, a probability of each of the 126 angles is obtained for each of the blocks that define the spatial dimensions in the last stage (

32 \times 32

). In the original bar_detector algorithm, the probability of each angle is discarded, and only the maximum angle is written into the output. To have an algorithm somewhat equivalent to the encoder of a CNN and avoid losing data, this was changed so that all the probabilities can be obtained. This change can be observed in the modified version of Algorithm 2 and gives place to the topology that is shown in Figure 11. The complexity of the algorithm described in [33] remains untouched, although the amount of memory that is needed increases.

Algorithm 2 Creating the activation maps for the CNN-like encoder from partial DRTs.

1:: function BAR_DETECTOR(H, V)
2:: Input: $H (x, y) ⟶$ Horizontal partial DRT of size $[n_s q u a r e s, n_s l o p e s, N]$
3:: Input: $V (x, y) ⟶$ Vertical partial DRT of size $[n_s q u a r e s, n_s l o p e s, N]$
4:: Output: $a c t i v a t i o n_m a p (x, y, s l o p e) ⟶$ Per tile and slope activation map
5:: $n_s q u a r e s \leftarrow size (V) [0]$ ▹ n_squares is N over tile_size
6:: $n_s l o p e s \leftarrow size (V) [1]$ ▹ n_slopes is 2xtile_size - 1
7:: $N \leftarrow size (V) [2]$
8:: $a c t i v a t i o n_m a p \leftarrow zeros (n_s q u a r e s, n_s q u a r e s, n_s l o p e s \times 2)$
9:: $t i l e_s i z e \leftarrow \frac{N}{n_s q u a r e s}$
10:: for $x_s q u a r e = 0$ to $n_s q u a r e s - 1$ do
11:: for $y_s q u a r e = 0$ to $n_s q u a r e s - 1$ do
12:: $x_c e n t r a l \leftarrow x_s q u a r e \times t i l e_s i z e + \frac{t i l e_s i z e}{2}$
13:: $y_c e n t r a l \leftarrow y_s q u a r e \times t i l e_s i z e + \frac{t i l e_s i z e}{2}$
14:: for $s l o p e = - t i l e_s i z e + 1$ to $t i l e_s i z e$ do
15:: $s t a r t_h \leftarrow y_c e n t r a l - \frac{t i l e_s i z e}{2} - \frac{s l o p e}{2}$
16:: $e n d_h \leftarrow y_c e n t r a l + \frac{t i l e_s i z e}{2} - \frac{s l o p e}{2}$
17:: $v a l u e s_h \leftarrow H (x_s q u a r e, s l o p e + t i l e_s i z e - 1, s t a r t_h : e n d_h)$
18:: $s t a r t_v \leftarrow x_c e n t r a l - \frac{t i l e_s i z e}{2} + \frac{s l o p e}{2}$
19:: $e n d_v \leftarrow x_c e n t r a l + \frac{t i l e_s i z e}{2} + \frac{s l o p e}{2}$
20:: $v a l u e s_v \leftarrow V (y_s q u a r e, - s l o p e + t i l e_s i z e - 1, s t a r t_v : e n d_v)$
21:: $v a l u e_h o r i z o n t a l \leftarrow \bar{| v a l u e s_h - \bar{v a l u e s_h} |}$
22:: $v a l u e_v e r t i c a l \leftarrow \bar{| v a l u e s_v - \bar{v a l u e s_v} |}$
23:: $v \leftarrow | v a l u e_h o r i z o n t a l - v a l u e_v e r t i c a l |$
24:: if $v a l u e_h o r i z o n t a l > v a l u e_v e r t i c a l$ then
25:: $a c t i v a t i o n_m a p (x_s q u a r e, y_s q u a r e, s l o p e + t i l e_s i z e - 1) \leftarrow v$
26:: $a c t i v a t i o n_m a p (x_s q u a r e, y_s q u a r e, s l o p e + t i l e_s i z e - 1 + n_s l o p e s) \leftarrow - v$
27:: else
28:: $a c t i v a t i o n_m a p (x_s q u a r e, y_s q u a r e, s l o p e + t i l e_s i z e - 1) \leftarrow - v$
29:: $a c t i v a t i o n_m a p (x_s q u a r e, y_s q u a r e, s l o p e + t i l e_s i z e - 1 + n_s l o p e s) \leftarrow v$
30:: end if
31:: end for
32:: end for
33:: end for
34:: return activation_map
35:: end function
36:
37:: function ENCODER(I)
38:: Input: $I (x, y) ⟶$ Image consisting of $N \times N$ data
39:: Output: $a c t i v a t i o n_m a p s (s c a l e, x, y, a n g l e)$
40:: $t i l e_s i z e \leftarrow 32$ ▹ Suitable for image size of 1024.
41:: $t i l e_s p a c i n g \leftarrow t i l e_s i z e$ ▹ Notice that $t i l e_s p a c i n g = t i l e_s i z e$ . Fusion works without overlapping.
42:: $h d r t \leftarrow p a r t i a l_d r t (I, t i l e_s i z e, t i l e_s p a c i n g)$
43:: $v d r t \leftarrow p a r t i a l_d r t (I^{T}, t i l e_s i z e, t i l e_s p a c i n g)$ ▹ Notice the transposed input.
44:: for $s t a g e = 0$ to $l o g_{2} (32)$ do
45:: $a c t i v a t i o n_m a p s (s t a g e) \leftarrow bar_detector (h d r t (s t a g e), v d r t (s t a g e))$
46:: end for
47:: return activation_maps
48:: end function

4.4. Testing the Encoder

With these simple adaptations, the authors have described a topology that is similar to those of the encoders of the mentioned CNNs, and the activation maps can now be studied. In addition, this encoder is optimal, because it is based on the DRT [36], which is optimal in itself. If the maps are analysed by choosing a scale and comparing all the 2D activation maps for all the angles, the pixels should gain intensity whenever a bar oriented along the angle of the slice is present. On the other hand, if an angle is chosen, and the scale is traversed, the pixels should gain intensity whenever a bar has the adequate size to be detected on the scale of the slice. To corroborate the first experiment, a synthetic image was prepared. This image was divided along the vertical axis such that the upper part contained lines at an angle that was different from that in the lower part. Therefore, the 14 activation maps at the second scale, of size

256 \times 256

, had to gain intensity in different activation map angle slices, indicating the angle of the lines; see Figure 12. For the second experiment, a synthetic image was prepared, again dividing it into two halves, where the upper part contained coarser lines and the lower part contained finer lines. This time, the angle index was fixed to match the angle of the lines, and the scale domain was traversed. The five activation maps had to gain intensity at different times, meaning that the lines were better detected at different scales, as demonstrated in Figure 13.

4.5. Designing a Decoder

Now that the encoder has been designed and tested, a decoder is needed to perform pixelwise segmentation. As shown above, in Figure 2, a way of taking advantage of an encoder to perform segmentation is to build a topology that is structured similarly to a mirror. Starting from the last operation of the decoder, that is, the coarser scale, for each operation of the encoder that lowers the spatial resolution, downsampling, a new operation that does the opposite, increasing the resolution, is placed, recreating the same structure but in reverse.

In the case of VGG16, the downsampling operation is a max pooling operation. This operation consists in reducing the dimensionality of a layer by choosing the maxima of the features contained in the sub-regions of the biggest domain. SegNet uses an unpooling operation, which uses the indices (the arrows in Figure 2) of the pooling operation for placing the features at the appropriate locations of the biggest domain; see Figure 14.

In this way, the features detected in the activation maps at a coarser scale are translated into a finer scale using the indices of the mirror pool operation on the encoder part, so compared with upsampling based on bilinear interpolation, the features are located more precisely. This allows one to retrieve the spatial resolution that is lost in the encoding part. As we can see in Figure 14, there are gaps, parts of the map where there are zeros, because no index refers to those locations. This information is filled in by layers that follow the unpooling operation. These layers are learned. The authors of SegNet avoided implementing a learnable upsampling operation in order to speed up training and inference by letting the convolutions and activation functions perform that process. Other models use learned upsampling operations such as transposed convolution, also called up-convolution or even (confusingly) deconvolution. This is the approach that the fully convolutional network (FCN) [43] model uses. This operation does not produce “gaps” as the unpooling operation does, and interestingly, the activation maps resulting from upsampling are summed with the ones of the corresponding downsampling operation.

Regardless, in our approach, both upsampling approaches have the same characteristics, i.e., there is a learnable part, and the operations that achieve the process are just pooling operations, convolutions, and activation functions. Pooling and convolutions that have strides can achieve the same objective, but a pooling–unpooling configuration always needs to be matched with learnable convolution in order to solve the “gaps”. Of course, this DRT-based encoder is not similar to a CNN encoder, as classification and operations such as max pooling do not exist. Therefore, a bit of imagination is needed to create an operation that is opposed to the downsampling operation and understand what the role of the learnable part (the convolutions) is. Taking both analysed pixelwise segmentation models, a decoder was designed.

An Upsampling Operation for a DRT-Based Encoder

First of all, the first building block is the upsampling operation. Since there is no learnable part, this should be straightforward. Taking inspiration from unpooling, the goal is to generate a set of activation maps that have doubled their dimensions in width and height. So, at a scale that has size

h \times w \times c

, the output is

2 \cdot h \times 2 \cdot w \times c

. In order to find the indices of the non-existing pooling operation, the arguments of the maxima of the activation functions at the next finer scale are used. Since this scale has a smaller angle dimension, this simple formula is used to calculate the finer angle index,

f_{a}

, using the coarser angle index,

f_{c}

, where C is the size of the angle domain at the finer scale.

f_{a} = ⌊ \frac{f_{c}}{2} ⌋ mod C

(4)

Algorithm 3 implements the custom unpooling operation for complementing the DRT-based encoder. There is an argmax operation in line 7. This means that the function is already different from the normal unpooling that one finds in a machine learning frameworks such as tensorflow or pytorch; only the max angle gets propagated. This design choice is because of two reasons. One of them is that this acts as an activation function, so not all the features propagate, and a priority is implemented. The other one is that while it is true that to resemble an actual unpooling operation, all the angles should propagate and, later on, an activation function should prioritise, this work is largely focused on speed, which allows us to significantly reduce the number of operations. Therefore, with this algorithm, the activation function and the unpooling behaviours of a decoder such as the one in SegNet are solved. Figure 15 shows the effect of executing this algorithm. As expected, the feature propagates to the right spots, but there are gaps. The improvement in accuracy of each of the activation functions that get propagated at the different scales is what makes it possible to gain precision in the final pixelwise segmentation.

Algorithm 3 Unpooling operation for a DRT-based encoder.

Input:

F (x, y, c) ⟶

Fine activations of size

[n_s q u a r e s \cdot 2, n_s q u a r e s \cdot 2, n_a n g l e s_f i n e]

Input:

C (x, y, c) ⟶

Coarse activations of size

[n_s q u a r e s, n_s q u a r e s, n_a n g l e s_c o a r s e]

Output:

N (x, y, c) ⟶

New activations of size

[n_s q u a r e s \cdot 2, n_s q u a r e s \cdot 2, n_a n g l e s_c o a r s e]

1:: function UNPOOL(C, F)
2:: $n_s q u a r e s \leftarrow size (C) [0]$
3:: $n_a n g l e s_f i n e \leftarrow size (F) [2]$
4:: $n_a n g l e s_c o a r s e \leftarrow size (C) [2]$
5:: for $i = 0$ to $n_s q u a r e s - 1$ do
6:: for $j = 0$ to $n_s q u a r e s - 1$ do
7:: $m a x_a n g l e \leftarrow argmax (C (x, y, :))$
8:: $m a x_a n g l e_f i n e r \leftarrow ⌊ \frac{m a x_a n g l e}{2} ⌋ \mod n_angles_fine$
9:: $m a x_v a l \leftarrow 0; v j \leftarrow - 1; v i \leftarrow - 1$
10:: $for i i = 0 t o 2 do$
11:: $for j j = 0 t o 2 do$
12:: $v \leftarrow F (i \cdot 2 + i i, j \cdot 2 + j j, m a x_a n g l e_f i n e r)$
13:: $if v > m a x_v a l then$
14:: $m a x_v a l \leftarrow v; v i \leftarrow i i; v j \leftarrow j j$
15:: $end if$
16:: $end for$
17:: $end for$
18:: $N (i \cdot 2 + v i, j \cdot 2 + v j, m a x_a n g l e) \leftarrow C (i, j, m a x_a n g l e)$
19:: $end for$
20:: $end for$
21:: $return N$
22:: $end function$

Finally, the hardest part to design is the one that, in models such as FCN or SegNet, is learnable. The DRT-based encoder should mimic what networks learn to do in their smart upsampling. The learnable part of a decoder performs three actions:

1.: Propagating features that are prominent at a coarser scale into a finer scale. This is achieved with the unpooling in Algorithm 3.
2.: Propagating features to the neighbouring vicinity. In this problem, vicinity is defined in the x- and y-dimensions but also in the angle (depth) dimension.
3.: Allowing features that are prominent at any scale to be considered. As Figure 13 suggests, there can be bars that are only detected at finer scales, so coarser detection should not overthrow these if they are more prominent.

So, two aspects have yet to be solved. Considering that the remaining palette of operations that a CNN uses are convolutions and activation functions, it is easy to conclude that the propagation of features at the same scale to the neighbouring vicinity can be achieved with a low-pass filter. So, this was solved as follows: a

9 \times 9

separable Gaussian filter is applied to each activation map for each angle; then, the angle dimension is also filtered but only with a size of three.

Finally, to allow features to appear in the middle of the decoding stage, at an intermediate scale, the result of the activation maps of the encoder are summed into the new activation maps, similarly to the FCN. This is a simple pixelwise sum operation, and no factors are applied. In this way, a decoder algorithm was obtained, and together with the encoder, a pixelwise segmentation algorithm for finding barcodes was designed. The final architecture of the segmentation algorithm is shown in Figure 16.

5. Implementation

With the goal of real-time barcode detection for augmented reality, two different programming environments were chosen. The first one, used for designing and debugging algorithms, was Python [44], which brings easiness of use and a very low development cost. This makes the fast prototyping of new ideas very affordable. The downside of using Python for these kind of computer vision tasks, given its interpreted nature, is that sometimes, a bottleneck can impede the testing of some ideas, because they are too computationally heavy. For mitigating this, to some extent, the authors used the Python package Numba [45] as an accelerator. The version used for this work was Python 3.10.

The second environment was chosen to make lower-level, accelerated implementation possible, which is suitable for targeting computing units such as CPUs, GPUs, and DSPs, and platforms such as desktop, mobile, and web. It consisted of code written in C++17 [46]. The algorithms were written in the Halide language [47]. Halide is a programming language embedded in C++ that decouples the declaration of computer vision algorithms from the schedules. This allows one to easily try different scheduling strategies and writing different schedules for different architectures. It also provides different auto-schedulers [48,49,50] that automatically generate schedules that can be used as a starting point or even become the final schedule. In this case, the authors used the Adams2019 [48] auto-scheduler for generating all the needed schedules. From the possible targets, the authors chose to test the results and execution time on a desktop CPU (Intel i9-9900) and a mobile phone CPU (Qualcomm Snapdragon 888).

Both Python and C++ versions are available from the GitHub repository [51].

5.1. Pruning and Optimisation

Three main types of optimisation were carried out before any hardware-specific acceleration. From now on, the authors refer to four algorithms by their acronyms:

PDRT 2: The original algorithm of the partial discrete Radon transform as described in [33], executed with $tile_size = 2$ . This means that a single step of the DRT was calculated.
PDRT 32: The original algorithm of the partial discrete Radon transform as described in [33], executed with $tile_size = 32$ . This means that five steps of the DRT were calculated.
PS DRT: Partially strided DRT-based detector as described in Section 3. The chosen parameters were $tile_size = 32$ and $stride = 2$ .
MDD DRT: Multiscale domain detector based on the DRT, as described in Section 4.

The average absolute deviation (D) was calculated for each tile and each slope; see lines 21 and 22 of Algorithm 2. This can be better expressed as

\bar{A} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

(5)

D = \frac{1}{n} \sum_{i = 1}^{n} | x_{i} - \bar{A} |

(6)

Considering the architecture of modern computing units and the goal of these operations, the authors realised that this is wasteful and can be substituted with the following, with no impact on quality:

\tilde{D} = \sum_{i = 1}^{n - 1} ∣ x_{i} - x_{i + 1} ∣

(7)

This approximation was already considered in [33].

To achieve an implementation that takes advantage of the fact that the discrete Radon transform works with only integer arithmetic, the data type chosen for the buffers was 16-bit integers, which is enough to hold the range of the sums that take place in the execution of a DRT for an image of

1024 \times 1024

and for a

tile_size = 32

(five stages). Because the PS DRT and MDD DRT algorithms calculate differences and thus hold negative values in some variables, the signed 16-bit integer data type was chosen instead of its unsigned variant. This speeds up computation when compared with the 32-bit floating point data type that was used when designing the algorithm, and the results are equivalent. This approach was also implemented with PDRT 2 and PDRT 32.

The resulting number of angles in this algorithm is excessive since such fine-grained precision on the angle is not needed for this task. By profiling the execution of the MDD DRT algorithm, it can be noticed that about 80% of the time is spent on the last convolution block. That is because the domain over which the convolution is executed is

512 \times 512 \times 126

. By halving the size of the output angle dimension in each unpool until it reaches size 30, the final quality remains almost equal, and the number of operations is greatly reduced. The sizes of the operations of the final algorithm following optimisation are shown in Table 2.

5.2. Choosing a Fast Approximation of a Gaussian Filter

In Section 4, low-pass filtering is described as a

9 \times 9

separable Gaussian filter. There are many works that deal with approximations of low-pass filtering that use different strategies, such as approximating a Gaussian filter using a box filter with multiple passes [52], using a discrete kernel [53], or performing filtering in a downsampled image [54]. Having implemented all the operations in integer arithmetic, it made sense to approximate the Gaussian filter using a set of two box filters of size

3 \times 3

, which were again implemented in a separable fashion. This, as the simplest option, gave a good result in terms of speed and quality.

6. Results and Conclusions

6.1. Time Results

The results of benchmark execution—with enough repetitions to ensure repeatability—are shown in Figure 17. These correspond to the auto-scheduling of the Halide/C++ implementation. Python time results were not measured, as its implementation purpose was solely that of prototyping and designing. Two targets were chosen as a test bed. One was the Intel i9 9900 desktop processor, with eight cores @ 3.10 GHz, and the other was a state-of-the-art CPU of a mobile system on a chip (SoC), Snapdragon (SD) 888, which has eight Kryo 680 cores @ 3 GHz. The latter was chosen because most AR applications run on a mobile device. The GPU of the same SoC could have also been an interesting target and will be addressed in future work.

PDRT 2’s and PDRT 32’s average execution time was far below the 33 ms threshold. Nevertheless, as stated in Section 2.2.1, these algorithms were not the focus of this work, as their output is not appropriate for segmentation, and are reported here as a mere reference. PS DRT, which, qualitatively, had acceptable segmentation results, was very far from a real-time, AR-ready implementation, since its execution took 107 and 215 milliseconds on i9 and SD, respectively. Lastly, MDD DRT, which had good segmentation results, achieved real-time execution times below 33 ms (21 ms) on the desktop CPU but not on the mobile CPU (66 ms).

6.2. Runtime Comparison

In order to be able to draw conclusions from directly comparing our methods with those by other authors, the conditions had to be similar, but previous methods mostly use low-resolution inputs and ignore mobile platforms.

It can be seen in Table 3 that our MDD DRT method was faster than any other method when run on a CPU, even if it required the highest input resolution. Moreover, it was the only one performing under the 33 ms time limit. Neural network-based methods were represented by the hand-crafted method by Zharkov and Zagaynov [17] and by our measurement of inference times of YOLO v5 small, as it was selected by Kamnardsiri et al. [24] as a good representative of the state-of-the-art methods in terms of both accuracy and speed.

Previous methods that manage to get below the 33 ms threshold are able to accomplish this because they target desktop GPUs, which this paper does not consider, as they are not a realistic platform for wearable AR devices.

Although it can be claimed that our methods outperformed the literature alternatives, our challenge was achieving execution within self-imposed time constraints without neglecting the quality requirements, and in this sense, the second proposed method can be considered to be fully satisfactory.

6.3. Accuracy Comparison

It should be noted that the output of our algorithms, i.e., the classification of areas according to bar presence and orientation, cannot be directly used as a barcode detector. In this sense, the outputs contain a lot of false positives, that is, areas where the segmenter detects potential barcodes, but where there are none. Since angular information on the orientation of the codes is available, a simple criterion of oriented eccentricity in connected regions is adequate to eliminate most of them. This simple criterion—which probably has a good margin of improvement—is the one used to turn our segmentation outputs into masks of detection for subsequent accuracy comparison. A particular example is provided, for a cluttered scenario, in Figure 18.

Accuracy Metric and Benchmarking Datasets

Given a binary mask G for ground truth and a binary mask F for the detection results, a commonly accepted metric is the intersection over union (IoU) or Jaccard index, defined as

J (G, F) = \frac{| G \cap F |}{| G \cup F |}

(8)

The overall detection rate, D, for a given Jaccard threshold T is defined as the ratio of images in a dataset S where the Jaccard index is greater than that threshold.

D_{T} = \frac{\sum_{i \in S} J (G, F) \geq T)}{| S |}

(9)

The averaged detection rate for several thresholds can be obtained as

D_{T 1 - T 2} = \frac{\sum_{T \in T 1}^{T 2} D_{T}}{T_{2} - T_{1}}

(10)

To compare our methods in terms of accuracy, the detection rates on two datasets for which binary ground truth masks are available were calculated. The first dataset (Zamberletti et al. [16]), named “Arte-lab Rotated”, contains 365 images of sizes 640 × 480 and 648 × 488. The second one (Wachenfeld et al. [4]), named “WWU Münster”, contains 1055 images of size 2592 × 1944. Both datasets were labelled by Zamberletti et al. [16] with binary masks that indicate what pixels of an image correspond to a barcode. These masks are of sizes 640 × 480 and 648 × 488, so this was the resolution used in the tests. There are 595 masks for WWW Münster, and from the 365 masks of Arte-Lab Rotated, 5 were discarded due to them being wrong. Our method would have benefited if these inputs had had 1024 × 1024 resolution. As that was not the case, we preferred to resize them to the expected size by adding zeroes to the borders, instead of upsampling them.

Metrics for both methods are included in Table 4. PS DRT was not only the slowest but also the one that profiled the barcodes the worst. Therefore, it is not the focus of attention in the following.

Looking at the results obtained with MDD DRT, it is worth noting that it improved upon some classical methods, obtaining detection rates of 90 and 95 % at 0.5, but it remained below the perfect detection rate at this threshold, which the majority of the methods based on neural networks and the classical method in [11] achieved.

The limits of the proposed method are discussed in Section 6.5; nonetheless, it is worth commenting here that some defocused barcodes that can be easily detected with state-of-the-art methods were not recognised by ours. This can be explained by the low resolution and because they presented compression. Compression artefacts make fine-grained activation maps differ in angle compared with the coarse-grained ones. This makes our method consider the detection unreliable and induces low scores even at

D_{0.5}

Nonetheless, the results obtained on these datasets, considering the information in Table 3, can be considered successful, given that the proposed method is designed for inputs that do not contain artefacts and are of greater resolution. This is discussed further in Section 6.5. On the Münster dataset, which is also compressed but was downsampled from higher-resolution inputs, the results of the proposed method were better, reaching 95% detection accuracy and superseding several methods in the

D_{0.5 - 0.9}

metric, because it profiled barcodes that can be troublesome for other methods quite well.

6.4. Synthetic Parametric Tests

The only one of the parameters of the MDD DRT algorithm that can be considered free is the activation threshold. For this single parameter, the evolution graphs of the accuracy metrics are presented for the two datasets in Figure 19. Given that the two datasets are different, a threshold of 28 was selected to achieve a good compromise between the two.

It is interesting to evaluate how the IoU metric changes with the size of a barcode. For this, the following experiment was carried out: An image containing a barcode was synthesised in such a way that the size of the pixels of the barcode was programmatically set, ranging from 1024—the maximum size possible if the barcode is horizontal—to 0, varying at intervals of 10.24 pixels, such that 100 measurements were performed. The ground truth mask was synthesised at the same time. The MDD DRT algorithm worked as expected, and the quality of detection slowly degraded with the size of the barcode. The result of such analysis can be seen in Figure 20.

The angle is a factor that needs to be evaluated, since the MDD DRT algorithm should provide angle invariance. A new experiment was devised: The same barcode generation scheme used in the scale experiment was used; in this case, the size was fixed at 512 pixels in width, and the angle was varied from 0 to 180 degrees at intervals of 0.6, so that 300 images were evaluated. The behaviour was as expected: the detector performed equally well at all angles. The result of this experiment can be appreciated in Figure 21.

6.5. Qualitative Analysis

Well-defined, sharp and large codes are not a challenge for previous proposals, nor for the current one. Hence, a qualitative discussion on more challenging situations, for which representative samples are shown in Figure 22, was carried out. It is recommended to view the samples enlarged in the digital version of the paper. The same examples were analysed with the existing Radon-based bar detector method at two fixed scales and with the two new methods, i.e., partially strided and multiscale domain detection.

The discussion on the samples is as follows:

The first row shows a barcode that could be considered unproblematic. Although it contains some contrast variation due to paper bending and quite separated bars, its main difficulty, only for some neural network-based methods, is that its aspect ratio is approximately 1:8. This was not a problem for the PS DRT algorithm, nor for MDD DRT. The fixed-scale algorithms, on the other hand, had problems in properly contouring the code boundaries. This was consistently found in the rest of the examples; therefore, these algorithms, whose outputs are of interest mainly because they are the base and apex of the data pyramid on which the PS DRT and MDD DRT methods operate, are not further discussed.
The second row shows an image containing a multitude of codes. Some are out of focus; some are rotated; and several dividing lines and alphabetic characters make it difficult to separate them correctly. In addition, a couple of codes are incomplete at the lower end. The MDD DRT method contoured the codes better than the PS DRT method. It was also more sensitive to the angle of the lines and tended to make them more uniform, but this does not result in definite advantages in a real-world application. Instead, as seen in the previous subsection, the focus from now on is exclusively on the MDD DRT method, because the PS DRT method simply does not fit into the time constraints. The segmentation was accurate, and even the small and trimmed code on the border was detected.
The main disadvantage of the MDD DRT method is that it joins characters close to a barcode if they are made up of strokes with the same slope. This is the case for characters such as 0, 1, 7, O, I, etc. Fortunately, these “overflows” can be counteracted in the post-processing phase, which is not covered in this article.
Bars or outlines close and parallel to the ends of the codes are also problematic, since the measures taken to save the internal areas with low texture but belonging to the codes force their undesired inclusion.
This image is the same as the one used in Figure 18 to illustrate how to get rid of false positives with a simple criterion.
The third and fourth rows illustrate the cases of barcodes affected by strong and extreme lens blur, respectively. The proposed methods still managed to adequately contour the input of the third row but not that of the fourth row. The final scale of the algorithm, with analysis zones of 32 × 32 pixels, allowed bars to be detected, even though, at a smaller scale, the blur merged them with other bars. Also depending on the loss of contrast, at some point, the detector stops triggering. In an example such as this, the detector showed its performance limits, but regardless of whether it can be further improved as a future line of work, it is already a major advance over previous methods, which simply do not work at all when facing heavily defocused barcodes.
The next row presents a case of severe motion blur in the direction that affects the code bars the most. Both types of blur, lens blur and motion blur, were treated in the same way, and in cases such as this, the detector behaved as expected.
The image in the sixth row combines very low contrast with glare and smudges on the codes. As it can be seen, the detection is satisfactory, but in this extreme case, the combined effect of a wide bar, low contrast, and glares oriented in the opposite direction with respect to the bars broke the code into two disconnected regions. This is another limitation of our method. In PS DRT, this does not occur, since it is the voting scheme in MDD DRT that penalises character regions with sharp slope changes nearby that works against the correct detection. Fortunately, in the rare occasions in which this happens, it can be solved in post-processing, since angle labels per region are available.
The last example is a case of extreme perspective that makes the observed slope of the bars of the same code vary. The MDD DRT method solved it very well, and the bars changed smoothly so that they could be grouped in a single region.
The problems of joining nearby characters are once again evident.
The lack of precision of the boundaries in the upper-right corner of the code is due to another factor. Unlike the other sample images in this figure, which were taken by directly accessing the camera, i.e., without compression and disabling edge enhancements, in this case, it is a JPEG image where there are compression artefacts. Those artefacts, at a low scale, make the MDD DRT method choose not to trigger. This is not exactly a disadvantage, but it has to be noted that the method works best when using raw camera frames instead of compressed images or video.

Another aspect in which our proposed methods distinguish themselves from non-AR-oriented methods is the emphasis on treating the input as a sequence of images, rather than as a still picture. In this sense, the output of many consecutive short-exposure frames is shown in a video provided in the repository, of which an example frame is shown in Figure 23. Again, it is difficult to assess how temporal changes would affect neural network-based methods, but in our method, differential changes in the input induce differential changes in the output, as is desirable, and the marks do not shift or disappear, even in the presence of considerable motion blur and noise due to shooting indoor with short exposure times.

6.6. Disadvantages

The main disadvantages of the MDD DRT method are exemplified in the above challenging scenarios and can be summarised as follows:

Certain alphanumeric characters above or below barcodes, and lines and outlines to the left and right, tend to merge with the barcodes, producing false positives.
When several adverse circumstances concur, a single instance of a barcode can be split into two connected regions.

It could be argued that neural network-based methods would solve both problems at their root without paying specific attention to them, whereas here, they must be taken into account. However, errors do not arise unpredictably here, while it is often the case when networks are pruned or retrained on different data. With our proposed methodologies, problems can be tackled specifically and with more or less simple methods, depending on the available time margin and severity of the error.

The methods are based on the characterisation of one-dimensional barcodes as sequences of parallel bars, so the detector cannot be used for the detection of two-dimensional codes, such as QR codes.

6.7. Conclusions

Considering the computation time cost and the qualitative assessment, the authors chose MDD DRT as the best solution for solving the problem of barcode segmentation for AR. This method allows real-time execution on desktop CPUs. The detection quality is not sacrificed in areas that are particularly relevant in AR-oriented applications, e.g., detection of barcodes affected by motion and lens blur; performance in image sequences acquired in video time; and detection in one-megapixel image streams, with a labelled output of a quarter of the input resolution.

The execution time of the proposed method was measured and compared with state-of-the-art methods. The MDD DRT outperformed other methods, when using a CPU. The accuracy was also measured. The proposed method performed worse than state-of-the-art machine learning methods that use widely available labelled datasets. Regardless, the goodness of the method consisted of the speed and the ability to deal with AR scenarios. For this, a small set of images, representative of these scenarios, was discussed as a qualitative analysis. Finally, synthetic tests were performed to evaluate the robustness of the method.

6.8. Future Lines of Research

It is the belief of the authors that a simple morphological closing method with a structuring element with orientation orthogonal to the bars can greatly ameliorate the problems that are now the performance limit of the MDD DRT method, but it is not the object of study of this paper, which is limited to presenting a segmenter that maximises true positives while minimising false negatives. If simple post-processing methods were not enough, it could be considered to improve the voting system of the decoder so that there is no overflow for nearby characters nor discontinuities in the segmentation of codes, but this remains an open line for the future.

On the other hand, the MDD DRT method consists of two parts that could be used separately. This gives rise to the possibility of creating a new detector that is valid for two-dimensional codes and still reuse the decoder. Another future line of work could be to use the decoder in other problems for which an appropriate encoder can be rewritten, such as the depth-from-focus problem [55,56].

The MDD DRT method does not currently meet the time constraints of mobile devices. However, improvements in mobile CPU architectures are so fast that soon the algorithm will be able to run in real time when using the same implementation and a newer CPU. There are two things that can be considered in this regard:

First, the scheduling of the algorithm in Halide can be improved, since the measurements were manually taken using an auto-scheduler, which gives a schedule that can be improved upon. In addition, the GPU target of SD 888 can also be used.
Second, more pruning can be made on MDD DRT by modifying two constants: input size and halting stage. For example, starting from $512 \times 512$ instead of $1024 \times 1024$ and stopping at tile_size = 16 would reduce the execution time to 40% of the original time. How much pruning can be performed without losing too much quality is an assessment that can be performed in follow-up research.

Author Contributions

Conceptualisation, Ó.G.-C. and J.G.M.-H.; data curation, J.M.R.-R.; formal analysis, J.G.M.-H.; funding acquisition, J.M.R.-R.; Investigation, Ó.G.-C. and J.-Y.S.; methodology, J.-Y.S.; project administration, R.P.-J.; resources, J.-Y.S., R.P.-J. and J.M.R.-R.; software, Ó.G.-C. and J.G.M.-H.; supervision, J.G.M.-H.; validation, Ó.G.-C. and J.-Y.S.; visualisation, Ó.G.-C.; writing—original draft, Ó.G.-C. and J.G.M.-H.; writing—review and editing, J.-Y.S., R.P.-J. and J.M.R.-R. All authors have read and agreed to the published version of the manuscript.

Funding

Ó.G.-C. was partially supported by “Catalina Ruiz training aid program for research personnel” (Regional Ministry of Economy, Knowledge, and Employment), as well as the European Social Fund. The participation of J.G.M.-H. was partially funded by “Estrategia de Especialización inteligente de Canarias RIS-3” (Government of the Canary Islands); European Regional Development Fund (grant No. ProID2020010066); and “Research agreement on consumer electronics Wooptix-ULL, 2023”. J.-Y.S. acknowledges the Institute of Information and Communications Technology Planning and Evaluation (IITP) grant funded by the Korean government (MSIT; No. 2020-0-00537, “Development of 5G based low latency device–edge cloud interaction technology”; 50%) and the Priority Research Centers Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Education (grant No. NRF-2018R1A6A1A03025542; 50%).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The images used for evaluating the quality of the results and the code to generate them are available in [51].

Conflicts of Interest

The authors declare no conflict of interest. The founders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; nor in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
AR	Augmented reality
CNN	Convolutional neural network
CPU	Central processing unit
DRT	Discrete Radon transform
FCN	Fully convolutional network
GPU	Graphics processing unit
SoC	System on a chip
SD	Snapdragon
IoU	Intersection over union

References

GS 22423-2023; GS1 General Specifications. GS1: London, UK, 2022; pp. 5.1–5.5. Available online: https://www.gs1.org/docs/barcodes/GS1_General_Specifications.pdf (accessed on 5 June 2023).
Azuma, R.; Baillot, Y.; Behringer, R.; Feiner, S.; Julier, S.; MacIntyre, B. Recent advances in augmented reality. IEEE Comput. Graph. Appl. 2001, 21, 34–47. [Google Scholar] [CrossRef] [Green Version]
Muniz, R.; Junco, L.; Otero, A. A robust software barcode reader using the Hough transform. In Proceedings of the 1999 International Conference on Information Intelligence and Systems, Bethesda, MD, USA, 31 October–3 November 1999; pp. 313–319. [Google Scholar] [CrossRef]
Wachenfeld, S.; Terlunen, S.; Jiang, X. Robust 1-D barcode recognition on camera phones and mobile product information display. In Mobile Multimedia Processing: Fundamentals, Methods, and Applications; Springer: Berlin, Germany, 2010; pp. 53–69. [Google Scholar]
Gallo, O.; Manduchi, R. Reading 1D Barcodes with Mobile Phones Using Deformable Templates. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1834–1843. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lin, D.T.; Lin, M.C.; Huang, K.Y. Real-time automatic recognition of omnidirectional multiple barcodes and DSP implementation. Mach. Vis. Appl. 2011, 22, 409–419. [Google Scholar] [CrossRef]
Katona, M.; Nyúl, L.G. A Novel Method for Accurate and Efficient Barcode Detection with Morphological Operations. In Proceedings of the 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems, Naples, Italy, 25–29 November 2012; pp. 307–314. [Google Scholar] [CrossRef] [Green Version]
Bodnár, P.; Nyúl, L.G. Improving Barcode Detection with Combination of Simple Detectors. In Proceedings of the 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems, Naples, Italy, 25–29 November 2012; pp. 300–306. [Google Scholar] [CrossRef] [Green Version]
Sörös, G.; Flörkemeier, C. Blur-resistant joint 1D and 2D barcode localization for smartphones. In Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, Luleå, Sweden, 2–5 December 2013; pp. 1–8. [Google Scholar]
Creusot, C.; Munawar, A. Real-Time Barcode Detection in the Wild. In Proceedings of the Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on, Waikoloa, HI, USA, 5–9 January 2015; pp. 239–245. [Google Scholar]
Creusot, C.; Munawar, A. Low-computation egocentric barcode detector for the blind. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 2856–2860. [Google Scholar]
Namane, A.; Arezki, M. Fast real time 1D barcode detection from webcam images using the bars detection method. In Proceedings of the World Congress on Engineering (WCE), London, UK, 5–7 July 2017; Volume 1. [Google Scholar]
Chen, C.; He, B.; Zhang, L.; Yan, P.Q. Autonomous Recognition System for Barcode Detection in Complex Scenes. In Proceedings of the 4th Annual International Conference on Information Technology and Applications (ITA 2017), Guangzhou, China, 26–28 May 2017; p. 04016. [Google Scholar] [CrossRef]
Fernandez, W.P.; Xian, Y.; Tian, Y. Image-Based Barcode Detection and Recognition to Assist Visually Impaired Persons. In Proceedings of the 2017 IEEE 7th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Honolulu, HI, USA, 31 July–4 August 2017; pp. 1241–1245. [Google Scholar] [CrossRef]
Xiao, Y.; Ming, Z. 1D Barcode Detection via Integrated Deep-Learning and Geometric Approach. Appl. Sci. 2019, 9, 3268. [Google Scholar] [CrossRef] [Green Version]
Zamberletti, A.; Gallo, I.; Albertini, S. Robust Angle Invariant 1D Barcode Detection. In Proceedings of the 2nd IAPR Asian Conference on Pattern Recognition, Okinawa, Japan, 5–8 November 2013; pp. 160–164. [Google Scholar] [CrossRef] [Green Version]
Zharkov, A.; Zagaynov, I. Universal Barcode Detector via Semantic Segmentation. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 837–843. [Google Scholar] [CrossRef] [Green Version]
Hansen, D.K.; Nasrollahi, K.; Rasmusen, C.B.; Moeslund, T.B. Real-Time Barcode Detection and Classification using Deep Learning. In Proceedings of the 9th International Joint Conference on Computational Intelligence—IJCCI. INSTICC, Madeira, Portugal, 1–3 November 2017; pp. 321–327. [Google Scholar] [CrossRef]
Zhang, L.; Sui, Y.; Zhu, F.; Zhu, M.; He, B.; Deng, Z. Fast Barcode Detection Method Based on ThinYOLOv4. In Proceedings of the International Conference on Cognitive Systems and Signal Processing, Zhuhai, China, 25–27 December 2020. [Google Scholar]
Wudhikarn, R.; Charoenkwan, P.; Malang, K. Deep Learning in Barcode Recognition: A Systematic Literature Review. IEEE Access 2022, 10, 8049–8072. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 1–74. [Google Scholar] [CrossRef]
Hao, S.; Zhou, Y.; Guo, Y. A Brief Survey on Semantic Segmentation with Deep Learning. Neurocomputing 2020, 406, 302–321. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Kamnardsiri, T.; Charoenkwan, P.; Malang, K.C.; Wudhikarn, R. 1D Barcode Detection: Novel Benchmark Datasets and Comprehensive Comparison of Deep Convolutional Neural Network Approaches. Sensors 2022, 22, 8788. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems–Volume 1, NIPS’15, Cambridge, MA, USA, 7–12 December 2015; pp. 91–99. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 14–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; TaoXie; Fang, J.; imyhxy; Michael, K.; et al. ultralytics/yolov5: v6.1—TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference. Available online: https://zenodo.org/record/6222936 (accessed on 25 June 2023).
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Zhao, Q.; Ni, F.; Song, Y.; Wang, Y.; Tang, Z. Deep dual pyramid network for barcode segmentation using barcode-30k database. arXiv 2018, arXiv:1807.11886. [Google Scholar]
Zharkov, A.; Vavilin, A.; Zagaynov, I. New Benchmarks for Barcode Detection Using Both Synthetic and Real Data. In Proceedings of the International Workshop on Document Analysis Systems, Wuhan, China, 26–29 July 2020. [Google Scholar]
Kaggle. Kaggle Competitions. Available online: https://www.kaggle.com/docs/competitions (accessed on 26 April 2023).
Oliva-García, R.; Gómez-Cárdenes, Ó.; Marichal-Hernández, J.G.; Martín-Hernández, J.; Phillip-Lüke, J.; Rodríguez-Ramos, J.M. A local real-time bar detector based on the multiscale Radon transform. In Proceedings of the Real-Time Image Processing and Deep Learning, Orlando, FL, USA, 6–12 June 2022; Kehtarnavaz, N., Carlsohn, M.F., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, DC, USA, 2022; Volume 12102, p. 121020A. [Google Scholar] [CrossRef]
Lindeberg, T. Scale-Space Theory in Computer Vision; Springer Science & Business Media: Berlin, Germany, 2013; Volume 256. [Google Scholar]
Radon, J. Über die Bestimmung von Funktionen durch ihre Integralwerte längs gewisser Mannigfaltigkeiten. Akad. Wiss. 1917, 69, 262–277. [Google Scholar]
Götz, W.; Druckmüller, H. A fast digital Radon transform—An efficient means for evaluating the Hough transform. Pattern Recognit. 1996, 29, 711–718. [Google Scholar] [CrossRef]
Brady, M.L. A fast discrete approximation algorithm for the Radon transform. SIAM J. Comput. 1998, 27, 107–119. [Google Scholar] [CrossRef]
Brandt, A.; Dym, J. Fast calculation of multiple line integrals. SIAM J. Sci. Comput. 1999, 20, 1417–1429. [Google Scholar] [CrossRef] [Green Version]
Gómez-Cárdenes, O.; Oliva-García, R.; Rodríguez-Abreu, G.A.; Marichal-Hernández, J.G. Exposing Parallelism of Discrete Radon Transform. In Proceedings of the 3rd International Conference on Telecommunications and Communication Engineering, ICTCE ’19, New York, NY, USA, 19 February 2019; pp. 136–140. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef] [Green Version]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2014, arXiv:1411.4038. [Google Scholar]
Van Rossum, G. Python Tutorial, CS-R9526. Technical Report. 1995. Available online: https://ir.cwi.nl/pub/5007/05007D.pdf (accessed on 25 June 2023).
Lam, S.K.; Pitrou, A.; Seibert, S. Numba: A LLVM-Based Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM ’15, New York, NY, USA, 15 November 2015. [Google Scholar] [CrossRef]
ISO. ISO/IEC 14882:2017 Information Technology—Programming Languages—C++, 5th ed.; ISO: Geneva, Switzerland, 2017; p. 1605. [Google Scholar]
Ragan-Kelley, J.; Barnes, C.; Adams, A.; Paris, S.; Durand, F.; Amarasinghe, S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM Sigplan Not. 2013, 48, 519–530. [Google Scholar] [CrossRef] [Green Version]
Adams, A.; Ma, K.; Anderson, L.; Baghdadi, R.; Li, T.M.; Gharbi, M.; Steiner, B.; Johnson, S.; Fatahalian, K.; Durand, F.; et al. Learning to optimize halide with tree search and random programs. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef] [Green Version]
Mullapudi, R.T.; Adams, A.; Sharlet, D.; Ragan-Kelley, J.; Fatahalian, K. Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 2016, 35, 1–11. [Google Scholar] [CrossRef] [Green Version]
Li, T.M.; Gharbi, M.; Adams, A.; Durand, F.; Ragan-Kelley, J. Differentiable programming for image processing and deep learning in Halide. ACM Trans. Graph. 2018, 37, 1–13. [Google Scholar] [CrossRef] [Green Version]
Source Code. Available online: https://github.com/DoMondo/an_encoder_decoder_architecture (accessed on 28 April 2023).
Wells, W.M. Efficient Synthesis of Gaussian Filters by Cascaded Uniform Filters. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 234–239. [Google Scholar] [CrossRef] [PubMed]
Kawase, M. Frame buffer postprocessing effects in double-steal (wrechless). In Proceedings of the Game Developers Conference 2003, San Jose, CA, USA, 4–8 March 2003. [Google Scholar]
Martin, S.; Garrard, A.; Gruber, A.; Bjørge, M.; Zioma, R.; Benge, S.; Nummelin, N. Moving mobile graphics. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference, SIGGRAPH ’15, Los Angeles, CA, USA, 9–13 August 2015. [Google Scholar] [CrossRef]
Nayar, S.; Nakagawa, Y. Shape from focus. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 824–831. [Google Scholar] [CrossRef] [Green Version]
Gómez-Cárdenes, Ó.; Marichal-Hernández, J.G.; Trujillo-Sevilla, J.M.; Carmona-Ballester, D.; Rodríguez-Ramos, J.M. Focus measurement in 3D focal stack using direct and inverse discrete radon transform. In Proceedings of the Three-Dimensional Imaging, Visualization, and Display, Anaheim, CA, USA, 10–11 April 2017; Javidi, B., Son, J.Y., Matoba, O., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, DC, USA, 2017; Volume 10219, p. 102190R. [Google Scholar] [CrossRef]

Figure 1. Topology of the VGG16 model. ReLU stands for the type of activation function used: rectified linear unit.

Figure 2. SegNet topology. It uses VGG16 as the encoder and adds a specularly symmetric decoder. Note that the indices of the max pooling operations are used when upsampling.

Figure 3. A series of rotated image tiles containing bars, surrounded by plots of their pixel intensity sums per row (red) and per column (blue). The variance difference among them is maximal when the bars are fully vertically or horizontally oriented.

Figure 4. Illustration of how adjacent pixels and subsequently line segments are combined in the partial stages of DRT. In this example,

16 \times 16

pixels (represented by circles) are combined initially, to the left, two by two; in the next stage, four by four; and finally, eight by eight. The final tile spacing between solutions (arrows, at the bottom) is eight. In each partial-stage solution, pixels are combined to consider each displacement and slope, but in non-overlapping vertical and horizontal bands, separated by dashed lines. Eminently horizontal lines are rendered in red, whilst vertical ones are rendered in blue.

Figure 4. Illustration of how adjacent pixels and subsequently line segments are combined in the partial stages of DRT. In this example,

16 \times 16

Figure 5. Dilemma of how to choose the right tile size to study a barcode region (in blue). From left to right, the tile subdivisions are shown for three different tile sizes, each of which quadruples the previous one in area. The tiles are depicted as rectangles with a circle at their centre and define the spacing of the output grid. The tiles whose centres are within the barcode region are depicted in a coarser line.

Figure 6. From left to right, from top to bottom: the output of the method by Oliva-García et al. [33] given the zoomed example image in the top-left corner at scales ranging from tile_size = 2 to tile_size = 32. Note that as the tile_size is doubled, the areas are quadrupled in size, but their number is reduced to a quarter. Each result was upsampled with nearest-neighbour interpolation so that all are shown to be the same size. The hue encodes the angle of the bars, and the intensity encodes the intensity of detection.

Figure 7. On the left, large tile size with smaller tile spacing, implying tile overlapping. On the right, the merging of tiles at various scales to optimise barcode coverage.

Figure 8. A depiction of how non-overlapping bands fuse at scales two and four to obtain a final output spacing of eight pixels per dimension. The depiction combines, in one picture, the evolution of vertical bands as they are shown at the bottom of Figure 4.

Figure 9. Illustration of how the final tile spacing determines the bands of computations in partial stages. Both represent

N = 16

and tile_size

= 8

. They differ in the tile_spacing, which is 4 on the left and 2 on the right.

Figure 9. Illustration of how the final tile spacing determines the bands of computations in partial stages. Both represent

N = 16

and tile_size

= 8

. They differ in the tile_spacing, which is 4 on the left and 2 on the right.

Figure 10. Architecture of the detector based on the partial DRT, as described in [33] but taking advantage of all the calculated stages up to

tile_size = 32

Figure 10. Architecture of the detector based on the partial DRT, as described in [33] but taking advantage of all the calculated stages up to

tile_size = 32

Figure 11. New encoder topology, based on the work by Oliva-García et al. [33] and modified as described in this section. Note that for simplicity, the partial DRT data structure was made smaller in scale for display purposes.

Figure 12. From left to right: input image and activation maps at scale = 2, slicing for slopes 0 to 12, which correspond to angles in the range

[- π / 4, 3 π / 4]

. A colormap is used to represent intensity, ranging from a darker blue to a lighter blue. The upper part of the image contains lines at an angle of

π / 6

, and the lower part of the image contains lines at an angle of

π / 3

. Note that the activation map correctly activates more in the corresponding slice of each half.

Figure 12. From left to right: input image and activation maps at scale = 2, slicing for slopes 0 to 12, which correspond to angles in the range

[- π / 4, 3 π / 4]

. A colormap is used to represent intensity, ranging from a darker blue to a lighter blue. The upper part of the image contains lines at an angle of

π / 6

, and the lower part of the image contains lines at an angle of

π / 3

. Note that the activation map correctly activates more in the corresponding slice of each half.

Figure 13. From left to right: input image and activation maps for the central angle of each scale, from finer (first) to coarser (last) scales. A colormap is used to represent intensity, ranging from a darker blue to a lighter blue. It is clear that finer lines (lower part) produce more intensity at the second and third scales, while coarser lines (upper part) are better detected at the last scale.

Figure 14. Example of unpooling operation. The pooling operation reduces dimensionality, and the unpooling operation increases it. Note that there are “gaps” where no indices refer to the array on the right.

Figure 15. From left to right: cropped part of input image; coarse activation map for horizontal angles; fine activation map for horizontal angles; the result of performing the unpooling described in Algorithm 3. Note that there are “gaps” and that each activation function that got propagated used the coarse angle information, and the correct spot was calculated by finding the argument of the maxima within the corresponding area at a finer scale.

Figure 16. Full encoder–decoder architecture, where the DRT-based encoder is matched with the newly designed decoder integrating the operations described in this section.

Figure 17. Average execution times of the Halide implementation of the four algorithms. These measurements were performed on an i9 9900 desktop CPU and on the CPU of the Qualcomm Snapdragon 888 mobile SoC. The red dashed line emphasises the 1/30 s time limit.

Figure 18. Left: unprocessed output of the MDD DRT segmenter. Right: superimposed on the input image, connected regions whose aspect ratio between the major and minor oriented axes is consistent with a barcode and whose area exceeds a minimum threshold.

Figure 19. The MDD DRT algorithm was executed for thresholds ranging from 10 to 38 on both the WWU Münster dataset and the Arte-lab Rotated dataset. The selected threshold value is marked with a dashed red line.

Figure 20. The IoU of a barcode that was scaled to different sizes.

Figure 21. The IoU of a barcode that was rotated to angles ranging from 0 to 180 degrees.

Figure 22. Columns, from left to right: input image and output of algorithms PDRT 2, PDRT 32, PS DRT, and MDD DRT. Rows, challenging scenarios, from top to bottom: elongated, cluttered, mild lens blur, strong lens blur, motion blur, low contrast, and shear.

Figure 23. Frame of the video provided in the repository documentation. Available at https://raw.githubusercontent.com/DoMondo/an_encoder_decoder_architecture/master/readme_data/cover.gif, accessed on 25 June 2023.

Table 1. Output data sizes for the partial DRT algorithm, modified to return data from all the stages and bar_detector. The dimensions of the partial DRT are n_tiles x n_slopes x n_displacements, and the dimensions of the bar_detector are channel x n_tiles x n_tiles, where the channel contains the intensity and the angle planes.

`tile_size`	$F_{i}$	Partial DRT	Bar Detector
1	$F_{0}$	-	-
2	$F_{1}$	$512 \times 3 \times 1024$	$2 \times 512 \times 512$
4	$F_{2}$	$256 \times 7 \times 1024$	$2 \times 256 \times 256$
8	$F_{3}$	$128 \times 15 \times 1024$	$2 \times 128 \times 128$
16	$F_{4}$	$64 \times 31 \times 1024$	$2 \times 64 \times 64$
32	$F_{5}$	$32 \times 63 \times 1024$	$2 \times 32 \times 32$

Table 2. Output data sizes for the pruned MDD DRT algorithm. The unpool operations include the summation with the original activation functions of the encoder, as it is implemented in the Halide version. The changes resulting from pruning are marked in green.

Block	Operation	Input Size	Output Size
Encoder	`mdd_drt_v`	$1024, 1, 1024$	$[1024, 3, 512]$ $[1024, 7, 256]$ $[1024, 15, 128]$ $[1024, 31, 64]$ $[1024, 63, 32]$
Encoder	`mdd_drt_h`	$1024, 1, 1024$	$[1024, 3, 512]$ $[1024, 7, 256]$ $[1024, 15, 128]$ $[1024, 31, 64]$ $[1024, 63, 32]$
Encoder	`mdd_bar_detector_0`	$[1024, 3, 512]$ $[1024, 3, 512]$	$[6, 512, 512]$
Encoder	`mdd_bar_detector_1`	$[1024, 7, 256]$ $[1024, 7, 256]$	$[14, 256, 256]$
Encoder	`mdd_bar_detector_2`	$[1024, 15, 128]$ $[1024, 15, 128]$	$[30, 128, 128]$
Encoder	`mdd_bar_detector_3`	$[1024, 31, 64]$ $[1024, 31, 64]$	$[62, 64, 64]$
Encoder	`mdd_bar_detector_4`	$[1024, 63, 32]$ $[1024, 63, 32]$	$[126, 32, 32]$
Decoder	`unpool_3`	$[126, 32, 32]$ $[62, 64, 64]$	$[62, 64, 64]$
Decoder	`convolutions_3`	$[62, 64, 64]$	$[62, 64, 64]$
Decoder	`unpool_2`	$[62, 64, 64]$ $[30, 128, 128]$	$[30, 128, 128]$
Decoder	`convolutions_2`	$[30, 128, 128]$	$[30, 128, 128]$
Decoder	`unpool_1`	$[30, 128, 128]$ $[14, 256, 256]$	$[30, 256, 256]$
Decoder	`convolutions_1`	$[30, 256, 256]$	$[30, 256, 256]$
Decoder	`unpool_0`	$[30, 256, 256]$ $[6, 512, 512]$	$[30, 512, 512]$
Decoder	`convolutions_0`	$[30, 512, 512]$	$[30, 512, 512]$
Decoder	`argmatxth`	$[30, 512, 512]$	$[512, 512, 3]$

Table 3. CPU execution time results of our proposed methods compared with a selection of classical and neural network-based methods.

Method	Resolution	Time (ms)
YOLO v5 [28]	1024 × 1024	70
	512 × 512	61
	640 × 480	45
Zamberletti et al. (2013) [16]	640 × 480	130
Creusot and Munawar (2015) [10]	640 × 480	115
Creusot and Munawar (2016) [11]	640 × 480	42
	1080 × 960	116
Zharkov and Zagaynov (2019) [17]	512 × 512	44
MDD DRT	1024 × 1024	21
PS DRT	1024 × 1024	107

Table 4. Accuracy metric results for our methods, those in Table 3, and those considered state-of-the-art methods in [24] on two barcode datasets.

Method	Arte-Lab Rotated		WWU Münster
	$D_{0.5}$	$D_{0.5 - 0.9}$	$D_{0.5}$	$D_{0.5 - 0.9}$
EfficientDet [26]	1.000	0.855	0.999	0.782
Faster R-CNN [25]	1.000	0.859	1.000	0.792
Retina Net [27]	1.000	0.876	1.000	0.809
YOLO v5 [28]	0.996	0.935	0.998	0.896
YOLO x [29]	0.970	0.848	1.000	0.813
Zamberletti et al. (2013) [16]	0.805	-	0.829	-
Creusot and Munawar (2015) [10]	0.893	-	0.963	-
Creusot and Munawar (2016) [11]	0.989	-	0.982	-
Hansen et al. (2017) [18]	0.926	-	0.991	-
Zharkov and Zagaynov (2019) [17]	0.989	-	0.980	-
PS DRT	0.886	0.700	0.944	0.732
MDD DRT	0.901	0.783	0.958	0.827

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gómez-Cárdenes, Ó.; Marichal-Hernández, J.G.; Son, J.-Y.; Pérez Jiménez, R.; Rodríguez-Ramos, J.M. An Encoder–Decoder Architecture within a Classical Signal-Processing Framework for Real-Time Barcode Segmentation. Sensors 2023, 23, 6109. https://doi.org/10.3390/s23136109

AMA Style

Gómez-Cárdenes Ó, Marichal-Hernández JG, Son J-Y, Pérez Jiménez R, Rodríguez-Ramos JM. An Encoder–Decoder Architecture within a Classical Signal-Processing Framework for Real-Time Barcode Segmentation. Sensors. 2023; 23(13):6109. https://doi.org/10.3390/s23136109

Chicago/Turabian Style

Gómez-Cárdenes, Óscar, José Gil Marichal-Hernández, Jung-Young Son, Rafael Pérez Jiménez, and José Manuel Rodríguez-Ramos. 2023. "An Encoder–Decoder Architecture within a Classical Signal-Processing Framework for Real-Time Barcode Segmentation" Sensors 23, no. 13: 6109. https://doi.org/10.3390/s23136109

APA Style

Gómez-Cárdenes, Ó., Marichal-Hernández, J. G., Son, J. -Y., Pérez Jiménez, R., & Rodríguez-Ramos, J. M. (2023). An Encoder–Decoder Architecture within a Classical Signal-Processing Framework for Real-Time Barcode Segmentation. Sensors, 23(13), 6109. https://doi.org/10.3390/s23136109

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu