Advanced Concepts For Intelligent Vision Systems: Jaques Blanc-Talon Patrice Delmas Wilfried Philips Paul Scheunders
Advanced Concepts For Intelligent Vision Systems: Jaques Blanc-Talon Patrice Delmas Wilfried Philips Paul Scheunders
Advanced Concepts For Intelligent Vision Systems: Jaques Blanc-Talon Patrice Delmas Wilfried Philips Paul Scheunders
Patrice Delmas
Wilfried Philips
Paul Scheunders (Eds.)
Advanced Concepts
LNCS 14124
for Intelligent
Vision Systems
21st International Conference, ACIVS 2023
Kumamoto, Japan, August 21–23, 2023
Proceedings
Lecture Notes in Computer Science 14124
Founding Editors
Gerhard Goos
Juris Hartmanis
Advanced Concepts
for Intelligent
Vision Systems
21st International Conference, ACIVS 2023
Kumamoto, Japan, August 21–23, 2023
Proceedings
Editors
Jaques Blanc-Talon Patrice Delmas
DGA TA University of Auckland
Toulouse, France Auckland, New Zealand
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2023
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
These proceedings gather the papers of the Advanced Concepts for Intelligent Vision
Systems (ACIVS) conference which was held in Kumamoto, Japan from August, 21 to
August 23, 2023.
This event was the 21st ACIVS. After the very first event, in Germany in 1999,
ACIVS has become a larger and independent scientific conference. However, the seminal
distinctive governance rules have been maintained:
– To update the conference scope on a yearly basis. While keeping a technical backbone
(the classical low-level image processing techniques), we have introduced topics of
interest like - chronologically - image and video compression, 3D, security and foren-
sics, evaluation methodologies in order to fit the conference scope to our scientific
community’s needs. In addition, speakers usually give invited talks on hot issues;
– To remain a single-track conference in order to promote scientific exchanges within
the audience;
– To grant oral presentations a duration of 25 minutes and published papers a length of
12 pages, which is significantly different from most other conferences.
The second and third items imply a complex management of the conference; in
particular, the number of time slots is smaller than in larger conferences. Although the
selection between the two presentation formats is primarily determined by the need to
compose a well-balanced program, papers presented during plenary and poster sessions
enjoy the same importance and publication format.
The first item is strengthened by the fame of ACIVS, which has been growing over
the years: official Springer records show a cumulative number of downloads at the first of
January, 2019, of about 1,200,000 (for ACIVS 2005–2018 only). Due to the COVID-19
pandemic, the conference activity was shut down in 2021 and 2022.
This year’s event also included invited talks of leading scientists from IROAST
(International Research Organization for Advanced Science and Technology), based
in Kumamoto, Japan. We would like to sincerely thank all of them for enhancing the
technical program with their presentations.
ACIVS attracted submissions from many different countries, mostly from Asia, but
also from the rest of the world: Belgium, Canada, China, France, Germany, Greece, India,
Israel, Japan, Mexico, New Zealand, Poland, Singapore, Slovenia, Spain, South Africa,
Sweden, Taiwan, Tunisia, Vietnam and the USA. From 49 submissions, 29 were selected
for oral presentation and 8 as posters. The paper submission and review procedure was
carried out electronically and a minimum of three reviewers were assigned to each paper.
A large and energetic Program Committee (61 people), helped by additional referees,
completed the long and demanding reviewing process. We would like to thank all of
them for their timely and high-quality reviews, achieved in quite a short time and during
the summer holidays.
vi Preface
Also, we would like to thank our sponsors (in alphabetical order), IROAST and the
city of Kumamoto from Japan and Springer for their valuable support.
Finally, we would like to thank all the participants who trusted in our ability to orga-
nize this conference for the 21st time. We hope they attended a different and stimulating
scientific event and that they enjoyed the atmosphere of the ACIVS social events in the
city of Kumamoto.
A conference like ACIVS would not be feasible without the concerted effort of many
people and the support of various institutions. We are indebted to the local organizers
for having smoothed all the harsh practical details of an event venue, and we hope to
return in the near future.
Acivs 2023 was organized by the University of Auckland, New Zealand.
Steering Committee
Organizing Committee
Program Committee
Reviewers
1 Introduction
[29]. Despite being extensively researched and even considering the most recent
advances using deep learning-based strategies [25], which are computationally
expensive, CV labeling is still considered an open problem with no optimal solu-
tion [16], and researchers have always been looking for alternatives to tackle the
problem regarding accuracy and speed.
With the advent of quantum computations, which promise potentially lower-
time complexity on certain problems than the best-classical counterparts [32],
recent studies have focused on leveraging quantum properties to overcome
intractable classical problems using Quantum Annealing computation [20]. D-
Wave system was the first company to build a Quantum Processing Unit (QPU)
that naturally approximates the ground state of a particular problem representa-
tion, namely Ising model [22]. In 2016, a Google study [9] compared D-Wave-2X
QPU with two classical algorithms (Simulating Annealing and Quantum Monte
Carlo algorithms) run on a single-core classical processor, showing the D-Wave
QPU to be up to 100 million times faster. A later study used the state-of-the-art
GPU implementations of these algorithms (on NVIDIA GeForce GTX 1080) for
a more complex problem and showed that the D-Wave-2000Q QPU was 2600
times faster in finding the ground state [17]. However, the scarcity of available
quantum bits (qubits) on a D-Wave QPU has always been challenging, from
128-qubits D-Wave One built in 2011 to the newly released 5000-qubit D-Wave
Advantage. Therefore, large CV problems involving highly non-convex functions
in a search space of many thousands of dimensions have yet to be widely stud-
ied based on D Wave quantum computers. To the best of our knowledge, the
first CV minimization problem implemented on a D-Wave QPU was to solve
a specific Stereo Matching problem on simplistic synthetic images [8]. An effi-
ciency improvement in terms of the number of variables in the quantum model
and applied to natural gray-scale images was proposed in [14]. Both quantum
models solve the minimum cut problem on a specific graph with two terminal
vertices, which can be solved efficiently in a polynomial time on classical proces-
sors. To show the real advantage of using D-Wave for CV problems, we need to
model a computationally intractable problem. Motivated by a CV application
(Image Restoration), we recently proposed an efficient quantum model for the
minimum multi-way cut problem [13], which is an NP-hard problem.
Here, we show how a specific Stereo Matching problem (as a significant CV
labeling problem) can be solved based on Quantum Annealing. Due to the lim-
itations of the current D-Wave quantum processors, it is impossible to directly
solve this problem on real-world full-sized images. Therefore, we present a hybrid
quantum-classical segment-based Stereo Matching method. We use the disparity
plane concept to represent the disparities of the input pixels. In the classical
part, we first partition the input image into small segments and estimate the
best disparity planes. Next, we label each segment by a disparity plane using
an optimization approach, which is an NP-Hard problem. Quantum Annealing
carries out the minimization.
The rest of the paper is organized as follows. Section 2 explains Quantum
Annealing and how an optimization problem can be solved using a D-Wave QPU.
A Hybrid Quantum-Classical Segment-Based Stereo Matching Algorithm 3
2 Quantum Annealing
The diagonal terms Qi,i are the linear coefficients acting as the external forces,
and the off-diagonal terms Qi,j are the quadratic coefficients for the internal
forces [22].
4 S. Heidari and P. Delmas
3 Stereo Matching
One of CV’s oldest yet unsolved problems is Stereo Matching, the most compu-
tationally extensive part of 3D reconstruction from digital images. In analogy to
human depth perception using two eyes, a stereo vision system typically has two
cameras placed horizontally, one on the left and the other on the right side, to
make a binocular vision. Each camera similarly captures the image with some
displacement. This displacement is called disparity which shows the differences
between the actual position, e.g., coordinates of the projection in the left, respec-
tively right, image of a 3D point in the real world. A rectification process is also
used to make sure that the corresponding pixels in the left and right images are
in the same line of pixels and there are only horizontal disparities. The disparity
is inversely proportional to the distance between the cameras and the object
itself in the real world. If a 3D point in the real world is closer, respectively
further away, to our eyes, the disparity value of the corresponding projections in
the images is larger, respectively smaller. When we visualize the disparities of
all pixels (known as a disparity map), closer objects with larger disparity values
are lighter than further-away objects with lower disparity values.
1. Color image segmentation: The left image is first partitioned into homo-
geneous color segments.
2. Initial disparity estimation: A local Stereo Matching algorithm is used to
initially estimate the disparities for both left and right stereo images.
3. Disparity plane fitting: An iterative plane fitting algorithm fits disparity
planes into color segments based on step 2’s estimated initial disparity values.
4. Segment and disparity plane refinement: Color segments are combined
according to a similarity measurement from the disparity map obtained from
step 3. Next, the disparity planes are updated by a plane fitting algorithm
on the new combined color segments. The final outputs are a set of color
segments and a set of disparity planes.
5. Optimization by Quantum Annealing and D-Wave QPU: An objec-
tive function labels each segment with a disparity plane. Then, we model an
equivalent QUBO to this objective function that a D-Wave QPU can mini-
mize.
The first four steps are performed on a classical computer, while a D-Wave QPU
performs the last step.
A Hybrid Quantum-Classical Segment-Based Stereo Matching Algorithm 5
where Cost(x, y, d) is the cost of allocating the disparity value d to the pixel
(x, y). We define the matching cost function as two similarity functions found
in the literature [5,21]: the truncated absolute difference on the color intensi-
ties (T ADc ) and the truncated absolute difference on the image gradient values
(T ADg ). Let Il and Ir be the left and right images, respectively.
1
Cost(x, y, d) = (1 − α)T ADc (x , y , d) + αT ADg (x , y , d), (3)
|Rxy |
(x ,y )∈Rxy
Disparity Plane Fitting: The next step is to fit a plane to each color segment
based on the corresponding initial disparity values to compute the disparities
of a segment of pixels by a plane. Take the segmented left stereo image of Bull
dataset (taken from 2001-Middlebury stereo datasets [23]) as an example. To
show the idea of disparity plane fitting, we select one of the color segments as
an example (see Fig. 1a) and fit a plane to its pixels based on the correspond-
ing initial disparity values computed from the previous step. Figure 1b shows
6 S. Heidari and P. Delmas
a top view of the selected pixels in a 3D plot where the third axis gives the
corresponding initial disparity values. The red points, called outliers, are the
inaccurately estimated disparities caused by texture-less and occluded regions
in the left stereo image, and the blue points, called inliers, are the points with
accurate disparities to which we want to fit a plane. Once a plane is fitted to the
blue points (Fig. 1e), the plane parameters could be used to obtain the disparity
of each pixel inside the segment. For the disparity plane fitting step, we use
an iterative algorithm proposed in [26] and widely used in later studies [21,31].
Let S = {S1 , S2 , . . . , Sns } be the set of color segments computed from the color
image segmentation step, where ns is the number of segments. The disparities
of each segment can be modeled by the function D (x, y) = ai x + bi y + ci , where
(ai , bi , ci ) are the fitted plane parameters, (x, y) ∈ Si for 1 ≤ i ≤ ns , and D (x, y)
is the computed disparity for the pixel (x, y) inside segment Si . This step aims
to capture a set of disparity planes (each having three plane parameters). This
set of disparity planes will then be used to label each color segment in the left
image with a disparity plane based on a cost function.
Fig. 1. A disparity-plane-fitting example. (a) The selected segment of the left stereo
image of the Bull dataset shown in a 2D plot, (b) the selected pixels in a 3D plot
(top view) where the third axis shows the corresponding initial disparity values, (c)
the selected pixels in the 3D plot (corner view), (d) the selected pixels in the 3D plot
(front view), (e) and the fitted plane to the selected pixels.
left image pixels. However, the main purpose is not to have the best disparity
planes for the segments but to extract all possible disparity planes to represent
the scene structure accurately. An additional refinement step combines the color
segments and fits new disparity planes to the updated segments. Let G(V, E, C)
be an undirected weighted graph where V = {1, 2, . . . , ns } is the set of vertices
representing the segment numbers, E is the set of the edges connecting the
corresponding adjacent segments, and C is a function that allocates weights to
the edges. A weight between two adjacent vertices (segments) represents how
similar or dissimilar the two segments are. To weigh the edges, we first define
the mean color of each segment concerning its corresponding computed disparity
values using the estimated planes (i.e., we use disparity values for computing the
mean value of each segment). Let Γ be a function that computes the mean value;
we weigh a given edge (u, v) ∈ E as C(u, v) = |Γ (u) − Γ (v)|. Next, we combine
the corresponding segments for u and v if C(u, v) is less than a threshold. Once
the segments are combined, we rerun the disparity plane fitting algorithm on the
combined segments to estimate new disparity planes.
x = {x1,1 , x1,2 , . . . , x1,np , x2,1 , x2,2 , . . . x2,np , . . . , xns ,1 , xns ,2 , . . . , xns ,np }
u∈V max{Cseg (l)|l ∈ L}
u
Let our QUBO model be defined as (9), where β >
+ λ|E|.
Hqubo (x) = β (1 − xu,l )2 (9)
u∈V l∈L
+ u
Cseg (l)xu,l +λ ξ(l1 , l2 )xu,l1 xv,l2 .
u∈V l∈L (u,v)∈E l1 ∈L l2 ∈L
We set x∗ = arg minx Hqubo (x) and define a vector of ns natural values as
w∗ = (wu∗ )u∈V where wu∗ = l if x∗u,l = 1. Then, w∗ = arg minw F (w).
Once we obtain the optimal solution of minimizing F , we can compute the
final disparity map as follows. Given the set of color segments S, the set of
disparity planes ρ, and w∗ as the allocated vector of labels, we have disp(x, y) =
ax + by + c where, (x, y) ∈ Su , and ρwu∗ = (a, b, c) for (x, y) ∈ Su , and u ∈ V .
Figure 2 shows the result of each step in the proposed hybrid quantum-classical
segment-based Stereo Matching method for the left image of the Bull dataset.
Fig. 2. From left to right: Color image segmentation, initial disparity estimation, dis-
parity plane estimation, segment and disparity plane refinement, and D-Wave mini-
mization result.
1 2
rmse = (disp(x, y) − truth(x, y)) , (10)
N
(x,y)∈P
⎛ ⎞
1
bad-B = ⎝ (|disp(x, y) − truth(x, y)| > B)⎠ × 100, (11)
N
(x,y)∈P
10 S. Heidari and P. Delmas
Fig. 3. The experimental results for the 2001-Middlebury stereo datasets: (a) Bull, (b)
Venus, (c) Sawtooth, (d) Barn. For each dataset from left to right: the segmented left
image, the computed disparity map, the corresponding ground truth, and the disparity
variation. Given (x, y) ∈ P , it has been shown in yellow if ϕxy > 2.0, and in red
ϕxy > 4.0.
where, B is the disparity error tolerance, and N is the number of pixels. In our
evaluation, B has been set to three different values: 0.5, 1.0, and 2.0, namely
bad-0.5, bad-1.0, and bad-2.0, respectively (Table 1).
Table 1. rmse and bad-B results for the computed disparity maps.
6 Conclusion
Our study presents a novel approach to Stereo Matching as a hybrid quantum-
classical segment-based pipeline that leverages Quantum Annealing for mini-
mization. The classical components of our method initially divide the left image
into small, homogeneous color segments, followed by estimating initial disparities
using a local Stereo matching method and disparity plane fitting. The quantum
component then assigns labels to each segment based on the estimated dispar-
ity planes and an objective function that can be minimized using a quantum
model and Quantum Annealing. It is worth noting that such a labeling problem
is classically intractable. Due to the limitations of current D-Wave quantum pro-
cessors, we employed a D-Wave hybrid solver for the minimization part. Despite
accurate results on a Middlebury dataset, our method is susceptible to the initial
segmentation since the initial disparity estimation relies on the segment bound-
aries. Any inaccuracies in the initial segmentation can lead to inaccurate dispar-
ity planes for the minimization part, ultimately impacting the overall accuracy
of our approach. For future works, one can address this issue by ignoring the
initial disparity values near the boundaries of each segment to fit more accurate
disparity planes to the segment.
References
1. Discrete quadratic models (2023). https://docs.ocean.dwavesys.com/en/stable/
concepts/dqm.html
2. Aharonov, D., Van Dam, W., Kempe, J., Landau, Z., Lloyd, S., Regev, O.: Adia-
batic quantum computation is equivalent to standard quantum computation. SIAM
Rev. 50(4), 755–787 (2008)
3. Besag, J.: On the statistical analysis of dirty pictures. J. Roy. Stat. Soc.: Ser. B
(Methodol.) 48(3), 259–279 (1986)
4. Bleyer, M., Breiteneder, C.: Stereo matching state-of-the-art and research chal-
lenges. In: Farinella, G., Battiato, S., Cipolla, R. (eds.) Advanced Topics in Com-
puter Vision. Advances in Computer Vision and Pattern Recognition, pp. 143–179.
Springer, London (2013). https://doi.org/10.1007/978-1-4471-5520-1_6
5. Bleyer, M., Rhemann, C., Rother, C.: PatchMatch stereo-stereo matching with
slanted support windows. In: BMVC, vol. 11, pp. 1–11 (2011)
6. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via
graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23(11), 1222–1239 (2001)
7. Černỳ, V.: Thermodynamical approach to the traveling salesman problem: an effi-
cient simulation algorithm. J. Optim. Theory Appl. 45(1), 41–51 (1985)
8. Cruz-Santos, W., Venegas-Andraca, S.E., Lanzagorta, M.: A QUBO formulation
of the stereo matching problem for D-Wave quantum annealers. Entropy 20(10),
786 (2018)
9. Denchev, V.S., et al.: What is the computational value of finite-range tunneling?
Phys. Rev. X 6(3), 031015 (2016)
10. Farhi, E., Goldstone, J., Gutmann, S., Sipser, M.: Quantum computation by adi-
abatic evolution. arXiv preprint quant-ph/0001106 (2000)
11. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient belief propagation for early vision.
Int. J. Comput. Vision 70(1), 41–54 (2006)
12 S. Heidari and P. Delmas
12. Geiger, D., Girosi, F.: Parallel and deterministic algorithms from MRFs: surface
reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 13(05), 401–412 (1991)
13. Heidari, S., Dinneen, M.J., Delmas, P.: An equivalent QUBO model to the mini-
mum multi-way cut problem. Technical report, Department of Computer Science,
The University of Auckland, New Zealand (2022)
14. Heidari, S., Rogers, M., Delmas, P.: An improved quantum solution for the stereo
matching problem. In: 2021 36th International Conference on Image and Vision
Computing New Zealand (IVCNZ), pp. 1–6. IEEE (2021)
15. Hosni, A., Rhemann, C., Bleyer, M., Rother, C., Gelautz, M.: Fast cost-volume
filtering for visual correspondence and beyond. IEEE Trans. Pattern Anal. Mach.
Intell. 35(2), 504–511 (2012)
16. Kappes, J.H., et al.: A comparative study of modern inference techniques for struc-
tured discrete energy minimization problems. Int. J. Comput. Vision 115(2), 155–
184 (2015)
17. King, J., et al.: Quantum annealing amid local ruggedness and global frustration.
J. Phys. Soc. Jpn. 88(6), 061007 (2019)
18. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Tech-
niques. MIT Press, Cambridge (2009)
19. Li, S.Z.: Markov Random Field Modeling in Computer Vision. Springer, Heidelberg
(1995). https://doi.org/10.1007/978-4-431-66933-3
20. Lucas, A.: Ising formulations of many NP problems. Front. Phys. 2, 5 (2014)
21. Ma, N., Men, Y., Men, C., Li, X.: Accurate dense stereo matching based on image
segmentation using an adaptive multi-cost approach. Symmetry 8(12), 159 (2016)
22. McGeoch, C.C.: Adiabatic quantum computation and quantum annealing: theory
and practice. Synthesis Lect. Quantum Comput. 5(2), 1–93 (2014)
23. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int. J. Comput. Vision 47(1), 7–42 (2002)
24. Szeliski, R., et al.: A comparative study of energy minimization methods for Markov
random fields with smoothness-based priors. IEEE Trans. Pattern Anal. Mach.
Intell. 30(6), 1068–1080 (2008)
25. Tankovich, V., Hane, C., Zhang, Y., Kowdle, A., Fanello, S., Bouaziz, S.: HIT-
Net: hierarchical iterative tile refinement network for real-time stereo matching.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 14362–14372 (2021)
26. Tao, H., Sawhney, H.S., Kumar, R.: A global matching framework for stereo compu-
tation. In: Proceedings Eighth IEEE International Conference on Computer Vision,
ICCV 2001, vol. 1, pp. 532–539. IEEE (2001)
27. Vedaldi, A., Soatto, S.: Quick shift and kernel methods for mode seeking. In:
Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5305, pp.
705–718. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88693-
8_52
28. Veksler, O.: Efficient graph-based energy minimization methods in computer vision.
Cornell University (1999)
29. Wainwright, M., Jaakkola, T., Willsky, A.: Tree consistency and bounds on the
performance of the max-product algorithm and its generalizations. Stat. Comput.
14(2), 143–166 (2004)
30. Wang, C., Komodakis, N., Paragios, N.: Markov random field modeling, inference
& learning in computer vision & image understanding: a survey. Comput. Vis.
Image Underst. 117(11), 1610–1627 (2013)
A Hybrid Quantum-Classical Segment-Based Stereo Matching Algorithm 13
31. Xiao, J., Yang, L., Zhou, J., Li, H., Li, B., Ding, L.: An improved energy segmen-
tation based stereo matching algorithm. ISPRS Ann. Photogram. Remote Sens.
Spat. Inf. Sci. 1, 93–100 (2022)
32. Yaacoby, R., Schaar, N., Kellerhals, L., Raz, O., Hermelin, D., Pugatch, R.:
A comparison between D-Wave and a classical approximation algorithm and a
heuristic for computing the ground state of an Ising spin glass. arXiv preprint
arXiv:2105.00537 (2021)
33. Zhang, K., Lu, J., Lafruit, G.: Cross-based local stereo matching using orthogonal
integral images. IEEE Trans. Circuits Syst. Video Technol. 19(7), 1073–1079 (2009)
Adaptive Enhancement of Extreme
Low-Light Images
1 Introduction
Images captured in low light are characterized by low photon counts, which
results in a low signal-to-noise ratio (SNR). Setting the exposure level while
capturing an image can be done by the user in manual mode, or automatically
by the camera in auto exposure (AE) mode. In manual mode, the user can
adjust the ISO, f-number, and exposure time. In auto exposure (AE) mode, the
camera measures the incoming light based on through-the-lens (TTL) metering
and adjusts the exposure values (EVs), which refers to configurations of the
above parameters.
We consider the problem of enhancing a dark image captured in an extremely
low-light environment, based on a single image [7]. In a dark environment, adjust-
ing the parameters to increase the SNR has its own limitations. For example, high
ISO increases the noise as well, and lengthening the exposure time might intro-
duce blur. Various approaches have been proposed as post-processing enhance-
ments in low-light image processing [6,14,15,19,35,37]. In extreme low light
conditions, such methods often fail to produce satisfactory results.
Each of the first two authors contributed equally.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 14–26, 2023.
https://doi.org/10.1007/978-3-031-45382-3_2
Adaptive Enhancement of Extreme Low-Light Images 15
Fig. 1. At runtime, the intensity level for optimal restoration of a given dark image
might be different from the trained one and can lead to dark areas or low contrast. For
a wide range of output image’s intensity levels, our model optimizes the enhancement
of the input image. (a) left: The input, (b) center left: Ground truth, (c) center right:
SID [7] enhances the image to a fixed intensity level, which is not optimal for the input
image, and as a result, there are noticeable artifacts. (d) right: Our approach adapts
the enhancement operation to optimally match any selected intensity levels, thereby
reducing the presence of artifacts.
parameter adjusts the operation of the image signal processing (ISP) unit to
enhance the degradations that are the result of the increase in the intensity,
conditioned on the intensity level.
Contribution. In pursuit of advancing research in the field and facilitating the
development of adaptive models, we have curated a dataset containing 1,500
raw images captured in extremely low-light conditions, comprising indoor and
outdoor scenes with diverse exposure levels. We propose and train a model that
can produce compelling results for restoring dark images with a wide range of
optimal intensity levels, including ones that were not available during training.
Our experimental results, which incorporate both qualitative and quantitative
measures, demonstrate that our model along with our dataset improves the
enhancement quality of dark images.
2 Related Work
Datasets. A key contribution of our work is a dataset of real-world images
that enable training and evaluating multi-exposore models in exterme low light.
Unlike existing datasets, we introduce a long-exposure reference image with mul-
tiple shorter exposure times for each scene, in both indoor and outdoor scenes,
and directly provide the raw sensor data. Our dataset fills the gap and allows
the training of an adaptive model in extreme low-light conditions by combining
multiple exposures. Our dataset vs. other datasets is compared in Table 1.
Dataset Format # Images Publicly Available Multi Exposure Extreme Low Light
DND [22] RAW 100 yes no no
SIDD [1] RAW 30000 yes yes no
LLNet [21] RGB 169 yes no no
MSR-Net [25] RGB 10000 no no no
SID [7] RAW 5094 yes no yes
SICE [5] RGB 4413 yes no no
RENOIR [3] RAW 1500 yes no no
LOL [8] RGB 500 yes no no
DeepUPE [31] RGB 3000 no no no
VE-LOL-L [20] RGB 2500 yes no no
DarkVision [36] RAW 13455 yes no no
Our RAW 1500 yes yes yes
training for an additional objective. CFSNet [28] uses branches, each one tar-
geted for a different objective. AdaFM [12] adds modulation filters after each
convolution layer. Deep Network Interpolation (DNI) [29] trains the same net-
work architecture on different objectives and interpolates all parameters. These
methods are optimized for well-lit images and as we demonstrate in the experi-
ments, struggle to enhance images captured in extreme low light conditions.
Low-light Image Enhancement. Widely used enhancement methods are his-
togram equalization, which globally balances the histogram of the image; and
gamma correction, which increases the brightness of dark pixels. More advanced
methods include illumination map estimation [11], semantic map enhancement
[33], bilateral learning [10], multi-exposure [2,5,34], Retinex model [4,9,30,38]
and unpaired enhancement [17]. In contrast to these methods, we consider an
extreme low-light environment with very low SNR, where the scene is barely
visible to the human eye. Chen [7] has introduced an approach to extreme low-
light imaging by replacing the traditional image processing pipeline with a deep
learning model based on raw sensor data. Wang [27] introduced a neural network
for enhancing underexposed photos by incorporating an illumination map into
their model, while Xu [32] presented a model for low-light image enhancement
based on frequency-based decomposition. These methods are optimized to out-
put an enhanced image with a fixed exposure. In cases where the user requires
a change in the exposure (intensity level) of the output image, these methods
require retraining the models, typically on additional sets of images. In contrast,
we introduce an approach that enables continuous setting of the desired exposure
at inference time.
Fig. 2. The multi-exposure dataset. The top two rows are images of outdoor scenes,
and the bottom two rows are images of indoor scenes. From left to right are exposure
times of 0.1 s, 0.5 s, 1 s, 5 s, and 10 s.
18 E. Hershkovitch Neiterman et al.
3 Our Approach
3.1 Multi-exposure Extreme Low-Light Dataset (ME2L)
Fig. 3. The dashed red rectangle Fig. 4. The architecture of our network. There
is the modulation module. The are two input parameters, α1 (brightness) and
enhancement parameter α2 repre- α2 (enhancement). α1 controls the brightness of
sents a weighted sum between the the raw input data. α2 modulates the weights
feature map of the initial and final of the filters and tunes the network, which
exposure levels. The blue dashed operates as an Image Signal Processing (ISP)
line is to emphasize that the oper- unit. We train the model for an initial and
ation of the modulation module is final exposure level, where for each value of α1
also affected by the α1 parameters there is a single value of α2 . At inference time,
which control the brightness of the each parameter can be set independently of the
image. (Color figure online) other.
Adaptive Enhancement of Extreme Low-Light Images 19
where y ∈ Y is the observed (raw) intensity at a pixel in the raw data space Y,
x is the original (unknown) signal, βshot is proportional to the analog gain (ga )
and digital gain (gd ) and βread is proportional to the sensor readout variance
(σr2 ) and digital gain: βread = gd2 σr2 , βshot = gd ga .
It is therefore evident from Eq. (1) that unlike previous methods, adding a
single noise source (e.g. Gaussian) or using a simple multiplication to adjust the
image intensity is not equivalent to acquiring an image with such original inten-
sity. We propose an alternative approach to enhance both read and shot noises
by employing two input parameters each contributing differently to Eq. (1), a
modulation layer [12], and mapping of a single data point from raw data space
to multiple points in sRGB, each with a different output intensity level.
Our Raw-to-sRGB pipeline is formulated as a function f : Y ×R×R → Yrgb ,
yrgb = f (y, α1 , α2 ; θ), where α1 is a scalar that sets the mean of the signal
in Eq. (1) to the desired level by multiplication of the raw data, α2 controls the
enhancement level of the Raw-to-sRGB pipeline, θ represents the parameters of
f (·) and yrgb ∈ Yrgb is the signal of the sRGB image. The function f is realized by
a deep network with modulation layers. To obtain θ, we train our network in two
steps. First, the base model is trained to fit the enhanced image with an initial
intensity level, without any additional modifications to the existing architecture.
Then we freeze the weights of the base model, and each modulation layer (g)
is inserted after each existing convolutional kernel g(w, b) ◦ X, where X is the
output feature map of existing convolutional kernels in the base network and
w, b are weights and bias of the modulation layer’s convolutional filter kernel.
20 E. Hershkovitch Neiterman et al.
The network is then fine-tuned to fit the enhanced image with a final intensity
level by learning the weights of the additional convolutional kernels. Thus, in
our formulation, θ includes the parameters of both the base network and the
modulation layers. During runtime, assuming w1 is the base convolution kernel,
w and b are the weights of filter and bias in each modulation layer, the output
of the modulation layer is:
w1 + α2 w1 ∗ w + α2 b, (2)
for the given scalar 0 ≤ α2 ≤ 1 representing the enhancement parameter (Fig. 3).
To control both noise sources, we set α2 ∈ [0, 1] such that it linearly cor-
responds to α1 and α2 = 1 corresponds to the maximum value of α1 . Our key
intuition is that for α1 , α2 → 0, it is the trained base network (before fine tun-
ing) that produces the most significant output, and it enhances the read noise
(Eq. (1) and (2)). During training, both parameters are adjusted according to the
ground-truth image. The input arrays’ values are multiplied by the α1 param-
eter, which represents the ratio between the input image’s exposure time and
the required output image’s exposure time, effectively setting the intensity and
noise levels of the output. The overall architecture of our network is presented
in Fig. 4.
Unlike existing adaptive method, we do not operate in sRGB domain for noisy
images as it limits the representation power of the architecture [1]. Instead, we
operate in the raw domain and employ a U-Net [24] as our base architecture (f ).
It replaces the entire image signal processing (ISP) pipeline. The input is a short
exposure raw image from Bayer sensor data and the output is an sRGB image.
The raw Bayer sensor data is packed into four channels, the spatial resolution is
reduced by a factor of two in each dimension; and the black level is subtracted.
The output is a 12-channel image processed to recover the original resolution of
the input image.
For testing, we set the intensity level (α1 ) and the enhancement (α2 ) param-
eters of the network to the desired exposure and ISP configuration. The input
image is multiplied according to the intensity level parameter, resulting in a
noisy, brighter image. The weights of the filter and bias in the modulation module
after the fine-tuning phase are adjusted according to the value of the enhance-
ment parameter.
We train the model using L1 loss and the Adam optimizer. The inputs are
random 512 × 512 patches with standard augmentation. The learning rate is
10−4 for 1000 epochs and then 10−5 for an additional 1000 epochs, a total of
2000 epochs for the training phase. Fine-tuning the model for the final exposure
level requires an additional 1000 epochs.
4 Experiments
Baselines. We compare our results with state-of-the-art adaptive methods
[12,18]. Using our dataset, we train them in accordance with their authors’
instructions. The inputs of the compared models were modified to operate on
Adaptive Enhancement of Extreme Low-Light Images 21
raw images in order to ensure fair comparisons. The SID [7] is the baseline
model for extreme low light enhancement, and it enhances dark images to a
fixed intensity level.
Evaluation Metrics. We use 70%, 10%, and 20% of the images for training,
validation, and testing, respectively, with uniform sampling and equal represen-
tation for indoor and outdoor scenes in each set. The ground truth images are
the corresponding long-exposure images processed by LibRaw1 to sRGB format.
Table 2. For all methods, the input exposure for both training and testing is 0.1 s. ⇒
denotes the ground-truth images used for training. The bold are the two best results.
As can be seen, our model outperforms all other methods. See text for more details.
Train/Test 1s 5s 10 s
PSNR SSIM PSNR SSIM PSNR SSIM
A - Single Exposure Baseline
SID [7] ⇒ 1 38.17 0.95 30.7 0.87 27.7 0.84
SID [7] ⇒ 5 36.82 0.94 33.35 0.91 28 0.86
SID [7] ⇒ 10 34.88 0.9 30.52 0.88 30 0.88
B - Multi Exposure Baseline
SID [7] ⇒ 1,5,10 35.77 0.92 29.55 0.86 26.25 0.82
Retinex [30] ⇒ 1,5,10 16.29 0.08 15.15 0.12 13.67 0.16
C - Two Exposure Interpolation
AdaFM [12] ⇒ 1,10 37.86 0.85 30.51 0.73 26.95 0.72
CResMD [18] ⇒ 1,10 36.37 0.8 21.63 0.46 26.52 0.64
Ours ⇒ 1,10 38.17 0.95 32.35 0.89 29.67 0.87
D - Two Exposure Extrapolation
AdaFM [12] ⇒ 1,5 37.86 0.85 31.12 0.76 25.98 0.7
CResMD [18] ⇒ 1,5 34.97 0.73 23.73 0.59 16.17 0.17
Ours ⇒ 1,5 38.17 0.95 31.78 0.89 28.65 0.86
1
www.libraw.org.
22 E. Hershkovitch Neiterman et al.
Fig. 5. The restoration effect of enhancing images to exposure level within the trained
range. The first column is obtained by directly adjusting the brightness level to the
optimal exposure by multiplication.
Adaptive Enhancement of Extreme Low-Light Images 23
Figure 5 shows the effect of adjusting the exposure time for a value within the
trained range, 5 s. The model is trained using input images with an exposure
time of 0.1 s and ground truth images with exposure times of 1 s and 10 s. SID
was trained on all possible output exposure times. The enhanced images after
adjusting the brightness and enhancement parameters are shown. The left col-
umn shows the effect of multiplying the intensity of the input images by 50,
which is the ratio between the ground truth exposure of the input (0.1 s) and
the ground truth output (5 s). It can be seen that our model successfully removes
the artifacts presented by the other approaches.
Filter Size. We evaluate the sizes of different filters in the modulation module.
We consider filter sizes of - 1 × 1, 3 × 3, 5 × 5, and 7 × 7. We train our base
model with an exposure of 0.1 s and an output of 1 s, then fine-tune it to an
output of 10 s. The test images are with an exposure of 5 s.
Table 3 shows our comparisons. It can be seen that the most significant gain
is achieved when using a filter size of 3 × 3.
Table 3. Filter size comparisons. The model is trained from 0.1 s to 1 s and fine-tuned
to 10 s, and tested for an unseen exposure level of 5 s.
Tuning Direction. We evaluate the optimal direction for the tuning. We com-
pare two models. The first one is trained from 0.1 s to 1 s and fine-tuned for
10 s. The second one is trained from 0.1 s to 10 s and fine-tuned for 1 s. We com-
pare the results with respect to unseen output images with an exposure time
of 5 s. The forward direction from 0.1 s to 10 s achieved better results than the
backward one, with a PSNR of 32.35 vs. 28.2.
24 E. Hershkovitch Neiterman et al.
5 Conclusion
Extreme low-light imaging is challenging and has recently gained growing inter-
est. Current methods allow enhancement of dark images, assuming the input
exposure and the optimal output exposure are known at inference time, which
prevents its adaptation in practical scenarios. We collected a dataset of 1500
images with multiple exposure levels for extreme low-light imaging. We present
an approach that enables continuously controlling of the optimal output expo-
sure levels of the images at runtime, without the need to retrain the model and
showed that our model presents promising results on a wide range of both indoor
and outdoor images. We believe that our dataset as well as our model will sup-
port further research in the field of extreme low-light imaging, making a step
forward towards its widespread adoption.
References
1. Abdelhamed, A., Lin, S., Brown, M.S.: A high-quality denoising dataset for smart-
phone cameras. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2018)
2. Afifi, M., Derpanis, K.G., Ommer, B., Brown, M.S.: Learning multi-scale photo
exposure correction. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 9157–9167 (2021)
3. Anaya, J., Barbu, A.: Renoir-a dataset for real low-light image noise reduction. J.
Vis. Commun. Image Represent. 51, 144–154 (2018)
4. Cai, B., Xu, X., Guo, K., Jia, K., Hu, B., Tao, D.: A joint intrinsic-extrinsic prior
model for retinex. In: Proceedings of the IEEE International Conference on Com-
puter Vision, pp. 4000–4009 (2017)
5. Cai, J., Gu, S., Zhang, L.: Learning a deep single image contrast enhancer from
multi-exposure images. IEEE Trans. Image Process. 27(4), 2049–2062 (2018)
6. Celik, T., Tjahjadi, T.: Contextual and variational contrast enhancement. IEEE
Trans. Image Process. 20(12), 3431–3441 (2011)
7. Chen, C., Chen, Q., Xu, J., Koltun, V.: Learning to see in the dark. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3291–
3300 (2018)
8. Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light
enhancement. In: British Machine Vision Conference (2018)
9. Fu, X., Zeng, D., Huang, Y., Zhang, X.P., Ding, X.: A weighted variational model
for simultaneous reflectance and illumination estimation. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 2782–2790
(2016)
10. Gharbi, M., Chen, J., Barron, J.T., Hasinoff, S.W., Durand, F.: Deep bilateral
learning for real-time image enhancement. ACM Trans. Graph. (TOG) 36(4), 1–
12 (2017)
11. Guo, X., Li, Y., Ling, H.: LIME: low-light image enhancement via illumination
map estimation. IEEE Trans. Image Process. 26(2), 982–993 (2016)
12. He, J., Dong, C., Qiao, Y.: Modulating image restoration with continual levels via
adaptive feature modification layers. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 11056–11064 (2019)
Adaptive Enhancement of Extreme Low-Light Images 25
13. He, J., Dong, C., Qiao, Y.: Multi-dimension modulation for image restoration with
dynamic controllable residual learning. arXiv preprint arXiv:1912.05293 (2019)
14. Hu, Z., Cho, S., Wang, J., Yang, M.H.: Deblurring low-light images with light
streaks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3382–3389 (2014)
15. Hwang, S.J., Kapoor, A., Kang, S.B.: Context-based automatic local image
enhancement. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C.
(eds.) ECCV 2012. LNCS, vol. 7572, pp. 569–582. Springer, Heidelberg (2012).
https://doi.org/10.1007/978-3-642-33718-5 41
16. Ignatov, A., Kobyshev, N., Timofte, R., Vanhoey, K., Van Gool, L.: DSLR-quality
photos on mobile devices with deep convolutional networks. In: Proceedings of the
IEEE International Conference on Computer Vision, pp. 3277–3285 (2017)
17. Jiang, Y., et al.: EnlightenGAN: deep light enhancement without paired supervi-
sion (2021)
18. He, J., Dong, C., Qiao, Yu.: Interactive multi-dimension modulation with dynamic
controllable residual learning for image restoration. In: Vedaldi, A., Bischof, H.,
Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 53–68. Springer,
Cham (2020). https://doi.org/10.1007/978-3-030-58565-5 4
19. Lee, C., Lee, C., Kim, C.S.: Contrast enhancement based on layered difference
representation of 2D histograms. IEEE Trans. Image Process. 22(12), 5372–5384
(2013)
20. Liu, J., Xu, D., Yang, W., Fan, M., Huang, H.: Benchmarking low-light image
enhancement and beyond. Int. J. Comput. Vision 129, 1153–1184 (2021)
21. Lore, K.G., Akintayo, A., Sarkar, S.: LLNet: a deep autoencoder approach to nat-
ural low-light image enhancement (2016)
22. Plotz, T., Roth, S.: Benchmarking denoising algorithms with real photographs. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1586–1595 (2017)
23. Remez, T., Litany, O., Giryes, R., Bronstein, A.M.: Deep convolutional denoising
of low-light images. arXiv preprint arXiv:1701.01687 (2017)
24. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4 28
25. Shen, L., Yue, Z., Feng, F., Chen, Q., Liu, S., Ma, J.: MSR-net: low-light image
enhancement using deep convolutional network (2017)
26. Shoshan, A., Mechrez, R., Zelnik-Manor, L.: Dynamic-net: tuning the objective
without re-training for synthesis tasks. In: Proceedings of the IEEE International
Conference on Computer Vision, pp. 3215–3223 (2019)
27. Wang, R., Zhang, Q., Fu, C.W., Shen, X., Zheng, W.S., Jia, J.: Underexposed
photo enhancement using deep illumination estimation. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 6849–6857
(2019)
28. Wang, W., Guo, R., Tian, Y., Yang, W.: CFSNet: toward a controllable feature
space for image restoration. In: Proceedings of the IEEE International Conference
on Computer Vision, pp. 4140–4149 (2019)
29. Wang, X., Yu, K., Dong, C., Tang, X., Loy, C.C.: Deep network interpolation for
continuous imagery effect transition. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 1692–1701 (2019)
30. Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light
enhancement. arXiv preprint arXiv:1808.04560 (2018)
26 E. Hershkovitch Neiterman et al.
31. Xu, K., Yang, X., Yin, B., Lau, R.W.: Learning to restore low-light images via
decomposition-and-enhancement (supplementary material) (2020)
32. Xu, K., Yang, X., Yin, B., Lau, R.W.: Learning to restore low-light images via
decomposition-and-enhancement. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 2281–2290 (2020)
33. Yan, Z., Zhang, H., Wang, B., Paris, S., Yu, Y.: Automatic photo adjustment using
deep neural networks. ACM Trans. Graph. (TOG) 35(2), 1–15 (2016)
34. Ying, Z., Li, G., Gao, W.: A bio-inspired multi-exposure fusion framework for
low-light image enhancement. arXiv preprint arXiv:1711.00591 (2017)
35. Yuan, L., Sun, J.: Automatic exposure correction of consumer photographs. In:
Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012.
LNCS, vol. 7575, pp. 771–785. Springer, Heidelberg (2012). https://doi.org/10.
1007/978-3-642-33765-9 55
36. Zhang, B., et al.: DarkVision: a benchmark for low-light image/video perception.
arXiv preprint arXiv:2301.06269 (2023)
37. Zhang, X., Shen, P., Luo, L., Zhang, L., Song, J.: Enhancement and noise reduction
of very low light level images. In: Proceedings of the 21st International Conference
on Pattern Recognition (ICPR2012), pp. 2034–2037. IEEE (2012)
38. Zhang, Y., Zhang, J., Guo, X.: Kindling the darkness: a practical low-light image
enhancer. In: Proceedings of the 27th ACM International Conference on Multime-
dia, pp. 1632–1640 (2019)
Semi-supervised Classification
and Segmentation of Forest Fire Using
Autoencoders
1 Introduction
Forests are one of the most important commodities in the world. Apart from pro-
viding natural habitat to numerous species, they also provide us with resources
such as wood, resin, herbs etc. Forest fires are a huge threat to the vast expanses
of forest cover. In recent years, we have seen uncontrollable forest fires around the
world some of which are burning to date. Every year, hundred million hectares of
land are destroyed by forest fires and over two lac fires happen every year over a
total area of about 3.5–4.5 million km2 [1,2]. The increase in forest fires in forest
areas around the world has resulted in increased motivation for developing fire
warning systems for the early detection of wildfires. Such early fire detection
systems can act as deterrents and can prevent the excessive damage causing to
flora and fauna due to wildfire.
Computer Vision and Machine Learning have been utilized to develop fire
detection systems [3]. For instance, one of the classical approaches such as Sup-
port Vector Machines, has been used to classify image processing features in
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 27–39, 2023.
https://doi.org/10.1007/978-3-031-45382-3_3
28 A. Koottungal et al.
detecting forest fire regions [4]. With the rise of deep learning, artificial neural
network-based architectures, including Convolutional Neural Networks, Vision
Transformers, and U-net, have also been employed [3]. Although supervised
learning-based approaches have shown promise, they require laborious and time-
consuming annotations. For instance, the task of semantic segmentation requires
humans to provide strong pixel-level annotations for millions of images. To
address this issue, semi-supervised learning has been used, which reduces the
amount of human effort required for data labeling by incorporating informa-
tion from a large set of unlabeled data [11,12]. Semi-supervised learning results
in improved accuracy through the incorporation of additional information from
unlabeled data thus providing enhanced generalization capabilities.
In this work, semi-supervised methods for forest fire classification and seg-
mentation using autoencoder are proposed. In particular, a convolutional autoen-
coder (CAE) model is leveraged to reconstruct the input image in an unsuper-
vised manner, by which a rich representation is learned at the latent space.
Further, the encoder part will be used towards classification task in a semi-
supervised manner using the annotations. On top of this, state-of-the-art weakly
supervised methods viz., Class Activation Mapping (CAM) [5] is incorporated
for the visualization of fire region. Further, for the semi-supervised segmentation,
model training via patch-wise extraction of images and corresponding masks is
facilitated.
Extensive analysis of classification and segmentation is carried out on the
FLAME dataset and Corsican database, respectively. Both classification and
segmentation models outperform the state-of-the-art semi-supervised approaches
as well as are quite competitive with fully supervised approaches, even with a
much lower annotation labeling effort. The key contributions of the paper are as
follows:
– Proposal of a novel approach for semi-supervised segmentation and classifi-
cation of forest fire using autoencoders, one of its first kind.
– Visualization of the classification results via Class Activation Mapping (CAM)
saliency heat maps/ CAM binary masks to depict the relevant frame regions,
thus making our approach interpretable (ExplainableAI).
– Semi-supervised learning for segmentation by training autoencoder, using
patches drawn from limited forest fire images.
The remainder of the paper is organized as follows: Sect. 2 provides a litera-
ture review of the detection and segmentation of forest fires. Section 3 describes
the methodology used for the classification and segmentation task, focusing on
the use of autoencoders in a semi-supervised manner. Section 4 presents the
experimental setup and result analysis. Finally, the paper concludes with a sum-
mary of the findings and suggestions for future research in Sect. 5.
2 Related Works
A variety of fire-sensing technologies were utilized in the early forest fire detection
systems that involve gas, flame, heat, and smoke detectors. While these systems
Semi-supervised Classification and Segmentation 29
have managed to detect fire, they have faced some limitations related to coverage
areas, false alarms, and slow time response [4]. Tlig et al. [10] proposed a new
color image segmentation method based on principal component analysis(PCA)
and Gabor filter responses on various color images and was found to be not
sensitive to add noises. Using YCbCr color space, Mahmoud et al. [9] proposed
a forest fire algorithm able to separate high-temperature fire centre pixels. The
method had good detection rates and fewer false alarms.
Recently, various deep learning techniques are used for fire detection. Perrolas
et al. [6] proposed a scalable method to detect and segment the region of fire
and smoke in the image using a quad-tree search algorithm. The method was
capable of localizing very small regions of fire incidence in the images. In another
work, Ghali et al. [7], presented the use of Vision-based Transformers on visible
spectrum images to perform segmentation of forest fire.
In addition to the supervised approaches, unsupervised and semi-supervised
approaches are also reported. For example, Meenu et al. [8] developed an unsu-
pervised segmentation system that can be used for early detection of fire in real-
time using spatial, temporal and motion information from the video frames. The
use of Class Activation Mapping (CAM) for weakly supervised fire and smoke
segmentation are addressed by the recents works on semi-supervised approaches
Amaral et al. [11] and Niknejad et al. [12]. In [11], Conditional Random Fields
(CRF) are incorporated along with CAM to accurately detect fire/smoke masks
at the pixel level.
In our proposed work, we leverage an Autoencoder-based semi-supervised
approach for fire detection. Some application of autoencoder for semi-supervised
learning is reported in the medical field for medical imaging and diagnostics. For
example, Kucharski et al. [13] developed a semi-supervised convolutional autoen-
coder architecture for segmenting nests of melanocytes in two stages. Mousumi et
al. [14] developed a similar architecture for segmenting viable tumour regions in
liver whole-slide images. Varghese et al. [15] utilized stacked denoising autoen-
coders in a semi-supervised learning approach to develop a model capable of
detecting and segmenting brain lesions using fewer data. To the best of our
knowledge, our work represents one of the first studies of an autoencoder-based
semi-supervised learning approach in the context of fire detection and segmen-
tation tasks.
3 Methodology
3.1 Autoencoders
The encoder function (φ) maps the original data X to a latent space Z, which
is present in the bottleneck layer(latent space). The decoder function (θ) maps
the latent space Z to the output X , where the output is the same as the input
function. Thus, the algorithm is trying to recreate the original image after some
generalized nonlinear compression.
φ:X→Z
θ : Z → X
φ, θ = arg min ||X − (φ.θ)X||2 (1)
φθ
4 Experiments
4.1 Datasets
In this work, images from two forest fire datasets are extensively used i.e., Cor-
sican Dataset and FLAME dataset.
(i) Corsican Fire Dataset [19] is an open fire database that includes pixel-
level segmented images of wildfires and controlled fires. It consists of 1135
RGB images of forest fires captured at a close range, including some with
sequences of frames of different fires. The database also contains 635 Near
Infrared (NIR) images, but they are not used in this study. The images have
dimensions of 1024 × 768.
(ii) FLAME Dataset [20] consists of 25018 images that contain fire and
14357 images that do not have fire. Each image is a frame from a video which
covers different forest regions that contain fires. The images are shot from an
aerial view with varying heights using a drone. The dimensions of the images
are 256 × 256.
n
1
M SE = (Yi − Ŷi )2 (3)
n i=1
where Yi is the vector of actual values of the variable being predicted and Ŷi
is the predicted values. In the later stage of training, the encoder is extracted
from the trained autoencoder for classification task. Here, the training loss for
classification is calculated using Binary Cross Entropy Loss (BCE loss).
N
1
BCE = −(yi ∗ log(pi ) + (1 − yi ) ∗ log(1 − pi )) (4)
N i=1
Sigmoid activation function. Refer to Table 1 for the CAE architecture used for
semi-supervised classification. The entire model was then trained for an addi-
tional 30 epochs. We evaluate the fire classification model’s performance not only
through accuracy but also by examining the fire localization with Class Activa-
tion Mapping (CAM), as explained in Sect. 3.2. A global average pooling layer is
employed after the last convolutional layer resulting in the activation mapping.
The ADAM optimizer is used to train the model for 20 epochs on labeled data,
achieving 0.97 accuracy.
Model Training for Semi-supervised Segmentation: The segmentation
model is trained by feeding patches of images instead of whole images taken
from Corsican Database. The model uses 500 images for training. Each image is
divided into patches and was fed into the network to reconstruct the patches. As
a part of the experiments, we experimented with variable patch sizes of 5 × 5,
9 × 9 and 16 × 16, respectively. We trained the model with 1000 patches from
each image. The whole autoencoder is trained to reconstruct the RGB patches
in unsupervised manner and is trained for 1000 epochs. In the later stage of
training, the autoencoder is modeled to output the segmented image in the size
of patches which later is rejoined to obtain the whole predicted mask. Here, the
model is trained using binary masks that correspond to the input image. Here,
the model is trained for 500 epochs.
Table 1. The architecture of the Convolutional Autoencoder (CAE) designed for semi-
supervised classification
Fig. 3. Visualization of the Class activation map (CAM) heatmap and binary mask,
conducted upon the CAE-based classification results.
From the results (Case study 1–3) in Table 3, it is observed that the model
trained with patches of size 5 × 5 and 9 × 9 are observed to produce a much
better mIoU scores between 0.756 and 0.742 on the images compared to large
patch size 16 × 16 (mIoU of 0.723). Further, it was found in Case study 4 that,
while fusing the small patch based results with large patch size 16 × 16, mIoU
is increased from 0.723 to 0.749.
State-of-the-Art Comparison: The proposed approach is compared against
some of the recent state-of-the-art works, Amaral et al. [11] and Niknejad et al.
Semi-supervised Classification and Segmentation 37
Table 4. Comparison of mean IoU (mIoU) on the test set for our proposed method
compared to other state-of-the-art weakly-supervised segmentation methods.
[12] that use weakly-supervised approaches to segment forest fire, as well as our
CAM-based fire localization approach. Note that, since the ground truth seg-
mentation images are unavailable in the FLAME dataset, the segmentation task
and its comparison studies are carried out in the Corsican Dataset. The com-
parision results are reported in Table 4. From the result, it is observed that the
CAM-based fire localization produced a mIoU of 0.547, which could be ascrib-
able to the missing of non-discriminative part while training for classification.
Whereas, our proposed CAE based patch-wise segmentation model (as in Case
38 A. Koottungal et al.
References
1. Martinez-de Dios, J.R., Arrue, B.C., Ollero, A., Merino, L., Gómez-Rodríguez, F.:
Computer vision techniques for forest fire perception. Image Vis. Comput. 26(4),
550–562 (2008)
2. Meng, Y., Deng, Y., Shi, P.: Mapping forest wildfire risk of the world. In: Shi, P.,
Kasperson, R. (eds.) World Atlas of Natural Disaster Risk. IERGPS, pp. 261–275.
Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-45430-5_14
3. Abid, F.: A survey of machine learning algorithms based forest fires prediction and
detection systems. Fire Technol. 57(2), 559–590 (2021)
4. Chen, T.H., Wu, P.H., Chiou, Y.C.: An early fire-detection method based on image
processing. In: 2004 International Conference on Image Processing, ICIP 2004,
Singapore, vol. 3, pp. 1707–1710 (2004)
5. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features
for discriminative localization. In: 2016 IEEE conference on Computer Vision and
Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 2921–2929 (2016)
6. Perrolas, G., Niknejad, M., Ribeiro, R., Bernardino, A.: Scalable fire and smoke
segmentation from aerial images using convolutional neural networks and quad-tree
search. Sensors 22(5), 1701 (2022)
7. Ghali, R., Akhloufi, M.A., Jmal, M., Souidene Mseddi, W., Attia, R.: Wildfire
segmentation using deep vision transformers. Remote Sens. 13(17), 3527 (2021)
8. Ajith, M., Martínez-Ramón, M.: Unsupervised segmentation of fire and smoke from
infra-red videos. IEEE Access 7, 182381–182394 (2019)
9. Mahmoud, M.A., Ren, H.: Forest fire detection using a rule-based image processing
algorithm and temporal variation. Math. Probl. Eng. (2018)
Semi-supervised Classification and Segmentation 39
10. Tlig, L., Bouchouicha, M., Tlig, M., Sayadi, M., Moreau, E.: A fast segmentation
method for fire forest images based on multiscale transform and PCA. Sensors
20(22), 6429 (2020)
11. Amaral, B., Niknejad, M., Barata, C., Bernardino, A.: Weakly supervised fire and
smoke segmentation in forest images with CAM and CRF. In: 26th International
Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, pp. 442–448
(2022)
12. Niknejad, M., Bernardino, A.: Weakly-supervised fire segmentation by visualizing
intermediate CNN layers. arXiv, abs/2111.08401 (2021)
13. Kucharski, D., Kleczek, P., Jaworek-Korjakowska, J., Dyduch, G., Gorgon, M.:
Semi-supervised nests of melanocytes segmentation method using convolutional
autoencoders. Sensors 20(6), 1546 (2020)
14. Roy, M., et al.: Convolutional autoencoder based model HistoCAE for segmentation
of viable tumor regions in liver whole-slide images. Sci. Rep. 11, 139 (2021)
15. Alex, V., Vaidhya, K., Thirunavukkarasu, S., Kesavadas, C., Krishnamurthi, G.:
Semisupervised learning using denoising autoencoders for brain lesion detection
and segmentation. J. Med. Imaging 4(4), 041311 (2017)
16. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neu-
ral networks. Science 313(5786), 504–507 (2006)
17. Gondara, L.: Medical image denoising using convolutional denoising autoencoders.
In: IEEE 16th International Conference on Data Mining Workshops (ICDMW),
Barcelona, Spain, pp. 241–246 (2016)
18. Liu, Y., Li, C., Zhao, Y., Xu, J.: Unified image restoration with convolutional
autoencoder. In: 2022 2nd International Conference on Networking, Communica-
tions and Information Technology (NetCIT), pp. 143–146 (2022)
19. Toulouse, T., Rossi, L., Campana, A., Celik, T., Akhloufi, M.A.: Computer vision
for wildfire research: an evolving image dataset for processing and analysis. Fire
Saf. J. 92, 188–194 (2017)
20. Alireza, S., Fatemeh, A., Abolfazl, R., Liming, Z., Peter, F., Erik, B.: The FLAME
dataset: aerial imagery pile burn detection using drones (UAVs). IEEE Dataport
(2020)
21. Zhou, D., Fang, J., Song, X., Guan, C., Yin, J., Dai, Y., Yang, R.: IoU loss for
2D/3D object detection. In: 2019 International Conference on 3D Vision (3DV),
pp. 85–94 (2019)
Descriptive and Coherent Paragraph
Generation for Image Paragraph
Captioning Using Vision Transformer
and Post-processing
1 Introduction
Image captioning is the task of generating a textual description of an image.
Early methods for image captioning used the encoder-decoder models, where
the encoder extracted features from the image and the decoder generated the
caption. These methods have difficulty in capturing the nuances and complexities
of the image content.
The advent of transformer-based models revolutionized the field of image
captioning. The transformer architecture [1], originally developed for natural
language processing tasks, allows for capturing the long-range dependencies and
relationships in the image and text. Transformer-based models have significantly
improved the quality of generated captions, allowing for more descriptive and
coherent descriptions of the image content.
Image paragraph captioning involves generating a paragraph of descriptive
text for an input image. Image paragraph captioning is a more challenging task
as it requires the model to not only generate a description of the image content
but also ensure coherence and consistency within the paragraph.
To address this challenge, the current approaches have leveraged the tech-
niques used in image captioning, such as transformer-based encoder-decoder
models and post-processing steps, to generate the paragraph. However, there
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 40–52, 2023.
https://doi.org/10.1007/978-3-031-45382-3_4
Image Paragraph Captioning Using ViT and Post-processing 41
Softmax
GPT-2 based
decoder
x Ld
Add & LN
x Le Add & LN
Add & LN
3.1 Encoder
As proposed by [9] in vision transformers (ViT), we use patches of images as
input to the encoder. The input image is denoted by I ∈ RH∗W ∗C , where H
represents the number of pixels across the height of the image, W represents
the number of pixels along the width of the image and C represents the number
of channels. First, the image is resized into a fixed size (H, W ) and then it is
converted into N patches of size (k, k), where N = HW k2 . The 2-D patches are
2
flattened to 1-D patches denoted by (x1 , x2 , ..., xN ), where xp ∈ Rk C×1 . They
are converted into embeddings (e1 , e2 , .., eN ) by using the embedding matrix
2
E ∈ Rd×k C , where ep = Exp . The 1-D positional embeddings are added to these
embeddings to retain the position information. The position embedding matrix
[1] is denoted by Epos ∈ Rd×N . The following is the sequence of operations
performed in the encoder:
3.2 Decoder
adapted to the specific task, its ability to fine-tune to the target data distribu-
tion, and its computational efficiency.
3.3 Post-processing
Dissimilarity Score: For the dissimilarity score, we use the word mover’s
distance (W M D), a measure of the dissimilarity between two text documents
that takes into account both the semantic meaning and the distance between
words in a vector space representation. Let Fi denote the i-th sentence in the
final caption generated denoted by F . Let T be the number of sentences in the
final caption. Then the dissimilarity score for a sentence s is calculated as shown
in Eq. (5)
T
i=1 W M D(s, Fi )
D(s, F ) = ,t > 0 (5)
t
Length Penalty: To avoid selecting overly brief sentences for the final caption,
a length penalty has been applied to sentences of short length. The length of a
candidate sentence, s, is represented by |s|. The median length of all candidate
46 N. Vakada and C. Chandra Sekhar
10 |s|
i=1 word2vec(ki ) i=1 word2vec(wi )
RC(K, Ws ) = cosine similarity( , ) (7)
10 |s|
The process to generate the final caption based on the scores described is
given below:
For the first sentence of the final paragraph, we pick the candidate sentence
with the highest language score and the image-text similarity score, as in Eq. 8.
The next sentence (NS) is picked from the list of available candidate sen-
tences(S) as shown in Eq. (9). After the next sentence is chosen, it is concate-
nated to the final caption.
Image Paragraph Captioning Using ViT and Post-processing 47
Model METEOR
Ours (Base model) 17.59
Ours + post-processing by [10] 18.86
Ours + post-processing with image-text score only 19.02
Ours + post-processing with related-classes similarity score only 19.07
Ours + post-processing using all scores combined 19.16
Ft = concatenate(Ft−1 , N S) (10)
where, t = 2, 3, ..., T
It is important to note that once a sentence is chosen for the final caption it
is removed from the candidate sentences list. We stop adding new sentences to
the final caption when there are no sentences with a score above 0.5 as suggested
in [10].
4 Experiments
4.1 Dataset
The number of layers in the encoder and decoder is denoted by L and it is set to
12. The dimension of the embeddings in both the encoder and decoder, denoted
by d, is set to 762. The dropout parameter is set to 0.1 for the decoder. The model
is trained for 24 epochs with cross-entropy loss function. The AdamW optimizer
is used with learning rate of 5e − 6. Beam search is used to sample the captions
during inference. Beam search with beam size 3 is used. The encoder is initialized
with the weights of the vision transformer and the decoder is initialized with the
weights of GPT-2 model.
In our experiments, we use METEOR [13] score metric to evaluate our model.
METEOR score can be used to evaluate if the hypothesis and the candidate are
semantically closer. A higher METEOR score indicates a high similarity between
48 N. Vakada and C. Chandra Sekhar
Table 3. Anaphora rate and Flesch reading ease score for the proposed approach,
HRNN and humans
We compare our model with HSGED [8]. From Fig. 3, we can see that our model
tends to be more descriptive of the image when compared to HSGED model.
Image Paragraph Captioning Using ViT and Post-processing 51
Our model gives a more detailed description about the person’s shirt or his hair.
This shows that our model tends to give better descriptions of the objects, than
the other model.
Figure 3b shows the output paragraphs generated by our model and com-
pares it with [5] and the ground truth. The example shows that our model uses
anaphoras very well to maintain the coherence of the paragraphs. The global
attention of the transformer models is the reason for the high coherence of the
paragraphs. The examples also indicate that the paragraphs generated by the
proposed model are easy to read and understand when compared to those gen-
erated by HRNN.
6 Conclusion
References
1. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information
Processing Systems, vol. 30 (2017)
2. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image
caption generator. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 3156–3164 (2015)
3. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and
visual question answering. In: Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition, pp. 6077–6086 (2018)
4. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer
for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 10578–10587 (2020)
5. Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for gen-
erating descriptive image paragraphs. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 317–325 (2017)
6. Liang, X., Hu, Z., Zhang, H., Gan, C., Xing, P.: Recurrent topic-transition GAN for
visual paragraph generation. In: Proceedings of the IEEE International Conference
on Computer Vision, pp. 3362–3371 (2017)
7. M. Chatterjee and G. Schwing. Diverse and coherent paragraph generation from
images. In: Proceedings of the European Conference on Computer Vision (ECCV),
pp. 729–744 (2018)
52 N. Vakada and C. Chandra Sekhar
8. Yang, X., Gao, C., Zhang, H., Cai, J.: Hierarchical scene graph encoder-decoder
for image paragraph captioning. In: Proceedings of the 28th ACM International
Conference on Multimedia, pp. 4181–4189 (2020)
9. Dosovitskiy, A., et al.: An image is worth 16×16 words: transformers for image
recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
10. Kanani, S., Saha, S., Bhattacharyya, P.: Improving diversity and reducing redun-
dancy in paragraph captions. In: 2020 International Joint Conference on Neural
Networks (IJCNN), pp. 1–8. IEEE (2020)
11. Radford, A., et al.: Learning transferable visual models from natural language
supervision. In: International Conference on Machine Learning, ICML, pp. 8748–
8763 (2021)
12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
13. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with
improved correlation with human judgments. In: Proceedings of the ACL Workshop
on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or
Summarization, pp. 65–72 (2005)
14. Paulus, R., Xiong, C., Socher, R.: A deep reinforced model for abstractive summa-
rization. arXiv preprint arXiv:1705.04304 (2017)
Pyramid Swin Transformer
for Multi-task: Expanding to More
Computer Vision Tasks
1 Introduction
Transformers have gained considerable attention in the computer vision commu-
nity due to their ability to capture long-range dependencies and model complex
interactions between visual elements more effectively than CNNs. The Vision
Transformer (ViT) by Dosovitskiy et al. [8] was one of the first successful adap-
tations of transformers for vision tasks, demonstrating competitive performance
in image classification tasks. Following the ViT, several other transformer-based
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 53–65, 2023.
https://doi.org/10.1007/978-3-031-45382-3_5
54 C. Wang et al.
architectures have been proposed for various computer vision tasks. Despite their
remarkable results, these architectures often suffer from scalability issues and
high computational complexity, which has led to the development of more effi-
cient vision transformer architectures. One such architecture is the Swin Trans-
former, introduced by Liu et al. [16]. It is a hierarchical vision transformer that
addresses the limitations of traditional transformers by incorporating a local
window-based self-attention mechanism and a shifted window strategy. This
innovative approach has resulted in state-of-the-art performance in various com-
puter vision benchmarks, including image classification, object detection, seman-
tic segmentation, and video recognition. Furthermore, the Swin Transformer has
demonstrated strong scalability and adaptability to a wide range of vision tasks,
making it a highly promising architecture for future research and applications.
Fig. 1. Pyramid Swin Transformer. a) model is used for image classification, b) model
is used for object detection, c) model is used for semantic segmentation, d) model is
used for video recognition
In our previous work [21], we proposed the Pyramid Swin Transformer, which
employed different-size windows in the Swin Transformer architecture to enhance
its performance in image classification and object detection tasks. Building upon
the success of our original Pyramid Swin Transformer, we extend its capabilities
in this paper to address two additional vision tasks: semantic segmentation and
video recognition. Our Pyramid Swin Transformer solves the problem of the
lack of connections among windows on a large-scale in Swin Transformer. By
implementing multiple windows of varying sizes on an extensive feature map, we
construct a layered hierarchy of windows. In this arrangement, a single window in
Pyramid Swin Transformer for Multi-task 55
the upper layer encapsulates the features of four windows in the immediate lower
layer. This progression significantly strengthens the interconnections among the
windows, thereby enhancing the overall representational capability of the model.
This paper presents the Pyramid Swin Transformer for Multi-task, detail-
ing our changes and their rationale. We also provide extensive experimental
results, demonstrating the improved performance achieved by our new architec-
ture across all four vision tasks: image classification on ImageNet [19], object
detection on COCO [14], semantic segmentation on ADE20K [26], and video
recognition on Kinetics-400 [11], as shown in Fig. 1. In this paper, our objective
is to further test our improved Pyramid Swin Transformer on a wider range of
computer vision tasks beyond image classification and object detection, demon-
strating its potential for various vision applications and its superior performance.
We aim to provide valuable resources for researchers and practitioners interested
in leveraging the enhanced Swin Transformer, as well as to encourage further
exploration and innovation in the field of computer vision.
2 Related Work
In this section, we will briefly describe the studies that are relevant to our
research.
on the same scale, thereby enhancing the connections among large-scale windows
and improving the model’s ability to capture both local and global information.
The Pyramid Swin Transformer for Multi-task is based on the Swin Trans-
former architecture [16] and incorporates multi-scale windows on the same scale
to address the lack of connections among large-scale windows. The architecture
consists of four stages, each with multiple layers of Swin Transformer blocks.
The input to the architecture is divided into non-overlapping patches, which are
then linearly embedded and processed through the hierarchical transformer lay-
ers. Each layer consists of two steps: one for window multi-head self-attention
and the other for shift window multi-head self-attention. We split it into smaller
blocks, and the number of different-sized windows in each layer follows a hier-
archical progression from more to less, facilitating global connections. In each
stage, except for the fourth stage, the last layer has a 2 × 2 window, enhancing
window-to-window information interaction and increasing global relevance (Fig.
3).
4 Result
We conduct experiments on ImageNet-1K image classification [7], COCO object
detection [14], ADE20K semantic segmentation [26] and Kinetics-400 video
recognition [11]. In the following sections, we will compare the suggested Pyra-
mid Swin Transformer architecture to the prior state-of-the-art on these tasks.
We use 4 pieces of NVIDIA Tesla V100 to make training and test.
Our proposed design, the Pyramid Swin Transformer, has been shown to
outperform several Transformer systems, even when utilizing a small model and
regular model (Pyramid Swin-S and Pyramid Swin-R) on ImageNet. However,
our design does not exhibit significant advantages over Transformer systems in
image classification. Compared to the original Swin Transformer and Swin Trans-
former V2 [15], our improved version achieves greater accuracy while utilizing
fewer parameters. For example, Pyramid Swin-R achieves the same accuracy as
SwinV2-B while utilizing fewer parameters. For bigger models, Pyramid Swin-
B and Pyramid Swin-L perform better than SwinV2 [15]. On the regular-size
60 C. Wang et al.
5 Conclusion
The Multi-Task Pyramid Swin Transformer is a versatile and efficient architec-
ture for object detection, image classification, semantic segmentation, and video
recognition tasks. The architecture adeptly captures local and global contex-
tual information by employing more shift window operations and integrating
diverse window sizes on the same scale. The structure of the Multi-Task Pyra-
mid Swin Transformer is divided into four stages, each consisting of layers with
varying window sizes, facilitating a robust hierarchical representation. Differ-
ent numbers of layers with distinct windows and window sizes are utilized at
the same scale. Extensive evaluations across various tasks, including image clas-
sification on ImageNet, object detection on COCO, semantic segmentation on
ADE20K, and video recognition on Kinetics-400, demonstrate that the Multi-
Task Pyramid Swin Transformer exhibits exceptional detection and recognition
performance. This demonstrates its effectiveness, adaptability, and scalability
across various vision tasks. In conclusion, the Multi-Task Pyramid Swin Trans-
former is a promising architecture for multi-task learning in computer vision.
Its ability to capture local and global contextual information through diverse
window sizes makes it highly adaptable to various tasks. The exceptional per-
formance demonstrated in various benchmarks paves the way for future research
and applications in this field.
References
1. Ali, A., et al.: XCiT: cross-covariance image transformers. Adv. Neural. Inf. Pro-
cess. Syst. 34, 20014–20027 (2021)
2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a
video vision transformer. In: Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pp. 6836–6846 (2021)
3. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for
video understanding? In: ICML, vol. 2, p. 4 (2021)
64 C. Wang et al.
4. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detec-
tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 6154–6162 (2018)
5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-
to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T.,
Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham
(2020). https://doi.org/10.1007/978-3-030-58452-8 13
6. Contributors, M.: MMSegmentation: OpenMMlab semantic segmentation toolbox
and benchmark (2020). https://github.com/open-mmlab/mmsegmentation
7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale
hierarchical image database. In: 2009 IEEE Conference on Computer Vision and
Pattern Recognition, pp. 248–255. IEEE (2009)
8. Dosovitskiy, A., et al.: An image is worth 16×16 words: transformers for image
recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
9. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the
IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
11. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint
arXiv:1705.06950 (2017)
12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
13. Li, Y., et al.: MViTv 2: improved multiscale vision transformers for classification
and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 4804–4814 (2022)
14. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D.,
Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp.
740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48
15. Liu, Z., et al.: Swin transformer v2: scaling up capacity and resolution. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 12009–12019 (2022)
16. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted win-
dows. arXiv preprint arXiv:2103.14030 (2021)
17. Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
18. Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.
3163–3172 (2021)
19. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J.
Comput. Vision 115, 211–252 (2015)
20. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training
data-efficient image transformers & distillation through attention. In: International
conference on machine learning, pp. 10347–10357. PMLR (2021)
21. Wang, C., Endo, T., Hirofuchi, T., Ikegami, T.: Pyramid swin transformer:
different-size windows swin transformer for image classification and object detec-
tion. In: Proceedings of the 18th International Joint Conference on Computer
Vision, Imaging and Computer Graphics Theory and Applications, pp. 583–590
(2023). https://doi.org/10.5220/0011675800003417
Pyramid Swin Transformer for Multi-task 65
22. Wang, W., et al.: Pyramid vision transformer: A versatile backbone for dense
prediction without convolutions. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 568–578 (2021)
23. Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer.
Comput. Vis. Media 8(3), 415–424 (2022)
24. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene
understanding. In: Proceedings of the European Conference on Computer Vision
(ECCV), pp. 418–434 (2018)
25. Zhang, P., et al.: Multi-scale vision longformer: a new vision transformer for high-
resolution image encoding. In: Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pp. 2998–3008 (2021)
26. Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset.
Int. J. Comput. Vision 127, 302–321 (2019)
Person Activity Classification
from an Aerial Sensor Based
on a Multi-level Deep Features
1 Introduction
Nowadays, surveillance is seen as an essential step to preserve personal secu-
rity. In fact, monitoring is mandatory in a number of locations, including malls
banks, airports, and protests. As a result, there was a greater need to automate
surveillance tasks so as to support security guards and enable them to do their
work more effectively. Indeed, abnormal person activity analysis is a practical
illustration of an intelligent surveillance system. In fact, it entails identifying the
achieved activity label and thereafter recognize the abnormal activity. Hence,
a significant interest has so been awarded to person activity classification by
the scientific community [8]. Particularly, person activity classification from an
aerial sensor allows monitoring wide regions and surveilling limited access areas
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 66–75, 2023.
https://doi.org/10.1007/978-3-031-45382-3_6
Person Activity Classification 67
2 Related Works
Referring to the literature [6], person activity classification methods can be cat-
egorized into handcraft methods and deep learning methods.
In the context of handcraft methods, Moussa et al. [9] introduced the spatial
features person activity classification method. Indeed, the SIFT (Scale Invari-
ant Feature Transform) technique was applied in order to extract the points
of interest. Then, the K-means technique was used to generate a BoW (Bag of
Word), which match each obtained feature vector to the closest visual word.
Subsequently, the visual word frequency histogram was calculated. At last, the
person activity class was determined through an SVM (Support Vector Machine)
classifier. In contrast to [9], Sabri et al. [11] proposed the spatio-temporal fea-
tures person activity classification method. Hence, the HOG (Histograms of Ori-
ented Gradients) and the HOF (Histograms of the Optical Flow) were combined
to generate the spatio-temporal features vectors. Finally, each feature vector
was assigned to the closest visual word using the K-means algorithm. Within
the same framework, Burghouts et al. [4] used STIP (Spatio-Temporal Inter-
est Points) technique since it exhibited higher discrimination in the context of
person activity classification. In [6], the authors introduced a method based on
the skeleton representation of spatio-temporal features. Indeed, they extracted
relative and temporal derivatives of joint positions features which refer to spa-
tial and temporal features. The temporal features include joint acceleration and
joint velocity. Despite the fact that these methods asserted their performance
in the person activity classification field, they rely heavily on the choice of the
features extraction technique [2].
With regard to deep learning methods, Baccouche et al. [2] introduced a two-
part deep learning model for person activity classification. The first part consists
of a 3D-CNN (Convolutional Neural Networks) network that aims to extract and
learn spatio-temporal features. In fact, the 3D-CNN is an extension of the CNN
68 F. Bouhlel et al.
3 Proposed Method
3b, inception 4a, inception 4b, inception 4c, inception 4d, inception 4e, inception
5a, and inception 5b (cf. Fig. 1).
These inception modules integrate different-sized convolutions, enabling the
learning of features at varying scales. Figure 2 shows an illustrative example of
an inception module.
pooling layer applying the Global Average Pooling (GAP) strategy to reduce
the size of the feature maps from (n × n × nc) to (1 × 1 × nc), with nc refer
to the size of the third dimension of the feature maps. Indeed, the GAP does
not only make it possible to extract discriminating information and avoid the
problem of overfitting but also to reduce the total number of parameters and
consequently the computation time. Next, we apply the ’dropout’ technique on
the Fully-Connected Layer FC which follows the GAP to consolidate the process
of the overfitting problem. In fact, the dropout technique is a regularization
technique that solves the over-learning constraint by temporarily deactivating
some neurons (cf. Fig. 3).
classification of the video sequence. The instant classification of the video frames
allows recognizing the activity in every frame of the sequence. This scenario of
classification is sought in the context of intelligent video surveillance systems
since an instant alert is set off in the case of distrustful activities. Figure 4 illus-
trates the proposed instant classification of a person’s activity.
As for the entire classification, it provides an activity label to the entire video
sequence. Thus, by resorting to instant classification scenario, we determine the
activity label. The average of each activity label throughout the video sequence
is then determined. Finally, we assign to the video sequence the label of the
activity class which has the highest average (cf. Fig. 5).
4 Experimental Study
In order to evaluate the proposed method, we first describe the used dataset.
Subsequently, we compare and discuss the obtained results.
5 Conclusion
The person activity classification from an aerial sensor is a prosperous research
axis. In this context, we proposed a method that involves offline and inference
phases. The offline phase uses convolutional neural networks to generate person
activity model. The inference phase makes use the generated model to perform,
the person activity classification. A multi-level deep features highlights the main
contribution of the proposed method, which aims to deal with the inter- and the
intra-class variation. Furthermore, we introduced two person activity classifica-
tion scenarios an instant and an entire classification. By means of a comparative
study, achieved on the UCF-ARG dataset, we demonstrated the performance
and the contribution of our method compared to the state-of-the-art works. As
future perspectives, we foresee introducing a bimodal person re-identification
method which combines the person appearance with its activity to enrich the
semantic description of persons in intelligent video surveillance system context.
References
1. AlDahoul, N., Md Sabri, A.Q., Mansoor, A.M.: Real-time human detection for
aerial captured video sequences via deep models. Comput. Intell. Neurosci. 2018
(2018)
2. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep
learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011.
LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011). https://doi.org/10.1007/
978-3-642-25446-8 4
3. Bouhlel, F., Mliki, H., Hammami, M.: Crowd behavior analysis based on convo-
lutional neural network: Social distancing control COVID-19. In: VISIGRAPP (5:
VISAPP), pp. 273–280 (2021)
4. Burghouts, G.J., Schutte, K.: Spatio-temporal layout of human actions for
improved bag-of-words action detection. Pattern Recogn. Lett. 34(15), 1861–1869
(2013)
5. Burghouts, G., van Eekeren, A., Dijk, J.: Focus-of-attention for human activity
recognition from UAVs. In: Electro-Optical and Infrared Systems: Technology and
Applications XI, vol. 9249, pp. 256–267. SPIE (2014)
Person Activity Classification 75
6. Dang, L.M., Min, K., Wang, H., Piran, M.J., Lee, C.H., Moon, H.: Sensor-based and
vision-based human activity recognition: a comprehensive survey. Pattern Recogn.
108, 107561 (2020)
7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
8. Mliki, H., Bouhlel, F., Hammami, M.: Human activity recognition from UAV-
captured video sequences. Pattern Recogn. 100, 107140 (2020)
9. Moussa, M.M., Hamayed, E., Fayek, M.B., El Nemr, H.A.: An enhanced method
for human action recognition. J. Adv. Res. 6(2), 163–169 (2015)
10. Nagendran, A., Harper, D., Shah, M.: UCF-ARG dataset, university of central
Florida (2010). http://crcv.ucf.edu/data/UCF-ARG.php
11. Sabri, A., Boonaert, J., Lecoeuche, S., Mouaddib, E.: Caractérisation spatio-
temporelle des co-occurrences par acp à noyau pour la classification des actions
humaines. In: GRETSI 2013 (2012)
12. Sargano, A.B., Wang, X., Angelov, P., Habib, Z.: Human action recognition using
transfer learning with deep representations. In: 2017 International Joint Conference
on Neural Networks (IJCNN), pp. 463–469. IEEE (2017)
13. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
14. Wang, L., Xu, Y., Cheng, J., Xia, H., Yin, J., Wu, J.: Human action recognition
by learning spatio-temporal features with deep neural networks. IEEE Access 6,
17913–17922 (2018)
Person Quick-Search Approach Based
on a Facial Semantic Attributes
Description
1 Introduction
Person search in real world scenarios is a challenging computer vision task that
attract the interest of several researchers. The main objective is to locate a sus-
pect or to find a missing person in public areas such as airports, shopping malls,
parks, among others. Traditional person search methods [9,15] are based on
appearance characteristics covering the whole human body which may present
limited capacity for automated surveillance solutions. Since the face remains the
most informative and accessible source about a person, the recent advancement
of research in facial semantic attributes detection has provided the way for per-
son search using a facial semantic attributes description (i.e. gender, age and
ethnicity).
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 76–87, 2023.
https://doi.org/10.1007/978-3-031-45382-3_7
Person Quick-Search Approach 77
In the literature, several studies were devoted to address the facial seman-
tic attributes detection. Early proposed methods [8,12,17], known as handcraft
methods, are based on a standard descriptor followed by a statistical classifier.
The recent success of the Convolutional Neural Network (CNN) based methods,
has encouraged the research community to adapt CNNs to the facial semantic
attributes detection. For the gender classification, Aslam et al. [3] applied the
Cascaded Deformable Shape model to extract the facial feature regions namely:
the mouth, the nose, the eyes and the foggy faces. Then, a four-dimensional (4-
D) representation is constructed based on these facial feature regions. Finally,
the VGG-16 pre-trained model [21] is fine-tuned using the 4-D array for a final
gender decision. In addition, Serna et al. [20] proposed a gender classification
method that provided a preliminary investigation on how biased data impact the
deep neural network architectures learning process in terms of activation level.
Two gender detection models based on VGG16 and ResNet are used to assess
the impact of bias on the learning process.
Regarding the age groups classification, Chen et al. [4] proposed to fine-
tune the VGG-Face model and adapt it to the age classification task. Next, the
activation of the penultimate layer and the last layer of the VGG-Face are used
as a separate local feature to feed to the maximum joint probability classifier
(MJPC). In the same context, a survey of different CNN architectures for facial
age classification is presented in [1]. In addition, several studies proposed a joint
gender and age classification methods. In this context, Duan et al. [7] combined
an Extreme Learning Machine (ELM) [11] with a Convolutional Neural Networks
(CNN) in one network and used the interaction of two classifiers to deal with
gender and age groups classification. The convolutional layer of the CNN was
applied to extract features. Then, the ELM structure was combined with the
fully connected layers of the CNN model, to generate a hybrid gender and age
groups classification model.
As for the ethnicity classification, Ahmed et al. [2] proposed a new CNN
architecture with nine layers called “RNet”. The first six layers constitue the
convolution layers while the final three layers represent the fully connected FC
layers followed by the softmax loss function. In addition, DropOut regularization
technique is used to avoid the overfitting problem. Finally, the trained model is
optimized using the stochastic gradient descent. In the same context, Luong et
al. [16] improved the face ethnicity, age and gender classification performance by
using a contrastive loss based metric learning. In fact, a multi-output supervised
contrastive loss was applied on the ResNet-50 model. Kärkkäinen et al. [13] used
the ResNet model based on the softmax loss to perform the face ethnicity, age
and gender classification. The proposed model was optimized using the Adaptive
Moment estimation (ADAM) algorithm.
Referring to the studied related works, we noticed that the CNN-based meth-
ods have been shown to achieve significantly results thanks to their capability
to capture complex visual variations by applying a large amount of training
data, without the need for any specific descriptor. Therefore, to tackle the facial
78 S. Dammak et al.
2 Proposed Approach
The proposed approach introduces a new person search solution based on a
facial semantic attributes description provided by an eyewitness in terms of a
query. The main objective is to determine a list of people that matches the query
using a facial semantic attributes prediction models. As illustrated in Fig. 1 the
proposed approach consists of two main steps: (1) Facial semantic attributes
detection, and (2) Multi-attributes score fusion.
Fig. 1. Proposed approach for person search based on facial semantic attributes
description.
Where Ai correspond to the set of semantic attributes gender, age and ethnicity
and Si presents its status. The Si values depend on the attribute. For the gender
attribute, we deal with two values (male/female). As for the age attribute, we
handle four values corresponding to the four age groups: 0–9, 10–29, 30–49 and
more than 50. Regarding to the ethnicity attribute, we address four values of
ethnicity groups: Asian, Black, Indian and White. A query may not specify the
values of the three attributes. In fact, unselected attributes are not considered
during the search process.
For the first model, the deep features are extracted using the fine-tuned VGG-
16 [21] model and the classification is achieved using the softmax classifier. As
for the second model, the handcrafted features are computed using the MB-LBP
(multi block LBP) [26] which is an extension of the LBP descriptor [18] and the
classification is performed using the SVM classifier.
The scores output from the softmax and the SVM classifiers indicate the
probabilities of the gender class. In the testing phase, we fused the extracted
scores from the softmax and the SVM classifiers to obtain a final score using the
maximum rule defined as follows:
where SCN N is the obtained score from the deep learned features based model
and S(M B−LBP ) is the obtained score from the handcrafted features based model.
Age Groups Classification. Several studies [22,23] prove that the aging pro-
cess differs from one gender to another. Therefore, we proposed to improve the
age groups classification by exploring the correlation between age and gender
information [5]. In addition, in order to reduce the confusion between the intra-
and inter-age groups, we proceed with a two-level age classification strategy.
This two-level age classification is based on deep learning and consists of the age
models generation and the age classification process. The proposed method for
age groups classification is illustrated in Fig. 3.
For the different age groups prediction models, we adapted the FaceNet model
[19] to the context of each classification level. First, we use the dedicated first
level CNN model to classify the age groups into two categories: children and
adults. Second, we use the second level model of the detected class to refine the
classification. Thus, a face classified as “adult” will be classified among three
age groups “young adult”, “middle-aged adult” and “old adult”.
L = Ls + µLc (3)
where, Ls is the traditional Softmax Loss, µ is the penalty term in the Center
Loss [24], and Lc is defined in Eq. 4:
N
1 2
Lc = xi − cyi 2 (4)
2 i=1
where xi is the deep feature of the ith sample, cyi denotes the yith class center
of the deep features.
In this step, we compute the final score that describe the similarity of the suspect
person matching the query. This involves a function F that fuses the scores of the
semantic attributes prediction models involved in the query Q to obtain a final
score F S, F : Q −→ F S. Formally, we have a set of N attributes A = A1 , . . . , AN
and their prediction scores are presented as probabilities P = P1 , . . . , PN , where
Pi = [0..1].
To merge the scores, we applied the sum of the probabilities (cf. Eqs. 5) [14].
N
F Si = Pi (5)
i=1
Then, the faces are sorted according to these scores in a descending order. The
first rank is given to the most similar person to the query. This system helps
security agent to locate the suspect person in a video surveillance network by
reducing the list of possible suspects. Thus, the identification of the target person
from a reduced list provides a gain in terms of time execution and increases the
possibility of finding the requested person.
3 Experimental Study
The proposed approach was evaluated on the FairFace datasets which contains
108,501 images. The images were gathered from the YFCC-100M Flickr dataset
and labeled with gender, age groups and ethnicity. It defines 7 ethnicity groups
(White, East Asian, Southeast Asian, Middle East, Indian, Black, and Latino)
aged from 0 to more than 70. For fair comparison with [13,16], we selected 4
ethnicity groups (White, Black, Asian, and Indian) from among the 7 ethnicity
groups. The images were collected under real world conditions (e.g. low resolu-
tion, pose variation, occlusion, and illumination variation).
Referring to Table 2, the proposed method for age groups and ethnicty clas-
sification recorded, respectively, an accuracy rate of 68.45% and 85.50% on the
FairFace dataset, which outperforms those obtained by the state-of-the art meth-
ods. Such performance confirms the effectiveness of the proposed methods. As
for the gender classification, the proposed method achieved a competitive results.
Compared to Luong et al. [16], we achieved a gain of 1.01% which validates the
efficiency of the combination of the deep learned and handcrafted features. How-
ever, the proposed method in [13], which is based on the ResNet model, slightly
outperforms our method. Although their method provides good results, its deep
architecture yields to complex computing time, making it an expensive option
in real-time applications.
Table 3. Qualitative results of the proposed person search approach based on a seman-
tic description in terms of attribute queries.
First five images
Queries
10
#
9
86 S. Dammak et al.
4 Conclusion
In this paper, we introduced a new people search approach “Quick-Search” based
on facial semantic attributes description provided by eyewitnesses. Firstly, we
proposed a facial semantic attributes detection based methods. In fact, we gener-
ated three prediction models: gender classification model, age groups classifica-
tion model and ethnicity classification model. The gender classification method
is based on a hybrid architecture which combines deep-learned and handcrafted
features by performing information fusion at the score-level. As for the age
groups classification, we proposed to explore the correlation between age and
gender information. In addition, in order to reduce the confusion between the
intra- and inter-age groups, we proceed with a two-level age classification strat-
egy. Regarding to the ethnicity classification, we proposed to jointly optimize
the softmax loss function with the center loss function. After that, we eval-
uated the performance of the multi-attributes search approach based on the
learned attributes classifiers. Experimental results illustrate the effectiveness of
the proposed approach. The present study presents novel insights on integrating
soft facial semantic attributes description with person appearance features to
enhance the effectiveness of person search in the context of video surveillance.
References
1. Agbo-Ajala, O., Viriri, S.: Deep learning approach for facial age classification: a
survey of the state-of-the-art. Artif. Intell. Rev. 54, 1–35 (2020)
2. Ahmed, M.A., Choudhury, R.D., Kashyap, K.: Race estimation with deep networks.
J. King Saud Univ. Comput. Inf. Sci. 34, 4579–4591 (2020)
3. Aslam, A., Hussain, B., Cetin, A.E., Umar, A.I., Ansari, R.: Gender classifica-
tion based on isolated facial features and foggy faces using jointly trained deep
convolutional neural network. J. Electron. Imaging 27(5), 053–023 (2018)
4. Chen, L., Fan, C., Yang, H., Hu, S., Zou, L., Deng, D.: Face age classification based
on a deep hybrid model. SIViP 12(8), 1531–1539 (2018)
5. Dammak, S., Mliki, H., Fendri, E.: Gender effect on age classification in an uncon-
strained environment. Multimedia Tools Appl. 80(18), 28001–28014 (2021)
6. Dammak, S., Mliki, H., Fendri, E.: Gender estimation based on deep learned and
handcrafted features in an uncontrolled environment. Multimedia Syst. 29, 1–13
(2022)
7. Duan, M., Li, K., Yang, C., Li, K.: A hybrid deep learning CNN-ELM for age and
gender classification. Neurocomputing 275, 448–461 (2018)
8. Eidinger, E., Enbar, R., Hassner, T.: Age and gender estimation of unfiltered faces.
IEEE Trans. Inf. Forensics Secur. 9(12), 2170–2179 (2014)
9. Frikha, M., Fendri, E., Hammami, M.: People search based on attributes descrip-
tion provided by an eyewitness for video surveillance applications. Multimedia
Tools Appl. 78, 2045–2072 (2019)
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
11. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and appli-
cations. Neurocomputing 70(1–3), 489–501 (2006)
Person Quick-Search Approach 87
12. Jagtap, J., Kokare, M.: Human age classification using facial skin aging features
and artificial neural network. Cogn. Syst. Res. 40, 116–128 (2016)
13. Kärkkäinen, K., Joo, J.: Fairface: Face attribute dataset for balanced race, gender,
and age. arXiv preprint arXiv:1908.04913 pp. 1–11 (2019)
14. Kittler, J., Hatef, M., Duin, R.P., Matas, J.: On combining classifiers. IEEE Trans.
Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)
15. Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural
language description. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 1970–1979 (2017)
16. Luong, T.K., Hsiung, P.A., Han, Y.T.: Improve gender, race, and age classifica-
tion with supervised contrastive learning (2021). https://doi.org/10.13140/RG.2.
2.14680.01286
17. Mohamed, S., Nour, N., Viriri, S.: Gender identification from facial images using
global features. In: Conference on Information Communications Technology and
Society (ICTAS), pp. 1–6. IEEE (2018)
18. Ojala, T., Pietikäinen, M., Harwood, D.: A comparative study of texture measures
with classification based on featured distributions. Pattern Recogn. 29(1), 51–59
(1996)
19. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face
recognition and clustering. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 815–823 (2015)
20. Serna, I., Pena, A., Morales, A., Fierrez, J.: InsideBias: measuring bias in deep
networks and application to face gender biometrics. In: 2020 25th International
Conference on Pattern Recognition (ICPR), pp. 3720–3727. IEEE (2021)
21. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: arXiv preprint arXiv:1409.1556, pp. 1–14 (2014)
22. Smulyan, H., Asmar, R.G., Rudnicki, A., London, G.M., Safar, M.E.: Comparative
effects of aging in men and women on the properties of the arterial tree. J. Am.
Coll. Cardiol. 37(5), 1374–1380 (2001)
23. Sveikata, K., Balciuniene, I., Tutkuviene, J.: Factors influencing face aging. Lit.
Revi. Stomatologija 13(4), 113–116 (2011)
24. Wang, J., Feng, S., Cheng, Y., Al-Nabhan, N.: Survey on the loss function of deep
learning in face recognition. J. Inf. Hiding Priv. Prot. 3(1), 29–47 (2021)
25. Wen, Y., Zhang, K., Li, Z., Qiao, Yu.: A discriminative feature learning approach
for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.)
ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.
org/10.1007/978-3-319-46478-7 31
26. Zhang, L., Chu, R., Xiang, S., Liao, S., Li, S.Z.: Face detection based on multi-
block LBP representation. In: Lee, S.-W., Li, S.Z. (eds.) ICB 2007. LNCS, vol.
4642, pp. 11–18. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-
74549-5 2
Age-Invariant Face Recognition Using
Face Feature Vectors and Embedded
Prototype Subspace Classifiers
Anders Hast(B)
1 Introduction
The study explores various models for face recognition (FR) and their perfor-
mance based on Face Feature Vectors (FFV). It also examines whether Embed-
ded Prototype Subspace Classification (EPSC) can improve the accuracy in FR
with age variations. The overall aim is to achieve age-invariant face recognition
(AIFR).
The reliability of face recognition (FR) performed by Intelligent Vision Sys-
tems is constantly improving. Facial features are captured through images or
videos and used as biometric markers. Face recognition methods involve ana-
lyzing these features and encoding them as Face Feature Vectors (FFV). These
vectors enable the identification and verification of a face by matching it with
a known identity in a face database. Hence, FFV serves as a valuable tool for
identification and verification purposes.
However, the impact of age progression on facial features poses a considerable
challenge for FR systems. As individuals age, their faces undergo changes that
include geometric alterations, changes in facial hair, and the use of glasses, among
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 88–99, 2023.
https://doi.org/10.1007/978-3-031-45382-3_8
AIFR Using FFV and EPSC 89
other factors. Although biometric markers, such as computed FFVs, are intended
to be unaffected by such factors, FR systems become less reliable as the age
range expands. This study investigates how well different models for FR and
their respective FFV perform and whether EPSC could enhance the accuracy
of FR in the presence of such age variations, with the goal of achieving AIFR.
EPSC has already proven to be able to classify datasets of various kinds, such
as digits, words and objects [9–12].
2 Previous Work
In their paper, Sawant and Bhurchandi [33] examine the difficulties encountered
by practical AIFR systems concerning appearance variations and resemblances
between subjects. They classify AIFR techniques into three categories: genera-
tive, discriminative, and deep learning, with each approach addressing the prob-
lem from a distinct standpoint.
The performance of generative approaches depends on aging models and age
estimation. The main idea is to transform the recognition problem into general
FR by simulating the aging process and synthesise face images [2,5,6,21,31,32],
or extracted features [27], of the target age into the same age group. Enhance-
ments in AIFR performance have been achieved using deep learning techniques;
however, these methods necessitate considerable training data, posing a chal-
lenge when utilising smaller databases.
Discriminative techniques, on the other hand, depend on learning methods
and local features. They strive to identify robust features for identity recognition
that remain unaffected by age. The goal of both discriminative and generative
approaches is to reduce the impact of age variation on FR systems. This study
concentrates on discriminative methods that employ FFVs obtained from any FR
pipeline. However, there has been relatively little written about discriminative
methods for AIFR compared to other areas of FR. Nevertheless, an overview
of the more recent paper will be given before showing how EPSC can improve
classification.
Several works propose to decompose the features into age- and identity-
related features. For an example Gong et al. [7] proposed Hidden Factor Analysis,
using a probabilistic model on these two different features. An Expectation Max-
imisation learning algorithm was used to estimate the latent factors and model
parameters. They reported improvement over then state-of-the-art algorithms
on two public face aging datasets. Similarily, Huang et al. [13,14] introduced
MTLFace, a multi-task learning framework for AIFR and identity-level face age
synthesis. They showed how improved performance can be achieved by a selective
fine-tune strategy. Experiments demonstrate the superiority of MTLFace, and
the authors suggest a newly collected dataset could further advance development
in this area. Xu and Ye [40] introduced a solution to the AIFR using coupled
auto-encoder networks (CAN) and nonlinear factor analysis. By utilising CAN,
the identity feature can be separated non-linearly to become age invariant in a
given face image.
90 A. Hast
Zheng et al. [43] developed a novel deep face recognition network called AE-
CNN, which uses age estimation to separate age-related variations from stable
person-specific features. The CNN model learns age-invariant features for FR and
the proposed approach was tested on two public datasets, showing good results.
Wang et al. [37] introduced similar approach that enhances AIFR by separating
deep face features into two components: age-related and identity-related. This
approach enabled the extraction of highly discriminative age-invariant features
using a multi-task deep CNN model.
Zhifeng et al. [23] proposed a discriminative model for age-invariant face
recognition. The approach uses a patch-based local feature representation scheme
and a multi-feature discriminant analysis (MFDA) method to refine the feature
space for enhanced recognition performance. Ling et al. [24,25] proposed a robust
face descriptor called the gradient orientation pyramid, which captures hierar-
chical facial information. The new approach was compared to several techniques
and demonstrated promising results on challenging passport databases, even
with large age differences. Yan et al. [41] propose an AIFR approach called MFD,
which combines feature decomposition with fusion based on the face time series
to effectively represent identity information that is not sensitive to the aging
process. Self-attention is also employed to capture global information from facial
feature series, and a weighted cross-entropy loss is adopted to address imbalanced
age distribution in training data. Gong et al. [8] proposed a new approach for
AIFR using a maximum entropy feature descriptor and identity factor analysis.
The maximum entropy feature descriptor encodes the microstructure of facial
images and extracts discriminatory and expressive information. Li et al. [22] pro-
posed a two-level hierarchical learning model for AIFR. The first level involves
extracting features by selecting local patterns that optimize common informa-
tion. The second level involves refining these features using a scalable high-level
feature refinement framework to create a powerful face representation. Experi-
ments demonstrated improvement over existing methods on one public face aging
dataset.
This work takes the Discriminative approach, previously described, and
therefore the FFV resulting from face detection on face images and subsequent
feature extraction will be used as a biometric marker. Several FR models are
investigated and compared.
In this work, two freely available major pipelines for FR were used and compared,
which are described below.
3.1 Insightface
The InsightFace [15] pipeline is an integrated Python library for 2D and 3D face
analysis, mainly based on PyTorch and MXNet. InsightFace efficiently imple-
ments a rich variety of state of the art algorithms for both face detection, face
AIFR Using FFV and EPSC 91
alignment and face recognition, such as RetinaFace [4]. It allows for automatic
extraction of highly discriminative FFVs for each face, based on the Additive
Angular Margin Loss (ArcFace) approach [3]. The models used are buf f alo l,
antvelope2, buf f alo m and buf f alo s, which are all provided by InsightFace.
4 Datasets
In the automatic FR process, images where faces were not properly detected or
images with more than one face were removed. Furthermore, it was required that
only person identities having several face images, covering several ages (decades)
were selected. Some mislabeled faces and corrupt FFVs were removed.
4.1 AgeDB
The AgeDB dataset [29] contains 16, 516 images. Of those, 9826 face images were
extracted so that each person depicted had about 36 face images on average
covering at least three different age decades, i.e. (0–10, 11–20, 21–30, 31–40,
41–50, 51–60, 61–70 and 70+). In addition, it was required that each person
included should have at least 30 face images each, in order to ensure that there
are several face images of the same person at different ages and decades.
4.2 CASIA
The original CASIA-WebFace [42] is rather large, containing around 500k
images. However, the selection process resulted in a much smaller and feasible
dataset containing 65579 face images. Nonetheless, it is still quite a lot larger
than the AgeDB dataset. A similar approach was used here, like the one used
for the AgeDB dataset, resulting in more than 50 face images per person on
average.
EPSC, a shallow model that utilizes PCA and dimensionality reduction tech-
niques such as t-SNE, UMAP, or SOM [12], has advantages over various deep
learning methods, including not requiring powerful GPU resources for training
due to the absence of hidden layers. EPSC creates subspaces that specialize in
identifying class variations from feature vectors. Although EPSC may not always
achieve higher accuracy than deep learning methods, it offers faster learning and
inference, interpretability, explainability, and ease of visualisation.
Every detected face to be classified is represented by a FFV x with m real-
valued elements xj = {x1 , z2 ...xm }, ∈ R, such that the operations take place in
a m-dimensional vector space Rm . Here, m is equal to the FFV length, which
depends on the model, usually 128 or 512. Any set of n linearly independent
basis vectors {u1 , u2 , ...un }, where ui = {w1,j , w2,j ...wm,j }, wi,j ∈ R, which can
be combined into an m × n matrix U ∈ Rm×n , span a subspace LU
n
LU = {x̂|x̂ = ρi ui , ρi ∈ R} (1)
i=1
where,
m
ρi = xT ui = xj wi,j (2)
j=1
Therefore, the feature vector x, which is most similar to the feature vectors
that were used to construct the subspace in question LUk , will have the largest
norm ||x̂||2 .
The previously explained Subspace classification can be regarded as a two
layer neural network [11,20,30], where the weights are all mathematically defined
through Principal Component Analysis (PCA) [20], instead of the time consum-
ing backpropagation.
AIFR Using FFV and EPSC 93
Generally, a group of prototypes within each class need to be chosen for the
construction of each subspace, which is done by the embedding obtained from
some dimensionality reduction method such as t-SNE [26], UMAP [28] or SOM
[19]. However, here a subspace for each person was constructed, as explained
later, regardless of age since relatively few FFVs are at hand per person.
6 Experiments
In order to verify the efficiency of the EPSC, experiments were conducted on
each data set as follows. Each and every FFV was classified using the EPSC
approach. Subspaces were created for all class labels, using all the FFVs for
each class, except for the FFV to be tested, which was temporarily removed
and a subspace was created for that class using the remaining FFVs. In this
way, the FFV to be classified was never used for creating the subspaces itself.
Therefore, the class to be classified will always have an disadvantage compared
to the others. However, the result will indicate how well AIFR can be performed
using EPSC.
This approach is compared to what a nearest neighbor classifier (NNC) would
give. Here each face image, with its corresponding FFV is given the same label
as its closest neighbour in FFV space, using the cosine similarity, which is simply
the dot product between two normalised FFVs.
The projection depth variable n is set to 3, which was experimentally found
to be a good value. This basically means that for each class, a dot product is
computed three times. While for NNC it is computed as many times as the size
of the dataset. NNC generally gives a very good classification, but is simply too
costly to be used in practice as it has quadratic time complexity, even if there
are methods to reduce the search space. EPSC on the other hand is very fast
and efficient as a classifier as it has linear time complexity.
Histograms are shown in Fig. 1, where both the distribution of the intra
similarities, i.e. cosine similarity between different FFVs of the same class (blue
94 A. Hast
Fig. 1. Bar plot of a histograms of intra similarities (blue bars = same class label)
and inter similarities (red bars = different class labels) between FFV (left) and EPSC
(right). The mean of both similarities are indicated with vertical lines. (Color figure
online)
bars) and the distribution of the inter similarities, i.e. the similarity between
FFVs of one class and FFVs of all other classes (red bars) are computed. There
is an overlap between the both distributions in Fig. 1a, which shows where mis-
classifications generally occur. When using EPSC the overlap becomes smaller.
The Bhattacharyya distance and the related Bhattacharyya coefficients [1] are
reported in Table 1.
The Bhattacharyya distance and coefficient have various applications in
statistics, pattern recognition, and image processing. They are used to compare
image similarity and cluster data points based on their probability distributions.
Machine learning algorithms like neural networks and support vector machines
also use them. Both measures are bounded by 0 and 1 and are symmetric.
AIFR Using FFV and EPSC 95
Fig. 2. Similarity matrices showing the accuracy for classifying a person from one age
group (per column) using images from another age group (per row). EPSC are used on
the left and the FFV on the right. The Macro Average Arithmetic is given and shows
that EPSC yields an overall improvement.
7 Discussion
The selected images from both AgeDB and CASIA contains a rather large age
span per person, which makes it quite challenging for FR. It also contains some
96 A. Hast
very hard to recognise images, where faces are semi occluded by hands and
glasses. Furthermore, some images are of low resolution or generally blurry.
Others are exhibiting strange grimaces etc. Nevertheless, it could be noted from
Table 1 that the models provided by InsightFace generally performed better than
those of DeepFace. The reason most probably lies in the fact that the latter is a
lightweight FR framework, which employs fast and simple alignment. Therefore,
it will fail more often for such hard cases, and especially it fails to capture profile
faces and to create good FFV’s for such difficult face postures. Hence, the fever
produced FFV’s.
Since the model buf f alo l generally performed best it was chosen for the
subsequent experiments. The FFV results in well-separated intra- and inter-class
similarities, which are further enhanced by EPSC, as demonstrated in Fig. 2. This
is also confirmed by the Bhattacharyya distances and coefficients in Table 1.
It should be added that no prototype embedding was used for the EPSC,
which means that only one susbpace per class was created. By using more than
one, the accurracies might increase, especially when having many images per
class. Another way to go about, would be to have smaller age spans in the
computation of the similarity matrices in Fig. 2. This is proposed for future
research.
It is not claimed that EPSC is more accurate than any other of the Machine
Learning or Deep Learning based approaches covered in Sect. 2. However, due
to its simplicity and that it use the FFVs only, it should be a viable alternative
to more complex methods.
8 Conclusion
It can be concluded that, among the two tested pipelines for FR, the FFVs
from InsightFace using the model buf f alo l produced the best results for the
goal of achieving AIFR, using the discriminative approach, previously explained.
NNC performs very well, but is impractical due to its quadratic time complexity.
EPSC, with its linear time complexity, increased the accuracies and is therefore
both faster and more efficient.
Acknowledgments. This work has been partially supported by the Swedish Research
Council (Dnr 2020-04652; Dnr 2022-02056) in the projects The City’s Faces. Visual
culture and social structure in Stockholm 1880-1930 and The International Centre for
Evidence-Based Criminal Law (EB-CRIME). The computations were performed on
resources provided by SNIC through UPPMAX under project SNIC 2021/22-918.
AIFR Using FFV and EPSC 97
References
1. Bhattacharyya, A.: On a measure of divergence between two statistical populations
defined by their probability distribution. Bull. Calcutta Math. Soc. 35, 99–110
(1943)
2. Deb, D., Aggarwal, D., Jain, A.K.: Identifying missing children: Face age-
progression via deep feature aging. In: 2020 25th International Conference on
Pattern Recognition (ICPR), pp. 10540–10547 (2021). https://doi.org/10.1109/
ICPR48806.2021.9411913
3. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for
deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 4690–4699 (2019)
4. Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., Zafeiriou, S.: Retinaface: single-stage
dense face localisation in the wild (2019). https://doi.org/10.48550/ARXIV.1905.
00641, https://arxiv.org/abs/1905.00641
5. Duong, C., Quach, K., Luu, K., Le, T., Savvides, M.: Temporal non-volume pre-
serving approach to facial age-progression and age-invariant face recognition. In:
2017 IEEE International Conference on Computer Vision (ICCV), pp. 3755–3763.
IEEE Computer Society, Los Alamitos, CA, USA (2017). https://doi.org/10.1109/
ICCV.2017.403, https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.403
6. Geng, X., Zhou, Z., Smith-Miles, K.: Automatic age estimation based on facial
aging patterns. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2234–2240 (2007).
https://doi.org/10.1109/TPAMI.2007.70733
7. Gong, D., Li, Z., Lin, D., Liu, J., Tang, X.: Hidden factor analysis for age invariant
face recognition. In: 2013 IEEE International Conference on Computer Vision, pp.
2872–2879 (2013). https://doi.org/10.1109/ICCV.2013.357
8. Gong, D., Li, Z., Tao, D., Liu, J., Li, X.: A maximum entropy feature descriptor
for age invariant face recognition. In: 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 5289–5297 (2015). https://doi.org/10.1109/
CVPR.2015.7299166
9. Hast, A.: Magnitude of semicircle tiles in Fourier-space : a handcrafted feature
descriptor for word recognition using embedded prototype subspace classifiers. J.
WSCG 30(1–2), 82–90 (2022). https://doi.org/10.24132/JWSCG.2022.10
10. Hast, A., Lind, M.: Ensembles and cascading of embedded prototype subspace clas-
sifiers. J. WSCG 28(1/2), 89–95 (2020). https://doi.org/10.24132/JWSCG.2020.
28.11
11. Hast, A., Lind, M., Vats, E.: Embedded prototype subspace classification: a sub-
space learning framework. In: Vento, M., Percannella, G. (eds.) CAIP 2019. LNCS,
vol. 11679, pp. 581–592. Springer, Cham (2019). https://doi.org/10.1007/978-3-
030-29891-3 51
12. Hast, A., Vats, E.: Word recognition using embedded prototype subspace classifiers
on a new imbalanced dataset. J. WSCG 29(1–2), 39–47 (2021). https://wscg.zcu.
cz/WSCG2021/2021-J-WSCG-1-2.pdf
13. Huang, Z., Zhang, J., Shan, H.: When age-invariant face recognition meets face
age synthesis: a multi-task learning framework. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 7282–7291 (2021)
14. Huang, Z., Zhang, J., Shan, H.: When age-invariant face recognition meets face
age synthesis: a multi-task learning framework and a new benchmark. IEEE Trans.
Pattern Anal. Mach. Intell. 45, 7917–7932 (2022)
15. InsightFace: Insightface (2023). https://insightface.ai. Accessed 30 Feb 2023
98 A. Hast
16. Kohonen, T., Lehtiö, P., Rovamo, J., Hyvärinen, J., Bry, K., Vainio, L.: A principle
of neural associative memory. Neuroscience 2(6), 1065–1076 (1977)
17. Kohonen, T., Oja, E.: Fast adaptive formation of orthogonalizing filters and asso-
ciative memory in recurrent networks of neuron-like elements. Biol. Cybern. 21(2),
85–95 (1976)
18. Kohonen, T., Reuhkala, E., Mäkisara, K., Vainio, L.: Associative recall of images.
Biol. Cybern. 22(3), 159–168 (1976)
19. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol.
Cybern. 43(1), 59–69 (1982)
20. Laaksonen, J.: Subspace classifiers in recognition of handwritten digits. G4 mono-
grafiaväitöskirja, Helsinki University of Technology (1997). https://urn.fi/urn:nbn:
fi:tkk-001249
21. Lanitis, A., Taylor, C., Cootes, T.: Toward automatic simulation of aging effects
on face images. IEEE Trans. Pattern Anal. Mach. Intell. 24(4), 442–455 (2002).
https://doi.org/10.1109/34.993553
22. Li, Z., Gong, D., Li, X., Tao, D.: Aging face recognition: a hierarchical learning
model based on local patterns selection. IEEE Trans. Image Process. 25(5), 2146–
2154 (2016). https://doi.org/10.1109/TIP.2016.2535284
23. Li, Z., Park, U., Jain, A.K.: A discriminative model for age invariant face recog-
nition. IEEE Trans. Inf. Forensics Secur. 6(3), 1028–1037 (2011). https://doi.org/
10.1109/TIFS.2011.2156787
24. Ling, H., Soatto, S., Ramanathan, N., Jacobs, D.W.: A study of face recognition
as people age. In: 2007 IEEE 11th International Conference on Computer Vision,
pp. 1–8 (2007). https://doi.org/10.1109/ICCV.2007.4409069
25. Ling, H., Soatto, S., Ramanathan, N., Jacobs, D.W.: Face verification across age
progression using discriminative methods. IEEE Trans. Inf. Forensics Secur. 5(1),
82–91 (2010). https://doi.org/10.1109/TIFS.2009.2038751
26. Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res.
9(11), 2579–2605 (2008)
27. Mahalingam, G., Kambhamettu, C.: Age invariant face recognition using graph
matching. In: 2010 Fourth IEEE International Conference on Biometrics: Theory,
Applications and Systems (BTAS), pp. 1–7 (2010). https://doi.org/10.1109/BTAS.
2010.5634496
28. McInnes, L., Healy, J.: UMAP: Uniform Manifold Approximation and Projection
for Dimension Reduction. ArXiv e-prints (2018)
29. Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., Kotsia, I., Zafeiriou, S.:
Agedb: The first manually collected, in-the-wild age database. In: 2017 IEEE Con-
ference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.
1997–2005 (2017). https://doi.org/10.1109/CVPRW.2017.250
30. Oja, E., Kohonen, T.: The subspace learning algorithm as a formalism for pattern
recognition and neural networks. In: IEEE 1988 International Conference on Neural
Networks, vol. 1, pp. 277–284 (1988). https://doi.org/10.1109/ICNN.1988.23858
31. Park, U., Tong, Y., Jain, A.K.: Age-invariant face recognition. IEEE Trans. Pattern
Anal. Mach. Intell. 32(5), 947–954 (2010). https://doi.org/10.1109/TPAMI.2010.
14
32. Ramanathan, N., Chellappa, R.: Modeling age progression in young faces. In: 2006
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR 2006), vol. 1, pp. 387–394 (2006). https://doi.org/10.1109/CVPR.2006.
187
AIFR Using FFV and EPSC 99
33. Sawant, M.M., Bhurchandi, K.M.: Age invariant face recognition: a survey on facial
aging databases, techniques and effect of aging. Artif. Intell. Rev. 52, 981–1008
(2019). https://doi.org/10.1007/s10462-018-9661-z
34. Serengil, S.I., Ozpinar, A.: Lightface: a hybrid deep face recognition framework.
In: 2020 Innovations in Intelligent Systems and Applications Conference (ASYU),
pp. 23–27. IEEE (2020). https://doi.org/10.1109/ASYU50717.2020.9259802
35. Serengil, S.I., Ozpinar, A.: Hyperextended lightface: a facial attribute analysis
framework. In: 2021 International Conference on Engineering and Emerging Tech-
nologies (ICEET), pp. 1–4. IEEE (2021). https://doi.org/10.1109/ICEET53442.
2021.9659697
36. Serengil, S.I., Ozpinar, A.: An evaluation of SQL and NOSQL databases
for facial recognition pipelines. https://www.cambridge.org/engage/coe/article-
details/63f3e5541d2d184063d4f569 (2023). 10.33774/coe-2023-18rcn, https://doi.
org/10.33774/coe-2023-18rcn. preprint
37. Wang, Y., et al.: Orthogonal deep features decomposition for age-invariant face
recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Com-
puter Vision - ECCV 2018, pp. 764–779. Springer, Cham (2018)
38. Watanabe, S., Pakvasa, N.: Subspace method in pattern recognition. In: 1st Inter-
national Joint Conference on Pattern Recognition, Washington DC. pp. 25–32
(1973)
39. Watanabe, W., Lambert, P.F., Kulikowski, C.A., Buxto, J.L., Walker, R.: Evalu-
ation and selection of variables in pattern recognition. In: Tou, J. (ed.) Computer
and Information Sciences, vol. 2, pp. 91–122. Academic Press, New York (1967)
40. Xu, C., Liu, Q., Ye, M.: Age invariant face recognition and retrieval by coupled
auto-encoder networks. Neurocomputing 222, 62–71 (2017)
41. Yan, C., et al.: Age-invariant face recognition by multi-feature fusionand decom-
position with self-attention. ACM Trans. Multimedia Comput. Commun. Appl.
18(1s) (2022). https://doi.org/10.1145/3472810
42. Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch (2014).
https://doi.org/10.48550/ARXIV.1411.7923. https://arxiv.org/abs/1411.7923
43. Zheng, T., Deng, W., Hu, J.: Age estimation guided convolutional neural network
for age-invariant face recognition. In: 2017 IEEE Conference on Computer Vision
and Pattern Recognition Workshops (CVPRW), pp. 503–511 (2017). https://doi.
org/10.1109/CVPRW.2017.77
BENet: A Lightweight Bottom-Up
Framework for Context-Aware Emotion
Recognition
Laboratoire Hubert Curien UMR CNRS 5516, Institut d’Optique Graduate School,
Université Jean Monnet Saint-Etienne, 42023 Saint-Etienne, France
{tristan.cladiere,olivier.alata}@univ-st-etienne.fr
1 Introduction
Understanding emotions is a difficult yet essential task in our daily life. They can
be defined as discrete categories or as coordinates in a continuous space of affect
dimensions [4]. For the discrete categories, Ekman and Friesen [5] defined six
basic ones: anger, disgust, fear, happiness, sadness, and surprise. Later, contempt
was added to the list [12]. Concerning the continuous space, valence, arousal, and
dominance form the commonly used three-dimensional frame [13].
Regarding non-verbal cues, facial expression is one of the most important
signals to convey emotional states and intentions [11]. However, the context
is also essential in some cases, because it can be misleading to infer emotions
using only the face [1]. For images, the context can include many cues, and the
recent authors have built different deep learning architectures to process it. Lee
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 100–111, 2023.
https://doi.org/10.1007/978-3-031-45382-3_9
Bottom-Up Emotions Network 101
2 Proposed Method
In this section, the components of our multitask approach will be detailed. Each
head of the architecture is dedicated to a specific task. First, the bottom-up
head is introduced. It allows to estimate the emotions of multiple people simul-
taneously, unlike the usual methods which treat them sequentially. Next, the
102 T. Cladière et al.
detection head is presented. Combined with the bottom-up one, these blocks
make the model fully autonomous, since it becomes independent of the anno-
tated bounding boxes. After, the person-centric head is described. It is specif-
ically used to predict the emotions of a single subject given as input. Here we
only rely on the subject’s features, without processing neither other people nor
the background information. On the contrary, the background head is finally
shown. It makes a global prediction using only the scene features, i.e. everything
except the annotated subjects. An overview of our architecture is given in Fig. 1.
are necessary to extract and attribute the predictions. They can be given either
by the ground truth or by a person detector.
Fig. 2. Example of the heatmap produced by the detection head, and two emotion maps
given by the bottom-up head. Normalisation is between 0 (black, emotion is absent)
and 1 (white, emotion is present). The predicted and annotated bounding boxes are
also added to the raw image.
further helps to infer the emotions of the subject, but they are also dependent on
external resources. In our case, a simpler method is used: a classification block
inspired by [16] is added to the model, and is referred as the person-centric head.
The combination of our backbone and this specific head is very similar to the
work of [16]. The flexibility given by such architecture seems profitable for pro-
cessing in-the-wild images, including close range faces and far range silhouettes,
mainly due to its multi-resolutions design.
For background information, the corresponding head has the same design
as the person-centric one, but both the input and the training objective are
different. Following [19], all the annotated subjects in the raw image are masked,
forcing the model to rely on other sources of features (see Fig. 1). However, rather
than predicting the emotions of a single person, the architecture has to estimate
all the emotions present in the image, i.e. each emotion that is labelled for at
least one subject. Given N people annotated for E emotions on a single image,
a one-hot-encoded matrix of dimension N × E is therefore created, and the
maximum along the N axis is taken, resulting in a vector of shape 1 × E.
Here again the whole architecture is end-to-end trainable, and all the tasks
are jointly trained. In this configuration, the shared backbone learns to extract
rich common features, so that each task benefits from the others.
3 Framework Details
In this section, the multitask architecture is first presented (see Fig. 1). Then
the used data and their processing to jointly train all heads are detailed. Finally
the loss functions are explained.
3.2 Databases
HECO. database [19] regroups 9,385 images and 19,781 annotated people, with
rich context information and various agent interaction behaviours. The annota-
tions include 8 discrete categories and 3 continuous dimensions, but also the
novel Self-assurance (Sa) and Catharsis (Ca) labels, which describe the degree
of interaction between subjects and the degree of adaptation to the context.
Unfortunately, the authors do not provide any partition of their dataset, which
makes the evaluations difficult to compare. Thus, we only used HECO as extra
data for training.
The data augmentation is divided into two parts. The first part is designed
to randomly apply a specific pre-transformation on each image of the training
batch. Depending on the pre-transformations drawn, the images will be dis-
patched at the end of the backbone, and fed to the corresponding heads. Thus,
each image is designed to train a specific task. These pre-transformations are
named ExtractSubject, MaskAllSubjects, and RandomMaskSubjects. They will be
briefly explained, and examples of the inputs are shown in Fig. 1.
ExtractSubject uses the ground truth to crop the image around the bounding
box of a given subject. It will be used to train the model to extract person-centric
features. This pre-transformation can only be drawn if the bounding box of the
selected subject does not contain other people.
MaskAllSubjects uses the ground truth to mask all the annotated subjects in
the image. With such images, the background head will have to extract features
from everything except the people. To ensure that there is still enough informa-
tion left for the model to learn useful features, this pre-transformation can only
be applied on images in which the sum of the areas of the bounding boxes do
not exceed 40% of the total area of the input.
RandomMaskSubjects is the transformation used to train the bottom-up
head. If there are multiple annotated subjects in the image, we will randomly
mask them, but always make sure to keep at least one. The idea is to augment
and diversify the combinations of emotions presented to the model.
106 T. Cladière et al.
The last option is to keep the raw image. In this case, it will be used to train
the detection head. Given a batch size B, each image will be pre-processed by
picking one of the above pre-transformations, with probabilities of (namely) 0.25,
0.25, 0.25, and 0.25. This equiprobability is chosen as the default experiment.
The second part of the data augmentation consists of adding random gaussian
noise, random blur, random colour jittering, random horizontal flip, and random
perspective transformations to the pre-transformed images.
1
N
Lsize = Ŝ (pk ) − sk (2)
N
k=1
where Ŝ ∈ Rw×h×2 are the width and height prediction maps of size w × h for
a given resolution. Hence, the detection loss at this resolution is:
Ldet = λcenter Lcenter + λsize Lsize (3)
λcenter is set to 1, and λsize to 0.1. Since the model gives predictions at resolu-
tions 1/4 and 1/2, the overall detection loss is therefore:
LDET = Ldet−1/4 + Ldet−1/2 (4)
For the emotion recognition task, which concerns person-centric, background,
and bottom-up heads, a loss similar to [2] is used. It is a multi-label and binary
focal loss, which gives better results while dealing with unbalanced data. It is
defined as follows:
−1
E α α
Lcat−emo = Yi 1 − Ŷi log Ŷi + 1 − Yi Ŷi log 1 − Ŷi (5)
N i=1
N
Bottom-Up Emotions Network 107
LCAT = Lperson−centric
cat−emo + Lbackground
cat−emo + Lbottom−up bottom−up
cat−emo−1/4 + Lcat−emo−1/2 (6)
Regarding the detection task, the best score is obtained when all the heads
are trained together with extra data, as it is illustrated in Table 2. Yet, the
main drawback with EMOTIC database is that it is not fully annotated. Indeed,
there are many images with several people but where only a few of them are
labeled. Thus, the detector tends to produce many False Positive (as illustrated
in Fig. 2), that are penalized during the training and may confuse the model,
and also reduce the precision during the evaluations.
The scores presented in Table 3 correspond to the new evaluation protocol
which considers the tasks of detection and emotion recognition together, intro-
duced in Sect. 4.2. As expected, the results are worse than those obtained with
the ground truth, but surprisingly the model using all heads and giving the best
results in both detection and emotion recognition is no longer the best with this
new metric. This can be explained by the fact that the latter detects more sub-
jects, even people whose emotions are particularly difficult to assess, for example
Bottom-Up Emotions Network 109
Table 2. Ablation Experiments on EMOTIC Dataset for person detection, using the
COCO API. Underlined values come from the final model instead of the best model.
those who are partially occluded or quite distant in the background. In these sit-
uations, the model is more likely to be wrong and produce more False Positives
and less True Positives, which decreases its precision. However, using external
data still leads to better results.
Table 3. mAP scores for emotion recognition on EMOTIC Dataset, depending on the
detector predictions for thresholds from 0.50 to 0.95. (1): BU+Det; (2): BU+Det+PC;
(3): BU+Det+PC+BG; (4): BU+Det+PC+BG+ED. Underlined values in the “Aver-
age” column indicate that the scores in the whole row come from the final model instead
of the best model.
Det. thr. 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Avg.
(1) 25.73 25.38 25.04 24.68 24.37 24.00 23.62 22.83 21.71 19.37 23.67
(2) 26.19 25.85 25.51 25.32 25.13 24.81 24.20 23.24 21.91 19.37 24.15
(3) 26.01 25.86 25.57 25.24 24.97 24.58 23.99 23.29 22.00 19.48 24.10
(4) 26.96 26.66 26.38 26.07 25.82 25.28 24.61 23.71 22.21 19.57 24.73
Even if our framework and our objectives are quite different from the other
authors, we finally compared our scores with those of the state-of-the-art in
Table 4. The baseline on EMOTIC, provided by [9], is outperformed. Our model
is multitask, but possible ways to fuse the predictions of the different heads
have not been explored yet. Indeed, the person-centric and background heads
are only used to help the model during its training, but not while inferring.
Nevertheless, we still tried to average all the outputs, which requires to pre-
process the raw image for the person-centric and background heads. It finally
appears that the mean between the bottom-up and the person-centric outputs
gives the best refined prediction, which means that 2 streams are used here.
However, the most recent methods are still quite ahead, due to their rich and
complex framework, and a well-made fusion.
110 T. Cladière et al.
Authors [10] [9] [2] [20] [17] [6] [14] [19] Ours
nb. of streams 2 2 2 2 3 6 4 7 2
fusion module ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗
NERR 1 0 0 1 1 2 3 3 0
mAP 20.84 27.38 28.33 28.42 30.17 35.16 35.48 37.73 29.30
References
1. Barrett, L.F., Mesquita, B., Gendron, M.: Context in emotion perception. Curr.
Dir. Psychol. Sci. 20(5), 286–290 (2011)
2. Bendjoudi, I., Vanderhaegen, F., Hamad, D., Dornaika, F.: Multi-label, multi-task
CNN approach for context-based emotion recognition. Inf. Fusion 76, 422–428
(2021). https://doi.org/10.1016/j.inffus.2020.11.007
3. Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: scale-
aware representation learning for bottom-up human pose estimation. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 5386–5395 (2020)
4. Ekman, P., Friesen, W.V.: Head and body cues in the judgement of emotion: a
reformulation. Percept. Mot. Skills 24(3), 711–724 (1967)
5. Ekman, P., Friesen, W.V.: Constants across cultures in the face and emotion. J.
Pers. Soc. Psychol. 17, 124–129 (1971). https://doi.org/10.1037/h0030377. Place:
US Publisher: American Psychological Association
Bottom-Up Emotions Network 111
6. Hoang, M.H., Kim, S.H., Yang, H.J., Lee, G.S.: Context-aware emotion recogni-
tion based on visual relationship detection. IEEE Access 9, 90465–90474 (2021).
https://doi.org/10.1109/ACCESS.2021.3091169. Conference Name: IEEE Access
7. Kingma, D.P., Ba, J.: ADAM: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
8. Kosti, R., Alvarez, J.M., Recasens, A., Lapedriza, A.: EMOTIC: emotions in con-
text dataset. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops, pp. 61–69 (2017)
9. Kosti, R., Alvarez, J.M., Recasens, A., Lapedriza, A.: Context based emotion
recognition using EMOTIC dataset. IEEE Trans. Pattern Anal. Mach. Intell.
42(11), 2755–2766 (2020). https://doi.org/10.1109/TPAMI.2019.2916866. confer-
ence Name: IEEE Transactions on Pattern Analysis and Machine Intelligence
10. Lee, J., Kim, S., Kim, S., Park, J., Sohn, K.: Context-aware emotion recognition
networks. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp. 10143–10152 (2019)
11. Li, S., Deng, W.: Deep facial expression recognition: a survey. IEEE Trans. Affect.
Comput. 13(3), 1195–1215 (2022). https://doi.org/10.1109/TAFFC.2020.2981446.
Conference Name: IEEE Transactions on Affective Computing
12. Matsumoto, D.: More evidence for the universality of a contempt expression. Motiv.
Emot. 16(4), 363–368 (1992). https://doi.org/10.1007/BF00992972
13. Mehrabian, A.: Framework for a comprehensive description and measurement of
emotional states. Genet. Soc. Gen. Psychol. Monogr. 121, 339–361 (1995). Place:
US Publisher: Heldref Publications
14. Mittal, T., Guhan, P., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.:
EmotiCon: context-aware multimodal emotion recognition using Frege’s principle.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 14234–14243 (2020)
15. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning
library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
16. Wang, J., et al.: Deep high-resolution representation learning for visual recogni-
tion. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2021). https://
doi.org/10.1109/TPAMI.2020.2983686. Conference Name: IEEE Transactions on
Pattern Analysis and Machine Intelligence
17. Wang, Z., Lao, L., Zhang, X., Li, Y., Zhang, T., Cui, Z.: Context-dependent emo-
tion recognition. SSRN Electron. J. (2022). https://doi.org/10.2139/ssrn.4118383
18. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention
module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018.
LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-
3-030-01234-2 1
19. Yang, D., et al.: Emotion recognition for multiple context awareness. In: Avidan,
S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS,
vol. 13697, pp. 144–162. Springer, Cham (2022). https://doi.org/10.1007/978-3-
031-19836-6 9
20. Zhang, M., Liang, Y., Ma, H.: Context-aware affective graph reasoning for emo-
tion recognition. In: 2019 IEEE International Conference on Multimedia and Expo
(ICME), pp. 151–156 (2019). https://doi.org/10.1109/ICME.2019.00034. ISSN:
1945-788X
21. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint
arXiv:1904.07850 (2019)
YOLOPoint: Joint Keypoint and Object
Detection
1 Introduction
Keypoints are low-level landmarks, typically points, corners, or edges that can
easily be retrieved from different viewpoints. This makes it possible for moving
vehicles to estimate their position and orientation relative to their surroundings
and even perform loop closure (i.e., SLAM) with one or more cameras. Histor-
ically, this task was performed with hand-crafted keypoint feature descriptors
such as ORB [17], SURF [2], HOG [5], SIFT [14]. However, these are either
not real-time capable or perform poorly under disturbances such as illumination
changes, motion blur, or they detect keypoints in clusters rather than spread out
over the image, making pose estimation less accurate. Learned feature descrip-
tors aim to tackle these problems, often by applying data augmentation in the
form of random brightness, blurring and contrast. Furthermore, learned keypoint
descriptors have shown to outperform classical descriptors. One such keypoint
descriptor is SuperPoint [6], a convolutional neural network (CNN) which we
use as a base network to improve upon.
SuperPoint is a multi-task network that jointly predicts keypoints and their
respective descriptors in a single forward pass. It does this by sharing the feature
outputs of one backbone between a keypoint detector and descriptor head. This
makes it computationally efficient and hence ideal for real-time applications.
This research is funded by dtec.bw – Digitalization and Technology Research Center
of the Bundeswehr. dtec.bw is funded by the European Union - NextGenerationEU.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 112–123, 2023.
https://doi.org/10.1007/978-3-031-45382-3_10
YOLOPoint: Joint Keypoint and Object Detection 113
Fig. 1. Example output of YOLOPointM on a KITTI scene with keypoint tracks from
3 frames and object bounding boxes.
2 Related Work
Classical keypoint descriptors use hand-crafted algorithms designed to over-
come challenges such as scale and illumination variation [2,5,14,17,20,22] and
have been thoroughly evaluated [16,21]. Although their main utility was key-
point description, they have also been used in combination with support vector
machines to detect objects [27].
Since then, deep learning-based methods have dominated benchmarks for
object detection (i.e., object localization and classification) and other computer
vision tasks [26–28]. Therefore, research has increasingly been dedicated to find-
ing ways in which they can also be used for point description. Both using CNN-
based transformer models COTR [10] and LoFTR [24] achieve state-of-the-art
results on multiple image matching benchmarks. COTR finds dense correspon-
dences between two images by concatenating their respective feature maps and
processing them together with positional encodings in a transformer module.
Their method, however, focuses on matching accuracy rather than speed and
performs inference on multiple zooms of the same image. LoFTR has a simi-
lar approach, the main difference being their “coarse-to-fine” module that first
predicts coarse correspondences, then refines them using a crop from a higher
level feature map. Both methods are detector-free, and while achieving excel-
lent results in terms of matching accuracy, neither method is suitable for real-
time applications. Methods that yield both keypoint detections and descriptors
are generally faster due to matching only a sparse set of points and include
R2D2 [19], D2-Net [7] and SuperPoint [6]. R2D2 tackles the matching problems
of repetitive textures by introducing a reliability prediction head that indicates
the discriminativeness of a descriptor at a given pixel. D2-Net has the unique
approach of using the output feature map both for point detection and descrip-
tion, hence sharing all the weights of the network between both tasks. In con-
trast, SuperPoint has a shared backbone but seperate heads for the detection and
description task. What sets it apart from all other projects is its self-supervised
training framework. While other authors obtain ground truth point correspon-
dences from depth images gained from structure from motion, i.e., using video,
the SuperPoint framework can create labels for single images. It does this by first
generating a series of labeled grayscale images depicting random shapes, then
training an intermediate model on this synthetic data. The intermediate model
subsequently makes point-predictions on a large data set (here: MS COCO [13])
that are refined using their “homographic adaptation” method. The final model
is trained on the refined point labels.
While there exist several models that jointly predict keypoints and descrip-
tors, there are to our knowledge none that also detect objects in the same net-
work. Maji et al’s work [15] comes closest to ours. They use YOLOv5 to jointly
predict keypoints for human pose estimation as well as bounding boxes in a sin-
gle forward pass. The main differences are that the keypoint detection training
uses hand-labelled ground truth points, the object detector is trained on a single
class (human), and both tasks rely on similar features.
YOLOPoint: Joint Keypoint and Object Detection 115
3 Model Architecture
Fig. 2. Full model architecture exemplary for YOLOPointS. The two types of bottle-
necks, C3 block (left) and a sequence of convolution, batch normalization and SiLU
activation form the main parts of YOLO and by extension YOLOPoint. k: kernel size,
s: stride, p: pad, c: output channels, bn: bottleneck, SPPF : fast spacial pyramid pooling
[11].
accuracy trade-off, the smallest vector is left at 64. Furthermore, in order for
the descriptor to be able to match and distinguish between other keypoints, it
requires a large receptive field [24]. This, however, comes with down-sampling
the input image and loosing detail in the process. In order to preserve detail, we
enlarge the low-resolution feature map with nearest-neighbor interpolation and
concatenate it with a feature map higher up in the backbone before performing
further convolutions. The full model fused with YOLOv5 is depicted in Fig. 2.
4 Training
To generate pseudo ground truth point labels, we follow the protocol of Super-
Point by first training the point detector of YOLOPoint on the synthetic shapes
dataset, then using it to generate refined outputs on COCO using homographic
adaptation for pre-training. Pre-training on COCO is not strictly necessary,
however it can improve results when fine-tuning on smaller data sets, as well as
reduce training time. Thus, the pre-trained weights are later fine-tuned on the
KITTI dataset [8]
For training the full model, pairs of RGB images that are warped by a known
homography are each run through a separate forward pass. The model subse-
quently predicts “point-ness” heat maps, descriptor vectors and object bounding
boxes. When training on data sets of variable image size (e.g., MS COCO) the
images must be fit to a fixed size in order to be processed as a batch. A com-
mon solution is to pad the sides of the image such that W = H, also known
as letterboxing. However, we found that this causes false positive keypoints to
be predicted close to the padding due to the strong contrast between the black
padding and the image, negatively impacting training. Therefore, when pre-
training on COCO, we use mosaic augmentation, that concatenates four images
side-by-side to fill out the entire image canvas, eliminating the need for image
padding [4]. All training is done using a batch size of 64 and the Adam optimizer
[12] with a learning rate of 10−3 for pre-training and 10−4 for fine-tuning.
For fine-tuning on KITTI we split the data into 6481 training and 1000
validation images resized to 288 × 960. To accommodate the new object classes
we replace the final object detection layer and train for 20 epochs with all weights
frozen except those of the detection layer. Finally, we unfreeze all layers and train
the entire network for another 50 epochs.
The model outputs for the warped and unwarped image are used to calculate
the keypoint detector and descriptor losses. However, only the output of the
unwarped images is used for the object detector loss, as to not teach strongly
distorted object representations.
The keypoint detector loss Ldet is the mean of the binary cross-entropy losses
over all pixels of the heatmaps of the warped and unwarped image of size H × W
and corresponding ground truth labels and can be expressed as follows:
YOLOPoint: Joint Keypoint and Object Detection 117
H,W
1
Ldet = − (yij · log xij + (1 − yij ) · log(1 − xij )) (1)
HW
where yij ∈ {0, 1} and xij ∈ [0, 1] respectively denote the target and predic-
tion at pixel ij.
The original descriptor loss is a contrastive hinge loss applied to all cor-
respondences and non-correspondences of a low-resolution descriptor D of size
Hc × Wc [6] creating a total of (Hc × Wc )2 matches. This, however, becomes
computationally unfeasible for higher resolution images. Instead, we opt for a
sparse version adapted from DeepFEPE [9] that samples N matching pairs of
feature vectors dijk and di j k and M non-matching pairs of a batch of sampled
descriptors D̃ ⊂ D and their warped counterpart D̃ ⊂ D . Using the known
homography, each descriptor d of pixel ij of the kth image of a mini-batch can
be mapped to its corresponding warped pair at i j . mp furthermore denotes the
positive margin of the sampled hinge loss.
D̃ D̃
1 T
Ln.corr = (dijk d opq ), (o, p, q) = (i , j , k) (4)
M o,p,q
i,j,k
5 Evaluation
In the following sections we present our evaluation results for keypoint detection
and description on HPatches [1] and using all three task heads for visual odome-
try (VO) estimation on the KITTI benchmark. For evaluation on HPatches the
models trained for 100 epochs on MS COCO are used, for VO the models are
fine-tuned on KITTI data.
118 A. Backhaus et al.
Fig. 3. HPatches matches between two images with viewpoint change estimated with
YOLOPointS. Matched keypoints are used to estimate the homography matrix describ-
ing the viewpoint change.
The HPatches dataset comprises a total of 116 scenes, each containing 6 images.
57 scenes exhibit large illumination changes and 59 scenes large viewpoint
changes. The two main metrics used for evaluating keypoint tasks are repeata-
bility which quantifies the keypoint detector’s ability to consistently locate key-
points at the same locations despite illumination and/or viewpoint changes and
homography estimation which tests both repeatability and discrimination ability
of the detector and descriptor. Our evaluation protocols are kept consistent with
SuperPoint’s where possible.
Repeatability
Illumination Viewpoint
YOLOPointL .590 .540
YOLOPointM .590 .540
YOLOPointS .590 .540
YOLOPointN .589 .529
SuperPoint .611 .555
Fig. 4. Translation and rotation RMSE over all KITTI sequences plotted against mean
VO estimation time for YOLOPointL (YPL), M, S and N with filtered points as well as
SuperPoint and classical methods for comparison (lower left is better). VO estimation
was done with 376 × 1241 images, NVIDIA RTX A4000 and Intel Core i7-11700K.
Fig. 5. Sequence 01: Driving next to a car on a highway. Top: All keypoints. Bottom:
Keypoints on car removed via its bounding box.
References
1. Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: HPatches: a benchmark and
evaluation of handcrafted and learned local descriptors. In: CVPR (2017)
2. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In:
Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–
417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023 32
3. Beer, L., Luettel, T., Wuensche, H.J.: GenPa-SLAM: using a general panoptic
segmentation for a real-time semantic landmark SLAM. In: Proceedings of IEEE
Intelligent Transportation Systems Conference (ITSC), pp. 873–879. IEEE, Macau,
China (2022). https://doi.org/10.1109/ITSC55140.2022.9921983
4. Bochkovskiy, A., Wang, C., Liao, H.M.: YOLOv4: optimal speed and accuracy of
object detection. CoRR abs/2004.10934 (2020)
5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection, vol. 1,
pp. 886–893 (2005). https://doi.org/10.1109/CVPR.2005.177
6. DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest
point detection and description (2018). http://arxiv.org/abs/1712.07629
7. Dusmanu, M., et al.: D2-Net: a trainable CNN for joint detection and description
of local features. In: Proceedings of the 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (2019)
8. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset
(2013)
9. Jau, Y.Y., Zhu, R., Su, H., Chandraker, M.: Deep keypoint-based camera pose
estimation with geometric constraints, pp. 4950–4957 (2020). https://doi.org/10.
1109/IROS45743.2020.9341229
10. Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: correspon-
dence transformer for matching across images. CoRR abs/2103.14167 (2021)
YOLOPoint: Joint Keypoint and Object Detection 123
1 Introduction
Electron Tomography (ET) [7] is a powerful technique to reconstruct 3D
nanoscale material microstructure. A Transmission Electron Microscope (TEM)
acquires sets of projections from several angles, allowing the reconstruction of
3D volumes. However, the resulting data contain noisy reconstruction artifacts
because the number of projections is limited, and their alignment remains a
challenging task [25] (Fig. 1). Thus, standard segmentation methods often fail
[8], requiring the input of an expert to achieve a good segmentation [9,13,26].
Deep learning (DL) based approaches have achieved excellent results in this
area [1,11,14], as advances are made in semantic segmentation of 2D and 3D
images [2,5,19,22,24]. Standard approaches rely on training a neural network on
fully labeled datasets, which requires many annotated 3D samples. To address
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 124–135, 2023.
https://doi.org/10.1007/978-3-031-45382-3_11
Less-than-One Shot 3D Segmentation Hijacking a Pre-trained STM Network 125
2 Related Works
Electron Tomography Segmentation. Segmentation of tomograms remains
challenging because of reconstruction artifacts and low signal-to-noise ratio.
Manual segmentation is still the preferred method [9] with the support of visu-
alization tools [13] and various image processing methods such as watershed
transform [26]. DL-based methods have been applied in electron tomography in
more recent work [1,14], with DL models performing better in general semantic
segmentation tasks [2,5,19,22,24]. The main bottleneck for ET segmentation
tasks is the low availability of labeled training data. Recent works addressed
the issue with either a semi-supervised setup with contrastive learning [15] or a
scalable DL model, which only requires small- and medium-sized ground-truth
datasets [14]. Our method goes further by repurposing a VOS model without
any training phase.
Memory Network. Memory networks have an external module that can access
past experiences [23]. Usually, an object in the memory can be referred to as a
key feature vector and encoded by a value feature vector. Segmentation memory
networks such as STM [20], SwiftNet [27], or STCN [4] encode the first video
frame, annotated by the user, into the memory component. The next frame
(query) is encoded into key and value feature vectors. The query keys and the
memory are then compared, resulting in a query value feature vector used to
segment the object on that frame. The memory component is then completed
with the new key and value. This technique is often used in video segmentation as
the object to segment, whose shapes change as time passes, is constantly added
to the memory, providing several examples to help segmentation [20]. Unlike our
approach, where only a fraction of the frame is needed, these methods require
the annotation of the first entire frame.
3 Proposed Method
Our method uses the same model to reconstruct the partially annotated frame
and the entire volume. The images and masks in the memory are stored as key
and value feature vectors. The key encodes a visual representation of the object
so that objects with similar keys have similar shapes and textures. The value
contains information for the decoder on the segmentation. Our intuition is that
if we disable areas containing unlabelled data in the memory, we can encode
useful information to segment whole slices even with a small amount of labeled
data.
Fig. 2. The partially annotated slice is encoded by the memory encoder into the mem-
ory. At inference time, the query slice is encoded by the query encoder into a key
and a value and compared with the memory keys and values. During the memory
reading, the labeling mask selects only keys from labeled pixels, indicating whether a
pixel is annotated. The result is given to the decoder, which reconstructs the whole
segmentation.
annotations given by the expert, a labeling mask Ms is built where labeled and
unlabeled pixels are denoted. The image and the annotations are encoded to
one key and one value {k M , v M } stored in the memory. There are two ways to
propagate the annotation into the entire volume. In the first one, the memory
is directly used to segment other slices (Algorithm 1 and Fig. 2). On the other
hand, we pass the same image Vs as a query into the network to get pseudo-
labels of the entire slice Yˆs . The entire slice and the newly acquired pseudo-label
mask are then encoded into a key and a value {k S , v S } to segment other slices
Vi , i ∈ [1, N ] with N the number of slices in the volume (Algorithm 2).
During the memory read, the key in memory is modified to mask unknown
zones to get the value to segment the whole slice. We use an STM network [20] as
the backbone of our method, as it was the first network to use memory networks
for 2D semantic segmentation. Moreover, as other methods in the field are based
on the STM network, our approach is generalizable to other networks.
f = [v Q , R × v M ] (7)
where × denotes the matrix product.
We use the STM network architecture proposed in [20] as well as the weights
proposed by the authors. The network’s backbone is a ResNet, trained for a
video segmentation task with Youtube-VOS and DAVIS as training datasets.
The decoder outputs a probability map that is 1/4 of the initial input size,
which degrades the results for ET where fine details on porous areas are needed.
An upsampling operation is performed on the input slice before entering the
network. The input slice is upscaled two times as a compromise between memory
consumption and finer details.
All the results are computed with Intersection Over Union (IOU) on the
entire volume V :
N
Ŷi ∩ Yi
i=1
IOU (V ) = (8)
N
Ŷi ∪ Yi
i=1
with Ŷ the segmented volume and Y the ground truth. The closer the IOU is to
1, the better the segmentation is.
4.2 Data
Chemical processes in the energy field often require using zeolites [10]. However,
the numerous nanometric scale cavities make zeolites complex to segment. We
evaluated our method on three volumes of hierarchical zeolites, NaX Siliporite
G5 from Ceca-Arkema [18]. Volumes’ sizes are 592 × 600 × 623, 512 × 512 ×
52, and 520 × 512 × 24.
The slices are automatically partially annotated to simulate real-world data.
A rectangle window of the area Aw is considered labeled. The remaining slice is
unlabelled. The center of the window is drawn randomly on the border between
the object and the background (Fig. 3). The window is adjusted to fit entirely on
Aw
the screen while maintaining its area. We define the labeling rate as r = H×W .
Less-than-One Shot 3D Segmentation Hijacking a Pre-trained STM Network 131
4.3 Results
For each volume, we run each experiment on the same five randomly selected
slices. The mean IOU of the three volumes is reported.
Comparaison with the STM Network. We first compare our method with
an unmodified STM network for several labeling rates r. We give the same par-
tially annotated slices for the STM network and our method. The results are
shown in Table 1. The STM network can not process the partially labeled slice
because it was not intended to deal with such data. As a result, there is no
way for the STM network to differentiate labeled and unlabeled pixels, which
leads to poor segmentation. Our key masking allows the STM network to achieve
significantly better results with accurate segmentation (Fig. 4).
First Slice Propagation. We then study the different approaches for the first
slice. We tested our method with only the annotated parts in the memory shown
132 C. Li et al.
Table 1. Mean IOU on our volumes for our method and an unmodified STM net-
work. The modification from our approach allows an STM-like model to produce good
segmentation.
(Ours) in the Algorithm 1 against our method with the pseudo-labels of the entire
first slice in the memory (Ours+F) described by the Algorithm 2. The results in
Table 2 demonstrate that better results are obtained with only the annotated bit
in the memory. The STM network performs better with accurate data instead of
more variety in memory. Our method’s implementation only uses the partially
labeled parts in the memory.
Table 2. Mean IOU on our volumes for our method with only the labeled parts in the
memory (Ours) and our method with the pseudo-labels of the first slice in the memory
(Ours+F). Our approach uses only the annotated zones in the memory.
Table 3. Mean IOU on our volumes for our method, a UNET adapted for partially
segmented areas, and a UNET using a contrastive loss to exploit both labeled and
unlabeled zones. Our proposed method achieves results close to these methods despite
no training phase.
Fig. 5. Mean IOU for several labeling rates r. All methods do not need an additional
training procedure except UNET and contrastive UNET [15].
5 Conclusion
In this paper, we illustrate that a slightly modified STM network handles accu-
rate volumetric segmentation of 3D scans from ET with only a tiny portion of
one slice labeled needed without any further fine-tuning. This approach achieves
results close to methods that require a training procedure. The masking of
the memory shows that semi-labeled slices can be used to propagate accurate
segmentation in fields where annotated data are not widely available. A more
detailed segmentation mask can be obtained with further investigations, as the
original STM network output size is 1/4 of the original size.
References
1. Akers, S., et al.: Rapid and flexible segmentation of electron microscopy data using
few-shot machine learning. NPJ Computat. Mater. 7(1), 1–9 (2021)
2. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab:
semantic image segmentation with deep convolutional nets, Atrous convolution,
and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–
848 (2018)
3. Cheng, H.K., Tai, Y.W., Tang, C.K.: Modular interactive video object segmenta-
tion: interaction-to-mask, propagation and difference-aware fusion. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
5559–5568 (2021)
4. Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with
improved memory coverage for efficient video object segmentation. Adv. Neural.
Inf. Process. Syst. 34, 11781–11794 (2021)
5. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-net:
learning dense volumetric segmentation from sparse annotation. In: Ourselin, S.,
Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS,
vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-
46723-8 49
6. Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsu-
pervised feature learning with convolutional neural networks. Adv. Neural Inf.
Process. Syst. 27 (2014)
7. Ersen, O., et al.: 3D-TEM characterization of nanometric objects. Solid State Sci.
9(12), 1088–1098 (2007)
8. Evin, B., et al.: 3D analysis of helium-3 nanobubbles in palladium aged under
tritium by electron tomography. J. Phys. Chem. C 125(46), 25404–25409 (2021)
9. Fernandez, J.J.: Computational methods for electron tomography. Micron 43(10),
1010–1030 (2012)
10. Flores, C., et al.: Versatile roles of metal species in carbon nanotube templates for
the synthesis of metal-zeolite nanocomposite catalysts. ACS Appl. Nano Mater.
2(7), 4507–4517 (2019)
11. Genc, A., Kovarik, L., Fraser, H.L.: A deep learning approach for semantic seg-
mentation of unbalanced data in electron tomography of catalytic materials. arXiv
preprint arXiv:2201.07342 (2022)
12. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invari-
ant mapping. In: 2006 IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR 2006), vol. 2, pp. 1735–1742 (2006)
13. He, W., Ladinsky, M.S., Huey-Tubman, K.E., Jensen, G.J., McIntosh, J.R.,
Björkman, P.J.: FcRn-mediated antibody transport across epithelial cells revealed
by electron tomography. Nature 455(7212), 542–546 (2008)
14. Khadangi, A., Boudier, T., Rajagopal, V.: EM-net: deep learning for electron
microscopy image segmentation. In: 2020 25th International Conference on Pattern
Recognition (ICPR), pp. 31–38 (2021)
15. Li, C., Ducottet, C., Desroziers, S., Moreaud, M.: Toward few pixel annotations
for 3D segmentation of material from electron tomography. In: International Con-
ference on Computer Vision Theory and Applications, VISAPP 2023 (2023)
16. Liu, Q., Xu, Z., Jiao, Y., Niethammer, M.: iSegFormer: interactive segmentation
via transformers with application to 3D knee MR images. In: Wang, L., Dou, Q.,
Fletcher, P.T., Speidel, S., Li, S. (eds.) MICCAI 2022. LNCS, vol. 13435, pp. 464–
474. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16443-9 45
Less-than-One Shot 3D Segmentation Hijacking a Pre-trained STM Network 135
17. Mahadevan, S., Voigtlaender, P., Leibe, B.: Iteratively trained interactive segmen-
tation. In: British Machine Vision Conference (BMVC) (2018)
18. Medeiros-Costa, I.C., Laroche, C., Pérez-Pellitero, J., Coasne, B.: Characteriza-
tion of hierarchical zeolites: combining adsorption/intrusion, electron microscopy,
diffraction and spectroscopic techniques. Microporous Mesoporous Mater. 287,
167–176 (2019)
19. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks
for volumetric medical image segmentation. In: 2016 Fourth International Confer-
ence on 3D Vision (3DV), pp. 565–571 (2016)
20. Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time
memory networks. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision, pp. 9226–9235 (2019)
21. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
22(10), 1345–1359 (2009)
22. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4 28
23. Sukhbaatar, S., Weston, J., Fergus, R., et al.: End-to-end memory networks. Adv.
Neural Inf. Process. Syst. 28 (2015)
24. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learn-
ing for human pose estimation. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
25. Tran, V.D., Moreaud, M., Thiébaut, É., Denis, L., Becker, J.M.: Inverse problem
approach for the alignment of electron tomographic series. Oil Gas Sci. Technol.-
Rev. d’IFP Energies Nouvelles 69(2), 279–291 (2014)
26. Volkmann, N.: Methods for segmentation and interpretation of electron tomo-
graphic reconstructions. Methods Enzymol. 483, 31–46 (2010)
27. Wang, H., Jiang, X., Ren, H., Hu, Y., Bai, S.: SwiftNet: real-time video object
segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 1296–1305 (2021)
28. Wurm, M., Stark, T., Zhu, X.X., Weigand, M., Taubenböck, H.: Semantic segmen-
tation of slums in satellite images using transfer learning on fully convolutional
neural networks. ISPRS J. Photogramm. Remote. Sens. 150, 59–69 (2019)
29. Zhao, X., et al.: Contrastive learning for label efficient semantic segmentation. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.
10623–10633 (2021)
30. Zhou, T., Li, L., Bredell, G., Li, J., Konukoglu, E.: Quality-aware memory network
for interactive volumetric image segmentation. In: de Bruijne, M., et al. (eds.)
MICCAI 2021. LNCS, vol. 12902, pp. 560–570. Springer, Cham (2021). https://
doi.org/10.1007/978-3-030-87196-3 52
31. Zhou, T., Li, L., Bredell, G., Li, J., Unkelbach, J., Konukoglu, E.: Volumetric
memory network for interactive medical image segmentation. Med. Image Anal.
83, 102599 (2023)
Segmentation of Range-Azimuth Maps
of FMCW Radars with a Deep
Convolutional Neural Network
1 Introduction
Recent developments in low cost frequency modulated continuous wave (FMCW)
mmWave radar sensors have opened up a new field of applications thanks to its
small packaging and low power requirements. Thanks to those new properties,
this type of sensor can now be mounted on a small sized unmanned aerial vehicle
(UAV), commonly known as a drone, to be used for navigation and detection pur-
poses. Classical signal processing techniques can be applied for navigation and
detection applications, however those quickly come short in challenging environ-
ments [1]. In particular in indoor environments multipath propagation of radar
signals creates a lot of radar clutter. Constant False Alarm Rate (CFAR) detec-
tors often fail to correctly interpret the structure of such environments and point
detections further make the path planning process more difficult as the precise
spatial extent of an obstacle is lost in the processing.
In order to keep radar sensing at the forefront of technology, the radar com-
munity has made a push into deep learning methods [3]. A deep Convolutional
Neural Network (CNN) is more suitable to more generally interpret and extract
features from the environment. A CNN acts as an approximation component
This research received funding from the Flemish Government under the “Onderzoek-
sprogramma Artificiele Intelligentie (AI) Vlaanderen” program.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 136–147, 2023.
https://doi.org/10.1007/978-3-031-45382-3_12
Segmentation of Range-Azimuth Maps of FMCW Radars with a CNN 137
2 Related Work
3 Proposed Method
The goal is to train a radar neural network for free space detection in the ground
plane using manually annotated free space maps. Note that reconstructing a com-
plete map of the environment is generally neither possible nor required because
of scene self-occlusion: mmwave radar has only limited penetration depth for
most commonly encountered materials and hence cannot look far beyond the
first obstacle encountered by the wavefront, but these areas behind an obstacle
are also unreachable, so irrelevant for navigation. Therefore, we will train our free
space estimation network to predict per azimuth bin the distance to the nearest
obstacle. More formally, our radar CNN fCN N () computes a one-dimensional
vector Ŷ with ŷi ∈ [0, 1] representing the estimated normalized distance to the
closest obstacle in the ith azimuth bin. A value of 0 represents no free space
at all, and a value of 1 represents no obstacles within the maximum range of
the radar. This estimation vector is then compared to a ground truth vector Y
obtained from manually annotated free space maps, normalized similarly on the
maximum radar range.
Segmentation of Range-Azimuth Maps of FMCW Radars with a CNN 139
x
θ
Fig. 1. Side view of sensorial system setup and its coordinate system. The UAV is
displayed with pitch angle θ. Our segmentation, including labelling, is done in local
coordinate system x y z .
3.2 Pre-processing
Our capture platform saves the ADC samples from the radar. The 3D radar cube
is subsequently constructed in software. This gives us increased flexibility in the
exploration of our algorithms. To improve the poor azimuth resolution of our
radar, we opted to use the CAPON [2] beamforming algorithm. After beamform-
ing, we obtain a range-azimuth map with increased angular resolution that will
be processed by our neural network. While CAPON represents a not-negligible
computation cost in our pipeline, we expect natural evolutions in commercial
FMCW radars (i.e., larger antenna arrays) will realize similar resolution at no
cost in the near future.
While a typical U-Net would have a similar expanding path to predict a seg-
mentation result at the original 2D resolution, our problem formulation removes
the need to scale up to the input resolution. Instead we feed the feature vec-
tor into two fully connected (FC) layers that allow for prediction of the nearest
obstacle per azimuth bin. The final layer is a sigmoid activation which outputs a
normalized distance between 0 and 1. A detailed overview of the different layers
of the neural network is shown in Fig. 2.
Finally, we also augment every training sample by flipping and rotating the
sample around its azimuth axis, θ = 0. Rotating the sample is done over iΔθ
where Δθ = 10◦ and i ∈ {−8, −7, ..., 8}. This operation is achieved in practice
by rolling the array over the azimuth axis. It should be noted that this is not
strictly equivalent to rotating the scene, as the radar cross section (RCS) of an
object is typically angle dependent. However, for our purposes, the inaccuracy of
the resulting RCS does not negatively impact the results. After augmentation,
we have an abundance of training samples where the typical characteristics of
radar clutter is retained which is exactly what we want our network to suppress.
Fig. 3. IoT lab acting as industrial warehouse and drone platform with Jetson Nano
(top), Intel RealSense depth camera (middle), TI IWR1443 mmWave radar (bottom
left) and Infineon 24 Ghz radar (not used, bottom right).
total, 1430 labels were annotated which took 16 h. The labelling was done in
Cartesian space to achieve the highest possible accuracy. Afterwards, polar and
nearest obstacle per aximuth bin can be automatically calculated for all our
experiments. It should be noted that the obtained labels are imperfect due to
the poor radar resolution especially at high azimuths. In order to make a qualita-
tive and quantitative comparison of our method, we compared its performance
to segmentation based on CFAR and by using a U-Net convolutional neural
network.
Fig. 4. A floor plan of our warehouse with the trajectory of the UAV that is used
for data generation. The three black lines represent the shelves between the different
aisles. Data captured in the green area is used for the test set, the blue area is used
for the training set. This results in a content-independent training and test dataset.
(Color figure online)
Fig. 5. Schematic overview of the evaluation procedure for our neural network com-
pared to CFAR and U-Net based segmentation.
coordinates. In the two examples, the U-Net outputs at first sight a slightly
more detailed segmented image. Although appearing more detailed, this is not
conformed in the evaluation metrics already discussed in the Table 1. In Fig. 7, a
representation of the estimated free space is displayed as an overlay on the RGB
camera.
Fig. 6. Output comparison between our neural network and the U-Net neural network
for two frames from the test dataset in polar coordinates. From left to right: RGB
camera, range-azimuth image from radar, label, output of our network converted to an
occupancy map, segmentation output of U-Net.
Segmentation of Range-Azimuth Maps of FMCW Radars with a CNN 145
Fig. 7. Overlay of free space estimation on the RGB camera. The overlay is constructed
by mapping Ŷ in to a color map of values between red and green. Red represents no
free space at all, green represents no obstacles within the maximum range of the radar.
The left frame corresponds with the top row in Fig. 6, the right frame corresponds with
the bottom row. (Color figure online)
5 Conclusion
In our work, we show that a low power millimeter wave radar can be used as an
navigation sensor on a UAV in an indoor environment by using a segmentation
approach. We presented a novel CNN for free space estimation in range-azimuth
146 P. Meiresone et al.
References
1. Bilik, I., Longman, O., Villeval, S., Tabrikian, J.: The rise of radar for autonomous
vehicles: signal processing solutions and future research directions. IEEE Signal
Process. Mag. 36(5), 20–31 (2019). https://doi.org/10.1109/MSP.2019.2926573
2. Capon, J.: High-resolution frequency-wavenumber spectrum analysis. Proc. IEEE
57(8), 1408–1418 (1969). https://doi.org/10.1109/PROC.1969.7278
3. Dickmann, J., et al.: Automotive radar the key technology for autonomous driving:
from detection and ranging to environmental understanding. In: 2016 IEEE Radar
Conference (RadarConf), pp. 1–6 (2016). https://doi.org/10.1109/RADAR.2016.
7485214,ISSN: 2375-5318
4. Dimitrievski, M., Shopovska, I., Hamme, D.V., Veelaert, P., Philips, W.: Weakly
supervised deep learning method for vulnerable road user detection in FMCW
radar. In: 2020 IEEE 23rd International Conference on Intelligent Transporta-
tion Systems (ITSC), pp. 1–8 (2020). https://doi.org/10.1109/ITSC45102.2020.
9294399
5. Kaul, P., de Martini, D., Gadd, M., Newman, P.: RSS-net: weakly-supervised multi-
class semantic segmentation with FMCW radar. In: 2020 IEEE Intelligent Vehi-
cles Symposium (IV), pp. 431–436 (2020). https://doi.org/10.1109/IV47402.2020.
9304674. ISSN 2642-7214
6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). https://
doi.org/10.48550/arXiv.1412.6980, http://arxiv.org/abs/1412.6980
7. de Oliveira, M.L.L., Bekooij, M.J.G.: Deep convolutional autoencoder applied for
noise reduction in range-doppler maps of FMCW radars. In: 2020 IEEE Interna-
tional Radar Conference (RADAR), pp. 630–635 (2020). https://doi.org/10.1109/
RADAR42522.2020.9114719. ISSN 2640-7736
8. Rebut, J., Ouaknine, A., Malik, W., Pérez, P.: Raw high-definition radar for
multi-task learning. In: 2022 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 17000–17009 (2022). https://doi.org/10.1109/
CVPR52688.2022.01651. ISSN 2575-7075
Segmentation of Range-Azimuth Maps of FMCW Radars with a CNN 147
9. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4 28
10. Safa, A., et al.: A low-complexity radar detector outperforming OS-CFAR for
indoor drone obstacle avoidance. IEEE J. Sel. Top. Appl. Earth Observ. Remote
Sens. 14, 9162–9175 (2021). https://doi.org/10.1109/JSTARS.2021.3107686
11. Xiao, Y., Daniel, L., Gashinova, M.: Image segmentation and region classification in
automotive high-resolution radar imagery. IEEE Sens. J. 21(5), 6698–6711 (2020).
https://doi.org/10.1109/JSEN.2020.3043586
12. Çatal, O., Jansen, W., Verbelen, T., Dhoedt, B., Steckel, J.: LatentSLAM: unsuper-
vised multi-sensor representation learning for localization and mapping. In: 2021
IEEE International Conference on Robotics and Automation (ICRA), pp. 6739–
6745 (2021). https://doi.org/10.1109/ICRA48506.2021.9560768. ISSN 2577-087X
Upsampling Data Challenge:
Object-Aware Approach for 3D Object
Detection in Rain
1 Introduction
LiDAR object detection in adverse weather conditions poses a significant chal-
lenge in autonomous vehicle research and remains an open issue. Given an
Supported by A*STAR.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 148–159, 2023.
https://doi.org/10.1007/978-3-031-45382-3_13
Upsampling Data Challenge 149
unordered sparse point cloud received by LiDAR in the rain, one could explore
point cloud upsampling to achieve a denser point cloud in order to improve the
detection of different targets. Traditionally, upsampling methods have primarily
been employed to support tasks like object classification [1] and surface smooth-
ness reconstruction [2]. For upsampling experiments, the different patches of
the samples present in the training dataset are selected for training and down-
sampled using methods such as Poisson Disk Sampling. Subsequently, during
the training phase, the network is tasked with upsampling the point cloud and
comparing it with the ground truth to evaluate its performance. Traditional
approaches for LiDAR point clouds typically employ the Farthest Point Sampling
(FPS) method to select seed points, followed by the application of the K-Nearest
Neighbours (KNN) algorithm to acquire input patches. These patches are then
merged after the upsampling process. While this approach presents good results
for many applications, it has lower performance in scenarios involving adverse
weather driving due to the extreme sparsity of data and the lack of emphasis on
the downsampling pattern observed in natural settings.
Reconstructing complex geometry or topology from a sparse point cloud is
still an open problem. Recently there has been some work done on optimizing
object detection improving the resolution from low-resolution LiDAR (32 Ch) to
high-resolution LiDAR (64 Ch), using 2D interpolation methods [3]. Their work
demonstrated improved mAP (mean Average Precision) for different objects
provided by the publicly available Kitti dataset [4]. However, their method is
not adequate for rainy scenarios, where LiDARs can only receive point clouds
with high sparsity. We aim to enhance and detect objects rather than obtain
high-density point clouds. Building upon this motivation, we propose a novel
object-aware upsampling approach, trained using the object of interest instead
of small patches, to extend the detection range. The main contributions include:
– We propose a few object-aware upsampling strategies (an angle-invariant
approach, a semi-supervised approach and an object-aware-traditional-patch-
based combined approach) to increase the LiDAR detection range and to
overcome the difficulty of collecting labelled data.
– We verify a well-established simulator and generate a rain database, which
we can adopt as a benchmark for object detection in rain.
The paper is organized as follows. In Sect. 2, we present a rain model for our
experiments. Section 3 provides an overview of existing upsampling technolo-
gies. In Sect. 4, we introduce our novel object-aware approach. The experimental
results and concluding remarks are presented in Sect. 5.
physical sensor model and the rain model. However, it should be noted that the
simulator does not account for the noise introduced by rain.
The theoretical model used in this simulator regarding the impact of rain
on LiDAR measurements can be found in [6]. The power received by a LIDAR
sensor [7] (reflected intensity) is,
z
c ∗ ρ(z) ∗ Ar
Pr (z) = El ∗ ∗ τ T ∗ τ R ∗ exp((−2) ∗ α(z ) dz ) (1)
2 ∗ z2 0
where El is the laser pulse energy, c is the speed of light, z is the detection range,
ρ(z) is the back-scattering coefficient of a target, α(z ) is the scattering coefficient
of rain along the path to a target, Ar is the effective receiver area, τT and τR are
the transmitter and the receiver efficiencies. Without loss of generality, assuming
a homogeneous environment, the constant coefficient Cs = c ∗ El ∗ Ar ∗ τT ∗ τR
represents the particular characteristics of the sensor, and can be ignored when
calculating the relative sensor power, which is,
ρ
Pn (z) = ∗ e(−2)∗α∗z (2)
z2
Under clear weather conditions, corresponding to α=0.0, at the maximum
detection range of a LiDAR (zmax ), the maximum detectable power for a high
reflectivity object (ρ=0.9/π) will be,
0.9
Pnmin = (3)
π ∗ zmax
2
To calculate the relative sensor power measured under rainy conditions, the
scattering coefficient, α, can be defined according to the power law [8],
α = a ∗ Rb (4)
where R is the rainfall rate (mm/h), and a and b are empirical coefficients. The
authors obtain the values a=0.01 and b=0.6 using the measurements of another
paper [9]. Therefore the final model for the relative intensity returned by the
LiDAR as a function of rainfall rate is:
ρ 0.6
Pn (z) = 2
∗ e(−0.02)∗R ∗z (5)
z
The last equation is used to simulate rain in terms of rain rate. The points with
a power/intensity less than the value defined by (3) will be eliminated.
4 Object-Aware Upsampling
Building upon the insights provided in the previous section, it has been estab-
lished that traditional upsampling approaches primarily emphasize patches and
treat individual points equally, overlooking the specific objects of interest. To
enhance the accuracy of object detection, this study proposes a novel rain
object-aware upsampling technique that extends the capabilities of patch-based
approaches. This approach leverages the objects of interest to guide the upsam-
pling process, thereby improving the overall effectiveness of object detection.
5 Experimental Results
5.1 Simulation Setting and Analysis
To validate the effectiveness of the simulator adopted in this work, we conducted
experiments using a publicly available database. Firstly, we focused on verifying
the rain model using NuScenes [10]. In their paper [6], Goodin et al. validated
the rain model by measuring the maximum detection range based on the rain
Upsampling Data Challenge 153
rate and object reflectivity. To assess the capabilities of our rain model, we repro-
duced these experiments and compared our results with the real-world findings
presented by the BMW Research group in [19]. The experimental results, shown
in Fig. 2, illustrate the distance to the farthest point in a random NuScenes
scene after applying the model (Eq. (5)). We assumed uniform reflectivity for all
points, employing the same values as [19]: 0.2 for the red line and 0.07 for the
blue line.
Fig. 2. Max Range vs Rain Rate for 1. Goodin et al. simulation [6], 2. Our reproduction
on NuScenes, 3. Real Results from BMW Research Lab [19]
As Fig. 2, the validity of the model matches with the real experiments, moti-
vating us to use the MAVS [5] simulator for this paper.
– We incrementally placed the car 1 m away from the AV until it became unde-
tectable. This iterative process was repeated for various angles of the car,
ranging from 0 to 90◦ (perpendicular to the AV), with an interval of 10◦ .
Additionally, we varied the rain rate from 30 mm/s to 70 mm/s during these
experiments.
A total of 5,170 scenarios have been recorded (with nine samples for each). The
hierarchy of this dataset is a series of subtrees which represent the data from
different rain rates, angles and distances.
We evaluated the impact of rain on the above-mentioned dataset using the
CenterPoint detector [12]. The detector was applied to predict bounding box
(bbox) scores and coordinates. For each sample scenario, we considered a mini-
mum distance of 5 m below the maximum detectable range, focusing specifically
on challenging cases where detection confidence and accuracy are significantly
reduced. The threshold for the bbox confidence score was set to 0.28, which is
in close proximity to the commonly chosen value of 0.3 [20].
Moreover, we visually depicted the enhancement of the detection range for three
different methods: (1) PuGAN semi-supervised object-aware method combined
with MPU points, (2) Traditional state-of-the-art patch-based approach (MPU),
and (3) PuGAN semi-supervised object-aware method. The visualization, pre-
sented in Fig. 5, showcases the improvements achieved for two angles, namely 0◦
and 80◦ . The results illustrate the promising performance of combining points
from different upsampling methods (object-aware and patch-based) in improving
detection confidence in certain cases, while the semi-supervised method displays
better improvements in other cases. Notably, both methods outperform tradi-
tional approaches in terms of performance.
Upsampling Data Challenge 155
Fig. 4. Maximum detection distance vs rain rate for the target placed at an orientation
of 0 and 80◦ for a. Unsupervised PuGAN object-aware method trained only with non-
rain samples, b. Semisupervised PuGAN object-aware method trained with both rain
non-rain pairs and non-rain samples
Fig. 5. Maximum detection distance vs rain rate for the target placed at an orientation
of 0 and 80◦ for: PuGAN semi-supervised object-aware method combined with MPU
points (green line); Traditional state-of-the-art patch-based approach (MPU) (blue
line) and PuGAN semi-supervised object-aware method (red line) (Color figure online)
Table 1. Highest Improvement of BEV IoU for different angles (mm/h) rain rates
and distances (m)
Despite the promising nature of the findings, we acknowledge that there are
anticipated limitations in our method. Expected and current limitations encom-
pass the following aspects: potential instability in performance, leading to a
degradation in object detection confidence or accuracy when employed with var-
ious object detectors across diverse scenarios, necessitating further fine-tuning
(e.g. For experiments conducted on Kitti, there is a minimal-to-none improve-
ment in the average precision metric using the semi-supervised learning method);
lack of generalizability, for example, domain knowledge or a segmentation algo-
rithm are required to select the targets of interest. These limitations primarily
stem from the ongoing research challenge of point cloud upsampling, as well as
the fact that existing approaches have not been specifically designed to address
object detection in adverse weather conditions. Specifically, the reconstruction
of a target’s shape and the utilization of LiDAR measurements for learning pur-
poses prove challenging due to the inherent sparsity of the data. Moreover, intro-
ducing an excessive number of points or misplacing points during the upsampling
process can adversely affect the detection performance. To overcome these chal-
lenges, it is imperative to devise a more robust upsampling algorithm that takes
into account the unique difficulties associated with upsampling sparse LiDAR
point clouds.
6 Conclusion
Lidar object detection range as well as accuracy will be reduced significantly in
rain. In this paper, we explore an object-aware upsampling method to increase
the LiDAR object detection range and accuracy in the rain. Different from the
existing upsampling approach, which increases the density of different patches
equally, we aim to detect an object. We verified a well-established simulator and
the experiments on a database generated by this simulator have shown that our
object-aware networks can extend the detection range from traditional patch-
based upsampling approaches by several meters in rain conditions. In addition,
it can improve object detection accuracy in terms of BEV IoU.
Although some preliminary results obtained by applying our novel approach
presented in this study are very encouraging, more experiments and optimization
are needed to make it a stable solution to perception in rain. Currently, limited
experimentation has been conducted to ascertain the characteristics including
the conditions under which the approach is effective or unsuccessful, as well as
the accuracy limitations associated with it.
A benchmark database could be built based on more simulation data in
the near future. Furthermore, experiments on real rain data, collected using
our autonomous vehicle, will allow us to evaluate our methods under challeng-
ing adverse weather conditions. Finally, We could extend our work to improve
robustness and adaptability across various targets, angles, and sparsity charac-
teristics in future.
References
1. Li, R., Li, X., Fu, C.-W., Cohen-Or, D., Heng, P.-A.: PU-GAN: a point cloud
upsampling adversarial network (2019). https://arxiv.org/abs/1907.10844
2. Zhou, H., Chen, K., Zhang, W., Fang, H., Zhou, W., Yu, N.: DUP-net: denoiser
and upsampler network for 3D adversarial point clouds defense (2019). https://
arxiv.org/abs/1812.11017
3. You, J., Kim, Y.-K.: Up-sampling method for low-resolution lidar point cloud to
enhance 3D object detection in an autonomous driving environment. Sensors 23(1)
(2023). https://www.mdpi.com/1424-8220/23/1/322
4. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI
dataset. Int. J. Robot. Res. 32, 1231–1237 (2013)
5. MSU autonomous vehicle simulator. https://www.cavs.msstate.edu/capabilities/
mavs.php. Accessed 29 Jan 2023
Upsampling Data Challenge 159
6. Goodin, C., Carruth, D., Doude, M., Hudson, C.: Predicting the influence of rain
on LiDAR in ADAS. Electronics 8, 89 (2019)
7. Dannheim, C., Icking, C., Mader, M., Sallis, P.: Weather detection in vehicles by
means of camera and LiDAR systems. In: 2014 Sixth International Conference on
Computational Intelligence, Communication Systems and Networks, pp. 186–191
(2014)
8. Lewandowski, P.A., Eichinger, W.E., Kruger, A., Krajewski, W.F.: LiDAR-based
estimation of small-scale rainfall: Empirical evidence. J. Atmos. Oceanic Tech-
nol. 26(3), 656–664 (2009). https://journals.ametsoc.org/view/journals/atot/26/
3/2008jtecha11221.xml
9. Filgueira, A., Gonzalez-Jorge, H., Laguela, S., Diaz-Vilarino, L., Arias, P.: Quan-
tifying the influence of rain in LiDAR performance. Measurement 95, 143–148
(2017). https://www.sciencedirect.com/science/article/pii/S0263224116305577
10. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving (2019).
https://arxiv.org/abs/1903.11027
11. Yifan, W., Wu, S., Huang, H., Cohen-Or, D., Sorkine-Hornung, O.: Patch-based
progressive 3D point set upsampling (2018). https://arxiv.org/abs/1811.11286
12. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking
(2020). https://arxiv.org/abs/2006.11275
13. Wu, J., Xu, H., Zheng, J., Zhao, J.: Automatic vehicle detection with roadside
lidar data under rainy and snowy conditions. IEEE Intell. Transp. Syst. Mag.
13(1), 197–209 (2021)
14. Wu, J., Xu, H.: The influence of road familiarity on distracted driving activities
and driving operation using naturalistic driving study data. Traffic Psychol. Behav.
52, 75–85 (2018)
15. MPU tensorflow implementation. https://github.com/yifita/3PU. Accessed 29 Jan
2023
16. PyTorch unofficial implementation of PU-net and PUGAN. https://github.com/
UncleMEDM/PUGAN-pytorch. Accessed 29 Jan 2023
17. Yang, Q., Zhang, Y., Chen, S., Xu, Y., Sun, J., Ma, Z.: MPED: quantifying point
cloud distortion based on multiscale potential energy discrepancy. IEEE Trans.
Pattern Anal. Mach. Intell. 1–18 (2022)
18. Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection.
Sensors 18, 3337 (2018)
19. Rasshofer, R., Spies, M., Spies, H.: Influences of weather phenomena on automotive
laser radar systems. Adv. Radio Sci. 9, 07 (2011)
20. MMDetection3D: OpenMMlab next-generation platform for general 3D object
detection. https://github.com/open-mmlab/mmdetection3d. Accessed 29 Jan 2023
21. Yang, Z., Sun, Y., Liu, S., Jia, J.: 3DSSD: point-based 3D single stage object
detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (2020)
A Single Image Neuro-Geometric Depth
Estimation
1 Introduction
Depth estimation is a fundamental task in computer vision that aims to estimate the dis-
tance of each pixel in an image from the camera [16]. It has many applications in areas
such as 3D scene perception, autonomous driving, robotics, augmented reality etc. [5, 21,
22]. The computational estimation of depth can be performed using different methods
[4, 6, 14]. Among these methods, monocular depth estimation is particularly challenging
because it requires learning complex and ambiguous relationships between image fea-
tures and depth values [19]. However, it has some advantages over other methods, such
as lower cost and simplified system complexity [5]. Therefore, many researchers have
been developing deep learning models to tackle this problem using large-scale datasets
and powerful neural networks [16].
Various depth prediction methods have been proposed in the literature aiming at
estimating the depth of object of interest. Liu et al. [15] proposed a methodology for
monocular depth prediction based on semantic class knowledge learning under geomet-
rical constrains. However, the accuracy of this methodology is highly constrained by
the quality of the training data. In another work, the camera and the various sensors of
a mobile device have been leveraged to perform depth estimation using a single image
[3]. A limitation of this work is that its performance is dependent on the accuracy of
mobile sensors and its requirement for user input. In [2], monocular dash-cam images are
employed to estimate the relative depth between vehicles in real-time. Nevertheless, this
methodology relies on certain geometric assumptions related to the scene geometry such
as the detection of the horizon. Other studies [17, 18] measure the distance between the
camera and objects of interest and perform size measurements, using RGB-D sensors.
These methodologies can produce predictions of enhanced accuracy; however, they rely
on specialized sensors which increase the hardware complexity.
Recently, research interest has been focused on deep learning (DL)-based methods
for monocular depth estimation. Fu et al. [7] proposed a Deep Ordinal Regression Net-
work (DORN) for performing monocular depth estimation given a single RGB image.
DORN is based on a supervised CNN model following a multi-scale feature extraction
scheme with an ordinal training regression loss. In addition, DORN employs a spacing-
increasing discretization strategy for the depth values to predict high resolution depth
maps. In [1], a methodology based on transfer-learning was presented for addressing the
issue of monocular depth estimation. A pre-trained CNN, using the DenseNet architec-
ture [12] as a backbone and trained via a loss considering edge consistency and structural
similarity, has been proposed to predict high quality depth maps. In [13], a supervised
deep learning framework that includes a synergy network architecture and an attention-
driven loss, was presented for jointly learning semantic labelling and depth estimation
from a single image. Godard et al. [9], proposed a CNN framework, trained in an unsu-
pervised manner, for performing monocular depth estimation. In that study, a training
scheme based on pairs of stereo images was leveraged for predicting a disparity map to
infer depth indirectly. The methodology presented in [10] expanded the work of [9] by
introducing a self-supervised training approach which incorporates an occlusion aware
loss along with an auto-masking training loss. Furthermore, a full-resolution multi-scale
estimation procedure was applied to reduce artifacts. The methodologies of [9, 10] facil-
itate depth estimation from a single image; however, they require stereoscopic datasets
for their training.
Considering the importance of single depth estimation, this paper proposes a novel
neuro-geometric method, in the sense that combines a geometrical and deep learning
approach, for object depth estimation, abbreviated as NGDE. NGDE infers the depth
information of an object, i.e., the object-to-camera distance, by propagating a set of
probable depth values and the 2D pixel coordinates of the bounding box of an object
to an appropriately trained MLP. Then, the MLP model is tasked to approximate an
accurate estimation of the object depth given the respective inputs. The set of probable
depth values is estimated using a virtual, automatically generated 3D point cloud (PC)
that is subsequently projected to the 2D image plane using the intrinsic parameters of the
camera. In this way, NGDE establishes correspondences between 2D pixel coordinates
162 G. Dimas et al.
and 3D world points. By leveraging these correspondences and the 2D bounding box of
an object, a set of probable depth values between the object and the camera is estimated.
An advantage of NGDE over state-of-the-art DL methods is that the object depth
estimation is solely based on the bounding box defining an object of interest, and the
geometric properties of the camera model, i.e., the depth estimation process does not
consider pixel values of the image content. Hence, NGDE is not directly affected by
changes in the environment, e.g., illumination conditions, or by compression artifacts.
Additionally, in contrast to NGDE, other single-image depth estimation approaches
require prior knowledge regarding the horizon line [2]. Nevertheless, horizon detection
may be challenging in various scenarios, such as in urban areas or indoors. Instead,
NGDE leverages the parameters of the geometric camera model to establish 3D-2D
correspondences between the real world and the image plane to predict the object depth
accurately.
The rest of this paper is organized into 3 sections. Section 2 describes the proposed
methodology; Sect. 3 presents the experimental setup and the evaluation results of NGDE
against state-of-the-art models, and Sect. 4 that provides the conclusions of this study.
2 Methodology
be extracted. Then, in the second stage, NGDE uses both the estimated set of probable
depth values and the bounding box pixel coordinates as inputs to an MLP trained to
estimate an accurate approximation of the depth between the camera and the object.
Initially, NGDE requires the generation of a virtual point cloud (PC) V (Fig. 1) and the
projection of each point vi = (x i , yi , zi )T ∈ V to the image plane. This is feasible by using
the intrinsic parameters of the camera, i.e., focal length ( f x , f y ) and principal point (ppx ,
ppy ) through the use of the pinhole camera model [11]:
fx x i
ui z + ppx
= fy y i (1)
vi
z + ppy
Each point vi is a coordinate in the XYZ Cartesian system with the X and Y axes
spanning along the height and width of a 3D scene whereas the Z-axis spans along the
depth of the scene. The 2D projection of V, in the image plane following the pinhole
camera model, is a set P comprising pixel coordinates pi = (ui , vi )T , where each pi has
a corresponding virtual 3D point vi ∈ V with known 3D world coordinates.
For the needs of this study, the virtual PC V consists of points defining a horizontal
plane that approximates the ground, i.e., coordinate y of vi has value in the range of (-δ
-h, -h + δ), where h is the height of the camera, and δ denotes the offset from -h. Hence,
a rough estimation of the height of the camera is required for the successful application
of NGDE. In addition, the target objects are assumed to lie on the ground plane. By
conditioning the PC generation to roughly approximate the ground plane the uncertainty
deriving from the 3D-2D projection is minimized. In detail, by projecting the PC V at
the height of the camera, due to the properties of the pinhole camera model that enables
the 3D-2D projection, a multiple point mapping is possible, i.e., different 3D points may
share the same pixel coordinates on the image plane.
Once the virtual PC V and its 2D projection P are defined, a target object needs to
be identified with a bounding box. This study assumes that the objects are on the ground
plane (therefore, one side of the bounding box lies on the ground plane). To estimate the
approximate depth between the camera and the target object a point matching process
is performed between the point set P and a point of the linear segment of the bounding
box of the object. Since PC V is an approximation of the ground plane, the middle point
pm of the linear segment that denotes the side of the bounding box lying on the ground
plane is used (Fig. 2). The 2D point matching process uses the Euclidean distance to
identify a subset P’ comprising N 2D points pi ∈ P that share similar pixel.
coordinates with pm . Since, each point pi ∈ P’ has a direct corresponding 3D point
vi with known z values, a set of probable depth values d init = (z1 , z2 , z3 , …, zN ) can
be extracted. This set of probable depth values d init along with the pixel coordinates
of the bounding box are used as inputs to the MLP in the next stage of the proposed
methodology (Fig. 3).
164 G. Dimas et al.
Fig. 2. Example of the 3D and 2D components used for the estimation of the probable depth
values. (a) Illustration of the 3D PC V; (b) Illustration of the 2D projection of V, object bounding
box, and pm .
Fig. 3. Example of the estimation of the probable depth values given the subset P’ and 3D PC V.
The probable depth estimation process produces a set of rough object depth approxima-
tions. The selection of a depth value estimated by this process can be subject to potential
deviations in the depth estimation originating from the systematic bias due to the cal-
ibration and the position of the camera [4]. To cope with this problem an additional
component for the final depth estimation is integrated in the proposed methodology. In
detail, an MLP network is trained to estimate the final depth value given a set of rough
approximations and the pixel coordinates of the bounding box.
Given the initial set of probable depth values d init , and the bounding box, bb,
defining the object of interest, NGDE presumes that there is a non-linear function
f (dinit , bb; w), parametrized by w, capable of inferring the final object depth. That func-
tion f (dinit , bb; w) is approximated using an MLP network composed of hidden layers
and a single output neuron. Each hidden layer contains a fixed number of k neurons that
use the hyperbolic tangent (tanh) as activation function. The prediction layer consists
A Single Image Neuro-Geometric Depth Estimation 165
of a single neuron that employs the log-sigmoid function as its activation. The final
architecture of the MLP has been experimentally determined (see Sect. 3.2).
To train the MLP of NGDE the Mean Squared Logarithmic Error (MSLE) was used
as loss function formally expressed as:
M
1 2
L y, y = log(yi + 1) − log yi + 1 (2)
M
i=1
where the real and predicted depth values are denoted by y and y, respectively, whereas
M denotes the total amount of training data. MSLE was selected on the basis that is less
sensitive to large errors compared to other loss functions, such as the Mean Squared
Error (MSE) [20]. MSE tends to emphasize larger errors because it regards their squared
differences resulting in increment of the total training error for a model. Additionally, the
range in which the objects are detected is wide. As a result, MSE loss would emphasize
the estimation errors occurring when objects are detected in larger distances compared to
the errors in small distances. Furthermore, to cope with the difference regarding the range
of the depth and the bounding box pixel coordinate values, the min-max normalization
has been applied separately to the bounding box and probable depths.
The experimental evaluation of the proposed methodology was performed on the KITTI
dataset [8]. KITTI dataset contains RGB images representing various outdoor urban and
rural scenarios along with 3D point clouds. KITTI dataset is considered a benchmark for
a variety of computer vision applications including stereo, optical flow, tracking, visual
odometry, SLAM and 3D object detection. The dataset was recorded with RGB and
grayscale stereo cameras and multiple sensors, e.g., depth lidar sensors. It provides 3D
bounding boxes for pedestrians, cars, cyclists, and other objects that are depicted at each
image (Fig. 4) as well as ground truth poses, IMU measurements, GPS coordinates and
calibration data. The MLP of NGDE has been trained using the training data partition of
the KITTI dataset. For one-to-one comparison with other similar methodologies, only
un-occluded objects were considered as well as the same test data partitioning with [2].
The comparison of the trained MLP models in the task of depth estimation performed
with the metrics of Mean Absolute Error (MAE), Relative Error (RE), Relative Squared
Error (RSER), Root Mean Squared Error (RMSE), logarithmic RMSE (RMSE-log) and
Ratio threshold δ, defined as:
M
1
MAE y, y = yi − yi (3)
M
i=0
M
1 yi − yi
RE y, y = (4)
M y
i=0
166 G. Dimas et al.
Fig. 4. Images of the KITTI dataset. (a) Multiple objects of class person; (b) single object of class
person; and (c) single object of class vehicle
M 2
1 yi − yi
RSER y, y = (5)
M y
i=0
M
1 2
RMSE y, y = yi − yi (6)
M
i=0
M
1 2
Table 1. Comparative results for different MLP architectures used by NGDE. ↑ and ↓ means
higher and lower is better, respectively.
Models Metrics
k RE↓ MAE↓ RSER↓ RMSE↓ δ < 1.25↑ δ < 1.252 ↑ δ < 1.253 ↑
1 10 0.16 3.18 0.84 5.25 0.85 0.96 0.98
100 0.16 3.07 0.79 5.20 0.83 0.95 0.98
256 0.16 3.16 0.80 5.20 0.83 0.96 0.98
2 10 0.12 2.32 0.56 4.24 0.89 0.96 0.98
100 0.11 2.22 0.46 3.94 0.90 0.97 0.99
256 0.11 2.13 0.42 3.77 0.89 0.97 0.99
3 10 0.11 2.07 0.57 3.89 0.90 0.97 0.98
100 0.10 1.73 0.38 3.28 0.92 0.97 0.99
256 0.11 1.92 0.45 3.49 0.92 0.97 0.99
4 10 0.11 2.04 0.45 3.88 0.90 0.97 0.99
100 0.09 1.51 0.30 2.95 0.94 0.97 0.99
256 0.09 1.48 0.34 3.21 0.94 0.97 0.99
Table 2. Comparisons of NGDE with state-of-the-art methods for depth estimation. ↑ and ↓
means higher and lower is better, respectively.
Methods Metrics
RE↓ RSER↓ RMSE↓ RMSE log ↓ δ < 1.25↑ δ < 1.252 ↑ δ < 1.253 ↑
MonoDepth [9] 0.13 1.36 6.34 0.21 0.82 0.94 0.98
DORN 0.11 0.44 2.44 0.18 0.92 0.96 0.98
[7]
Alhashim et al [1] 0.09 0.59 4.17 0.17 0.89 0.97 0.99
MonoDepth2 [10] 0.11 0.81 4.63 0.19 0.88 0.96 0.98
Ali et al [2] 0.29 2.02 6.24 0.30 0.53 0.89 0.98
DevNet [23] 0.10 0.70 4.41 0.17 0.89 0.97 0.99
Proposed 0.09 0.30 3.21 0.16 0.94 0.97 0.99
layers with 256 neurons each. However, considering the complexity of the network, the
architecture with 4 hidden layers with 100 neurons each has been chosen as the optimal
network configuration since it has comparative performance with the more complex
model (Fig. 5).
168 G. Dimas et al.
4 Conclusions
In this paper a novel hybrid, in the sense that incorporates both a geometrical and deep
learning component, methodology is proposed for the estimation of object depths. Unlike
other approaches, NGDE leverages the parameters of the geometric camera model to
initially estimate a set of probable values denoting the depth between the object and the
A Single Image Neuro-Geometric Depth Estimation 169
camera. These depth values, in combination with the bounding box defining the borders
of an object, are propagated to an MLP model that makes an accurate estimation of the
depth between the target object and the camera. The architecture of the MLP compo-
nent of NGDE has been determined by an ablation study were various combinations of
hyperparameters were tested.
Since NGDE does not consider any information regarding the image content, i.e.,
the pixel values, it cannot be directly affected by various changes, e.g., changes in
illumination conditions or compression artifacts, and thus it can be easily applied in
various settings including indoor environments. However, in the case that an object
detector is used for the extraction of bounding boxes, inaccuracies in the bounding box
estimation can affect the performance of NGDE.
The experimental study performed in the context of this paper shown that NGDE
can achieve a relative error regarding the object depth estimation of 0.09, outperforming
other state-of-the-art approaches that have been proposed for object depth estimation.
A limitation of NGDE is that its MLP receives as input the bounding box of a target
object. This is considered as a limitation because the dimensions of the bounding box
of the same object captured under the same conditions by different cameras may vary;
as a result, it can affect the generalization capacity of NGDE. NGDE assumes that the
objects are on the ground plane and that the camera parameters and height are given.
These can be acquired offline in many cases, such as in autonomous driving, using vehicle
specification and camera calibration. However, the automatic estimation of these factors
constitutes directions for future work. Additionally, future work the investigation of
approaches that could make the performance of NGDE modality more computational
efficient and independent of the camera model, e.g., by normalizing the bounding box
coordinates considering the intrinsic parameters of the camera, as well as comparing its
performance in indoor environments.
Acknowledgements. We acknowledge support of this work by the project “Smart Tourist” (MIS
5047243) which is implemented under the Action “Reinforcement of the Research and Innovation
Infrastructure”, funded by the Operational Programme "Competitiveness, Entrepreneurship and
Innovation" (NSRF 2014–2020) and co-financed by Greece and the European Union (European
Regional Development Fund).
References
1. Alhashim, I., Wonka, P.: High quality monocular depth estimation via transfer learning. arXiv
preprint arXiv:181211941 (2018)
2. Ali, A., Hassan, A., Ali, A.R., Khan, H.U., Kazmi, W., Zaheer, A.: Real-time vehicle distance
estimation using single view geometry. In: Proceedings of the IEEE/CVF Winter Conference
on Applications of Computer Vision (WACV) (2020)
3. Chen, S., Fang, X., Shen, J., Wang, L., Shao, L.: Single-image distance measurement by a
smart mobile device. IEEE Trans. Cybernet. 47, 4451–4462 (2016)
4. Dimas, G., Bianchi, F., Iakovidis, D.K., Karargyris, A., Ciuti, G., Koulaouzidis, A.:
Endoscopic single-image size measurements. Meas. Sci. Technol. 31, 074010 (2020)
5. Dimas, G., Gatoula, P., Iakovidis, D.K.: MonoSOD: monocular salient object detection based
on predicted depth. In: 2021 IEEE International Conference on Robotics and Automation
(ICRA), pp 4377–4383. IEEE (2021)
170 G. Dimas et al.
6. Falkenhagen, L.: Depth estimation from stereoscopic image pairs assuming piecewise contin-
uos surfaces. In: Image Processing for Broadcast and Video Production: Proceedings of the
European Workshop on Combined Real and Synthetic Image Processing for Broadcast and
Video Production, Hamburg, 23–24 November 1994, pp 115–127. Springer, Cham (1995).
https://doi.org/10.1007/978-1-4471-3035-2
7. Fu, H, Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for
monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp 2002–2011 (2018)
8. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision
benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition,
pp 3354–3361 (2012)
9. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with
left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp 270–279 (2017)
10. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocu-
lar depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp 3828–3838 (2019)
11. Heikkila, J., Silvén, O.: A four-step camera calibration procedure with implicit image correc-
tion. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pp. 1106–1112. IEEE (1997)
12. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolu-
tional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp 4700–4708 (2017)
13. Jiao, J., Cao, Y., Song, Y., Lau, R.: Look deeper into depth: monocular depth estimation with
semantic booster and attention-driven loss. In: Proceedings of the European Conference on
Computer Vision (ECCV), pp 53–69 (2018)
14. Johari, M.M., Carta, C., Fleuret, F.: (2021) DepthInSpace: exploitation and fusion of mul-
tiple video frames for structured-light depth estimation. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp 6039–6048
15. Liu, B., Gould, S., Koller, D.: Single image depth estimation from predicted semantic labels.
In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
pp 1253–1260. IEEE (2010)
16. Ming, Y., Meng, X., Fan, C., Yu, H.: Deep learning for monocular depth estimation: a review.
Neurocomputing 438, 14–33 (2021)
17. Park, H., Van Messemac, A., De Neveac, W.: Box-Scan: an efficient and effective algorithm
for box dimension measurement in conveyor systems using a single RGB-D camera. In:
Proceedings of the 7th IIAE International Conference on Industrial Application Engineering,
Kitakyushu, Japan, pp. 26–30 (2019)
18. Shuai, S., et al.: Research on 3D surface reconstruction and body size measurement of pigs
based on multi-view RGB-D cameras. Comput. Electron. Agric. 175, 105543 (2020)
19. Spencer, J., et al.: The monocular depth estimation challenge. In: Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision, pp 623–632 (2023)
20. Tyagi, K., et al.: Regression analysis. In: Artificial Intelligence and Machine Learning for
EDGE Computing, pp 53–63. Elsevier (2022)
21. Valentin, J., et al.: Depth from motion for smartphone AR. ACM Trans. Graph. (ToG) 37,
1–19 (2018)
A Single Image Neuro-Geometric Depth Estimation 171
22. Yang, X., Luo, H., Wu, Y., Gao, Y., Liao, C., Cheng, K.-T.: Reactive obstacle avoidance of
monocular quadrotors with online adapted depth prediction network. Neurocomputing 325,
142–158 (2019)
23. Zhou, K., et al.: Devnet: Self-supervised monocular depth learning via density volume con-
struction. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel,
23–27 October 2022, Proceedings, Part XXXIX, pp 125–142. Springer, Cham (2022). https://
doi.org/10.1007/978-3-031-19842-7_8
Wave-Shaping Neural Activation for Improved
3D Model Reconstruction from Sparse Point
Clouds
1 Introduction
In the last few decades, the reconstruction of three-dimensional (3D) models has been
widely investigated. Nowadays, 3D models of real objects are typically created from
image sequences depicting a target object from different perspectives, or by using 3D
laser scanning techniques, e.g., LIght Detection And Ranging (LIDAR). The quality of
these methods can be hindered by various factors related to the digitization process, e.g.,
low image or volume resolution, while there is usually a tradeoff between resolution
and scanning speed, i.e., higher resolution scans require longer scanning times, whereas
faster scans result in coarser 3D models.
The significance of digital twins (DTs) has been recognized in a broad variety of
domains, including industrial, medical, and cultural heritage applications. Recently, the
use of DT models of human tissue structures has been identified as an impactful research
topic in biomedicine [18]. DTs are employed to simulate the pathophysiology of human
organs under different conditions. Hence, they can be of paramount importance for
biomedical applications, since they can shorten the time required for the various trial
phases, e.g., clinical trials required for the development of novel biomedical devices. The
accuracy of these simulations depends on the fidelity of the 3D model of a human organ.
Such a 3D model can be reconstructed from magnetic resonance (MR) or computed
tomography (CT) images. However, due to uncertainty factors introduced in this process,
e.g., imperfect segmentation of the tissue structures in the tomographic slices, the quality
of the obtained 3D models varies, and a post processing stage is required to improve the
overall model quality.
In the domain of cultural heritage, museums, cultural venues, and archeological sites
can benefit from the creation of digital replicas of their exhibits. Such digital replicas
can be used as part of a virtual experience for people that cannot physically visit these
venues, or to generate physical 3D-printed replicas [12], e.g., for the preservation of the
original artifacts or for the creation of tactile exhibitions for visually impaired people
(VIP) [37].
Considering the above, several deep learning (DL) approaches have been proposed
for the reconstruction of 3D models. These approaches employ voxels [10, 40], meshes
[2, 15], or point clouds (PCs) [32, 42] to train deep artificial neural networks (ANNs).
Nevertheless, the interpolation capability of these networks is limited. Implicit neural
representations (INRs) have exhibited the capacity of expressing shapes in the form of
continuous functions with the use of multilayer perceptrons (MLPs) [7, 8, 27, 30] or
convolutional neural networks (CNNs) [31]. These networks are tasked to approximate
implicit functions based on raw data, PCs, or latent codes with or without supervision
[27, 30, 36]. Signed distance functions (SDFs) have been recently utilized in INRs to
infer different geometries [14]. Recently, the utilization of the sinusoidal (sin) function
for enhancing the performance of INRs (SIREN) has been proposed in [36].
Methods related to the 3D reconstruction of complex human tissue structures include
structure from motion (SfM) approaches [16] and their more recent variations, such as
non-rigid-structure-from-motion (NRSfM) [13, 21, 34]. More recently, generative mod-
els for 3D organ shape reconstruction have been proposed using autoencoder architec-
tures [3, 38]. In [43], a DL-based framework was utilized to reconstruct 3D colon struc-
tures based on a colon model segmented from CT images and monocular colonoscopy
images. Similar methods have also been applied to the 3D reconstruction of cultural her-
itage artifacts, focusing mainly on obtaining 3D models based on SfM [26] and multiview
stereo photogrammetry techniques [19] as an alternative to high-resolution laser scan-
ners. Moreover, DL has been implemented for 3D reconstruction based on dense models
[6]. Other studies have utilized non-DL approaches that require manual curation or use
of 3D modelling-related software [4, 39]. Based on the above studies, it can be inferred
that all current methods depend on acquiring an initially high-resolution 3D model based
on traditional photogrammetry methods, e.g., SfM. Thus, in cases where only a coarse
3D model of the object can be obtained, the quality of the extracted mesh is limited. In
174 G. Triantafyllou et al.
such scenarios, an INR-based post-processing step that can successfully reconstruct the
extracted coarse 3D model at higher resolution without using computationally expensive
methodologies would be extremely beneficial.
To this end, this paper proposes an MLP-based INR comprising a neural network that
utilizes a novel activation function, called hereinafter Wave-shaping Neural Activation
(WNA). The development and use of this novel activation function has been motivated by
SIREN. WNA enables an MLP to provide better reconstruction results given just a sparse
PC of an object. Subsequently, coarse 3D models can be properly restored to be used
in various configurations, e.g., virtual reality and DT applications. Finally, the capacity
of MLPs of learning continuous INRs to reconstruct high-quality 3D meshes of the
gastrointestinal (GI) tract and cultural heritage artifacts given sparse PC representations,
is demonstrated. To the best of our knowledge WNA functions have not been previously
considered in the context of INRs, and this is the first time that INRs have been used to
reconstruct 3D models given sparse PCs in either the cultural or biomedical domains.
2 Method
The proposed methodology aims at reconstructing 3D coarsely retrieved models by
employing a fully-connected MLP network tasked to learn an implicit continuous rep-
resentation of that model. An overview of the proposed methodology describing its
pre-processing, training, and reconstruction stages is illustrated in Fig. 1. Each hidden
layer of the MLP model comprises neurons formally expressed as ϕ(x;W,b,ω), where
ϕ(·) denotes the WNA function, W represents the weights of the neurons, b is the bias,
and ω is the learnable parameter of the WNA function. The proposed methodology does
not require training on large datasets since it focuses on a single 3D model. The MLP
network, utilizes the proposed WNA function to learn an SDF [28], which describes
efficiently the 3D model that it aims to reconstruct. The MLP receives as input a point
u = (x, y, z)T of a 3D model, and it outputs a value approximating the respective SDF
response. That value describes the distance of a point from the surface of the 3D model.
After the training process, the model is capable of reconstructing the 3D model, which
was initially coarse, at a higher resolution by predicting through inference if a point
in the defined 3D space belongs to the surface of the model, i.e., its distance from the
surface is 0.
Fig. 1. Overview of the INR approach for 3D model refinement and restoration
The WNA function is introduced aiming at further improving the implicit representation
capacity of MLPs regarding coarse 3D models. The implementation of the proposed
WNA is achieved by applying the tanh function as a “wave-shaper” to the sin function.
The tanh function is chosen since it is widely used as a wave-shaper in various applica-
tions in the context of signal processing [17, 22, 29]. The proposed WNA function and
its derivative are:
dWS(x; ω)
= ωcos(ωx)sech2 (sin(ωx)) (2)
dx
where ω ∈ R is a learnable parameter of WNA in contrast to SIREN, where it is applied
manually as a constant throughout the training and inference processes. Equation (2)
describes the first derivative of the WNA function. As it can be seen in Fig. 2(a), for ω
= 1, the WNA function can be regarded as a scaled approximation of sin.
Fig. 2. Graphical representation of the proposed WNA and sin functions. (a) Responses and (b)
derivatives of the WNA and sin functions.
This means that WNA maintains the properties of sin, i.e., phase, periodicity, and
it has upper and lower bounds. Nevertheless, it can be observed that the derivation of
the WNA function produces a more complex expression (Fig. 2(b)); thus, it is expected
that, during the training process, WNA will produce a different computation of gradients
compared to sin. The following section shows that the gradients computed during the
training of an MLP that utilizes the WNA function enable the network to efficiently
reconstruct 3D models given only a small number of surface points.
the ground truth mesh. This is achieved by selecting n samples of points (n = 1000)
from the ground truth mesh and finding the fraction of points that are closer than a
minimum distance Δd from the generated mesh. In this study, the value of Δd was
selected according to the scale of each model.
In all the experiments, the architecture of the neural network comprised 4 hidden
layers, with 256 neurons each, the proposed WNA function, and a linear output layer.
This type of MLP architecture has been widely used in related works [35, 36]. All models
used in the evaluation process were trained with the same number of epochs, batch size,
and initialization parameters. In particular, the batch size was 5 × 103 on-surface points
and 5 × 103 off-surface points, and the number of epochs for the experiments was 3
× 103 ; it was observed that training the network for more epochs did not yield better
results. The Adam optimizer was utilized for the weights optimization with a learning
rate of 10–4 . To simulate 3D models with a sparse 3D representation, the dense PC of
each high-resolution 3D model was used to generate sparse PCs with densities of 0.5%,
1%, 5%, 10%, 20%, and 40% the original PC density. This was achieved by randomly
sampling points from each original PC based on a uniform distribution.
Fig. 3. Qualitative comparison of the reconstruction outcome for a representative large intestine
reconstructed from PCs with various densities: (a) 0.5%; (b) 5%; (c) 20%; (d) CT-sampled.
that it can produce meshes with better quality. Moreover, the proposed method and
SIREN needed 40% of the original PC to learn the represented shape and only 5% to
generate high-quality meshes.
Table 1. Average CD, EMD, and MC of the proposed WNA function against other methods for GI
tract 3D models with different PC densities. ↑ and ↓ means higher and lower is better, respectively.
Fig. 4. Qualitative comparison of the reconstruction of a sarcophagus from PCs with various
initial densities: (a) 0.5%; (b) 5%; (c) 10%; (d) 20%.
As in Figs. 3 (b, c), the quality of the meshes improves with increasing PC density.
Moreover, it can be seen that the baseline model produces low-quality meshes even for
higher densities (Figs. 4(c, d)).
180 G. Triantafyllou et al.
Table 2 exhibits the quantitative results obtained from the comparative evaluation of
the cultural objects. Based on the CD values, the proposed methodology outperforms the
other methods in terms of the similarity between PCs, except from the 40% PC density.
In addition, the MC values verify the capacity of the proposed method to generate more
complete meshes given PCs with low density (Fig. 4(a)). Again, for higher PC densities
the proposed method is on par with SIREN. These types of objects (Fig. 4) encompass
challenging surface geometries, i.e., both smooth and irregular. This may be the reason
why both the proposed method and SIREN produce meshes with comparable EMD
values.
Table 2. Average CD, EMD, and MC of the proposed WNA function against other methods for
cultural heritage 3D models with different PC densities. ↑ and ↓ means higher and lower is better,
respectively.
PC Methodologies
Density WNA SIREN Baseline (SP)
(%)
CD ↓ EMD↓ MC↑ CD ↓ EMD↓ MC↑ CD ↓ EMD↓ MC↑
0.5 3.747 6.084 0.747 7.533 7.065 0.700 9.459 6.799 0.055
1 2.571 6.465 0.860 2.590 6.891 0.851 5.863 4.206 0.166
5 1.668 6.941 0.975 1.692 6.778 0.969 2.862 3.209 0.610
10 1.544 7.144 0.989 1.565 6.909 0.991 2.120 3.835 0.889
20 1.476 6.495 0.993 1.486 6.523 0.994 1.870 7.683 0.943
40 1.650 5.395 0.984 1.459 6.371 0.990 1.699 7.414 0.963
4 Conclusions
In this paper, the use of INRs and a novel periodic parametric activation function for the
reconstruction of coarse 3D models has been evaluated in two case studies. To the best
of our knowledge, this is the first time that INRs are exploited for the reconstruction of
complex human tissue models and cultural heritage artifacts. The proposed method is
based on a 5-layer MLP combined with a novel neural activation function. The effect
of the proposed WNA function was compared with state-of-the-art methodologies. The
evaluation study suggests that the employment of the WNA function produces better
results, when evaluated in the context of 3D reconstruction of sparse 3D models. In
addition, the parametrization of WNA alleviates the need for its manual adjustment
when the target model changes. The proposed 3D reconstruction methodology can be
regarded as unsupervised since it is trained directly on a PC of a coarse 3D model
without the need for a huge dataset of other models and any labeled data. Given the
fact that periodic functions seem to have a positive effect on INRs, future work should
attempt to explore other periodic activation functions, determine how different types of
parameterization affect the performance of the network, and seek for ways to improve
the reconstruction quality in scenarios with extremely sparse PCs.
Wave-Shaping Neural Activation for Improved 3D Model Reconstruction 181
Acknowledgement. We acknowledge support of this work by the project “Smart Tourist” (MIS
5047243) which is implemented under the Action “Reinforcement of the Research and Innovation
Infrastructure”, funded by the Operational Programme "Competitiveness, Entrepreneurship and
Innovation" (NSRF 2014–2020) and co-financed by Greece and the European Union (European
Regional Development Fund).
References
1. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and genera-
tive models for 3D point clouds. In: International Conference on Machine Learning. PMLR,
pp. 40–49 (2018)
2. Bagautdinov, T., Wu, C., Saragih, J., Fua, P., Sheikh, Y.: Modeling facial geometry using
compositional VAEs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp 3877–3886 (2018)
3. Balashova, E., Wang, J., Singh, V., Georgescu, B., Teixeira, B., Kapoor, A.: 3D organ shape
reconstruction from Topogram images. In: Chung, A.C.S., Gee, J.C., Yushkevich, P.A., Bao,
S. (eds.) IPMI 2019. LNCS, vol. 11492, pp. 347–359. Springer, Cham (2019). https://doi.org/
10.1007/978-3-030-20351-1_26
4. Ballarin, M., Balletti, C., Vernier, P.: Replicas in cultural heritage: 3D printing and the museum
experience. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 42, 55–62 (2018)
5. Chabra, R., et al.: Deep local shapes: learning local SDF priors for detailed 3D reconstruction.
In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374,
pp. 608–625. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_36
6. Chen, X., et al.: A fast reconstruction method of the dense point-cloud model for cultural
heritage artifacts based on compressed sensing and sparse auto-encoder. Opt. Quant. Electron.
51, 1–16 (2019)
7. Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5939–5948
(2019)
8. Chibane, J., et al.: Neural unsigned distance fields for implicit function learning. In: Advances
in Neural Information Processing Systems, vol. 33, pp. 21638–21652 (2020)
9. Clark, K., et al.: The cancer imaging archive (TCIA): maintaining and operating a public
information repository. J. Digit. Imaging 26, 1045–1057 (2013)
10. Dai, A., Ruizhongtai Qi, C., Nießner, M.: Shape completion using 3D-encoder-predictor
CNNs and shape synthesis. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp 5868–5877 (2017)
11. Deng, Z., Yao, Y., Deng, B., Zhang, J.: A robust loss for point cloud registration. In: Pro-
ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6138–6147
(2021)
12. Garcia Carrizosa, H., Sheehy, K., Rix, J., Seale, J., Hayhoe, S.: Designing technologies for
museums: accessibility and participation issues. J. Enabl. Technol. 14, 31–39 (2020)
13. Gómez-Rodrguez, J.J., Lamarca, J., Morlana, J., Tardós, J.D., Montiel, J.M.: SD-DefSLAM:
Semi-direct monocular SLAM for deformable and intracorporeal scenes. In: 2021 IEEE
International Conference on Robotics and Automation (ICRA), pp 5170–5177. IEEE (2021)
14. Gropp, A., Yariv, L., Haim, N., Atzmon, M., Lipman, Y.: Implicit geometric regularization
for learning shapes. arXiv preprint arXiv:200210099 (2020)
15. Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: A papier-mâché approach to
learning 3D surface generation. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp 216–224 (2018)
182 G. Triantafyllou et al.
16. Hu, M., Penney, G., Edwards, P., Figl, M., Hawkes, D.J.: 3D reconstruction of internal organ
surfaces for minimal invasive surgery. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI
2007. LNCS, vol. 4791, pp. 68–77. Springer, Heidelberg (2007). https://doi.org/10.1007/978-
3-540-75757-3_9
17. Huovilainen, A.: Non-linear digital implementation of the Moog ladder filter. In: Proceedings
of the International Conference on Digital Audio Effects (DAFx-04), pp 61–64 (2004)
18. Kalozoumis, P.G., Marino, M., Carniel, E.L., Iakovidis, D.K.: Towards the development of a
digital twin for endoscopic medical device testing. In: Hassanien, A.E., Darwish, A., Snasel,
V. (eds.) Digital Twins for Digital Transformation: Innovation in Industry. Studies in Systems,
Decision and Control, vol. 423, pp. 113–145. Springer, Cham (2022). https://doi.org/10.1007/
978-3-030-96802-1_7
19. Kaneda, A., Nakagawa, T., Tamura, K., Noshita, K., Nakao, H.: A proposal of a new automated
method for SfM/MVS 3D reconstruction through comparisons of 3D data by SfM/MVS and
handheld laser scanners. PLoS ONE 17, e0270660 (2022)
20. Kazhdan, M., Hoppe, H.: Screened Poisson surface reconstruction. ACM Trans. Graph. (ToG)
32, 1–13 (2013)
21. Lamarca, J., Parashar, S., Bartoli, A., Montiel, J.: DefSLAM: tracking and mapping of
deforming scenes from monocular sequences. IEEE Trans. Rob. 37, 291–303 (2020)
22. Lazzarini, V., Timoney, J.: New perspectives on distortion synthesis for virtual Analog
oscillators. Comput. Music. J. 34, 28–40 (2010)
23. Levina, E., Bickel, P.: The earth mover’s distance is the mallows distance: some insights from
statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV
2001, pp 251–256. IEEE (2001)
24. Lewiner, T., Lopes, H., Vieira, A.W., Tavares, G.: Efficient implementation of marching cubes’
cases with topological guarantees. J. Graph. Tools 8, 1–15 (2003)
25. Ma, B., Han, Z., Liu, Y.-S., Zwicker, M.: Neural-pull: learning signed distance functions from
point clouds by learning to pull space onto surfaces. arXiv preprint arXiv:201113495 (2020)
26. Makantasis, K., Doulamis, A., Doulamis, N., Ioannides, M.: In the wild image retrieval
and clustering for 3D cultural heritage landmarks reconstruction. Multimed. Tools Appl. 75,
3593–3629 (2016)
27. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks:
learning 3D reconstruction in function space. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp 4460–4470 (2019)
28. Osher, S., Fedkiw, R.: Signed distance functions. In: Level Set Methods and Dynamic Implicit
Surfaces, pp 17–22. Springer (2003)
29. Pakarinen, J., Yeh, D.T.: A review of digital techniques for modeling vacuum-tube guitar
amplifiers. Comput. Music. J. 33, 85–100 (2009)
30. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning contin-
uous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp 165–174 (2019)
31. Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., Geiger, A.: Convolutional occupancy
networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol.
12348, pp. 523–540. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_31
32. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3D classifi-
cation and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp 652–660 (2017)
33. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation
of multi-view stereo reconstruction algorithms. In: 2006 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR 2006), pp 519–528. IEEE (2006)
34. Sengupta, A., Bartoli, A.: Colonoscopic 3D reconstruction by tubular non-rigid structure-
from-motion. Int. J. Comput. Assist. Radiol. Surg. 16, 1237–1241 (2021)
Wave-Shaping Neural Activation for Improved 3D Model Reconstruction 183
35. Ben-Shabat, Y., Koneputugodage, C.H., Gould, S.: DiGS: divergence guided shape implicit
neural representation for unoriented point clouds. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp 19323–19332 (2022)
36. Sitzmann, V., Martel, J., Bergman, A., Lindell, D., Wetzstein, G.: Implicit neural represen-
tations with periodic activation functions. In: Advances in Neural Information Processing
Systems, vol. 33, pp. 7462–7473 (2020)
37. Vaz, R., Freitas, D., Coelho, A.: Blind and visually impaired visitors’ experiences in museums:
increasing accessibility through assistive technologies. Int. J. Inclusive Mus. 13, 57 (2020)
38. Wang, Z., et al.: A Deep Learning based Fast Signed Distance Map Generation. arXiv preprint
arXiv:200512662 (2020)
39. Wilson, P.F., Stott, J., Warnett, J.M., Attridge, A., Smith, M.P., Williams, M.A.: Evaluation
of touchable 3D-printed replicas in museums. Curator Mus. J. 60, 445–465 (2017)
40. Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp 1912–1920 (2015)
41. Xu, Z., Xu, C., Hu, J., Meng, Z.: Robust resistance to noise and outliers: screened Poisson
surface reconstruction using adaptive kernel density estimation. Comput. Graph. 97, 19–27
(2021)
42. Yuan, W., Khot, T., Held, D., Mertz, C., Hebert, M.: PCN: point completion network. In: 2018
International Conference on 3D Vision (3DV), pp 728–737. IEEE (2018)
43. Zhang, S., Zhao, L., Huang, S., Ma, R., Hu, B., Hao, Q.: 3D reconstruction of deformable colon
structures based on preoperative model and deep neural network. In: 2021 IEEE International
Conference on Robotics and Automation (ICRA), pp 1875–1881. IEEE (2021)
44. Zhou, L., Sun, G., Li, Y., Li, W., Su, Z.: Point cloud denoising review: from classical to deep
learning-based approaches. Graph. Models 121, 101140 (2022)
A Deep Learning Approach to Segment
High-Content Images of the E. coli
Bacteria
1 Introduction
Antimicrobial resistance (AMR) is a global health challenge [25]. A systematic
review published in 2019 estimated that approximately 1.27 million deaths are
attributable to AMR infections worldwide [22]. Remarkably, there were 33,000
deaths caused by resistant bacteria in Europe [7] and 23,000 deaths in the USA
[12]. According to the WHO, the hospitalization costs for patients infected with
AMR bacteria are significantly higher than that for patients infected with sus-
ceptible organisms [24]. The economic cost of AMR infections in Europe was
estimated to be around 1.1 billion Euros [26]. In a broader view, without effi-
cient intervention, AMR was predicted to directly or associatively cause 10 mil-
lion deaths worldwide and will cost the global economy £55 trillion to treat
infection with AMR pathogens [23].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 184–195, 2023.
https://doi.org/10.1007/978-3-031-45382-3_16
A Deep Learning Approach to Segment High-Content Images 185
ization making it very useful in biomedical image analysis when annotated data
is scarce [14,19]. Transfer learning and fine-tuning approaches have been used
in recent research and competitions. Gihun and his colleagues [17] proposed a
holistic pipeline that employs pretraining and finetuning, namely MEDIAR, for
cell instance segmentation under a multi-modality environment. The approach
achieved the highest score on the NeurIPS 2022 cell segmentation challenge.
In this study, we first develop a protocol to acquire high-content fluorescent
images of the E. coli bacteria cells in densely populated environments. One hun-
dred full-size images of 2160 × 2160 pixels are carefully annotated and will be
released for public access to contribute to the biomedical community. Compared
to other public data sets [9,30], our data set contains images at a higher reso-
lution, with a dense distribution of bacterial cells in each field of view. We then
employ a convolutional neural network approach, namely EffNet-UNet, to boost
the segmentation network performance by utilizing transfer learning. The pro-
posed method outperforms other methods in this study and can be considered
a standardized benchmark for bacterial segmentation for high-content imaging
analysis.
2 Methodology
This section presents our main segmentation problem, how we collected and
annotated data for the experiments, the proposed deep learning approach, and
the evaluation metrics.
One antimicrobial-resistant isolate was randomly selected for this study from
a collection of twenty-six bloodborne E. coli [35]. The study was approved by
the Oxford Tropical Research Ethic Committee (OxTREC, reference number
35–16) and the Ethic Committee of Children’s Hospital 1 (reference number
73/GCN/BVND1). One laboratory isolate of ATCC 25922 was included in this
study as a susceptible reference.
Image acquisition was performed on Opera Phenix high-content imaging plat-
form (PerkinElmer, Germany); 100 images were selected randomly for annota-
tion at full resolution (2160 × 2160) using the APEER Annotate platform1 . An
example of a raw image and its corresponding instance segmentation mask is
shown in Fig. 1.
Fig. 1. A full-size E. coli fluorescence image and the corresponding segmentation mask.
the individual cells for the final instance segmentation result. EfficientNet is the
convolutional neural network (CNN) architecture and compound scaling method
[34] that effectively scales up the network depth, width, and resolution to achieve
state-of-the-art results on the classification task of the ImageNet data set. There
are eight variants in the EfficientNets family, namely EfficientNetB0 to Efficient-
NetB7. EfficientNets have been shown to consistently achieve higher accuracy
on small new data sets with an order of magnitude fewer parameters than exist-
ing methods by employing transfer learning using the pre-trained model on the
ImageNet [10] data set. We aim to combine the effectiveness of EfficientNet as a
feature extractor in the encoder and the U-Net [29] decoder for generating fine
segmentation maps.
Due to the nature of high resolution in microscopy images, it is difficult
for neural network models to train on native-resolution data. Many approaches
have been proposed to resolve this issue, such as resizing, random cropping, and
creating smaller patches of the input images to feed into the model [6,8,17]. We
adopt the approach of creating a patch data set to use as inputs for training
the neural networks. Multiple patches of size (256 × 256) are extracted from the
original data set of size (2160 × 2160); we choose a minimal overlapping size
between patches and remove the patches with no bacteria cells or if the cell size
is smaller than 100 pixels. During validation and inference, the full-size image
will be predicted using a sliding-window process. An overview of our approach
is shown in Fig. 2.
decoder part from the original U-Net is reused. The input size is 256 × 256 patch
images from the preprocessing step. In the decoder, transposed convolution was
employed to restore the resolution of the feature maps. The output segmentation
maps consist of two channels, one for the semantic mask of the cell cytoplasm and
the other for the cell instance boundary. The pre-trained weights of EfficientNet
on ImageNet are used as initialization in the encoder and then further fine-tuned
to our data set.
Loss Function. For training the U-Net and EfficientNet-UNet models, we adopt
a weighted Binary cross entropy (WBCE) loss function to account for the binary
semantic mask yseg and boundary contours mask ycont [18], defined as:
where ŷseg and ŷcont are the decoder outputs that correspond to the segmentation
mask and the boundary contours given the input data x. We set λ as 0.5 in
our experiments to balance between optimizing the segmentation mask and the
boundary contour.
[17]. In our experiments, we use a window size of (256 × 256) with a minimal
overlap size of 8.4 pixels to reconstruct the final image.
Post-processing. We observe that the smallest cell in our data set has an area
of 101 pixels. Thus, after obtaining the semantic mask and the instance boundary
contours, small blobs with an area of fewer than 100 pixels are removed from
the predictions by morphological masking operation. Then, a marker-controlled
watershed transformation [3] is applied to the output probability maps. In our
experiments, we set the threshold of 0.5, 0.1, and 0.3 for computing the seed
map, boundary contour, and semantic mask from the output probability maps,
respectively.
T Pτ T Pτ 2 ∗ T Pτ
Pτ = ; Rτ = ; F1τ = (2)
T P τ + F Pτ T P τ + F Nτ 2 ∗ T P τ + F P τ + F Nτ
3 Experiments
Due to the small size of the data set, we conducted a 5-fold cross-validation to
evaluate the performance and stability of the proposed methods. The number
of training, validation, and testing images for each fold is 78, 2, and 20, respec-
tively. During training, the patch data set is generated for every image in the
training folds; the validation and testing images are kept in full-size resolution.
The number of images in every training fold is increased to around 3000 images
after the generation of the patch data set.
Hyper-Parameters. The Cellpose and Omnipose models are trained using the
default hyperparameters from the papers [9,32]. The U-Net and EffNet-UNet
models are trained using the Adam stochastic optimizer [16] with an initial
learning rate of 0.0003. The batch size number is set to 8, and the number of
epochs is set at 200 with an early stopping criteria based on the validation set
loss for 50 epochs.
For other deep learning approaches, Cellpose and Omnipose displayed better
results than the Harmony program, with an F1-score of 0.87 and 0.73, respec-
tively. Interestingly, the Omnipose model did not perform as well as the Cellpose
model after fine-tuning process in our data set, although Ominipose was devel-
oped for bacterial segmentation. These results can be explained by intrinsic
variation between distinct data sets with different conditions of data acquisi-
tion. Cellpose possessed the highest precision among all methods tested but
lower recall and F1-score compared to the original U-Net and EffNetB7-UNet
(Table 1). The effects of these results were demonstrated in Fig. 4c. Though Cell-
pose’s prediction was accurate, it missed out on detecting many bacteria cells
that are close together and under the effects of artifacts leading to a lower recall
score.
Compared to the Cellpose and Onipose methods which are based on gradient-
flow tracking, all the marker-controlled watershed U-Net-based models per-
formed well in capturing dense bacterial cells, yielding high precision, recall,
and F1-score. Overall, the precisions of these models were over 0.87, while recalls
exceeded 0.91 and F1-scores reached at least 0.89 (Table 1). Among these mod-
els, the EffNetB7-UNet showed the highest performance with precision, recall,
and F1-Score of 0.89, 0.94, and 0.91, respectively (Table 1). As shown in Fig. 4f,
the method could segment 23 out of 23 cells from cropped images, which was
explained by the high recall and F1-score. In addition, we also investigate the
transfer learning performance of the EffNetB7-UNet model with weights ini-
tialized from a pre-trained model on the “bact_fluor” data set, i.e., the data
set of bacterial fluorescent images published along with the Omnipose method
[9]. However, only precision increased to 0.90, while the F1-score remained
unchanged (Table 2).
Table 1. The average results from the 5-Fold cross-validation procedure between the
proposed methods and other methods.
Fig. 4. Comparison between different segmentation methods. Images are zoomed in for
clear visualization.
References
1. Baheti, B., Innani, S., Gajre, S., Talbar, S.: Eff-UNet: a novel architecture for
semantic segmentation in unstructured environment. In: 2020 IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle,
WA, USA, pp. 1473–1481. IEEE (2020)
2. Berg, S., et al.: Ilastik: interactive machine learning for (bio) image analysis. Nat.
Methods 16(12), 1226–1232 (2019)
3. Beucher, S., Meyer, F.: Segmentation: the watershed transformation. Mathematical
morphology in image processing. Opt. Eng. 34, 433–481 (1993)
4. Boutros, M., Heigwer, F., Laufer, C.: Microscopy-based high-content screening.
Cell 163(6), 1314–1325 (2015)
5. Caicedo, J., et al.: Data-analysis strategies for image-based cell profiling. Nat.
Methods 14(9), 849–863 (2017)
6. Caicedo, J.C., et al.: Nucleus segmentation across imaging experiments: the 2018
data science bowl. Nat. Methods 16(12), 1247–1253 (2019)
7. Cassini, A., et al.: Attributable deaths and disability-adjusted life-years caused by
infections with antibiotic-resistant bacteria in the EU and the European economic
area in 2015: a population-level modelling analysis. Lancet. Infect. Dis 19(1), 56–66
(2019)
8. Chen, H., Qi, X., Yu, L., Heng, P.A.: DCAN: deep contour-aware networks for
accurate gland segmentation. In: 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 2487–2496. IEEE (2016)
9. Cutler, K.J., et al.: Omnipose: a high-precision morphology-independent solution
for bacterial cell segmentation. Nat. Methods 19(11), 1438–1448 (2022)
10. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-
scale hierarchical image database. In: 2009 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops (CVPR Workshops), Los
Alamitos, CA, USA, pp. 248–255. IEEE Computer Society (2009)
11. Dillencourt, M.B., Samet, H., Tamminen, M.: A general approach to connected-
component labeling for arbitrary image representations. J. ACM 39(2), 253–280
(1992)
12. Hampton, T.: Report reveals scope of us antibiotic resistance threat. JAMA
310(16), 1661–1663 (2013)
13. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN (2017)
194 D. Q. Duong et al.
14. Iman, M., Rasheed, K., Arabnia, H.R.: A review of deep transfer learning and
recent advancements (2022)
15. Jeckel, H., Drescher, K.: Advances and opportunities in image analysis of bacterial
cells and communities. FEMS Microbiol. Rev. 45(4), fuaa062 (2021)
16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y.,
LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR
2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings, p. 80.
ICLR, San Diego (2015)
17. Lee, G., Kim, S., Kim, J., Yun, S.Y.: Mediar: harmony of data-centric and model-
centric for multi-modality microscopy (2022)
18. Li, M., Chen, C., Liu, X., Huang, W., Zhang, Y., Xiong, Z.: Advanced deep net-
works for 3D mitochondria instance segmentation (2021)
19. Maqsood, M., et al.: Transfer learning assisted classification and detection of
Alzheimer’s disease stages using 3D MRI scans. Sensors 19, 2645 (2019)
20. Massey, A.J.: Multiparametric cell cycle analysis using the operetta high-content
imager and harmony software with phenologic. PLoS ONE 10(7), e0134306 (2015)
21. Mermillod, M., Bugaiska, A., Bonin, P.: The stability-plasticity dilemma: investi-
gating the continuum from catastrophic forgetting to age-limited learning effects.
Front. Psychol. 4, 504 (2013)
22. Murray, C.J., et al.: Global burden of bacterial antimicrobial resistance in 2019: a
systematic analysis. Lancet 399(10325), 629–655 (2022)
23. O’Neill, J.: Tackling Drug-Resistant Infections Globally: Final Report and Recom-
mendations. Review on Antimicrobial Resistance. Wellcome Trust and HM Gov-
ernment (2016)
24. World Health Organization, et al.: Antimicrobial resistance: global report on
surveillance. World Health Organization, Geneva (2014)
25. World Health Organization, et al.: Ten threats to global health in 2019 (2019)
26. World Health Organization, et al.: Antimicrobial resistance surveillance in Europe
2022–2020 data. WHO: World Health Organization, Copenhagen (2022)
27. Panigrahi, S., et al.: Misic, a general deep learning-based method for the high-
throughput cell segmentation of complex bacterial communities. Elife 10, e65151
(2021)
28. Quach, D., Sakoulas, G., Nizet, V., Pogliano, J., Pogliano, K.: Bacterial cytological
profiling (BCP) as a rapid and accurate antimicrobial susceptibility testing method
for staphylococcus aureus. EBioMedicine 4, 95–103 (2016)
29. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4_28
30. Spahn, C., et al.: DeepBacs for multi-task bacterial image analysis using open-
source deep learning approaches. Commun. Biol. 5(1), 688 (2022)
31. Sridhar, S., et al.: High-content imaging to phenotype antimicrobial effects on
individual bacteria at scale. Msystems 6(3), e00028-21 (2021)
32. Stringer, C., Wang, T., Michaelos, M., Pachitariu, M.: Cellpose: a generalist algo-
rithm for cellular segmentation. Nat. Methods 18(1), 100–106 (2021)
33. Stylianidou, S., Brennan, C., Nissen, S.B., Kuwada, N.J., Wiggins, P.A.: Super-
Segger: robust image segmentation, analysis and lineage tracking of bacterial cells.
Mol. Microbiol. 102(4), 690–700 (2016)
A Deep Learning Approach to Segment High-Content Images 195
34. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural
networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th
International Conference on Machine Learning, Long Beach, California, USA. Pro-
ceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR (2019)
35. Tuan-Anh, T., et al.: Pathogenic Escherichia coli possess elevated growth rates
under exposure to sub-inhibitory concentrations of azithromycin. Antibiotics 9(11),
735 (2020)
36. Wang, W., et al.: Learn to segment single cells with deep distance estimator and
deep cell detector. Comput. Biol. Med. 108, 133–141 (2019)
37. Wei, D., et al.: MitoEM dataset: large-scale 3D mitochondria instance segmentation
from EM images. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12265,
pp. 66–76. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59722-1_7
Multimodal Emotion Recognition System
Through Three Different Channels
(MER-3C)
1 Introduction
modality (face, speech, and text) as well as with their fusion (MER-3C). Finally,
the conclusions and perspectives on future work are stated in the last section.
2 Related Work
Recent research has shown that integrating the classification judgments of dif-
ferent classifiers might produce better recognition outcomes. Pattern recognition
classifier results have been combined effectively using the majority vote method.
Although majority voting [19] is by far the simplest method for implementing
the combination, research has shown that it is just as effective as more intri-
cate plans. Another significant factor to take into account in pattern recognition
applications is having low error or substitution rates. When the judgments of
the n classifiers are combined, the sample is given the classification for which
there is an agreement, or when the identification is agreed upon by more than
half of the classifiers. Although each classifier has the potential to be right or
wrong. A combined decision is incorrect only when a majority of the votes are
incorrect and they all make the same error because of the nature of consensus.
This is a benefit of the combination approach since it reduces the chances that
most people will commit the same error [19].
Our TER Sub-system. The basic concepts behind textual emotion recog-
nition (TER) involve the following phases to predict emotion shown in Fig. 3.
The steps conducted for this study’s pre-processing include “Case Folding” which
converts all of the letters in the document to lowercase. “Data cleaning” is used
to eliminate HTML tags, emails, special characters, and accents. “Tokenizing”
is the method of breaking down a sentence into smaller bits that can be more
readily given meaning, and “Stopword Removal” where Stop words are a col-
lection of frequently used phrases (e.g. the, a, an, etc). The pre-processed data
was used as the input for our model, which uses a Bi-LSTM architecture [24].
LSTMs have the ability to selectively retain patterns for a long time. A memory
cell, as the name suggests, enables this. Four primary components make up its
distinctive structure: an input gate, a neuron with a self-recurrent connection,
a forget gate, and an output gate. It contains the following layers: Embedding,
Bidirectional, and Dense layer. The embedding layer is an input layer that maps
the words/tokenizers to a vector with input_dim dimensions. The bidirectional
layer is an RNN-LSTM layer with an LSTM output. The dense is an output
layer with nodes (indicating the number of classes) and softmax as an activation
function. As we know from the softmax activation function, it helps to determine
the probability of inclination a text will have in each class. The TER model takes
categorical cross-entropy as a loss function and Adam as an optimizer.
Our SER Sub-system. In our SER sub-system, we use three features as Mel-
Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC), and
Perceptual Linear Prediction (PLP). These features were separately classified
using five different classifiers such as Gaussian Process Classifier (GPC), Random
Forest Classifier (RFC), Convolutional Neural Network (CNN), Support Vector
Machine (SVM), and K-Nearest Neighbor (KNN).
After that, we use a majority vote technique to select the best machine
learning classifier among these options for each set of features. In theory, the
majority voting method is similar to the electoral system. In the latter, the
winner of an election is the candidate who obtains more than 50% of the vote.
In our instance, the classifier that has the highest accuracy is deemed to be
Multimodal Emotion Recognition System 203
the best with regard to the relevant features. The architecture of our SER Sub-
system is presented in Fig. 4.
– FER-2013 [25] is used for face channel. It contains seven emotions mapped to
integers as follows: (0) anger, (1) disgust, (2) fear, (3) happiness, (4) sadness,
(5) surprise, and (6) neutral. The use of this dataset is already a challenge
due to the imbalance problem, the occlusion mostly with hands, the contrast
variations, and the eyeglasses that hide a part of faces.
– Tweet Emotions1 is used for textual channel. This corpus consists of tweets.
Each tweet expresses one of the thirteen emotions. But in our experiments,
we will focus only on four emotions, which are sadness, neutral, happiness,
and anger.
– Ravdess [26] is used for speech channel. It contains 1440 audio recordings
that 24 trained actors pronounce with neutral North American accents and
expresses seven emotions. We chose this dataset because of its excellent avail-
ability. In the second place, we attempt to employ a dataset with mixed gen-
der. Last but not least, RAVDESS is widely utilized for comparative analysis
in the SER sector and is quite well known.
But we train again our models with these datasets as we predict this time only
four emotions which are calm, happy, sad, and angry.
From Table 2 achieves, we can conclude that our decision fusion achieves a
good performance of 93%. This accuracy is 1% and 16% significantly higher than
those achieved by our facial and textual emotion recognition baseline systems.
The confusion matrix illustrated in Fig. 6 shows that our proposed system named
MER-3C achieves good results in the detection of emotions but “Calm” is the
toughest.
Modality Accuracy
Face Channel [20] 92%
Text Channel [24] 91%
Speech Channel 97.31%
Our Work (MER-3C) 93%
We compare our MER-3C with some works that also predict the four emo-
tions (happy, sad, angry, and neutral) using the fusion of three modalities face,
text, and speech. Note that our emotion labeled “calm” is the same emotion
labeled “neutral” in the other works. The comparison is presented in Table 4 for
further validating. When compared to the other models, our model is the most
efficient.
1
https://www.kaggle.com/code/hanifkurniawan/nlp-with-tweet-emotions/data.
206 N. Khediri et al.
Declaration
Conflict of Interest. The authors declare that they have no conflict of interest.
Availability of Data and Materials. All data generated or analyzed during this
study are included in this published article.
References
1. Lisetti, C.L.: Affective computing. Pattern Anal. Appl. 1, 71–73 (1998)
2. Nikita, J., Vedika, G., Shubham, S., Agam, M., Ankit, C., Santosh, K.C.: Under-
standing cartoon emotion using integrated deep neural network on large dataset.
Neural Comput. Appl. (2021). https://doi.org/10.1007/s00521-021-06003-9
3. Andres, J., Semertzidis, N., Li, Z., Wang, Y., Floyd Mueller, F.: Integrated
exertion-understanding the design of human-computer integration in an exertion
context. ACM Trans. Comput.-Hum. Interact. 29(6), 1–28 (2023)
4. Fischer, F., Fleig, A., Klar, M., Müller, J.: Optimal feedback control for modeling
human-computer interaction. ACM Trans. Comput.-Hum. Interact. 29(6), 1–70
(2022)
5. Kosch, T., Welsch, R., Chuang, L., Schmidt, A.: The placebo effect of artificial
intelligence in human-computer interaction. ACM Trans. Comput.-Hum. Interact.
29, 1–32 (2022)
6. Glenn, A., LaCasse, P., Cox, B.: Emotion classification of Indonesian tweets using
bidirectional LSTM. Neural Comput. Appl. 35, 9567–9578 (2023). https://doi.org/
10.1007/s00521-022-08186-1
7. Tang, K., Tie, Y., Yang, T., Guan, L.: Multimodal emotion recognition (MER)
system. In: 2014 IEEE 27th Canadian Conference on Electrical and Computer
Engineering (CCECE), pp. 1–6. IEEE (2014)
8. Veni, S., Anand, R., Mohan, D., Paul, E.: Feature fusion in multimodal emotion
recognition system for enhancement of human-machine interaction. In: IOP Con-
ference Series: Materials Science and Engineering, vol. 1084, no. 1, p. 012004. IOP
Publishing (2021)
Multimodal Emotion Recognition System 207
9. Luna-Jiménez, C., Griol, D., Callejas, Z., Kleinlein, R., Montero, J.M., Fernández-
Martínez, F.: Multimodal emotion recognition on RAVDESS dataset using transfer
learning. Sensors 21(22), 7665 (2021)
10. Aviezer, H., Trope, Y., Todorov, A.: Body cues, not facial expressions, discriminate
between intense positive and negative emotions. Science 338(6111), 1225–1229
(2012)
11. Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: M3ER: mul-
tiplicative multimodal emotion recognition using facial, textual, and speech cues.
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 02,
pp. 1359–1367 (2020)
12. Tripathi, S., Tripathi, S., Beigi, H.: Multi-modal emotion recognition on IEMOCAP
with neural networks. arXiv (2018). arXiv preprint arXiv:1804.05788
13. Zhang, D., et al.: Multi-modal multi-label emotion recognition with heterogeneous
hierarchical message passing. In: Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 35, no. 16, pp. 14338–14346 (2021)
14. Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal
sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based
Syst. 161, 124–133 (2018)
15. Lian, Z., Li, Y., Tao, J., Huang, J.: Investigation of multimodal features, classi-
fiers and fusion methods for emotion recognition. arXiv preprint arXiv:1809.06225
(2018)
16. Siriwardhana, S., Kaluarachchi, T., Billinghurst, M., Nanayakkara, S.: Multimodal
emotion recognition with transformer-based self supervised feature fusion. IEEE
Access 8, 176274–176285 (2020)
17. Heredia, J., et al.: Adaptive multimodal emotion detection architecture for social
robots. IEEE Access 10, 20727–20744 (2022)
18. Heredia, J., Cardinale, Y., Dongo, I., Díaz-Amado, J.: A multi-modal visual emo-
tion recognition method to instantiate an ontology. In: 16th International Confer-
ence on Software Technologies, pp. 453–464. SCITEPRESS-Science and Technology
Publications (2021)
19. Lam, L., Suen, C.Y.: A theoretical analysis of the application of majority voting to
pattern recognition. In: Proceedings of the 12th IAPR International Conference on
Pattern Recognition, vol. 3-Conference C: Signal Processing (Cat. No. 94CH3440-
5), vol. 2, pp. 418–420. IEEE (1994)
20. Khediri, N., Ben Ammar, M., Kherallah, M.: Deep learning based approach to
facial emotion recognition through convolutional neural network. In: International
Conference on Image Analysis and Recognition, ICIAR (2022)
21. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
22. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann
machines. In: Proceedings of the 27th International Conference on Machine Learn-
ing (ICML 2010), pp. 807–814 (2010)
23. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
24. Khediri, N., BenAmmar, M., Kherallah, M.: A new deep learning fusion approach
for emotion recognition based on face and text. In: Nguyen, N.T., Manolopou-
los, Y., Chbeir, R., Kozierkiewicz, A., Trawiński, B. (eds.) ICCCI 2022. LNCS,
vol. 13501, pp. 75–81. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-
16014-1_7
208 N. Khediri et al.
25. Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The
extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and
emotion-specified expression. In: 2010 IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition-Workshops, pp. 94–101. IEEE (2010)
26. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional
speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expres-
sions in North American English. PLoS ONE 13(5), e0196391 (2018)
Multi-modal Obstacle Avoidance in USVs
via Anomaly Detection and Cascaded
Datasets
1 Introduction
Obstacle detection is crucial for autonomous vehicles. Unmanned robotic surface
vehicles (USVs) can use various sensors, like RGB cameras [37], sometimes in
stereo depth configuration [32], RADAR [35], LIDAR [23,41], and SONAR [20].
Cameras are appealing for their cost and superficial similarity to human per-
ception, but require substantial image processing, which falls in the domain of
computer vision.
Cameras’ main drawback is their sensitivity to environmental variations.
However, lately data-driven algorithms and deep neural networks (DNN)
improved obstacle detection in marine environments [5,29,33], but require exten-
sive annotations [46].
We propose semi-supervised learning for water obstacle detection, specifically
anomaly detection in a one-class learning setting [3,39,42]. This approach trains
on normal data and detects anomalies as non-conforming samples. Many, (but
This work was financed by the Slovenian Research Agency (ARRS), research projects
[J2-2506] and [J2-2501 (A)] and research programs [P2-0095] and [P2-0250 (B)].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 209–221, 2023.
https://doi.org/10.1007/978-3-031-45382-3_18
210 T. Cvenkel et al.
Fig. 1. Left: Obstacle avoidance through anomaly detection has three learnable com-
ponents [(1), (2), (3)]. This approach allows separate learning using datasets without
marine environments (1), obstacles (2), and with minimal detailed obstacle annotations
(3). Right: Obstacle detection examples using one-class learning anomaly detector CS-
Flow [39]: in the right-upper image correctly detects a bridge, while in the right-lower
image detects a floating dock and a riverbank house.
not all) USVs are expected to operate in limited geographical domains, allowing
one-class learning and anomaly detection.
Our contributions include: i) a novel strategy for water obstacle detection
with limited annotated data, ii) a new approach to training with limited marine
data, iii) evaluation of recent SOTA anomaly detection algorithms in realistic
USV scenarios, and iv) comparison of a fully-supervised SOTA segmentation
algorithm [6] with our approach.
2 Related Work
We review works using image data and computer vision for water obstacle detec-
tion, followed by literature on semi-supervised learning for anomaly detection,
which avoids laborious annotation.
4 Experiments
We based our research on the LWIR and RGB image data, that was acquired
using our own USV multi-sensor system [36], attached to a river boat. RGB and
LWIR images were recorded.1 The image acquisition took place on a stretch of
Ljubljanica river2 , that represents predominantly natural (river,bush) environ-
ment with some urban elements.
4.1 Dataset
Data was gathered in June 2021 and September 2021. In June, only RGB images
were taken under good weather conditions, while in September, both RGB and
LWIR images were collected under cloudy conditions.
Data was organized into three main datasets: LWIRSept, RGBJune, and
RGBSept, which were further divided into smaller subsets. These three
datasets were then further divided into 5 smaller, non-overlapping subsets,
i.e. LWIRSeptemberNoObs, RGBJuneNoObs, both consisting of obstacle–free
images and LWIRSeptObs, RGBJuneObs, RGBSeptObs, containing images with
obstacles. These three datasets were then further divided into 5 smaller, non-
overlapping subsets, i.e. LWIRSeptemberNoObs, RGBJuneNoObs, both consist-
ing of obstacle–free images and LWIRSeptObs, RGBJuneObs, RGBSeptObs, con-
taining images with obstacles. Tuning datasets, LWIRSeptObs32 and RGBJu-
neObs32, were created using selected images (see Subsect. 3.1). An overview of
the datasets is presented in Table 1. Images in some subsets don’t depict the
same geographical location; subsets with obstacles and without obstacles were
obtained on different river stretches.
1
Stereolabs ZED stereo camera (only the left frame) and Device A-lab SmartIR384L
thermal camera.
2
Data was sampled from a section between 46.0402◦ N, 14.5125◦ E and 46.0234◦ N,
14.5079◦ E.
Multi-modal Obstacle Avoidance in USVs 213
4.2 Evaluation
Our evaluation protocol is based on several assumptions, that are realistic for
USV environments and commonly used in such scenarios. We consider the per-
formance of the algorithm in the upper part of the image entirely irrelevant, as
this part of the image contains the sky, and possibly the distant shore [27]. The
demarcation line between the upper and bottom part of the image can be inferred
from the inertial sensor in the vehicle, which has been shown to help with image
segmentation in marine environment before [6]. Due to the slow dynamic of the
vessel used for image acquisition (imperceptible pitch and roll), we evaluate our
algorithms using a fixed horizontal line, as shown in Fig. 2.
In the testing phase, only the part of the image below the water edge is used
for the quantitative evaluation of the trained model. Interesting structures, such
as bridges and riverbank houses, appearing above the water edge are evaluated
only qualitatively. Two such examples are presented in the Fig. 1. To exclude
214 T. Cvenkel et al.
Fig. 2. Left: Simplified water edge annotation, straight line. Right: precise polygon-
based annotation of the water edge. The former could be easily inferred from USV’s
inertial sensor, as proposed by [6]. Our method is evaluated using both kinds of anno-
tations.
4.3 Training
The modality adaptation stage for the RGB images was skipped in the pub-
licly available models of PaDiM and CS-Flow, since they both use a backbone
classification CNN, pre-trained on RGB images. The modality adaptation was
thus performed only for LWIR images. We trained feature extraction CNN from
scratch, using the Teledyne FLIR Thermal Dataset for Algorithm Training [44]
in object detection task as a proxy.
For the environment adaptation stage, PaDiM was trained on RGB-
JuneNoObs and LWIRSeptNoObs subsets, to adapt it to the target river envi-
ronment for both (RGB and LWIR) camera modalities, respectively, resulting
in two different models, one for each imaging modality. CS-Flow was trained
on RGBJuneNoObs only, to adapt it to target river environment in the RGB
images.
Finally, in the tuning stage, thresholds T were obtained using MODS evalu-
ation scheme [8] on RGBJuneObs32 for RGB models, and LWIRSeptObs32 for
the PaDiM model, which is adapted to LWIR images. Final threshold values
were selected based on the highest F 1 score.
5 Results
Each evaluation was performed twice, using first the straight line as annotation
of water boundary, and then for the second time, using a more accurate polygon
to delimit the water area where the evaluation is performed, as shown in Fig. 2.
Both RGB models, PaDiM and CS-Flow, were tested on the RGBJuneObs sub-
set, while LWIR PaDiM was tested on LWIRSeptObs. Obtained results are
reported in Tables 2 and 3.
Table 3. PaDiM results on water edge annotated with fixed horizontal line.
Table 5. CS-Flow results on water edge annotated with fixed horizontal line.
Fig. 3. Comparison of obstacle detection using anomaly detection by PaDiM (left) and
WaSR (right) in the same river scene. PaDiM performs better in detecting obstacles
not fully surrounded by water, like the paddleman. WaSR’s misclassification can cause
issues for tracking or motion prediction
From the results it can be concluded that WaSR still outperforms both
anomaly detection methods in terms of the chosen evaluation metrics. If we
compare these two methods to one another, we can see, that the CS-Flow [39]
outperforms PaDiM [13], where the distribution of data features is modelled
Multi-modal Obstacle Avoidance in USVs 217
References
1. Akcay, S., Atapour-Abarghouei, A., Breckon, T.P.: GANomaly: semi-supervised
anomaly detection via adversarial training. In: Jawahar, C.V., Li, H., Mori, G.,
Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 622–637. Springer, Cham
(2019). https://doi.org/10.1007/978-3-030-20893-6_39
2. Bergman, L., Hoshen, Y.: Classification-based anomaly detection for general data.
In: International Conference on Learning Representations (ICLR) (2020)
3. Bergmann, P., Fauser, M., Sattlegger, D., Steger, C.: MVTec AD – a comprehensive
real-world dataset for unsupervised anomaly detection. In: IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 9584–9592 (2019)
4. Bergmann, P., Fauser, M., Sattlegger, D., Steger, C.: Uninformed students: student-
teacher anomaly detection with discriminative latent embeddings. In: IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), pp. 4183–4192 (2020)
5. Bovcon, B., Kristan, M.: A water-obstacle separation and refinement network for
unmanned surface vehicles. In: IEEE International Conference on Robotics and
Automation (ICRA) (2020)
6. Bovcon, B., Kristan, M.: WaSR—a water segmentation and refinement maritime
obstacle detection network. IEEE Trans. Cybern. 52, 12661–12674 (2021)
7. Bovcon, B., Muhovic, J., Perš, J., Kristan, M.: The MaSTr1325 dataset for train-
ing deep USV obstacle detection models. In: IEEE International Conference on
Intelligent Robots and Systems (IROS), pp. 3431–3438 (2019)
8. Bovcon, B., Muhovič, J., Vranac, D., Mozetič, D., Perš, J., Kristan, M.: MODS-A
USV-oriented object detection and obstacle segmentation benchmark. IEEE Trans.
Intell. Transp. Syst. 23, 13403–13418 (2021)
9. Caesar, H., et al.: NuScenes: a multimodal dataset for autonomous driving. In:
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.
11621–11631 (2020)
10. Cane, T., Ferryman, J.: Evaluating deep semantic segmentation networks for object
detection in maritime surveillance. In: IEEE International Conference on Advanced
Video and Signal Based Surveillance (AVSS), pp. 1–6 (2018)
11. Cohen, N., Hoshen, Y.: Sub-image anomaly detection with deep pyramid corre-
spondences. arXiv preprint arXiv:2005.02357 (2020)
12. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.
3213–3223 (2016)
13. Defard, T., Setkov, A., Loesch, A., Audigier, R.: PaDiM: a patch distribution mod-
eling framework for anomaly detection and localization. In: International Confer-
ence on Pattern Recognition (ICPR), vol. 12664, pp. 475–489 (2020)
14. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-
scale hierarchical image database. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 248–255 (2009)
15. Fei, Y., Huang, C., Jinkun, C., Li, M., Zhang, Y., Lu, C.: Attribute restoration
framework for anomaly detection. IEEE Trans. Multimed. 24, 116–127 (2021)
16. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The
KITTI vision benchmark suite. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 3354–3361 (2012)
17. González, A., et al.: Pedestrian detection at day/night time with visible and FIR
cameras: a comparison. Sensors 16(6), 820 (2016)
Multi-modal Obstacle Avoidance in USVs 219
18. Gudovskiy, D., Ishizaka, S., Kozuka, K.: CFLOW-AD: real-time unsupervised
anomaly detection with localization via conditional normalizing flows. In: IEEE
Winter Conference on Applications of Computer Vision (WACV), pp. 1819–1828
(2022)
19. Haselmann, M., Gruber, D.P., Tabatabai, P.: Anomaly detection using deep learn-
ing based image completion. In: International Conference on Machine Learning
and Applications (ICMLA) (2018)
20. Heidarsson, H.K., Sukhatme, G.S.: Obstacle detection and avoidance for an
autonomous surface vehicle using a profiling sonar. In: IEEE International Confer-
ence on Robotics and Automation (ICRA), pp. 731–736 (2011)
21. Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learning
can improve model robustness and uncertainty. In: Advances in Neural Information
Processing Systems (NIPS) (2019)
22. Hermann, D., Galeazzi, R., Andersen, J., Blanke, M.: Smart sensor based obsta-
cle detection for high-speed unmanned surface vehicle. In: IFAC Conference on
Manoeuvring and Control of Marine Craft (MCMC), vol. 48, pp. 190–197 (2015)
23. Jeong, M., Li, A.Q.: Efficient LiDAR-based in-water obstacle detection and seg-
mentation by autonomous surface vehicles in aquatic environments. In: IEEE Inter-
national Conference on Intelligent Robots and Systems (IROS), pp. 5387–5394
(2021)
24. Jia, X., Zhu, C., Li, M., Tang, W., Zhou, W.: LLVIP: a visible-infrared paired
dataset for low-light vision. In: IEEE International Conference on Computer Vision
(ICCV), pp. 3496–3504 (2021)
25. Kiefer, B., et al.: 1st workshop on maritime computer vision (MaCVi) 2023: chal-
lenge results. In: IEEE Winter Conference on Applications of Computer Vision
Workshops (WACVW), pp. 265–302 (2023)
26. Kim, H., et al.: Vision-based real-time obstacle segmentation algorithm for
autonomous surface vehicle. IEEE Access 7, 179420–179428 (2019)
27. Kristan, M., Kenk, V.S., Kovačič, S., Perš, J.: Fast image-based obstacle detection
from unmanned surface vehicles. IEEE Trans. Cybern. 46(3), 641–654 (2016)
28. Lee, S.J., Roh, M.I., Lee, H.W., Ha, J.S., Woo, I.G.: Image-based ship detection and
classification for unmanned surface vehicle using real-time object detection neu-
ral networks. In: International Ocean and Polar Engineering Conference (ISOPE)
(2018)
29. Ma, L., Xie, W., Huang, H.: Convolutional neural network based obstacle detection
for unmanned surface vehicle. Math. Biosci. Eng. 17, 845–861 (2020)
30. Mai, K.T., Davies, T., Griffin, L.D.: Brittle features may help anomaly detection.
In: Women in Computer Vision Workshop at CVPR (2021)
31. Moosbauer, S., Konig, D., Jakel, J., Teutsch, M.: A benchmark for deep learning
based object detection in maritime environments. In: IEEE Conference on Com-
puter Vision and Pattern Recognition Workshops (CVPRW) (2019)
32. Muhovič, J., Mandeljc, R., Bovcon, B., Kristan, M., Perš, J.: Obstacle tracking
for unmanned surface vessels using 3-D point cloud. IEEE J. Oceanic Eng. 45,
786–798 (2019)
33. Nirgudkar, S., Robinette, P.: Beyond visible light: usage of long wave infrared
for object detection in maritime environment. In: International Conference on
Advanced Robotics (ICAR), pp. 1093–1100 (2021)
34. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving
jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016.
LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-
3-319-46466-4_5
220 T. Cvenkel et al.
35. Onunka, C., Bright, G.: Autonomous marine craft navigation: on the study of radar
obstacle detection. In: International Conference on Control Automation Robotics
& Vision (ICARCV), pp. 567–572 (2010)
36. Perš, J., et al.: Modular multi-sensor system for unmanned surface vehicles. In:
International Electrotechnical and Computer Science Conference (ERK) (2021)
37. Qiao, D., Liu, G., Li, W., Lyu, T., Zhang, J.: Automated full scene parsing for
marine ASVs using monocular vision. J. Intell. Robot. Syst. 104(2) (2022)
38. Rippel, O., Mertens, P., Merhof, D.: Modeling the distribution of normal data in
pre-trained deep features for anomaly detection. In: International Conference on
Pattern Recognition (ICPR) (2020)
39. Rudolph, M., Wehrbein, T., Rosenhahn, B., Wandt, B.: Fully convolutional cross-
scale-flows for image-based defect detection. In: IEEE Winter Conference on Appli-
cations of Computer Vision (WACV), pp. 1829–1838 (2022)
40. Ruff, L., et al.: Deep one-class classification. In: International Conference on
Machine Learning (ICML), pp. 4393–4402 (2018)
41. Ruiz, A.R.J., Granja, F.S.: A short-range ship navigation system based on ladar
imaging and target tracking for improved safety and efficiency. IEEE Trans. Intell.
Transp. Syst. 10(1), 186–197 (2009)
42. Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: f-
AnoGAN: fast unsupervised anomaly detection with generative adversarial net-
works. Med. Image Anal. 54, 30–44 (2019)
43. Steccanella, L., Bloisi, D., Castellini, A., Farinelli, A.: Waterline and obstacle detec-
tion in images from low-cost autonomous boats for environmental monitoring.
Robot. Auton. Syst. 124, 103346 (2020)
44. Teledyne: Teledyne FLIR thermal dataset for algorithm training. https://www.flir.
eu/oem/adas/adas-dataset-form/
45. Tsai, C.C., Wu, T.H., Lai, S.H.: Multi-scale patch-based representation learning
for image anomaly detection and segmentation. In: IEEE Winter Conference on
Applications of Computer Vision (WACV), pp. 3992–4000 (2022)
46. Žust, L., Kristan, M.: Learning maritime obstacle detection from weak annotations
by scaffolding. In: IEEE Winter Conference on Applications of Computer Vision
(WACV), pp. 955–964 (2022)
47. Wang, G., Han, S., Ding, E., Huang, D.: Student-teacher feature pyramid match-
ing for unsupervised anomaly detection. In: British Machine Vision Conference
(BMVC) (2021)
48. Wang, M., Wang, Q., Hong, D., Roy, S.K., Chanussot, J.: Learning tensor low-rank
representation for hyperspectral anomaly detection. IEEE Trans. Cybern. 53(1),
679–691 (2022)
49. Wang, W., Gheneti, B., Mateos, L.A., Duarte, F., Ratti, C., Rus, D.: Roboat: an
autonomous surface vehicle for urban waterways. In: IEEE International Confer-
ence on Intelligent Robots and Systems (IROS), pp. 6340–6347 (2019)
50. Waymo: An autonomous driving dataset (2019)
51. Yang, J., Li, Y., Zhang, Q., Ren, Y.: Surface vehicle detection and tracking with
deep learning and appearance feature. In: International Conference on Control,
Automation and Robotics (ICCAR), pp. 276–280 (2019)
52. Yao, L., Kanoulas, D., Ji, Z., Liu, Y.: ShorelineNet: an efficient deep learning
approach for shoreline semantic segmentation for unmanned surface vehicles. In:
IEEE International Conference on Intelligent Robots and Systems (IROS), pp.
5403–5409 (2021)
53. Yu, J., et al.: FastFlow: unsupervised anomaly detection and localization via 2D
normalizing flows. arXiv preprint arXiv:2111.07677 (2021)
Multi-modal Obstacle Avoidance in USVs 221
54. Zavrtanik, V., Kristan, M., Skočaj, D.: Reconstruction by inpainting for visual
anomaly detection. Pattern Recogn. 112, 1–9 (2021)
55. Zhan, W., et al.: Autonomous visual perception for unmanned surface vehicle nav-
igation in an unknown environment. Sensors 19(10) (2019)
A Contrario Mosaic Analysis for Image
Forensics
Quentin Bammey(B)
1 Introduction
The once-reliable status of photographic images as evidence is now uncertain,
owing to the proliferation of digital photography and the development of sophisti-
cated photo editing tools. Although image modifications are frequently intended
to enhance an image’s aesthetic appeal, they can alter the meaning of the image.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 222–234, 2023.
https://doi.org/10.1007/978-3-031-45382-3_19
A Contrario Mosaic Analysis for Image Forensics 223
Fig. 1. Mimic automatically detects forgeries based on the analysis of the underlying
mosaic of an image. An a contrario detection automatically filters the grid estimates
in search for significant inconsistencies, while keeping false positives under control.
However, images contain traces and artefacts left by the various operations
of the image signal processing pipeline (ISP), from the camera sensors to the
compressed version of the image. Those traces act as a signature to the image;
as modifications made to an image will alter the original traces. As such, the
resulting inconsistencies can be detected to show that the image has been forged.
One such trace that can be analysed is the image mosaic.
Most cameras do not capture colours directly, instead, a colour filter array
(CFA) is used to sample each pixel’s value in a single colour. By applying filters of
different colours to adjacent sensors, the pixels are sampled in different colours.
The missing colours are interpolated with a demosaicing algorithm to provide a
true colour image. We focus on the Bayer CFA, used in nearly all commercial
cameras. This matrix samples half of the pixels in green, a quarter in red, and a
quarter in blue, in a quincunx pattern. Depending on the offset, the image can be
sampled in one of four patterns: gr gb , gb gr , gb gr , or gr gb . These patterns are phases
of the 2-periodic CFA, offset by 0 or 1 in both directions. As demosaicing involves
the reconstruction of missing data, no demosaicing method can be considered
perfect, and each method introduces artefacts of some degree. As a result, these
artefacts can reveal the image mosaic.
When an image is forged, the underlying mosaic of the image is altered as
well. Copy-move forgeries, for instance, will displace the mosaic and might induce
dephasing. Common operations in photo editing software, such as cloning and
healing, often consist of multiple small copy-moves from smooth regions, the
underlying mosaic of the resulting image will thus feature many small blobs of
the original mosaic. Splicing from JPEG-compressed or resampled sources will
make the underlying mosaic harder to detect, and might even alter its periodicity.
Overall, revealing the underlying mosaic of an image provides important clues
to the presence of image forgeries.
Of course, this mosaic is not explicitly known. Revealing the sampling colour
of each pixel is a difficult enterprise, as the mosaic traces are hidden deep in the
highest frequencies of an image. The slightest JPEG compression can dampen
or even erase said traces [19], making it even more difficult to unconceal.
224 Q. Bammey
The recent advent of positional learning [5,6], coupled with internal fine-
tuning, enabled the analysis of demosaicing traces on an image, even after a slight
compression. However, even these methods remain locally inaccurate; reliable
information on the mosaic can only be obtained when aggregating the method’s
output over a larger scale. Simply revealing the estimated mosaic is consequently
no longer enough to detect forgeries. To provide reliable detections, a method
must be able to analyse its own estimation so as to distinguish true mosaic
inconsistencies from regions where the analysis is not accurate enough.
A contrario detection theory [13,14] provides a way to perform such an anal-
ysis. Based on the non-accidentalness principle, this theory proposes to detect
data based on their unlikelihood under a background hypothesis, by threshold-
ing the results based on a tolerated limit on the number of false alarms (NFA)
under the hypothesis. This paradigm has seen successful applications in varied
detection tasks [1,20–23,27–29], including forensics [3,7,8,16,18,28,31].
In this article, we propose mimic: Mosaic Integrity Monitoring for Image a
Contrario forensics, an a contrario method to analyse a mosaic estimate and reli-
ably detect image forgeries. Taking as only input the pixelwise mosaic estimate
from an existing demosaicing analysis algorithm, the proposed method detects
regions in which the estimate is significantly incoherent due to a shifted or even
locally erased mosaic. Mimic beats the forgery detection SOTA on high-quality
images, even against a slight compression.
2 Related Works
Demosaicing Analysis. focused at first on linear estimation or using filters to
detect inconsistencies [6,7,33]. Popescu and Farid jointly estimated a linear
model and detected sampled pixels [33]. Ferrara looked for the local absence
of demosaicing traces by comparing the variance between sampled and inter-
polated pixels [17]. Kirchner and Milani identified the sampling pattern by per-
forming demosaicing with multiple algorithms [24,30]. Choi compared the counts
of intermediate values in each lattice to estimate the correct pattern [4,10].
A contrario analysis is more than twenty years old [14], and has recently seen
use in image forensics, primarily to attempt to detect inconsistencies in the
JPEG [31] or demosaicing patterns [7,8] while controlling the risks of false pos-
itives. Blocks are made to vote for the most likely pattern, then the algorithm
look for regions where one of the votes is significant enough to void the back-
ground hypothesis of an absence of demosaicing or JPEG compression. If two
different patterns are significantly detected in different places, then the meth-
ods conclude to a forgery. This paradigm, however, is unable to detect regions
without demosaicing traces, as is often possible in forged images, and also fails
to reliably detect many forgeries even when they are visible in the vote maps.
Positional Learning. In AdaCFA [5], it is noted that due to translation invariance,
convolutional neural networks cannot inherently detect the position of pixels, or
information thereon. However, they can do so if the input image itself contains
cues on the position of each pixel, as they will then learn to infer the positional
A Contrario Mosaic Analysis for Image Forensics 225
information from these cues. In the case of demosaicing, the sampling mosaic of
an image is 2-periodic, and demosaicing artefacts thus feature a strong 2-periodic
component. If a CNN is trained on demosaiced images to detect information such
as the modulo (2, 2) position (horizontally and vertically) of each pixel, it will
naturally rely on the demosaicing artefacts to use them as clue to the position
of the pixel. Of course, the actual position of the pixel is already known, there
is thus no need to actually detect it. What matters is that the network mimics
the underlying mosaic of an image. Being trained on authentic images with
integrate mosaics, the network indeed detects the correct positions of pixels in the
training set. When used on a real image, however, the correctness of the output
depends on the integrity of the image mosaic. If an image is authentic with an
integrate mosaic, the network should correctly detect the position of pixels. More
interestingly, if the tested image is locally forged, its mosaic will likely be locally
altered as well. As a consequence, the network will yield incorrect outputs. If the
mosaic has simply been shifted, for instance due to an internal copy-move, the
model output will likewise be locally shifted on the forged area. If the mosaic is
locally destroyed, for instance due to blurring or the insertion of a compressed
and/or resampled object, whose mosaic is no longer visible, the network will
simply render noise-like output instead of the actual position. In both cases,
the fact that the network yields locally erroneous results is the very proof of a
local forgery. This positional learning is introduced with AdaCFA [5] and refined
with 4Point [6]. Combined with internal retraining on the very tested image, the
internal mosaic of an image can be revealed even on slightly-compressed images.
in the rest of the image. Indeed, while the mosaic may be harder to detect on
some images due to the processing of an image, this difficulty should not nat-
urally vary within an image, as such, a locally erroneous estimate is a sign of
forgery.
7 return E, Pg
In each 2×2 block of the image, we have derived an estimation of the diagonal
tile and full pattern of the underlying mosaic. These estimations can be compared
A Contrario Mosaic Analysis for Image Forensics 227
to the global estimations to look for forgeries. So as to avoid false positives that
are solely due to misinterpretation of the detected mosaic map, we propose an
a contrario framework to automatically detect significantly deviant regions.
As proposed in [7], we could focus on regions that present a significant grid
that is different from the grid of the global image. Yet, this would not enable us
to detect areas with multiple small patches of different grids (as is frequently the
case on inpainted images); nor would we see the localised absence of demosaicing.
Instead, we detect regions where the detection is significantly erroneous, i.e.
where the network makes more mistakes than in the rest of the image. We apply
the method separately on the detected diagonals and patterns. Let Ed (resp. Ep )
be a binary map which equals 1 for each block whose detected diagonal (resp.
pattern) is different from Dg (resp. Pg ). This is a map of the “wrong” blocks.
The computation of those maps is described in Algorithm 1.
For the rest of the subsection, E represents either Ed or Ep . The empirical
probability of any block on the image being wrong is denoted by p0 , and is
computed as the mean of E. We want to find regions in which the error density
is significantly higher than p0 .
Let us assume that, in a given rectangle, k out of the n blocks contained in the
rectangle are incorrect. Under the background hypothesis that the probability
of error is p0 , and assuming that the blocks are independent, the probability of
having at least k wrong blocks in the area is the survival function of the binomial
distribution Binomsf (k, n, p0 ). Yet a first obstacle to this simple strategy arises,
as the grid values of different blocks are not independent, as the neural network
uses inputs that overlap between neighbouring blocks. To achieve independence,
we simulate down-sampling and divide k and n by d2 , where d is the distance
228 Q. Bammey
between two independent outputs. We set d to 17, the radius of the CNN. To
account for the fact that in the binomial integers are then replaced by floating
values, we use the Beta distribution to interpolate the binomial. The probability
of having at least k wrong blocks in this area is thus evaluated by
k n−k
pk,n,p0 = Ip0 + 1, ,
d2 d2
14 return D
Fig. 2. Visual results on Korus forged images. In most cases, mimic detects the forg-
eries, even with inaccurate grid estimates. Although it uses the same grid estimate as
4Point, its detections are much more precise, in addition to being automatic. Its main
caveat is that it misses some detections when they are too thin or diagonal, such as in
the second and last columns.
that of the most significant rectangle. However, the number of rectangles scales
quadratically with the number of pixels in an image. Hence, checking all possible
rectangles is not possible. Even if a forgery is detected, some rectangles bigger
than the forgery itself may still be significant, and the detection will therefore
be too large; conversely, if part of a forgery is detected, we should detect nearby
parts of the same forgery as well, even if they are not as significant as the
detected part. As a consequence, we propose to first detect and separate all
potential forgeries, and then to decide on their significance, so as to improve the
localization of the forgeries. The method used is described in Algorithm 3.
Still separately on the diagonal and full patterns, we use the map E of 2 × 2
blocks whose diagonal/pattern is erroneous. We apply a morphological closing
to this map with a disk of size 2 to connect inconsistent blocks, and segment the
resulting map into connected components. Components where the global pattern
(respectively diagonal) represent more than 25% (respectively 50%) of the blocks
are immediately rejected and not tested for forgeries.
Each of the remaining components is tested to determine whether it is a
forgery. On each component, we test all the rectangles contained within the
230 Q. Bammey
bounding box of the component, with a step of 16 pixels. The selected striding
represents a compromise between precision and computation time, as a lower
stride means more rectangles need to be checked.
Finally, we keep the NFA of the most significant rectangle. We set the score
of the whole component to this NFA, thus solving the final two issues addressed
above: only blocks that were in the component are given this NFA, and blocks out
of a significant rectangle but still in the component are kept. Forgery detection
is performed separately on the full pattern and on diagonals, then the detected
forgeries are merged. The final NFA map is the pointwise minimum of score
maps of the patterns and diagonals NFA.
The NFA of a region is an upper bound on the expected number of regions
that would be falsely detected as forged under the background hypothesis. We
set the threshold to = 10−3 , and the final, binary map keeps pixels whose NFA
is below this threshold. Under the background hypothesis, the false detection
rate therefore is expected to be below one for 1000 images. Of course, this rate
only concerns the risk of false positives that are due to a misinterpretation of
the estimated mosaic, and does not provide further guarantees against significant
errors within the estimated mosaic, which could be due to the image structure,
such as the presence of textured areas, or post-processing such as resampling
which would modify the mosaic traces. Still, this enables us to filter out regions
which are only marginally less precisely detected than the rest of the image. The
proposed a contrario method thus only select regions which are significantly more
erroneous than the rest of the image, regardless of the reason, and provides us
with a mathematically rigorous way to automatically interpret the estimated
mosaic to yield an automatic detection.
4 Results
We test mimic on the Trace CFA Grid [9] and on the Korus [25,26] datasets.
The Trace CFA Grid dataset, on which results are shown in Table 1 contains
1000 forged images that can only be detected by their demosaicing traces, thus
enabling a comparison between demosaicing analysis methods and evaluation
of the sensitivity to demosaicing artefacts of more generic methods. The Korus
dataset, on which results are shown in Table 2 and visually in Fig. 2, features
220 forged images from four cameras.
Results are presented using Matthew’s Correlation Coefficient (MCC), a met-
ric ranging from 1 (perfect detection) to −1 (opposite detection). Any input-
independent method has a zero MCC expectation. The MCC can only be mea-
sured on binary detections, which is only the case for mimic. For the other
methods, we threshold the output on the threshold that maximizes the score,
giving those methods a slight advantage compared to a real case scenario, where
adjusting the threshold to the data would not be possible.
Experiments on the Trace dataset show that mimic beats the state of the
art even when the images are JPEG-compressed at quality levels 95 and 90,
while generic methods are shown to be insensitive to demosaicing artefacts and
A Contrario Mosaic Analysis for Image Forensics 231
Table 1. Results on the CFA Grid exomask (Grid) dataset of the Trace database,
on uncompressed images and after compression with quality factors 95 and 90. The
methods are grouped, after the proposed method mimic are methods based on demo-
saicing analysis, then more generic methods that do not specifically target demosaic-
ing artefacts. Our analysis of the mosaic improves on the results of 4Point, especially
on stronger compression (Q = 90). As already established in the literature [5, 6, 8, 9],
generic methods that do not specifically target demosaicing artefacts are entirely blind
to shifts in the mosaic.
Table 2. Results on the Korus dataset of forged images. No demosaicing artefacts are
found on the Canon 60D images by any of the demosaicing-based method, thus we can
safely conclude they do not feature demosaicing artefacts. On all the other images, as
well as overall, the proposed method significantly improves over the existing state of
the art on this dataset.
Method Overall Canon 60D Nikon D7000 Nikon D90 Sony α57
Proposed 0.472 0.000 0.595 0.630 0.662
4Point [6] 0.353 0.00 0.401 0.378 0.624
AdaCFA [5] 0.167 0.002 0.049 0.044 0.574
Shin [34] 0.143 0.021 0.003 0.012 0.511
Choi [4, 10] 0.238 0.004 0.176 0.251 0.251
Ferrara [17, 36] 0.321 −0.016 0.498 0.461 0.339
Dirik [15, 36] 0.153 0.036 0.241 0.275 0.062
Park [32] 0.338 0.018 0.540 0.491 0.302
Noiseprint [12] 0.202 0.153 0.322 0.236 0.148
Splicebuster [11] 0.238 0.153 0.329 0.222 0.155
ManTraNet [2, 35] 0.169 0.121 0.229 0.193 0.143
232 Q. Bammey
inconsistencies. On the Korus dataset, one quarter of the images do not feature
traces of demosaicing, as validated by all tested methods. The proposed method
is thus unable to detect forgeries on this part of the dataset. Despite that, mimic
still yields the best overall score on this dataset. We further note that on the
images without traces of demosaicing, mimic does not make any false detection.
5 Conclusion
In this paper, we proposed mimic (Mosaic Integrity Monitoring for Image a Con-
trario forensics), an a contrario method that extends on the demosaicing analysis
network 4Point [6] to refine its analysis and automatically detect image forgeries.
Mimic identifies forgeries as regions where the network output is significantly
more erroneous than in the rest of the image. An a contrario framework helps
limit the risk of false positives.
Demosaicing artefacts are frail and subtle, yet they can provide highly-
valuable information to detect forgeries. On high-quality images, their sole anal-
ysis yields better results on forgery detection than any other state-of-the-art
method. These results are furthermore fully complementary to non-demosaicing-
specific methods, which are not sensible to demosaicing artefacts.
Acknowledgment. This work has received funding by the European Union under
the Horizon Europe vera.ai project, Grant Agreement number 101070093, and by the
ANR under the APATE project, grant number ANR-22-CE39-0016. Centre Borelli is
also a member of Université Paris Cité, SSA and INSERM. I would also like to thank
Jean-Michel Morel and Rafael Grompone von Gioi for their insightful advice regarding
this work.
References
1. Aguerrebere, C., Sprechmann, P., Muse, P., Ferrando, R.: A-contrario localization
of epileptogenic zones in SPECT images. In: 2009 IEEE International Symposium
on Biomedical Imaging: From Nano to Macro (2009)
2. Bammey, Q.: Analysis and experimentation on the ManTraNet image forgery
detector. Image Processing On Line 12 (2022)
3. Bammey, Q.: Jade OWL: JPEG 2000 forensics by wavelet offset consistency anal-
ysis. In: 8th International Conference on Image, Vision and Computing (ICIVC).
IEEE (2023)
4. Bammey, Q., Grompone von Gioi, R., Morel, J.M.: Image forgeries detection
through mosaic analysis: the intermediate values algorithm. IPOL (2021)
5. Bammey, Q., von Gioi, R.G., Morel, J.M.: An adaptive neural network for unsu-
pervised mosaic consistency analysis in image forensics. In: CVPR (2020)
6. Bammey, Q., von Gioi, R.G., Morel, J.M.: Forgery detection by internal positional
learning of demosaicing traces. In: WACV (2022)
7. Bammey, Q., von Gioi, R.G., Morel, J.M.: Reliable demosaicing detection for image
forensics. In: 2019 27th European Signal Processing Conference (EUSIPCO), pp.
1–5 (2019). https://doi.org/10.23919/EUSIPCO.2019.8903152
A Contrario Mosaic Analysis for Image Forensics 233
8. Bammey, Q., Grompone Von Gioi, R., Morel, J.M.: Demosaicing to detect demo-
saicing and image forgeries. In: 2022 IEEE International Workshop on Informa-
tion Forensics and Security (WIFS), pp. 1–6 (2022). https://doi.org/10.1109/
WIFS55849.2022.9975454
9. Bammey, Q., Nikoukhah, T., Gardella, M., von Gioi, R.G., Colom, M., Morel, J.M.:
Non-semantic evaluation of image forensics tools: methodology and database. In:
WACV (2022)
10. Choi, C.H., Choi, J.H., Lee, H.K.: CFA pattern identification of digital cameras
using intermediate value counting. In: MM&Sec (2011)
11. Cozzolino, D., Poggi, G., Verdoliva, L.: Splicebuster: a new blind image splicing
detector. In: WIFS (2015)
12. Cozzolino, D., Verdoliva, L.: Noiseprint: a CNN-based camera model fingerprint.
IEEE TIFS (2020)
13. Desolneux, A., Moisan, L., Morel, J.: From gestalt theory to image analysis. Inter-
disc. Appl. Math. 35 (2007)
14. Desolneux, A., Moisan, L., Morel, J.M.: Meaningful alignments. IJCV (2000)
15. Dirik, A.E., Memon, N.: Image tamper detection based on demosaicing artifacts.
In: ICIP (2009)
16. Ehret, T.: Robust copy-move forgery detection by false alarms control. arXiv
preprint arXiv:1906.00649 (2019)
17. Ferrara, P., Bianchi, T., De Rosa, A., Piva, A.: Image forgery localization via fine-
grained analysis of CFA artifacts. IEEE TIFS 7 (2012)
18. Gardella, M., Musé, P., Morel, J.M., Colom, M.: Noisesniffer: a fully automatic
image forgery detector based on noise analysis. In: IWBF. IEEE (2021)
19. Gardella, M., Nikoukhah, T., Li, Y., Bammey, Q.: The impact of jpeg compression
on prior image noise. In: 2022 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), ICASSP 2022, pp. 2689–2693 (2022). https://doi.
org/10.1109/ICASSP43922.2022.9746060
20. Grompone von Gioi, R., Jakubowicz, J., Morel, J.M., Randall, G.: LSD: a fast line
segment detector with a false detection control. IEEE Trans. Pattern Anal. Mach.
Intell. 32 (2010)
21. Grompone von Gioi, R., Jakubowicz, J., Morel, J.M., Randall, G.: LSD: a line
segment detector. IPOL 2 (2012)
22. Grompone von Gioi, R., Randall, G.: Unsupervised smooth contour detection.
Image Process. On Line 6 (2016)
23. Gómez, A., Randall, G., Grompone von Gioi, R.: A contrario 3D point alignment
detection algorithm. Image Process. On Line 7 (2017)
24. Kirchner, M., Fridrich, J.: On detection of median filtering in digital images. In:
MFS II (2010)
25. Korus, P., Huang, J.: Evaluation of random field models in multi-modal unsuper-
vised tampering localization. In: IEEE WIFS (2016)
26. Korus, P., Huang, J.: Multi-scale analysis strategies in PRNU-based tampering
localization. IEEE Trans. Inf. Forensics Secur. (2017)
27. Lezama, J., Randall, G., Morel, J.M., Grompone von Gioi, R.: An unsupervised
point alignment detection algorithm. Image Process. On Line 5 (2015)
28. Li, Y., et al.: A contrario detection of h.264 video double compression. In: 2023
IEEE International Conference on Image Processing (ICIP) (2023)
29. Lisani, J.L., Ramis, S.: A contrario detection of faces with a short cascade of
classifiers. Image Process. On Line 9 (2019)
30. Milani, S., Bestagini, P., Tagliasacchi, M., Tubaro, S.: Demosaicing strategy iden-
tification via eigenalgorithms. In: ICASSP (2014)
234 Q. Bammey
31. Nikoukhah, T., Anger, J., Ehret, T., Colom, M., Morel, J.M., Grompone von Gioi,
R.: JPEG grid detection based on the number of DCT zeros and its application to
automatic and localized forgery detection. In: CVPRW (2019)
32. Park, C.W., Moon, Y.H., Eom, I.K.: Image tampering localization using demosaic-
ing patterns and singular value based prediction residue. IEEE Access 9, 91921–
91933 (2021). https://doi.org/10.1109/ACCESS.2021.3091161
33. Popescu, A.C., Farid, H.: Exposing digital forgeries in color filter array interpolated
images. IEEE Trans. Signal Process. 53, 3948–3959 (2005)
34. Shin, H.J., Jeon, J.J., Eom, I.K.: Color filter array pattern identification using
variance of color difference image. J. Electron. Imaging (2017)
35. Wu, Y., AbdAlmageed, W., Natarajan, P.: Mantra-net: manipulation tracing net-
work for detection and localization of image forgeries with anomalous features. In:
IEEE CVPR (2019)
36. Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y.: Large-scale evaluation of
splicing localization algorithms for web images. Multimed. Tools Appl. 76 (2017)
IRIS Segmentation Technique Using IRIS-UNet
Method
Abstract. Precise segmentation of the iris from the eye images is an essential
task in iris diagnosis. Most of the predictions fail due to improper segmentation
of iris images, which results in false predictions of the patient’s disease. However,
the traditional methods for segmenting the iris are not suitable for iris diagnosis
applications. In the field of medical purposes. The iris of the eye could be separated
from the eye using deep learning techniques. To overcome this issue, we design
a model called Iris-UNet which can effectively segment the limbic and pupillary
boundary from the eye images. Using the Iris-UNet model, high-level features are
extracted in the encoder path, and segmentation of limbic and pupillary boundaries
takes place in the decoder path. We have evaluated our Iris-UNet model on real
patient datasets: CASIA, MMU, and PEC datasets. Our Iris-UNet model shows
an outperforming solution through the experimental results compared with other
traditional methods.
1 Introduction
Image Segmentation is the approach where a digital image is split into different parts
known as image segments which makes it reduces the complexity of an image for fur-
ther processing or analysing the image. Image segmentation is mainly used to detect
the objects and boundaries of an image such as lines, curves, etc., in an image. The
limitations of image segmentation are over-segmentation when the image suffers from
noise or has intensity variations. To overcome this, many new algorithms have been
implemented. Many applications are involved in image segmentation such as medical
image segmentation, audio and video surveillance, iris detection, object detection, traffic
control system, recognition tasks, etc.
The process of separating the iris portion from the ocular image is known as iris
segmentation. Sometimes, it is referred to as iris localization, an important step in iris
recognition. The thin, annular organ known as the iris, which is located inside the eye,
is in charge of regulating the pupil’s size and diameter. Five layers of fibre-like tissues
make up the iris’ extremely fine structure. These tissues are extremely complex and
can be seen as reticulation, thread-like, linen-like, burlap-like, etc. The iris’ surface is
covered in a variety of intricate texture patterns, including crystals, thin threads, spots,
concaves, radials, furrows, strips, etc. (Fig. 1).
2 Literature Survey
By fusing the memristive network and the genetic algorithm, Yu Yongbin [1] proposed a
network model known as memristive network-based genetic algorithm (MNGA). Edge
detection using MNGA was suggested. This new edge detection approach uses filtering
technique and a fitness function to assess each pixel in order to overcome the limitations
of the existing methods, such as noise and the difficulty in designing an individual’s
fitness and edge information.
FF(x, y) = |P(x, y) − P(x + i, y + j)| (1)
i=−1,0,1
j=−1,0,1
where P(x, Y) stands for the image’s pixel value for the xth row and yth column. The
fitness of a pixel is indicated by FF(x,y). (x, y). Based on FoM, which produced a value
of 62.7%, the performance was assessed.
Mamta Mittal [2] proposed a robust edge detection algorithm using multiple threshold
approaches(B-Edge). The edge connectivity and edge thickness are the two main restric-
tions encountered. With minimal noise, the suggested approach successfully recognises
strong and thin edges. Better edge continuity and entropy value are both provided by it.
m
n
α=1 β=1 ara
ψ= (2)
m·n
IRIS Segmentation Technique Using IRIS-UNet Method 237
In this case, m and n are the pixel sizes, and area is an array representing the input
image. With each image input, Graythresh calculates a unique average of the inputted
images.
ψ · 20
Phi = (3)
8.33
The performance is evaluated based on PSNR which gave 62.47% as result. This
recommended solution does not work for blur images properly. Time calculations need
to be made more accurate. A deep learning strategy can be used to get the best outcomes.
Davood Karimi [3], an alternative deep neural network that does not use convolutional
operations has been proposed to provide more accurate segmentation than FCNs. Which
is based on self-attention between close picture patches. This model, which was devel-
oped using self-attention between nearby 3D patches rather than the traditional con-
volution building block, produced an accurate outcome. Starting with the position-
and embedded-encoded patch input sequence, X 0 described above, the k th stage of the
network performs the following operations to map X k to X k+1 .
With 89.2% accuracy, the proposed approach was assessed using the DSC (Dice
Similarity Coefficient). Future study will involve using this model for other medical
image analysis tasks, like classifying and detecting anomalies.
Iman Aganj [4], a brand-new atlas-based technique for supervised soft image segmen-
tation was introduced that, even when only one valid label is discovered, nonetheless
in the new image, it assigns an anticipated label at each position. They computed the
anticipated label value map by performing a straightforward convolution with the key
after a fast Fourier transform(FFT). When N atlases with manual labels are available and
are aligned in the same space, for instance, the equation can be expressed as follows:
1
N
E := E[Li ◦ T | I , Ji ], (4)
N
i=1
where L i and J i are ith pair of manual label and atlas images, respectively. The per-
formance is evaluated based on DSC showed the result as 92%. Still, the segmentation
accuracy can be improved by using an ELV map.
[5]. Noisy structures and discontinuous edges are the paper’s two primary shortcomings.
Along with SR, a comprehensive examination of CED is also provided.
The suggested algorithm significantly outperformed CED. With several measures,
ODS = 0.79, OIS = 0.81, and AP = 0.50, it produced performance outcomes. How-
ever, the algorithm’s performance has been constrained by the lack of any pre- or post-
processing. Future research will concentrate on developing deep neural networks for
enhanced edge detection methods to overcome these constraints.
To achieve automatic spine parsing for volumetric MR images, Shumao Pang [6] sug-
gested a brand-new two-stage system called SpineParseNet. SpineParseNet employs 2D
segmentation refinement for 2D residual U-Net (ResUNet) and 3D coarse segmentation
for segmenting 3D graphs. Memory costs are decreased during the training and testing
phases of our two-stage segmentation scheme. The main benefit of the suggested strat-
egy is that GCN is used to enhance the distinction of various spinal structures. This is
how the semantic graph is produced:
F S = σ Ae σ Ae σ Ae F G W1e W2e W3e (5)
where W1e ∈ Rm×m , W2e ∈ Rm×m , and W3e ∈ Rm×m are three trianable weight matrices.
SpineParseNet successfully completed accurate spine parsing for volumetric MR
images as a result. DSC was used as a metric, and it obtained 0.87. But the top structure
of the image cannot be segmented by SpineParseNet. In the future, regions with high
levels of uncertainty can be segmented using a region-specific classifier to get better
segmentation results.
It is possible to quantify optimisation loss for the target and source domain translation
sub-network as follows:
1
N
α N
Lsup = Lseg Qsi , Yi + Lmse Qs,sdm
i , Yi
sdm
(8)
N N
i=1 i=1
SimCVD produced brand-new state-of-the-art outcomes. The Dice score for this
approach is 89.03%. The approach may also be expanded to address multi-class medical
image segmentation challenges.
240 M. Poovayar Priya and M. Ezhilarasan
Yue Zhao [10] created a teeth segmentation model using a two-stream graph convo-
lutional network (TSGCN). The nave concatenation of many raw qualities during the
input phase causes unneeded complexity when describing and differentiating between
mesh cells. Modern deep learning-based systems, in contrast, present completely dif-
ferent geometric data and experience various raw properties. This TSGCN model can
successfully handle inter-view confusion between them in order to combine the comple-
mentary information from various raw attributes and create discriminative multi-view
geometric representations. The intra-oral scanning image’s basic topology is extracted
by the C-stream from each cell’s coordinates. The information from each neighbourhood
is combined and sent to each centre as follows:
l
fil+1 = αijl f ij , (10)
mij ∈N (i)
For all the cells, the normal vectors are assigned to a canonical space as inputs to N-stream
via an input transformer module before the hierarchical extraction of higher-level feature
representations. The following formulation describes the boundary representation for the
relevant centre:
l
fil+1 = maxpooling f ij , ∀mij ∈ N (i) . (11)
This approach achieves an overall accuracy of 96.69%. Future work can be expanded
to implement a large number of training samples since current method is only appropriate
for a small number of training samples.
Kai Han [11] proposed a deep semi-supervised method that incorporates a pseudo-
labeling technique for segmenting liver CT images. The fundamental obstacle to devel-
oping a deep segmentation model continues to be the volume of training data. A liver
image segmentation based on a semi-supervised framework was introduced, using the
direction from labeled photos to produce pseudo-unlabelled images in a high-quality
manner.
Dice score for this approach was 86.83%. In order to normalize, direct the formation
of labels, future work will focus on creating more reliable class representations that are
produced by the network architecture and the quantity of output channels.
2.12 SSL-ALPNet
The mean value of this method is 80.16%. The future work can be extended to
segment more than one class i.e., multi-way segmentation.
Zhiyong Wang created a deep convolutional neural network (DCNN) that automati-
cally extracts the iris and pupil pixels of each eye from the input image [14]. UNet and
SqueezeNet were merged to develop a potent convolutional neural network for image
categorization in this network. With just a single RGB camera, our method advances
242 M. Poovayar Priya and M. Ezhilarasan
the state-of-the-art in 3D eye gaze tracking. The cross-entropy loss to evaluate the
segmentation outcome and the ground truth is expressed as follows:
E = arg min − wi log(Pi ) + (1 − wi )log(1 − Pi ) (16)
θ i∈
By using this method, a mean inaccuracy of 8.42° in eye gaze directions was obtained.
This method works well since it is quick, completely automatic, and precise. Both PCs
and cell phones can use this technology in real-time.
4 Practical Details
4.1 Exporational Setup
In this paper, an automatic separation of the iris part from the eye image is shown in
Fig. 2. Training data for 20 participants and testing data for 5 subjects were randomly
selected from the datasets. The outputs are visualized using seismic and salt plots in
Fig. We use three different datasets for IRIS segmentation: CASIA, MMU, and PEC
database (Fig. 3).
Our Iris-UNet was correlated with two different models which are traditional U-Net
and LinkNet. The three datasets are used for comparison as shown in the table. The
overall segmentation performance was computed by the metrics’ overall accuracy is
calculated as:
Number of Correct Predictions
Accuracy =
Total Number of Predictions
Table 2. Comparison of our method - different three datasets with state-of-the-art methods
Table 2: 1) Among all three datasets, the PEC database achieves higher accuracy than
all other commercially available datasets which is student data taken from Puducherry
Technological University. 2) Our model Iris-UNet achieves higher in all three dataset
comparisons.Fig. 4 represents the segmentation results with different plots as seismic,
salt, salt predicted and salt predicted binary from which we can have two observations:
1) In Fig. 4, our Iris-UNet segments the outer part of the iris i.e., it separates the limbic
boundary from the eye image. 2) In Fig. 5, it segments the inner part from the pupils by
separating the pupillary boundary from the iris part.
4.3 Discussion
4.3.4 Limitations
Although our method and the dataset achieve higher performance, compared with all
other methods. Our method fails in eye occlusion cases where the human closes their
eye by their upper or lower lid while capturing the photo of the iris through sensors. In
those cases, our Iris-UNet cannot be able to detect the limbic and pupillary boundary
from the eye image.
248 M. Poovayar Priya and M. Ezhilarasan
5 Conclusion
In this paper, the Iris-UNet approach has been developed to segment the limbic and
pupillary from the eye image i.e., separation of the iris from the eye. First, the limbic
boundary was separated from the eye image. Then, the pupillary boundary was sepa-
rated from the iris part. An extensive comparison was performed with three different
datasets and two methods. The outcomes show how effective our suggested strategies
are, particularly in cases involving real patients. Eye occlusions are the major concern to
be focussed on so that all the iris images can be easily detected particularly for medical
diagnosis purposes, which could be the future direction in this research.
References
1. Yongbin, Y.U., Chenyu, Y.A.N.G., Quanxin, D.E.N.G., Tashi, N., Shouyi, L., Chen, Z.: Mem-
ristive network-based genetic algorithm and its application to image edge detection. J. Syst.
Eng. Electron. 32(5), 1062–1070 (2021)
2. Mittal, M., et al.: An efficient edge detection approach to provide better edge connectivity for
image analysis. IEEE Access 7, 33240–33255 (2019)
3. Karimi, D., Dou, H., Gholipour, A.: Medical image segmentation using transformer networks.
IEEE Access 10, 29322–29332 (2022)
4. Aganj, I., Fischl, B.: Multi-atlas image soft segmentation via computation of the expected
label value. IEEE Trans. Med. Imaging 40(6), 1702–1710 (2021)
5. Dhillon, D., Chouhan, R.: Enhanced edge detection using SR-guided threshold maneuvering
and window mapping: handling broken edges and noisy structures in canny edges. IEEE
Access 10, 11191–11205 (2022)
6. Pang, S., et al.: SpineParseNet: spine parsing for volumetric MR image by a two-stage seg-
mentation framework with semantic image representation. IEEE Trans. Med. Imaging 40(1),
262–273 (2020)
7. Li, X., Yu, L., Chen, H., Fu, C.W., Xing, L., Heng, P.A.: Transformation-consistent self-
ensembling model for semisupervised medical image segmentation. IEEE Trans. Neural Netw.
Learn. Syst. 32(2), 523–534 (2020)
8. Han, X., et al.: Deep symmetric adaptation network for cross-modality medical image
segmentation. IEEE Trans. Med. Imaging 41(1), 121–132 (2021)
9. You, C., Zhou, Y., Zhao, R., Staib, L., Duncan, J.S.: SimCVD: simple contrastive voxel-
wise representation distillation for semi-supervised medical image segmentation. IEEE Trans.
Med. Imaging 41, 2228–2237 (2022)
10. Zhao, Y., et al.: Two-stream graph convolutional network for intra-oral scanner image
segmentation. IEEE Trans. Med. Imaging 41(4), 826–835 (2021)
11. Han, K., et al.: An effective semi-supervised approach for liver CT image segmentation. IEEE
J. Biomed. Health Inform. 26(8), 3999–4007 (2022). https://doi.org/10.1109/JBHI.2022.316
7384
12. Ouyang, C., Biffi, C., Chen, C., Kart, T., Qiu, H., Rueckert, D.: Self-supervised learning for
few-shot medical image segmentation. IEEE Trans. Med. Imaging 41(7), 1837–1848 (2022).
https://doi.org/10.1109/TMI.2022.3150682
13. Wang, C., Muhammad, J., Wang, Y., He, Z., Sun, Z.: Towards complete and accurate iris
segmentation using deep multi-task attention network for non-cooperative iris recognition.
IEEE Trans. Inf. Forensics Secur. 15, 2944–2959 (2020)
IRIS Segmentation Technique Using IRIS-UNet Method 249
14. Wang, Z., Chai, J., Xia, S.: Realtime and accurate 3D eye gaze capture with DCNN-based
iris and pupil segmentation. IEEE Trans. Vis. Comput. Graph. 27(1), 190–203 (2019)
15. Sabry Abdalla, M., Omelina, L., Cornelis, J., Jansen, B.: Iris segmentation based on an
optimized U-Net. In: BIOSIGNALS, pp. 176–183 (2022)
Image Acquisition by Image Retrieval
with Color Aesthetics
1 Introduction
Due to the technological advancements in the past few decades, capturing well-
composed photos is no longer restricted to personal expertise. Independent proxy
cameras can produce stunning photographs, and photographers also share their
work on the Internet [2]. Consequently, desired and high quality images can
be possibly found through online searches. However, with an unlimited number
of images available on the Internet, it would be time-consuming to perform
the search one by one to find the image of interest. In this paper, we propose
a technique that leverages location information and cloud data to obtain the
photos without an actual image capture operation. The idea is to create a cloud
database, and used for image retrieval under certain circumstances based on an
IoT framework [10]. Our method consists of two parts, an insertion module and
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 250–261, 2023.
https://doi.org/10.1007/978-3-031-45382-3_21
Image Acquisition by Image Retrieval with Color Aesthetics 251
a search module. The insert system selects the images to be stored, as well as
shows the related information such as acquisition time, location, and weather
conditions, etc. It can store photos and their information in the cloud database.
The search system retrieves the images from the database based on the location
information. Our system uses an image selection algorithm to recommend the
best photos to the users for consideration. The overall flowchart is illustrated in
Fig. 1.
– Self-collected photos with sensor data in the experiments show the usability
of system development.
– The NIMA technique based on the VGG16 model is used to modify the data
before processing, improving the quality of photo selection.
– The photo retrieval system is realized by hardware, and operated in outdoor
scenes, showing its practicability for applications.
2 Related Work
The human visual system receives a tremendous amount of sensor data from
the environment. To reduce the information complexity, saliency detection algo-
rithms prioritize the most visually prominent information within a given scene,
reflecting the way humans tend to focus on the most attention-grabbing element.
Zhao et al. introduced the Pyramid Feature Attention (PFA) network in an early
work for detecting image saliency [20]. In the PFA architecture, a context-aware
pyramid feature extraction module was employed to extract high-level features
at multiple scales. The channel attention module then selected the appropriate
scale to generate salient regions. Finally, the spatial attention module filtered
noise to refine the boundaries of the salient regions. Kroner et al. introduced a
convolutional neural network (CNN) for large-scale image classification, utilizing
a pre-training technique. The proposed network architecture is based on VGG16
and consists of an encoder-decoder structure with multiple parallel convolutional
layer modules. It enables the extraction of multi-scale features to predict salient
objects in images [8].
Psychologists believe that vision is the primary sense of human beings, and
color is considered to have the greatest influence [12]. It is also commonly believed
that color has a significant impact on people’s emotions and feelings. The impact
of colors on people’s psychological state is attributed to the origin of colors from
nature and the feelings associated with natural scenes. This is a fundamental
influence on human psychology. Adams et al. point out that there is no notable
difference in color perception between men and women, and individuals from
diverse cultural backgrounds exhibit similar attitudes towards the emotional
responses evoked by colors [1]. The Luscher color test is a personality assessment
tool [11], which is used to evaluate the subject’s personality characteristics with
eight color cards. Ram et al. conducted a study in which 944 respondents were
asked to provide 20 emotional associations for 12 different colors [16]. The color-
emotion associations were then quantified using neural networks, support vector
machine (SVM), and 10-fold cross-validation with the variation in both age and
gender.
3 Method
The schematic diagram of the proposed system for image acquisition using loca-
tion information and cloud data is illustrated in Fig. 1. The system is a combi-
nation of the insert and search modules. In the insert module, users can select
the photos to be stored in the cloud database. The image information, including
EXIF data and weather recognition results, will be extracted. In the search mod-
ule, users can retrieve photos by providing the location information and other
characteristics provided by EXIF. Then, the system searches for similar images
in the cloud database, evaluates their quality, and ranks them by score. Finally,
the system recommends the best photos and provides feedback to the user.
254 H.-F. Lin and H.-Y. Lin
Image Information. Upon entering the insert module and selecting an image,
the system reads its information using EXIF. It includes the image shooting time,
exposure time, and GPS latitude and longitude information. The Sensor Logger
APP is used to obtain a CSV file containing 3D accelerometer data, which is then
used to calculate the three-axis attitude angle of the image, specifically the roll,
pitch, and yaw values. This is achieved by converting the three-dimensional data
of the gyroscope and magnetometer. The following formula is used to calculate
the attitude angle value:
180 accelY
R= tan−1 √
π accelX · accelX + accelZ · accelZ
180 accelX
P = tan−1 √
π accelY · accelY + accelZ · accelZ
180 −magy
Y = tan−1
π magx
The roll and pitch are determined based on the three-dimensional data obtained
from the accelerometer, while the yaw is computed using the three-dimensional
data acquired from the magnetometer.
magx = magX cos P + magY sin R sin P + magZ cos R sin P
magy = magY cos R − magZ sin R
Image Acquisition by Image Retrieval with Color Aesthetics 255
We collect the nine-axis sensor data at the frequency of 1 Hz, which results in
approximately 5 or 6 data points per photo. To determine the three-axis attitude
angle, we calculate the time difference between the data point and the photo. If
it is within one second, the nine-axis sensor data is then used to calculate the
three-axis attitude angle.
After providing the location of the scene for photo capture, the search module
will look for similar information in the cloud database, and use an algorithm to
calculate and recommend the best images. There are two different ways to enter
the location information: manually typing the GPS coordinates of the place of
interest, or selecting an image taken at that location and using the EXIF module
to extract the GPS latitude and longitude information from it.
To ensure that the results generated by the best image selection algorithm
are compliant with the human aesthetic preferences, we use the AVA dataset for
training [13]. The dataset contains around 255,000 images with scores, semantic
labels and style labels assigned by 200 amateur and professional photographers
based on their aesthetic sense. It utilizes the rating scale from 1 to 10, with 10
representing the best quality, and the average rating score for the AVA dataset is
approximately 5.5. This work mainly focuses on the outdoor image acquisition,
which is more complex than the indoor scenes in general. Thus, we use a saliency
detection technique to derive the regions of attention in the images. The network
is an encoder-decoder structure based on the VGG16 architecture [8]. It is trained
for 100 epochs, with the image size of 240 × 320, batch size of 1, and the learning
rate of 10−5 .
In photo aesthetics, we incorporate the concept of color aesthetics since the
color representation will greatly affect the viewing experience, in addition to the
main objects in the image. For training, we the AVA dataset is adopted. We use
Pseudo Color to convert the number of colors in the images to 12, as shown in
Fig. 2. There are 11 basic colors selected based on the work of Kay et al. [4], and
an additional turquoise is suggested by Mylonas et al.’s color naming experiment
conducted on the Internet [14]. Colors are grouped based on similar emotional
256 H.-F. Lin and H.-Y. Lin
associations proposed by Ram et al. [16]. The colors with higher quantification
values signify common emotions within that color group as illustrated in Fig. 3.
It is believed that colors in an image evoke common emotions, whether they
are joyful or sad, are pure and straightforward. If there are too many colors
in an image, it can lead to complex emotions and diminish the overall viewing
experience.
transformed into 12 colors. ΔE (total color difference) is based on the three color
values, ΔL , Δa , Δb , in the rectangular coordinate system. After converting
the pixel values of both images to the Lab color space, we apply the CIEDE2000
color difference formula to calculate the color difference for each pixel. They
are averaged to derive the color difference value of the image. Once the saliency
region is obtained from the saliency map detection, the image is converted to the
one with 12 colors. We calculate the pixel value of the saliency region and check
if it falls within the same color category, and then adjust the score accordingly.
If the image contains only one color group, it implies that the image expresses a
single emotion. If the image contains various color groups, it implies that there
are different emotions conveyed in the image. In either case, we multiply the
color ratio of the image by the quantized value and add the resulting value to
the original score of the image.
To recommend the best photos from the cloud image database to users, the
quality evaluation, score calculation and ranking are necessary. We use the NIMA
method [17] for image quality assessment, which predicts human opinion scores
using convolutional neural networks, not just a high or low quality classification.
We use the AVA dataset with adjusted scores for network training and prediction.
The backbone network we use is VGG16 for the NIMA method, and the last
layer of the VGG16 network is replaced by a fully connected layer that outputs
quality scores of 10 categories. The architecture diagram is illustrated in Fig. 4.
We train the network using the AVA dataset after saliency detection and pseudo
color processing, with 100 epochs, 256 × 256 image input cropped to 224 × 224,
the batch size of 16 and learning rate of 5 × 10−4 .
4 Experiments
In this section, we evaluate the effectiveness of our optimal image selection algo-
rithm and image retrieval system. We first compare our best image selection
algorithm with the original method, followed by the performance evaluation of
the best image selection algorithm through human aesthetics. Finally the devel-
oped system is presented. We adopt the modified and original AVA datasets and
use the VGG16 model for comparison. For testing, we use self-collected images
258 H.-F. Lin and H.-Y. Lin
Fig. 5. The top contains the images in the AVA dataset, and the bottom presents the
results after adding BitwiseAND.
for each photo includes its name, acquisition time, geographic coordinates, and
three-axis attitude angle.
To select the image capture or retrieval modules on Raspberry Pi, we use
multi-control buttons on the Sense HAT expansion board, with button direc-
tions acting as trigger keys. Although the three-axis attitude angle information
can be obtained, the GPS signal transmission involves receiving data at a 9600
baud rate, with about 0.5 s. We set the received data to be GNGLL position-
ing information, which returns time and GPS latitude and longitude. However,
GNGLL data may not be received every time postback information is triggered,
necessitating multiple operations to obtain the data smoothly. The image cap-
ture system is activated by setting the upward direction of the multi-control but-
tons on the Sense HAT expansion board. Once the GPS module starts receiving
data in the GNGLL format, the images are captured by the camera and the
acquisition time is recorded by the Sense HAT expansion board. The location
coordinates collected by the GPS module are then stored in the cloud database.
If we set the direction of the multi-control buttons on the Sense HAT expan-
sion board to be down, it activates the retrieval module. In this mode, we also
set the GPS to receive data in the GNGLL format. The images captured by
the camera can be saved to a local folder. To overcome the limitations of the
display connected to Raspberry Pi, the optimal image selection algorithm is exe-
cuted on the computer instead. We utilize the Samba server running on Linux
to establish a network between the Raspberry Pi board and the computer. After
the computer reads the images, the optimal image selection algorithm calculates
260 H.-F. Lin and H.-Y. Lin
their aesthetic scores and identifies the top three images. These images are then
sent back to the Raspberry Pi. Next, the GUI interface built with PyQt5 will be
displayed on Raspberry Pi. This interface presents three images with the highest
scores, and users can subjectively choose the images based on the preferences.
Figure 7 shows several best image selection results with the proposed algorithm.
5 Conclusion
This paper develops an image acquisition system using location information and
cloud data. It consists of two main modules: an image insertion module and
an image retrieval module. The system enables sensor data recording for image
information acquisition, storage of desired images, and efficient search for the
images. Furthermore, the system utilizes an aesthetic score-based algorithm to
recommend the best-looking images to the users. As the current publicly avail-
able datasets do not meet our requirements, we conduct our own data collection
of photos and nine-axis sensor data. The best image selection algorithm and the
entire system on weather recognition are then evaluated. To ensure diversity in
our dataset, we capture photos under three different weather conditions, which
meet our needs for variability in both weather and color presentation. The future
work will focus on the identification of weather condition for the improvement
of best-photo selection.
Image Acquisition by Image Retrieval with Color Aesthetics 261
References
1. Adams, F.M., Osgood, C.E.: A cross-cultural study of the affective meanings of
color. J. Cross Cult. Psychol. 4(2), 135–156 (1973)
2. AlZayer, H., Lin, H., Bala, K.: AutoPhoto: aesthetic photo capture using reinforce-
ment learning. In: 2021 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pp. 944–951. IEEE (2021)
3. Asarkar, S.V., Phatak, M.V.: Effects of color on visual aesthetics sense. In: Bhalla,
S., Kwan, P., Bedekar, M., Phalnikar, R., Sirsikar, S. (eds.) Proceeding of Interna-
tional Conference on Computational Science and Applications. AIS, pp. 181–194.
Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-0790-8 19
4. Berlin, B., Kay, P.: Basic Color Terms: Their Universality and Evolution. Univer-
sity of California Press (1991)
5. Chu, W.T., Zheng, X.Y., Ding, D.S.: Image2weather: a large-scale image dataset
for weather property estimation. In: 2016 IEEE Second International Conference
on Multimedia Big Data (BigMM), pp. 137–144. IEEE (2016)
6. Xia, J., Xuan, D., Tan, L., Xing, X.: ResNet15: weather recognition on traffic road
with deep convolutional neural network (2020)
7. Kang, L.W., Chou, K.L., Fu, R.H.: Deep learning-based weather image recognition.
In: 2018 International Symposium on Computer, Consumer and Control (IS3C),
pp. 384–387. IEEE (2018)
8. Kroner, A., Senden, M., Driessens, K., Goebel, R.: Contextual encoder-decoder
network for visual saliency prediction. Neural Netw. 129, 261–270 (2020)
9. Lin, H.Y., Chang, C.C., Chou, X.H.: No-reference objective image quality assess-
ment using defocus blur estimation. J. Chin. Inst. Eng. 40(4), 341–346 (2017)
10. Lin, H.Y., Wu, Z.Y.: Development of automatic gear shifting for bicycle riding
based on physiological information and environment sensing. IEEE Sens. J. 21(21),
24591–24600 (2021)
11. Lüscher, M.: The Luscher Color Test. Simon and Schuster (1971)
12. Mahnke, F.H.: Color, Environment, and Human Response: An Interdisciplinary
Understanding of Color and Its Use as a Beneficial Element in the Design of the
Architectural Environment. Wiley, Hoboken (1996)
13. Murray, N., Marchesotti, L., Perronnin, F.: AVA: a large-scale database for aes-
thetic visual analysis. In: 2012 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2408–2415. IEEE (2012)
14. Mylonas, D., MacDonald, L.: Augmenting basic colour terms in English. Color.
Res. Appl. 41(1), 32–42 (2016)
15. Purcell, M.: A new land: Deleuze and Guattari and planning. Plann. Theory Pract.
14(1), 20–38 (2013)
16. Ram, V., et al.: Extrapolating continuous color emotions through deep learning.
Phys. Rev. Res. 2(3), 033350 (2020)
17. Talebi, H., Milanfar, P.: NIMA: neural image assessment. IEEE Trans. Image Pro-
cess. 27(8), 3998–4011 (2018)
18. Yavuz, O.: Novel paradigm of cameraless photography: methodology of AI-
generated photographs. Proc. EVA Lond. 2021, 207–213 (2021)
19. Zhang, Z., Ma, H.: Multi-class weather classification on single images. In: 2015
IEEE International Conference on Image Processing (ICIP), pp. 4396–4400. IEEE
(2015)
20. Zhao, T., Wu, X.: Pyramid feature attention network for saliency detection. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp. 3085–3094 (2019)
Improved Obstructed Facial Feature
Reconstruction for Emotion Recognition
with Minimal Change CycleGANs
1 Introduction
Facial expression recognition [12] is crucial in various research areas such as
psychology, medicine, and computer vision. In computer vision, occlusion of
facial features presents a challenge for existing algorithms, as most assume a fully
visible face and lack consideration for occluded features in available datasets.
We focused on detecting six basic emotions (anger, disgust, fear, happiness,
sadness, and surprise) defined by Ekman and Friesen [4], which are characterized
by specific facial muscle activation. Accurate measurement of muscle activation
typically requires recordings detected via surface electrodes (sEMG). These elec-
trodes and their cables cover parts of the faces. Combining EMG-based facial
expression recognition with computer vision algorithms can enhance understand-
ing of the underlying facial muscle activity. Our study involved 36 healthy sub-
jects performing the six basic emotions four times, with recordings taken both
with and without attached sEMG surface electrodes. This allowed for comparison
between mimicked and targeted expressions. We employed the ResMaskNet [20]
architecture for emotion detection but found that it struggled to handle occluded
facial features. This resulted in only prediction of anger and surprised. Mim-
icked expressions, like disgusted, were recognized only 2 out of 528 times.
Büchner et al. [3] attempted to restore facial features by interpreting face
coverage as a learnable style between uncovered and covered faces. Despite
promising results, the method has drawbacks: it requires separate models for
each individual, and it memorizes uncovered faces resulting in inadequate hal-
lucination of occluded features, see Fig. 3. We extended Büchner et al.’s work
by introducing new regularization terms to the optimization problem, enabling
a single model to be trained for multiple individuals. We increased individual
accuracy from 33.8% (random guessing) up to 90% and demonstrated emotion
detection on individuals not part of the training set, providing generalizability
to unseen individuals. Fine-tuning the general model to specific individuals with
minimal data further improved results.
We conducted ablation studies to enhance the backbone network, signifi-
cantly reducing the number of parameters and computational cost while main-
taining comparable results. We evaluated emotion classification accuracy and the
following qualitative metrics: Frenchet Inception Distance (FID) [17], Structural
Similarity Index (SSIM) [25], and Learned Perceptual Image Patch Similarity
(LPIPS) [27]. These advancements are crucial for EMG-based facial expression
recognition applications. The reduced parameters, improved generalization, and
enhanced visual quality enable a more efficient, robust implementation. Addi-
tionally, these improvements could benefit live or therapeutic applications requir-
ing visual feedback.
2 Related Work
Restoring occluded facial features is challenging due to their invisibility in the
input images. Generative approaches, such as Generative Adversarial Networks
(GANs) [6], can learn anatomically correct facial features from non-occluded
faces, making them suitable for this task.
GANs have demonstrated strong results in image generation, particularly in
medical applications [26]. However, they are typically trained on specific datasets
and struggle to generalize to unseen data. Facial generation research [10,11,16]
employs GANs to create realistic faces from specific datasets, indistinguishable
from real faces [24]. These works focus on non-occluded faces, thus generating
only non-occluded facial images.
264 T. Büchner et al.
Restoring hidden facial features is more challenging since GANs must learn
facial features from non-occluded faces. Li et al. [15] used GANs to restore
artificially altered faces; however, the unrealistic changes limit its applicabil-
ity to real-world data. Moreover, the restored facial expressions do not match
the original ones. Alternative methods attempt to use GANs for transferring
facial attributes [18] to modify faces.
Some approaches attempted to generalize GANs to unseen data. For instance,
Zhu et al. [28] demonstrated that CycleGANs can translate images between
domains without paired data, allowing each generator to learn a specific domain
style. CycleGANs can thus translate covered faces to uncovered faces in medical
applications, treating sEMG coverage as a domain style.
Abraiam and Eklund [1] used CycleGANs to restore obfuscated faces in MRI
images, focusing on patient privacy and side views with no facial expression dif-
ferences. However, they did not evaluate the quality of restored faces. In contrast,
our work focused on facial feature restoration and quality evaluation.
We build upon Büchner et al. [3], using the CycleGAN architecture for facial
feature restoration. While their architecture remained unchanged, every individ-
ual required a separately trained model which could hallucinate facial features.
We introduce a new regularization term to the optimization problem, enforcing
minimal changes and also enabling a single model to be trained for multiple
individuals.
3 Method
The CycleGAN architecture by Zhu et al. [28], consists of two generators and two
discriminators, shown in Fig. 1. The generators, trained adversarially [6], trans-
late images between domains while maintaining cycle consistency [28], allow-
ing the translated images to be reverted to the original domain. Discriminators
distinguish between real and fake generated faces. To improve visual quality,
generators are trained using an identity loss [22], preserving color composition
between input and output. We hypothesize that the identity loss encourages gen-
erators to learn input faces’ facial features, enabling better feature restoration.
Facial Feature Reconstruction with Minimal Change CycleGANs 265
Fig. 1. The CycleGAN architecture for facial feature restoration: The generators GA
and GB translate images between domains A and B. The minimal change regularization
ensures that the generators do not modify uncovered areas.
Lcycle = λA · LL1 (GB (GA (A)), A) + λB · LL1 (GA (GB (B)), B), (2)
Lidt = λidt (·LL1 (GB (A), A) + ·LL1 (GA (B), B)). (3)
While the identity loss helps generators to learn facial features, it is insufficient
for correct facial feature restoration. As seen in Fig. 3, the generator hallucinated
facial features not present in the original face, contrary to our goal. Limitations
shown in [3] result from changes in uncovered face areas.
266 T. Büchner et al.
In style transfer tasks, every pixel of the input image might change according
to the target domain’s style. However, we wanted to enforce the generator to only
change covered areas, preserving original facial features. We introduced a new
regularization term (Eq. 4) to the optimization problem, penalizing the generator
for significant changes and reducing the reconstruction error between input and
output images:
Due to the GAN loss, generators must still remove surface electrodes to
deceive the discriminator. We introduced hyperparameters λM CA and λM CB to
weight the regularization term, enabling control over each domain’s regulariza-
tion amount. Notably, this regularization allows for simultaneous training on
multiple individuals. Since surface electrodes are positioned consistently across
individuals, the model learns to preserve uncovered face areas. Consequently, the
model is not limited to a single individual and can be applied to unseen indi-
viduals. The final optimization problem (Eq. 5) comprises the GAN loss, cycle
consistency loss, identity loss, and minimal change loss:
4 Dataset
Without 61.6% 72.5% 28.1% 90.8% 47.1% 85.2% 64±10% 0.63±0.08 0.10±0.04 0.50±0.74
Attached 88.0% 0.3% 20.2% 21.9% 3.8% 68.8% 34±10% 0.38±0.05 0.25±0.02 10.46±2.10
Removed 66.2% 49.0% 11.9% 72.1% 42.5% 60.8% 54±16% 0.66±0.09 0.08±0.04 0.35±0.48
Fig. 2. Time series excerpt of emotion activations by ResMaskNet [20]: Dashed lines
represent predicted emotions for uncovered faces, while solid lines represent predictions
for covered faces. ResMaskNet struggles to correctly classify emotions for covered faces
and predominantly activates for angry or sad. A neutral expression is never predicted
for covered faces. (Best viewed digitally.)
increased with both larger amounts of training data and larger backbone net-
works. Fine details, highlighted in Fig. 3, are better preserved with larger back-
bone networks and might not be measurable with these metrics. Considering
the tradeoff between visual quality and training time, we opted for a generators
with three residual blocks and a feature dimension of 64. All further experiments
utilize these model configurations.
Features Depth Emo. Max Emo. Acc. (↑) SSIM (↑) LPIPS (↓) FID (↓)
The proper restoration of faces was already shown by Büchner et al. [3] but their
model might hallucinate features, see Fig. 3. Thus, we investigated on the one side
if the minimal change regularization improves the results and on the other side if
it reduces the hallucination of features. We trained the CycleGAN architecture
Facial Feature Reconstruction with Minimal Change CycleGANs 269
Fig. 3. Quality difference for backbone network sizes: The third column shows the
base model without minimal change regularization, leading to hallucinated features. In
contrast, the remaining columns incorporate minimal change regularization and tuned
hyperparameters, showcasing improved restoration. (Best viewed digitally.)
The CycleGAN architecture has the problem of not being able to generalize to
unseen individuals as shown by Büchner et al. [3]. Thus, we investigate the gen-
270 T. Büchner et al.
Table 3. Emotion classification accuracy and FID score increase with stronger regu-
larization: SSIM and LPIPS did not change significantly.
λM C Emo. Max Emo. Acc.(↑) SSIM (↑) LPIPS (↓) FID (↓)
Trained On Emo. Max Emo. Acc. (↑) SSIM (↑) LPIPS (↓) FID (↓)
False 0.83 0.53±0.15 0.60±0.06 0.12±0.03 0.53±0.84
True 0.88 0.55±0.18 0.61±0.09 0.10±0.04 0.33±0.29
Fig. 4. Qualitative results of the generalization capabilities using the minimal change
regularization: The model removes surface electrodes for individuals outside the train-
ing set. Important facial features are unchanged but the head shape is altered slightly.
Uncovered expressions are shown in the lower right corner.
6 Conclusion
Our study demonstrated that CycleGAN with minimal change regularization
effectively restores individuals’ faces with EMG surface electrodes, improving
emotion classification accuracy. This regularization does not impede the model’s
ability to remove electrodes, and eliminates CycleGAN limitations like hallu-
cinations, making the restored faces suitable for emotion classification tasks.
Our approach enables direct utilization of existing methods without fine-tuning
on our data. Additionally, we demonstrated its ability to generalize to unseen
272 T. Büchner et al.
individuals and improve the visual quality of the restored faces without hallu-
cinations. Here we showed that now sEMG-based measurements can be jointly
used with computer vision-based techniques to enhance the comprehension of
facial anatomy. The data-driven approach seamlessly integrates into existing
pipelines, enabling real-time restoration of individuals’ faces with surface elec-
trodes. Thus, it can be used in applications where electrodes are used for facial
muscle stimulation, such as physical therapy [2,14,19].
References
1. Abramian, D., Eklund, A.: Refacing: reconstructing anonymized facial features
using GANS. 2019 IEEE 16th International Symposium on Biomedical Imaging
(ISBI 2019), pp. 1104–1108 (2019). https://doi.org/10.1109/ISBI.2019.8759515
2. Arnold, D.: Selective surface electrostimulation of the denervated zygomaticus mus-
cle. Diagnostics 11(2), 188 (2021). https://doi.org/10.3390/diagnostics11020188
3. Büchner, T., Sickert, S., Volk, G.F., Anders, C., Guntinas-Lichius, O., Denzler,
J.: Let’s get the FACS straight - reconstructing obstructed facial features. In:
International Conference on Computer Vision Theory and Applications (VISAPP),
pp. 727–736. SciTePress (2023). https://doi.org/10.5220/0011619900003417
4. Ekman, P., Friesen, W.V.: Facial Action Coding System. Consulting Psychologists
Press (1978)
5. Fridlund, A.J., Cacioppo, J.T.: Guidelines for human electromyographic research.
Psychophysiology 23(5), 567–589 (1986). https://doi.org/10.1111/j.1469-8986.
1986.tb00676.x
6. Goodfellow, I.J.,et al.: Generative Adversarial Networks. In: Advances in Neu-
ral Information Processing Systems, vol. 27 (2014). https://doi.org/10.48550/
arXiv.1406.2661
7. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recogni-
tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 770–778 (2016). https://doi.org/10.48550/arXiv.1512.03385
8. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs
Trained by a two time-scale update rule converge to a local nash equilibrium.
In: Advances in Neural Information Processing Systems, vol. 30 (2017). https://
doi.org/10.48550/arXiv.1706.08500
9. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-
tional adversarial networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1125–1134 (2017). https://doi.org/10.48550/
arXiv.1611.07004
10. Kammoun, A., Slama, R., Tabia, H., Ouni, T., Abid, M.: Generative adversarial
networks for face generation: a survey. ACM Comput. Surv. 55(5), 1–37 (2023)
11. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing
and improving the image quality of StyleGAN. In: 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 8107–8116 (2020). https://
doi.org/10.1109/CVPR42600.2020.00813
Facial Feature Reconstruction with Minimal Change CycleGANs 273
28. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation
using cycle-consistent adversarial networks. In: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 2223–2232 (2017). https://
doi.org/10.48550/arXiv.1703.10593
Quality Assessment for High Dynamic Range
Stereoscopic Omnidirectional Image System
Liuyan Cao1 , Hao Jiang1 , Zhidi Jiang2 , Jihao You1 , Mei Yu1 , and Gangyi Jiang1(B)
1 Faculty of Information Science and Engineering, Ningbo University, Ningbo 315211, China
jianggangyi@126.com
2 College of Science and Technology, Ningbo University, Ningbo 315300, China
Abstract. This paper focuses on visual experience of high dynamic range (HDR)
stereoscopic omnidirectional image (HSOI) system, which includes such as HSOI
generation, encoding/decoding, tone mapping (TM) and terminal visualization.
From the perspective of quantifying coding distortion and TM distortion in HSOI
system, a “no-reference (NR) plus reduced-reference (RR)” HSOI quality assess-
ment method is proposed by combining Retinex theory and two-layer distortion
simulation of HSOI system. The NR module quantizes coding distortion for HDR
images only with coding distortion. The RR module mainly measures the effect of
TM operator based on the HDR image only with coding distortion and the mixed
distorted image after TM. Experimental results show that the objective prediction
of the proposed method is better compared some representative method and more
consistent with users’ visual perception.
1 Introduction
Omnidirectional images can provide users with a sense of immersion and interaction
within 360° × 180° field of view (FOV) [1–3]. Due to the large FOV, the luminance
of scene may be very inconsistent. High dynamic range (HDR) stereoscopic omnidi-
rectional image (HSOI) system can record real scene information [2], and it consists of
HSOI generation, coding/decoding, tone mapping (TM), and visualization with head-
mounted display (HMD), as shown in Fig. 1. The generated HSOI is compressed with
the JPEG XT standard and transmitted to the client. Then, TM operator (TMO) is used to
compress the dynamic range of the decoded HSOI so as to adapt to HMD to display the
image. Since these processes will introduce distortion and result in a decrease in HSOI
quality, how to establish a reliable objective prediction model to monitor the quality of
HSOIs is an important issue to be studied.
General image quality assessment (IQA) methods can be divided into the 2D-IQA
and 3D-IQA. For 2D-IQA, some representative 2D-IQA metrics were proposed, such as
IL-NIQE [4], GWH-GLBP [5], OG [6], BRISQUE [7], SISBLIM [8], dipIQ [9], BMPRI
[10], and so on. For 3D-IQA, Liu et al. [11] designed a S3D INtegrated Quality (SINQ)
Predictor, Shen et al. [12] used deep learning to explore high-level features to characterize
complex binocular effects. For stereoscopic omnidirectional IQA (SOIQA), Zhou et al.
[3] combined invariant features, visual salience and position priori of projection, and
extended the framework to other projection formats. Qi et al. [13] presented a viewport
perception based blind SOIQA method (VP-BSOIQA). For TM-IQA, Gu et al. [14]
designed the blind tone mapping quality index (BTMQI) by adopting the methods of
naturalness interference and structure preservation. Jiang et al. [15] designed a blind tone
mapping image IQA (BTMIQA) method with the help of global and aesthetic features.
In HSOI system, an HSOI at the server is coded by JPEG XT and transmitted to
the client, where it is decoded and processed with TMO into a distorted HSOI for
visualization on the standard dynamic range HMD, so that the coding distortion and TM
distortion will be produced [2]. The process of TM may enhance or mask the coding
distortion, hence the actual visual content presented to users in subjective assessment
is the result of the joint effect of these two types of distortion. Retinex theory points
out that the color of an object observed by human eye is determined by the color of the
object itself and the surrounding lighting environment [16]. The color of the light source
illuminated on the surface of the object is the final perceived visual signal, and these two
factors can be separated.
In this paper, combing Retinex theory and the processing flow of HSOI system,
a quality assessment method for HSOI system (denoted as QA-HSOI) is proposed to
quantify the distortions in the processing process of the system. No-reference (NR) and
reduced-reference (RR) modules are established to measure coding distortion and TM
distortion, respectively. Additionally, the idea of intrinsic image decomposition based
on Retinex theory is also adopted in preprocessing of the NR module.
Suppose that the size of I h L are W × H, where W and H represent the height and width
respectively. After the above processing, Tchebichef score matrix M c is obtained with the
size of floor(W /8) × floor(H/8), and floor(·) is a downward integral function. Texture
masking effect indicates that complex texture regions have a stronger ability to hide
distortion compared to flat regions. Here, just noticeable difference (JND) coefficient
of local image blocks is calculated by the JND model in pixel domain. For an input
8 × 8 image block P, its JND coefficient is expressed as the sum of the maximum
value of each pixel after gradient filtering in four directions (0°, 45°, 90° and 135°). The
filtering coefficient is referred to [19]. Let go (x,y) be the gradient filter in four direction,
1 5 5
Go (x, y) = 16 i=1 j=1 P(x − 3 + i, y − 3 + j)go (x, y), where o = 1, 2, 3, 4. Then,
JND = sum( max {|Go (x, y)|}).
o=1,2,3,4
The Tchebichef score matrix M c and JND coefficient matrix M j of {ζ i,f ,ζ r,f } and
{ζ i,c ,ζ r,c } are obtained after the above steps. The JND coefficient is further used as the
distortion sensitivity index to fuse flat region and complex texture region. Taking ζ i as
an example, the blocking artifact matrix M i after regional fusion is expressed as
2
Mi = M cζ i,k · exp − M jζ i,k 2 × 0.32 (2)
k=f ,c
(1 − d )βd −1 d αd −1
Pd (d ) = (4)
Beta(αd , βd )
Quality Assessment 279
Let P ch /P cl denote the PCMF map of C h /C l , the global PMCF similarity f g is defined
as the sum of similarity matrix M fc obtained as follows
2 × P ch × P cl + τ
M fc = (9)
P 2ch + P 2cl + τ
We also define a first n% saliency bit feature f s as an auxiliary feature of f g . Specif-
ically, the pixel values of the saliency map of the corresponding viewport image are
arranged in descending order, and all pixel coordinates {a,b} corresponding to the first
n% (n = 10,30,50,70,90) saliency bits are filtered respectively. The mean of all values
at {a,b} are screened out in M fc as the feature of the first n% saliency bits. f g and f s
are calculated on four scales. The final fusion channel features are denoted as F5 =
{f fc g ,f fc s }.
For competition channel, absolute difference image (ADI) is used here to character-
ize the differences between the two views. Taking V Ll,m and V Rl,m as an example, its ADI
is defined as Al,m = V Ll,m − V Rl,m . Similarly, the HDR version of ADI is represented as
Ah,m . ADI reflects two kinds of information. One is approximate contour information.
Usually, more contour stimulation will lead the competition. The other is content differ-
ences between the two views. Here, the 5-tap derivative (T5 ) maps is used to describe this
special structure, where 5-tap refers to the 5 parameters set when performing derivative
filtering. Different derivative maps have different parameters [21]. Here, let x and y be
the horizontal and vertical directions, and ψ x , ψ y , ψ xx , ψ yy and ψ xy be five derivative
maps. The former two represent first-order derivatives, and the latter three are second-
order derivatives. The five derivative maps can complement each other. Let ψ υ,h /ψ υ,l
(υ = x,y,xx,yy,xy) denote the T5 map of Ah,m /Al,m . Then the global similarity f cc g is
defined as the sum of the similarity matrix M cc obtained as follows
2 × ψυ,h × ψυ,l + τ
M cc = , (υ = x, y, xx, yy, xy) (10)
ψυ,h
2 + ψ2 + τ
υ,l
Three classical indicators are used to measure the accuracy of regression tasks: Pear-
son linear correlation coefficient (PLCC), Spearman rank order correlation coefficient
(SROCC) and root mean squared error (RMSE).
Fig. 3. The SHOID source sequences (left images). (a) - (j) Scene 1 - Scene 10.
The proposed method contains six perceptual feature sets F = {F1 ,F2 ,F3 ,F4 ,F5 ,F6 },
which are as follows: blocking artifact feature F1 = {f i L ,f i R ,f r L ,f r R }, naturalness fea-
ture F2 = {f n L ,f n R }, color fidelity feature F3 = {f c L ,f c R }, perceptual hashing feature F4
= {f p L ,f p R }, fusion feature F5 = {f fc g ,f fc s } and competition feature F6 = {f cc g ,f cc ed }.
The RF model is used to train each feature set individually and then reports its perfor-
mance. Table 2 shows the results of ablation experiment, and some observations can be
obtained as follows. (1) For the six single feature set, F1 , F2 , F3 and F4 perform rela-
tively well. Among them, F1 , F2 , F3 are evaluated in ERP format, while F4 is evaluated
in viewport. Compared with F5 and F6 , they can be regarded as global information,
which indicates that ensuring the integrity of global information plays a great role in
improving performance. Secondly, the performance of perceptual hashing feature F4
is the best, which explains the effectiveness of multi-channel and multi-band feature
extraction to a certain extent. (2) For NRhdr , the performance of F1 + F2 is relatively
stable, although it is not significantly improved compared to their performance alone,
but we still retain them. Because F1 and F2 consider characteristics from different per-
spectives, it is beneficial to improve the overall performance. (3) For RRhdr,ldr , F3 + F4
is extracted by single-channel module, and F5 + F6 is extracted with double-channel
module. Although the performance of the former is better, the double-channel features
improve PLCC by 0.0261 on the basis of the single-channel, which is a considerable gain,
indicating that it is necessary and effective to consider the stereoscopic effect. (4) For
PLCC, NRhdr reaches 0.8037, RRhdr,ldr reaches 0.8654, and the overall modules reach
0.9144. HDR images and LDR images can be comprehensively utilized and assisted
each other to further improve the accuracy of the objective model.
284 L. Cao et al.
To verify the consideration of binocular characteristics in this paper, Table 3 shows some
comparison results including 3D-IQA, TM-IQA, SOIQA and the proposed method. The
SHOID dataset is divided into the symmetric and asymmetric distortions to train the RF
models, respectively. Table 3 lists the performance indexes of these methods. It can be
found that the performance of all methods for symmetric distortion is better than that
of asymmetric distortion, and the overall performance is between asymmetric distortion
and symmetric distortion. This shows that asymmetric distortion should be considered
in HSOI system, and the better the performance of asymmetric distortion is, the better
the overall performance is.
Finally, some discussions are as follows. The HSOI system includes various percep-
tual characteristics, such as omnidirectional, stereoscopic and HDR. This paper proposes
Quality Assessment 285
a new quality assessment method combining NRhdr and RRhdr,ldr . The former quantifies
coding distortion and the latter measures the effect of TMO. The SHOID dataset is used
to verify the performance of the proposed method. However, in some challenging sit-
uations, the work of this paper may leave some room for improvement. The first is for
omnidirectional characteristics. The viewport sampling method in this paper is based
on the viewing habits of users. However, such a method is difficult to accurately sim-
ulate the user’s behavior in the actual viewing process. In the future, it is necessary to
further combine HMD interactive sensor to develop a more effective viewport sampling
method. The second is for binocular perception. How to simulate binocular effect is
always the focus of SIQA research. In general, the research on HSOI system is still
in the exploratory stage. HSOI system contains a variety of perceptual characteristics,
which pose challenges to future research work.
4 Conclusions
High dynamic range stereoscopic omnidirectional image (HSOI) system contains cod-
ing distortion and tone mapping (TM) distortion. How to monitor visual experience of
the HSOI system is an important issue. From the perspective of quantifying coding dis-
tortion and TM distortion in HSOI generation process, a quality assessment method of
“no-reference (NR) plus reduced-reference (RR)” has been proposed. The NR module
decomposes HSOI only with coding distortion. The corresponding features are extracted
according to the distortion characteristics of the decomposed image. The NR module
quantizes coding distortion, while the RR module measures TM distortion introduced by
TM operator. The RR module takes HSOI only with coding distortion as the reference to
evaluate the HSOI after TM, and considering the stereoscopic characteristics of HSOI,
the features are extracted with single-channel and double-channel modules. Finally, the
feasibility of the proposed method is verified. Experimental results show that the pro-
posed method is superior to the existing state-of-the-art methods and can be used as an
effective evaluator for HSOI systems.
Acknowledgements. This work was supported in part by the National Natural Science Founda-
tion of China under Grant Nos. 61871247, 62071266 and 61931022, and Science and Technology
Innovation 2025 Major Project of Ningbo (2022Z076).
References
1. Liu, Y., Yin, X., Wang, Y., Yin, Z., Zheng, Z.: HVS-based perception-driven no-reference
omnidirectional image quality assessment. IEEE Trans. Instrum. Measur. 72, art no. 5003111
(2023)
2. Cao, L., You, J., Song, Y., Xu, H., Jiang, Z., Jiang, G.: Client-oriented blind quality metric
for high dynamic range stereoscopic omnidirectional vision systems, Sensors 22, art no.8513
(2022)
3. Zhou, X., Zhang, Y., Li, N., Wang, X., Zhou, Y., Ho, Y.-S.: Projection invariant feature and
visual saliency-based stereoscopic omnidirectional image quality assessment. IEEE Trans.
Broadcast. 67(2), 512–523 (2021)
286 L. Cao et al.
4. Zhang, L., Zhang, L., Bovik, A.C.: A feature-enriched completely blind image quality
evaluator. IEEE Trans. Image Process. 24(8), 2579–2591 (2015)
5. Li, Q., Lin, W., Fang, Y.: No-reference quality assessment for multiply-distorted images in
gradient domain. IEEE Sig. Process. Lett. 23(4), 541–545 (2016)
6. Liu, L., Hua, Y., Zhao, Q., Huang, H., Bovik, A.C.: Blind image quality assessment by relative
gradient statistics and adaboosting neural network. Sig. Process. Image Commun. 40, 1–15
(2016)
7. Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial
domain. IEEE Trans. Image Process. 21(12), 4695–4708 (2012)
8. Gu, K., Zhai, G., Yang, X., Zhang, W.: Hybrid no-reference quality metric for singly and
multiply distorted images. IEEE Trans. Broadcast. 60(3), 555–567 (2014)
9. Ma, K., Liu, W., Liu, T., Wang, Z., Tao, D.: DipIQ: blind image quality assessment by
learning-to-rank discriminable image pairs. IEEE Trans. Image Process. 26(8), 3951–3964
(2017)
10. Min, X., Zhai, G., Gu, K., Liu, Y., Yang, X.: Blind image quality estimation via distortion
aggravation. IEEE Trans. Broadcast. 64(2), 508–517 (2018)
11. Liu, L., Liu, B., Su, C., Huang, H., Bovik, A.C.: Binocular spatial activity and reverse saliency
driven no-reference stereopair quality assessment. Sig. Process. Image Commun. 58, 287–299
(2017)
12. Shen, L., Chen, X., Pan, Z., Fan, K., Li, F., Lei, J.: No-reference stereoscopic image quality
assessment based on global and local content characteristics. Neurocomputing 424, 132–142
(2021)
13. Qi, Y., Jiang, G., Yu, M., Zhang, Y., Ho, Y.-S.: Viewport perception based blind stereoscopic
omnidirectional image quality assessment. IEEE Trans. Circ. Syst. Video Technol. 31(10),
3926–3941 (2021)
14. Gu, K., et al.: Blind quality assessment of tone-mapped images via analysis of information,
naturalness, and structure. IEEE Trans. Multimedia 18(3), 432–443 (2016)
15. Jiang, G., Song, H., Yu, M., Song, Y., Peng, Z.: Blind tone-mapped image quality assessment
based on brightest/darkest regions, naturalness and aesthetics. IEEE Access 6, 2231–2240
(2018)
16. Xu, J., et al.: STAR: A structure and texture aware Retinex model. IEEE Trans. Image Process.
29, 5022–5037 (2020)
17. Chi, B., Yu, M., Jiang, G., He, Z., Peng, Z., Chen, F.: Blind tone mapped image quality assess-
ment with image segmentation and visual perception, J. Vis. Commun. Image Represent. 67,
art. no. 102752 (2020)
18. Li, L., Zhu, H., Yang, G., Qian, J.: Referenceless measure of blocking artifacts by Tchebichef
kernel analysis. IEEE Sig. Process. Lett. 21, 122–125 (2014)
19. Yang, X., Ling, W., Lu, Z., Ong, E., Yao, S.: Just noticeable distortion model and its
applications in video coding. Sig. Process. Image Commun. 20(7), 662–680 (2005)
20. Wang, L., Zhang, C., Liu, Z., Sun, B.: Image feature detection based on phase congruency by
Monogenic filters. In: Proceedings of theChinese Control and Decision Conference, pp. 2033–
2038 (2014)
21. Farid, H., Simoncelli, E.P.: Differentiation of discrete multidimensional signals. IEEE Trans.
Image Process. 13(4), 496–508 (2004)
Genetic Programming with Convolutional
Operators for Albatross Nest Detection
from Satellite Imaging
1 Introduction
Advances in remote sensing have had a tremendous impact on conservation [4].
Through very high-resolution (VHR) satellite imaging, it is possible to obtain
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 287–298, 2023.
https://doi.org/10.1007/978-3-031-45382-3_24
288 M. Rogers et al.
daily data for remote regions, thereby allowing year-round monitoring of inac-
cessible areas [9]. Previous remote sensing research has led to many impressive
applications in machine learning and computer vision, such as detecting boats
[11] and counting elephant herds [5]. As the spatial resolution of satellite imag-
ing has improved to sub-40 cm resolution, the range of possible applications has
expanded to enable monitoring of birds in remote regions, such as the albatross
[3,6,8].
This study aims to detect and count southern royal albatross nests on a ridge
of remote Campbell Island, south of New Zealand. The sub-Antarctic Campbell
Island group lies 700 km south of New Zealand’s South Island. The subject of this
study, the southern royal albatross, has the largest wingspan of any bird, but the
population is declining at an alarming rate [15]. Currently, manual species count-
ing methods are laborious, infrequent, and expensive. Traversing the entire island
on foot requires a full-day hike, and manual counting of nests requires weeks to
months of work. An alternative is to assess the area via a helicopter, which
disturbs wildlife and is expensive, with various logistical and health and safety
hurdles. Owing to Campbell Island’s remoteness, transporting conservationists
and helicopters takes weeks, meaning it is not currently possible to regularly
count the population.
Over the last 15 years, several advances in image classification and segmen-
tation have been based on convolutional neural networks (CNNs) [14], which
apply a series of local convolutions to images to extract increasingly higher-level
features. These features range from simple features, such as edges or corners, to
more complex features, such as cars or faces. Over this period, researchers have
created increasingly complex models by adding new layer types and tweaking
loss functions to incrementally increase the performance [10]. State-of-the-art
deep learning methods often require large datasets and extensive expert knowl-
edge to implement and tweak for maximum performance in new applications.
These models tend to extract redundant or overly complex features [13] and are
difficult to interpret. As such, they are often referred to as black boxes. Genetic
Programming (GP) offers an explainable, simple, and self-configuring alternative
for designing these architectures [19] and can learn generalisable models from a
small number of training samples [2].
Simple, explainable approaches that do not require extensive AI expert
knowledge, large-scale datasets, or expensive GPUs are crucial for wildlife conser-
vation efforts. Our paper improves on previous albatross-counting deep learning
methods [3] to create an interpretable computer vision model to count potential
southern royal albatross (Diomedea epomophora) nests year-round from satellite
images.
This paper aims to develop a new GP-based approach with convolutional
filters using an explainable and flexible tree-based program structure for binary
image segmentation. The contributions of this study can be summarised as fol-
lows:
Genetic Programming with Convolutional Operators 289
2 Proposed Approach
This section describes the proposed convolutional-GP approach for image seg-
mentation, including the algorithm, program structure, fitness function, and
function set.
Based on the described program structure, four types of functions constitute the
function set. The functions and their input types are listed in Table 1.
The convolve functions take either an input channel of the image or a previ-
ous convolution map along with an evolved filter and generate a new convolution
map. The filter is applied to the image as a convolution.
Genetic Programming with Convolutional Operators 291
The activation functions transform the convolution map values. The min-
max normalisation rescales the values between 0 and 255. This function can
be applied to a convolution map or an input channel because normalisation is
a common input preprocessing technique for deep learning. The other activa-
tion functions in this function set were ReLU, sqrt, and absolute value. The
ReLU function transforms the image by applying a rectified linear unit function,
returning the pixel value if it is positive; otherwise, it returns to zero. The square
root function is protected to return zero if the value is negative.
The threshold and threshold inverse operators convert a convolved image
into a binary image using a given threshold value (τ ). Logical operators OR,
AND, XOR, and NOT are applied to the binary images and can be chained to
combine multiple thresholded feature maps. A binary image may also be eroded
or dilated to decrease or increase the size of nest components masked by the
image.
The convolution and threshold functions are required by each individual,
with all others being optional. These are the only two functions that change
the input type of an image into a different output type. All other functions are
optional, and evolutionary processes automatically determine their suitability.
The values of the convolutional filters, thresholds, and kernel sizes are selected
randomly from a distribution of values. The convolutional filters have sizes of
3 × 3, 5 × 5, or 7 × 7 with values randomly selected from −5 to 5. The threshold
values are randomly sampled from values between −150 and 150, as the input
feature maps may be positive or negative floating-point values. Finally, the kernel
sizes were either 3 × 3 or 5 × 5 with a square structuring element.
292 M. Rogers et al.
(1 + β 2 )T P
Fβ = (1)
(1 + β 2 )T P+ (β 2 )F N + F P
3 Experimental Design
In this study, we used satellite data from the Faye Bump, north of Mount Faye,
on Campbell Island, south of New Zealand. This island is home to the majority
(>99%) of the southern royal albatross population [17]. An RGB satellite image
(6, 668 × 3, 335 pixels) of the region of interest was taken on the 4th of June,
2021 from the Maxar WorldView-3 satellite (DigitalGlobe, Inc., USA) was used
for training in this study. Each pixel has a spatial resolution of approximately
31 cm2 . The annotated locations are shown in Fig. 3b.
Genetic Programming with Convolutional Operators 293
Fig. 3. Visual examples of the possible nest points and their locations along the Faye
Bump, north of Mount Faye, Campbell Island.
4 Experimental Results
In this section, the experimental results of the GP-based method are compared
with those of the nnU-Net model, and the evolved operators and structures of
GP individuals are examined.
Table 2. Overall test set performance of the GP-based method compared with the
average of the individual nnU-Net models and the ensemble of nnU-Net models.
models, the nnU-Net models achieved the highest performance, correctly identi-
fying an average of 128.4 nest-like points out of 166. There was minimal variation
among the five models from cross-validation, and the configured ensemble of the
two best models performed slightly worse in terms of per-pixel metrics on the
test set.
The best GP individuals had extremely low false-positive rates, with some
having per-pixel precision of 1.0. However, recall rates were significantly lower
than those of the nnU-Net models, and they could only detect 45% of the total
number of nests on average. Given the shallow nature of the GP-based method,
a single function tree with a constrained depth cannot come close to the deep
feature extraction ability of nnU-Net.
The ensemble of the best GP trees, where each pixel was labelled as a nest if at
least one individual classified it as a nest, achieved a detection rate comparable to
the nnU-Net models. The combination of the ten best GP individuals predicted
a similar number of false nests as the nnU-Net model and correctly identified
123. These results indicate that a shallow model that extracts a broad range of
features may be better suited for segmenting small objects than deep features
from large CNN models.
The nnU-Net models had a lower per-pixel recall than the GP method ensem-
ble but a higher number of correct nests because the GP method’s fitness function
did not penalise false positives surrounding correct nest pixels as the nnU-Net
loss function did. Many evolved function trees would predict abnormally large
nest objects by using functions such as dilation. Further investigation found
the GP-ensemble method labelled 42.58% of the pixels surrounding the labelled
ground truth values as nests, compared with an average for each individual of
11.94% and only 0.27% of pixels from the nnU-Net ensemble.
4.2 GP Individuals
This subsection analyses the top individual obtained in each GP run. Figure 4
presents an example of one of the best individuals. The most commonly selected
image channel across the best individuals was blue (18 times), followed by red
(ten times), and green (three times), being the least likely. The green channel is
296 M. Rogers et al.
likely the least informative because of the large amount of green vegetation on the
island. Only three out of ten individual trees contained function branches apply-
ing more than one convolutional operation, indicating that the extracted features
were typically responses from a single convolution filter. The nnU-Net architec-
ture, on the other hand, applied 26 convolutional blocks, each with between 32
and 512 convolutional filters, meaning that the GP individuals used <1% the
number of convolutional filters of the nnU-Net model. Typically, the filters were
3 × 3. These occurred 28 times in the convolutions compared to 5 × 5 and 7 × 7
filters used four and five times, respectively. As the nest objects were typically
only a few pixels wide, the evolved filters did not need to be very large. Only
two out of the 31 feature extraction branches utilised the min-max normalisation
function. The min-max normalisation function may yield inconsistent results for
small images, where the results of the convolution operations may have varying
pixel intensity distributions between images.
Fig. 4. An example of one of the best individuals from the ten individual runs. (Color
figure online)
Eight individuals included dilation operations, increasing the size of the nest
components, and two had erosion operations. Due to these operations, the GP
individuals often labelled pixels directly surrounding the labelled nest as nests.
Some GP individuals contain structurally ineffective code sequences, such as the
absolute value followed by the ReLU function in the above example. These are
commonly referred to as introns and can be manually pruned from the resulting
tree; otherwise, they have no effect on model performance. Natural selection
during the evolutionary process removes functionally detrimental code sequences.
5 Conclusion
Acknowledgments. We thank Kāi Tahu and Kaitiaki Rōpū ki Murihiku for allowing
us to work on their taonga. We also thank DOC Murihiku for their logistical support.
References
1. Bi, Y., Xue, B., Zhang, M.: An evolutionary deep learning approach using genetic
programming with convolution operators for image classification. In: 2019 IEEE
Congress on Evolutionary Computation (CEC), pp. 3197–3204 (2019)
2. Bi, Y., Xue, B., Zhang, M.: Genetic programming with image-related operators
and a flexible program structure for feature learning in image classification. IEEE
Trans. Evol. Comput. 25(1), 87–101 (2021)
3. Bowler, E., Fretwell, P.T., French, G., Mackiewicz, M.: Using deep learning to
count albatrosses from space: assessing results in light of ground truth uncertainty.
Remote Sens. 12(12), 1–18 (2020)
4. Corbane, C., et al.: Remote sensing for mapping natural habitats and their conser-
vation status - new opportunities and challenges. Int. J. Appl. Earth Obs. Geoinf.
37, 7–16 (2015). Special issue on earth observation for habitat mapping and bio-
diversity monitoring
5. Duporge, I., Isupova, O., Reece, S., Macdonald, D.W., Wang, T.: Using very-high-
resolution satellite imagery and deep learning to detect and count African elephants
in heterogeneous landscapes. Remote Sens. Ecol. Conserv. 7(3), 369–381 (2021)
6. Ferreira, A.C., et al.: Deep learning-based methods for individual recognition in
small birds. Methods Ecol. Evol. 11(9), 1072–1085 (2020)
298 M. Rogers et al.
7. Fortin, F.A., Rainville, F.M.D., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP:
evolutionary algorithms made easy. J. Mach. Learn. Res. 13(70), 2171–2175 (2012)
8. Fretwell, P.T., Scofield, P., Phillips, R.A.: Using super-high resolution satellite
imagery to census threatened albatrosses. Ibis 159(3), 481–490 (2017)
9. Hoffman-Hall, A., Loboda, T.V., Hall, J.V., Carroll, M.L., Chen, D.: Mapping
remote rural settlements at 30 m spatial resolution using geospatial data-fusion.
Remote Sens. Environ. 233, 1–19 (2019)
10. Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU-Net:
a self-configuring method for deep learning-based biomedical image segmentation.
Nat. Methods 18(2), 203–211 (2021)
11. Kanjir, U., Greidanus, H., Oštir, K.: Vessel detection and classification from space-
borne optical images: a literature survey. Remote Sens. Environ. 207, 1–26 (2018)
12. Marchant, S., Higgins, P., Ambrose, S.: Handbook of Australian, New Zealand
& Antarctic Birds: Volume I, Ratites to Ducks, vol. 1. Oxford University Press,
Oxford (1990)
13. Menghani, G.: Efficient deep learning: a survey on making deep learning models
smaller, faster, and better. ACM Comput. Surv. 55(12), 1–37 (2023)
14. Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., Terzopoulos, D.:
Image segmentation using deep learning: a survey. IEEE Trans. Pattern Anal.
Mach. Intell. 44(7), 3523–3542 (2022)
15. Mischler, C., Wickes, C.: Campbell Island/Motu Ihupuku Seabird Research &
Operation Endurance February 2023. POP2022-11 final report prepared for Con-
servation Services Programme. Department of Conservation (2023)
16. Montana, D.J.: Strongly typed genetic programming. Evol. Comput. 3, 199–230
(1995)
17. Moore, P.J.: Southern royal albatross on Campbell Island/Motu Ihupuku: solving a
band injury problem and population survey, 2004–08. DOC research and develop-
ment series, 333, Publishing Team, Department of Conservation, Wellington, New
Zealand (2012)
18. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4 28
19. Suganuma, M., Shirakawa, S., Nagao, T.: A genetic programming approach to
designing convolutional neural network architectures (2017)
20. Westerskov, K.: The nesting habitat of the royal albatross on Campbell Island, vol.
6, pp. 16–20. New Zealand Ecological Society (1958)
Reinforcement Learning for Truck
Eco-Driving: A Serious Game as Driving
Assistance System
1 Introduction
Eco-driving has emerged as a crucial aspect within the field of road freight
transport. The requirements for adapting an eco-driving behavior are becom-
ing stronger: transport companies aims to reduce their fuel consumption and
CO2 emissions for economic and environmental issues. Logistic and transporta-
tion enterprises benefit of research in numerical data. Today, most of the trucks
are equipped with numerous sensors that provide real-time information. When
coupled with driving indicators, they can effectively contribute to punctually
optimizing driving efficiency. For example, gear shift indicator can be coupled
to engine speed.
In this work, we aim to overtake this simple and punctual driving-assistance
by a complete eco-driving system which assist the driver in the long term jour-
ney. By the analysis of vehicle technical data such as vehicle speed, real-time
consumption or braking state our system should be able to evaluate driving
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 299–310, 2023.
https://doi.org/10.1007/978-3-031-45382-3_25
300 M. Fassih et al.
quality and propose actions that globally optimize fuel consumption and CO2
emissions. Our solution can be viewed as a serious game where driver has to
improve their driving score. This work is jointly developed with the company
Strada 1 specialized in Transport Management System (TMS) for fleet manage-
ment optimization. Strada possesses a vast amount of transport data due to its
4800 customers and approximately 5800 connected vehicles. The company’s pri-
mary goal is to offer drivers a solution that enhances their eco-driving awareness
and learning capabilities through a driving simulator.
The analysis of Strada requirements involve some assumptions and con-
straints. First, since it exists several driving modes and situations depending
on the drivers, vehicles, journeys, roads, etc, the simulator must autonomously
adapt to various contexts and provide precise driving recommendations tailored
to each specific situation. Secondly, the solution should appears as a positive
experience for the learner. Thus a gamification of the solution sounds an appro-
priate approach. As driving can be viewed as a continuous sequence of actions,
our proposition is to develop a simulator that increase the driving performance
of drivers in a whole sequence. With these hypotheses reinforcement learning
algorithms sound well adapted to our problematic. Indeed, the reinforcement
learning (RL) are another path of machine learning approach between unsu-
pervised and supervised learning. The agent learns to behave in environment
depending on future rewards, and has a goal to develop efficient policy to opti-
mize a cumulus reward. From our point of view, RL is a adapted response to
our concern: increase driving score in a gamification approach.
It exists large number of papers about Reinforcement Learning. We can cite
the introduction to RL presented by Sutton in [1] and some surveys in [2–4].
Obviously, RL and deep RL are currently being employed in a wide range of prac-
tical applications with numerous and diverse use cases. In [5], authors present
review of RL for Cybersecurity domain. In [6], healthy problems are treated. In
[7], a survey of Deep RL for blockchain in industrial IoT is presented. Closer
to our problematic, RL and Deep RL are also used in vehicle management but
most of times it concerns autonomous driving context. One survey is proposed
by Elallid et al. [8]. In particular context of eco-driving, some propositions exists
but mainly concerns electric and personal vehicle [9–11] but trucks consumption
assistance in transport of goods still little treated.
In this article, we propose a solution for truck eco-driving. As a chest game,
our proposal consists in giving to a driver optimal recommendations (driving
actions) that will increase a global performance considering fuel consumption
using reinforcement learning algorithm. As RL is based on the principle of action-
reward estimation, our solution should be able to identify current state of a
vehicle and to provide local rewards. The originality of our proposition is to
combine clustering approach for states estimation and RL. Rewards and states
are defined within expert knowledge.
The remaining part of this paper is organized as follows. Section 3 dedicate to
theoretical aspects of RL and the proposal description: RL basis are first intro-
1
https://www.stradaworld.com/.
Reinforcement Learning for Truck Eco-Driving 301
duced in Sect. 2 and general scheme of our proposal is described in Sect. 3.1. The
use of experts’ knowledge for actions-rewards modeling is presented in Sect. 3.2
and Sect. 3.3 describes how states are estimated. Section 4 presents our experi-
ments and results. We conclude and propose some new perspectives in Sect. 5.
– a set S of states,
– a set A of actions,
– a transition probability function p : S × A → Psa (.) that is the transition
probabilities upon taking action a in state s,
– a reward function R : S×S×A → r which modelizes the reward R(st+1 , st , at ),
the expected rewards for state-action-next-state triples,
– a future discount factor γ, the discount rate determines the present value of
future rewards.
In the previous equation, the expectation, noted E is over the state sequence
(s0 , s1 , ...) we pass through when we execute the policy π starting from s0 . In
our setting, we use a finite time horizon T (it is a particular context, the end
of the truck travel). Qπ (s0 , a0 ) is the expected cumulative reward received while
starting from state s0 with action a0 and following policy π. The solution of
an MDP is a policy π ∗ that for every initial state s0 maximizes the expected
cumulative reward.
Reinforcement Learning tests which actions are best for each state of an
environment essentially by trial and error. The model sets a random policy to
start, and each time one action is taken. This continues until the state being
terminal (T in the previous equation).
To solve this problem a classical strategy is to use the Q-Learning algorithm.
The Q-Learning algorithm was first introduced by [12], and is one of the most
302 M. Fassih et al.
α ∈ [0, 1] is the learning rate. With the -greedy strategy, the agent choose
the optimal action with probability (1 − ), and to choose a random action with
probability . The value of the parameter can be varied over time, by decreasing
it over the course of training.
the experts have identified significant technical parameters to monitor and eval-
uate the driver’s adherence to the objective of minimizing fuel consumption.
The Start-up phase is characterized by a strong speed acceleration in a certain
time slot: during this stage, the driver has to increase speed very quickly to
reach the cruising speed of the vehicle. Factually, driver can act on the vehicle
acceleration (more or less acceleration) and change gears. So, engine speed (in
rpm) and acceleration (in %) can be used to characterize vehicle state during
the start-up phase. This couple of measures defines the start-up phase feature
space. Using RL, each driver’s action can be perceived as a movement within the
measurement space, necessitating the ability to assign a state and a localized
score to each position in the feature space. Strada’s experts, based on their
experience, propose to divide feature space into several sub-spaces and associate
each sub-space to a local reward. Figure 3 describes their proposal. As we can see,
Fig. 3. Start-up phase feature space, states and scoring table (Color figure online)
On the same principle, scoring tables can be defined for the two others driving
phases. Considering that rolling phase corresponds to a constant speed without
acceleration neither deceleration and braking phase corresponds to a strong speed
deceleration in a certain time slot, experts propose to retain the measures of
variation speed and fuel consumption for the rolling phase and the measures of
braking percentage and deceleration for the braking phase. Figure 4 presents the
proposed scoring Tables for these phase and the Table 1 summarize measures
and driver actions.
Fig. 4. Rolling and braking phase feature space, states and scoring tables (Color figure
online)
Reinforcement Learning for Truck Eco-Driving 305
Figures 3 and 4 define the possible states in our RL modeling. The choice by
the experts of the number of states and of the values of rewards is based on
their experience. It meets their main objective which is the determination of the
optimal driving path in the least possible states.
However using Q-Learning method for RL implies to be able to identify at
each iteration the current state (see Eq. 2) of the vehicle. The advantage of the
scoring tables as defined by the experts is their simplicity. A simple couple of
measures provide the state. However, at this point, experts are unable to pre-
cisely define the boundaries of each sub-space. Moreover, we have to keep in mind
that the boundaries between each sub-space can not be universal. Indeed, many
exterior parameters influence driving context (type of journey, cargo, vehicle. . . )
and it seems obvious that the boundaries between states can not be static. They
have to be estimated for each driving situation.
Some solutions exist to estimate states by combining the convolutional neural
network with the Q-Learning algorithm as proposed in [13]. In this deep Q-
Learning [13], a full-connected neural network is used to identify some states
according to the input values during a learning state. But this process is very time
consuming and requires a large data bases. This due to the fact that during the
reinforcement learning algorithm it is necessary to learn the weights associated
with the conventional neural network and the full connected part.
In our application, we propose an another strategy: a clustering process in
order to identify the state of the vehicle and the sub-spaces boundaries. Our
proposition is to use Self-Organizing Map (SOM) which is a well-known unsu-
pervised learning tool. Some authors have already proposed to use SOM in RL
[14,15]. In these articles, the SOM maps the input space in response to the real-
valued state information, and each unit is then interpreted as a discrete state.
This context closely aligns with our specific problem. So, we suggest utilizing this
structure as it enables the application of a clustering process for state detection.
Moreover, the huge advantage of SOM is its ability to preserves the topological
properties of the input space. The number of neurons will correspond to the
number of sub-spaces proposed by the experts.
The Fig. 5 illustrates SOM convergence. Applied to our process, the grid
corresponds to the score space and blue cloud to real data measures.
306 M. Fassih et al.
In this part, we describe our experimentation, in particular the used data, the
clustering stage and the driving recommendations.
Fig. 6. Example of data corresponding to start-up, rolling and braking phases (from
left to right)
Fig. 7. Kohonen2D result associated with start-up, rolling and braking phases
Table 2. Recommendations for start-up phase (s: engine speed in rmp, a acceleration
in m/s2
Fig. 8. Displacements in feature/state space - Scatter space (From left to right: for
start-up, rolling and braking phases
The same analysis is performed for the rolling and braking phases. The
Tables 3 and 4 sum up theses cases. We observe that the system is always try-
ing to reach the high score zone which is the desired behavior for eco-driving
optimization. The system seems to be efficient to provide accurate driving rec-
ommendations. The recommended actions successively move feature date toward
“green” regions of the score tables as expected.
Theses results were presented to Strada experts to evaluate the global solu-
tion on a complete driving sequence. They estimate that the system offers con-
sistent recommendations from an eco-driving behavior point of view. The recom-
mendations offered by our system appear highly relevant. They perfectly meet
the driving advises to give according to the simulated driving sequence. Right
now, the solution can be viewed of as three parallel processes, one per driving
phase. Individually, each process provides accurate recommendations for eco-
driving. Obviously, driving a truck is a continuous process where start-up, rolling
Reinforcement Learning for Truck Eco-Driving 309
Table 3. Recommendations for rolling phase (with f c: fuel consumption in L/100 Km,
Δs: speed variation in Km/h)
and braking phases follow one another. Global solution includes automatic detec-
tion of the driving phase. Evaluation of global fuel consumption on a simulated
travel has also to be done. Given these results, solution will be adapted to real
truck. Others search should be engaged based on deep RL.
5 Conclusion
In the context of eco-driving, we presented a real-time recommendation sys-
tem for a simulator-type driving training. The proposed system is based on the
reinforcement learning. Since the states domain is continuous, we introduce an
identification of discrete states by using a Self-organizing Map. All the parame-
ters of our strategy are setting by qualified expert. Finally, we use our proposed
system on simulated driving, and according to the expert, the recommendations
are coherent and allows the driver to have an eco-driving behavior. Perspec-
tives of this work are numerous. First, the modeling of states and actions has
been deliberately simplified. However, improvement within each driving phase
by refining the representation of states and actions could be made. Implement-
ing, for example, more specific and precise actions for each of the phases is a
very good start for improvement. Secondly, introducing Deep RL should sim-
plify identifications of the current states for Q-learning. Despite the need for
abundant data, acquiring real-world transportation data poses challenges for
deep learning-based ML and RL systems. Thirdly, the proposal can be enriched
by taking into account new information such as type of road during the travel
(streets, roundabout, highway. . . ), topology of missions or shipments.
Acknowledgments. This work was supported by the ANRT funding. Special thanks
to the Strada team for their time, knowledge and expertise.
310 M. Fassih et al.
References
1. Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT
Press, Cambridge (1998)
2. Wiering, M., Van Otterlo, M.: Reinforcement Learning: State of the Art. Springer,
Heidelberg (2012). https://doi.org/10.1007/978-3-642-27645-3
3. Garcıa, J., Fernández, F.: A comprehensive survey on safe reinforcement learning.
J. Mach. Learn. Res. 16(1), 1437–1480 (2015)
4. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J.
Artif. Intell. Res. 4, 237–285 (1996)
5. Adawadkar, A.M.K., Kulkarni, N.: Cyber-security and reinforcement learning – a
brief survey. Eng. Appl. Artif. Intell. 114, 105116 (2022). https://doi.org/10.1016/
j.engappai.2022.105116
6. Coronato, A., Naeem, M., De Pietro, G., Paragliola, G.: Reinforcement learning
for intelligent healthcare applications: a survey. Artif. Intell. Med. 109, 101964
(2020). https://doi.org/10.1016/j.artmed.2020.101964
7. Frikha, M.S., Gammar, S.M., Lahmadi, A., Andrey, L.: Reinforcement and deep
reinforcement learning for wireless Internet of Things: a survey. Comput. Commun.
178, 98–113 (2021). https://doi.org/10.1016/j.comcom.2021.07.014
8. Elallid, B.B., Benamar, N., Hafid, A.S., Rachidi, T., Mrani, N.: A comprehen-
sive survey on the application of deep and reinforcement learning approaches in
autonomous driving. J. King Saud Univ. Comput. Inf. Sci. 34(9), 7366–7390 (2022).
https://doi.org/10.1016/j.jksuci.2022.03.013
9. Yeom, K.: Model predictive control and deep reinforcement learning based energy
efficient eco-driving for battery electric vehicles. Energy Rep. 8, 34–42 (2022).
https://doi.org/10.1016/j.egyr.2022.10.040
10. Li, J., Wu, X., Xu, M., Liu, Y.: Deep reinforcement learning and reward shap-
ing based eco-driving control for automated HEVs among signalized intersections.
Energy 251, 123924 (2022). https://doi.org/10.1016/j.energy.2022.123924
11. Du, G., Zou, Y., Zhang, X., Liu, T., Wu, J., He, D.: Deep reinforcement learning
based energy management for a hybrid electric vehicle. Energy 201, 117591 (2020).
https://doi.org/10.1016/j.energy.2020.117591
12. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992).
https://doi.org/10.1007/BF00992698
13. Mnih, V., et al.: Playing atari with deep reinforcement learning, CoRR
abs/1312.5602 (2013)
14. Osana, Y.: Reinforcement learning using Kohonen feature map probabilistic asso-
ciative memory based on weights distribution. In: Advances in Reinforcement
Learning. IntechOpen (2011)
15. Montazeri, H., Moradi, S., Safabakhsh, R.: Continuous state/action reinforcement
learning: a growing self-organizing map approach. Neurocomputing 74(7), 1069–
1082 (2011)
16. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol.
Cybern. 43(1), 59–69 (1982). https://doi.org/10.1007/BF00337288
17. Kohonen, T.: Self-Organization and Associative Memory, vol. 8. Springer, Heidel-
berg (2012)
18. Kohonen, T.: Essentials of the self-organizing map. Neural Netw. 37, 52–65 (2013).
https://doi.org/10.1016/j.neunet.2012.09.018
19. Euro Truck Simulator 2. https://eurotrucksimulator2.com/
Underwater Mussel Segmentation Using
Smoothed Shape Descriptors
with Random Forest
1 Introduction
In mussel restoration ecology there is a need to rapidly quantify the spatial
extent and three-dimensional (3D) geometry of restored mussel beds to assess
the efficacy of restoration efforts. Traditional surveying methods are labour inten-
sive utilising multiple divers and measurement tools such as weighted tapes and
calipers. Limited both temporally and spatially due to diving logistics and safe
diving practice, it is difficult to obtain high-resolution continuous data and large
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 311–321, 2023.
https://doi.org/10.1007/978-3-031-45382-3_26
312 D. A. S. Valdez et al.
coverage, on the growth of restored mussel beds over time. With the increased
availability of open-source 3D reconstruction software, obtaining accurate geo-
metrical data on underwater mussel structures only requires high-resolution pho-
tos or videos [3]. Utilising cameras allows marine ecologists to simplify under-
water data collection procedures or introduce autonomous underwater vehicles
(AUVs) into their monitoring programmes [8,16] however, the labour is shifted
from in-situ measurements to isolating the benthic structures in the reconstruc-
tions using 3D software. The increasing volume of data gathered by modern
sampling procedures will soon be untenable without reliable autonomous pro-
cessing pipelines. Point cloud segmentation is one such approach suited for the
autonomous processing of unstructured and irregular 3D data.
2 Related Work
3 Methodology
The methodology section is divided into three sections: data preparation, shape
descriptors computation and point cloud element classification.
Fig. 1. Different locations of 3D seafloor reconstruction with color and texture (top
row). Corresponding labeled mussels structures colored in magenta (bottom row).
(Color figure online)
It is possible to use the labeled data to train a decision tree, which can segment
the mussel structures. However, by looking at the color data, it is obvious that
texture is not an ideal feature to perform any classification of structures. Results
can be further degraded by turbidity variations around the locations of interest.
In order to minimize the impact of the above on the segmentation process, a
shape descriptor approach was selected for this work.
We propose the use of a shape descriptor based on the method of Smoothed
Particle Hydrodynamics (SPH) [14]. A similar approach was used successfully for
Archaeological studies [28]. This shape descriptor aims to provide the robustness
of the local shape descriptors, such as splash shape descriptor [24], along with
global operator descriptors, such as smoothing salient features [11].
The shape descriptor is defined as a field property As for each particle P (see
Eq. 1), where the mass m and the density ρ are fixed in an arbitrary fashion.
While the smoothing length h of the kernel W can be computed by considering
the size of the features of the underwater structures, it was set manually across
this work. In order to avoid changes in shape description and sensitivity, the
smoothing kernel function W (see Eq. 2) was not modified from the one pro-
posed in the original work, where R is the distance between particles and h the
interaction radius of the kernel.
n
mj
As (Pi ) = As (Pj )W (ri − rj , h) (1)
j=1
ρj
Underwater Mussel Segmentation 315
315 (h2 − R2 )3 0 ≤ R ≤ h
W (R, h) = (2)
64πh9 0 R>h
Descriptors used in detection and classification are mainly based on region
covariance and have proven to be quite successful [26,27]. Taking this into con-
sideration, we added a region covariance computation to our shape descriptor.
This was applied to the main feature used for our approach as well as the position
covariance of 3D data. The main feature that we used for our mussel segmenta-
tion approach includes only the vertex normal.
Since the dimensions of the 3D point cloud of each transact exhibit a principal
axis, along which the data is acquired, we decided to subdivide it into patches to
provide regions with more regular dimensions. Following this approach we were
able to use an octree with the same depth level for both the SPH and HOG
method. The Depth of these octree was chosen by taking into consideration the
64 grid dimension proposed by Dalal [7]. Since we are aiming to obtain a cubic
volume, we opted for an octree with a maximum depth level 6 for the HOG
initial grid (64 × 64 × 64). This value is also used for the kernel size of the SPH
shape descriptor, that way the gradient and SPH computation are equivalent
region wise.
Using these conditions we tested both, the SPH shape descriptor, and HOG
shape descriptor. To measure the performance we opted for several combinations
of the SPH features and the complete set of HOG features, comparing four scores:
precision, recall, F1 and accuracy.
Object classification using decision trees along with texture and shape features
has been proven successful on tasks that were challenging before machine learn-
ing adoption [2,20]. Our approach provides a set of features for the classification
of objects. This novel shape descriptor, along with traditional ones, such as
region covariance, deliver a set of multi-features, which are only based on 3D
shape data. For performance comparison, we evaluated our approach outputs
against the results obtained using solely HOG shape features.
We used the python implementation of random forest provided in the library
Scikit-learn [15]. The random forest classifier was executed using one hundred
estimators and the random state parameter was set at 42.
For our binary classification, we defined a single class, “Mussel”, while every-
thing else was considered as “NOT Mussel”. The random forest classifier trained
on data that was labeled as Mussel, as well as data labeled as “NOT Mussel”,
which typically includes seafloor, rocks, algae, and other underwater objects.
While our focus is the detection of mussel structures, the approach can be mod-
ified to include more classes and detect other features.
316 D. A. S. Valdez et al.
4 Results
We tested our approach using a data set which contains nine 3D reconstructions
from different underwater locations. Our data set contains over 12 million ver-
texes. Since the amount of transects is limited, we opted for a cross-validation
approach, where we used each transect as training data, and then we tested our
classifier on the remaining 8 transects.
The features selected are listed in Table 1. The same encoding is used in
subsequent tables and figures in this paper. Those that are named “covariance”
are computed using region covariance, while those with “SPH” in their name are
computed using the smoothed shape descriptor.
Table 2 shows the average results for precision, recall, f1 and accuracy. Each
row indicates which shape descriptor data was included for the random forest
training. The most dominant feature is “NORMAL”, while the computation of
covariance has some of the best results. However, when combined with the SPH
computation, all measures improve slightly. As such, the use of SPH can be
considered as a fine-tuning tool. The most relevant results containing our SPH
descriptor were obtained for experiment 9 (see 2), where using SPH features
along with the SPH centroid features achieve a similar performance as the ones
that used covariance or covariance and SPH.
Our SPH shape descriptor performed better than the shape descriptor based
on HOG features in most cases, only experiments 1,2 and 5 got a lower score
in precision, only on experiment 2 the score for precision, F1 and accuracy was
worse than 20% compared to HOG. As expected position was never a reliable
feature, but those which included features based on vertex normal, where the
better performing, even better than HOG. In the case of experiments 3, 5, 6, 8
and 9, all of them performed better in all the scores we evaluated going from
ranges of 2% to 9% in precision, 15% to 21% in Recall, 12% to 15% on F1, and
4% to 5% in Accuracy (see 2).
5 Discussion
As with 2D segmentation tasks, 3D segmentation benefits from the use of robust
shape descriptors invariant to the orientation and scale differences present in 3D
point clouds. Our method requires normalization of the orientation and scale dif-
ferences between the point clouds to improve the robustness of the SPH descrip-
tor, while HOG features readily extracted are scale and orientation invariant.
Despite the difference in invariance, SPH shape descriptors outperformed HOG
in accuracy and recall during segmentation. Our SPH shape descriptor requires
further extension for scale and orientation invariance or should be used alongside
those like HOG in 3D point cloud segmentation tasks.
For highly complex structures, segmentation pipelines should seek to ade-
quately represent or leverage this geometrical complexity. Runyan et al. [19]
used SparseConvNet to process point clouds similarly derived from SfM pho-
togrammetry on complex coral structures. SparseConvNet operates on voxels
requiring the creation of implicit surfaces from the point cloud using a fixed
voxel resolution. The chosen voxel resolution however determines the minimum
geometric complexity maintained (at the cost of computation). Comparatively,
318 D. A. S. Valdez et al.
our approach works directly on the unstructured point cloud. SPH computes
per vertex field properties that can be turned into individual features per ver-
tex, where closer and more dense sets of elements significantly influence the
features assigned to each vertex utilizing the geometric complexity of the full
point cloud. Pointnet++ [18] similarly operates directly on the unstructured
point cloud. While the features learned are orientation and scale-invariant, the
network’s ability to learn representative features for the naturally fractal mussel
structures [23] should be investigated.
Underwater Mussel Segmentation 319
References
1. Akhtar, A., Gao, W., Li, L., Li, Z., Jia, W., Liu, S.: Video-based point cloud
compression artifact removal. IEEE Trans. Multimed. 24, 2866–2876 (2021)
2. Ali, J., Khan, R., Ahmad, N., Maqsood, I.: Random forests and decision trees. Int.
J. Comput. Sci. Issues (IJCSI) 9(5), 272 (2012)
3. Azhar, M., Hillman, J.R., Gee, T., Thrush, S., Delmas, P.: A low-cost stereo
pipeline for semi-automated spatial mapping of mussel structures within mussel
beds. Remote Sens. Environ. (Manuscript in review) (2023)
4. Behley, J., et al.: Semantickitti: a dataset for semantic scene understanding of
lidar sequences. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision, pp. 9297–9307 (2019)
5. Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. arXiv
preprint arXiv:1512.03012 (2015)
6. Chang, Y.L., Fang, C.Y., Ding, L.F., Chen, S.Y., Chen, L.G.: Depth map gener-
ation for 2D-to-3D conversion by short-term motion assisted color segmentation.
In: 2007 IEEE International Conference on Multimedia and Expo, pp. 1958–1961.
IEEE (2007)
320 D. A. S. Valdez et al.
7. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR 2005), vol. 1, pp. 886–893. IEEE (2005)
8. Ferrari, R., et al.: 3D photogrammetry quantifies growth and external erosion of
individual coral colonies and skeletons. Sci. Rep. 7(1), 1–9 (2017)
9. Grilli, E., Poux, F., Remondino, F.: Unsupervised object-based clustering in sup-
port of supervised point-based 3D point cloud classification. Int. Arch. Photogram-
metry Remote Sens. Spat. Inf. Sci. 43, 471–478 (2021)
10. Li, H., Huang, D., Lemaire, P., Morvan, J.M., Chen, L.: Expression robust 3D
face recognition via mesh-based histograms of multiple order surface differential
quantities. In: 2011 18th IEEE International Conference on Image Processing, pp.
3053–3056 (2011). https://doi.org/10.1109/ICIP.2011.6116308
11. Li, X., Guskov, I.: Multiscale features for approximate alignment of point-based
surfaces. In: Symposium on Geometry Processing, vol. 255, pp. 217–226 (2005)
12. Lu, B., Wang, Q., Li, A.: Massive point cloud space management method based
on octree-like encoding. Arab. J. Sci. Eng. 44, 9397–9411 (2019)
13. Martin-Abadal, M., PiÃČÂśar-Molina, M., Martorell-Torres, A., Oliver-Codina,
G., Gonzalez-Cid, Y.: Underwater pipe and valve 3D recognition using deep learn-
ing segmentation. J. Mar. Sci. Eng. 9(1), 5 (2020)
14. Monaghan, J.J.: Smoothed particle hydrodynamics. ARAA 30, 543–574 (1992).
https://doi.org/10.1146/annurev.aa.30.090192.002551
15. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
16. Pizarro, O., Eustice, R.M., Singh, H.: Large area 3-D reconstructions from under-
water optical surveys. IEEE J. Oceanic Eng. 34(2), 150–169 (2009)
17. Pulido, A., Qin, R., Diaz, A., Ortega, A., Ifju, P., Shin, J.J.: Time and cost-efficient
bathymetric mapping system using sparse point cloud generation and automatic
object detection. In: OCEANS 2022, Hampton Roads, pp. 1–8 (2022). https://doi.
org/10.1109/OCEANS47191.2022.9977073
18. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn-
ing on point sets in a metric space. arXiv preprint arXiv:1706.02413 (2017)
19. Runyan, H., et al.: Automated 2D, 2.5 D, and 3D segmentation of coral reef point-
clouds and orthoprojections. Front. Rob. AI 9 (2022)
20. Schroff, F., Criminisi, A., Zisserman, A.: Object class segmentation using random
forests. In: BMVC, pp. 1–10 (2008)
21. SÃühnlein, G., Rush, S., Thompson, L.: Using manned submersibles to create 3d
sonar scans of shipwrecks. In: OCEANS 2011 MTS/IEEE KONA, pp. 1–10 (2011).
https://doi.org/10.23919/OCEANS.2011.6107130
22. Shu, C., Ding, X., Fang, C.: Histogram of the oriented gradient for face recog-
nition. Tsinghua Sci. Technol. 16(2), 216–224 (2011). https://doi.org/10.1016/
S1007-0214(11)70032-3
23. Snover, M.L., Commito, J.A.: The fractal geometry of mytilus edulis l. spatial
distribution in a soft-bottom system. J. Exp. Mar. Biol. Ecol. 223(1), 53–64 (1998)
24. Stein, F., Medioni, G.: Structural indexing: efficient 2d object recognition. IEEE
Trans. Pattern Anal. Mach. Intell. 14(12), 1198–1204 (1992)
25. Surasak, T., Takahiro, I., Cheng, C.H., Wang, C.E., Sheng, P.Y.: Histogram of
oriented gradients for human detection in video. In: 2018 5th International Confer-
ence on Business and Industrial Research (ICBIR), pp. 172–176 (2018). https://
doi.org/10.1109/ICBIR.2018.8391187
Underwater Mussel Segmentation 321
26. Tabia, H., Laga, H., Picard, D., Gosselin, P.H.: Covariance descriptors for 3D shape
matching and retrieval. In: 2014 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 4185–4192 (2014). https://doi.org/10.1109/CVPR.2014.533
27. Tuzel, O., Porikli, F., Meer, P.: Region covariance: a fast descriptor for detec-
tion and classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006.
LNCS, vol. 3952, pp. 589–600. Springer, Heidelberg (2006). https://doi.org/10.
1007/11744047_45
28. Valdez, D.A.S., et al.: CUDA implementation of a point cloud shape descriptor
method for archaeological studies. In: Blanc-Talon, J., Delmas, P., Philips, W.,
Popescu, D., Scheunders, P. (eds.) ACIVS 2020. LNCS, vol. 12002, pp. 457–466.
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-40605-9_39
29. Wang, G., Tie, Y., Qi, L.: Action recognition using multi-scale histograms of ori-
ented gradients based depth motion trail Images. In: Falco, C.M., Jiang, X. (eds.)
Ninth International Conference on Digital Image Processing (ICDIP 2017), vol.
10420, p. 104200I. SPIE (2017). https://doi.org/10.1117/12.2281553
30. Zhang, Y., et al.: Polarnet: an improved grid representation for online lidar point
clouds semantic segmentation. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 9601–9610 (2020)
31. Zhou, W., Gao, S., Zhang, L., Lou, X.: Histogram of oriented gradients feature
extraction from raw Bayer pattern images. IEEE Trans. Circ. Syst. II Express
Briefs 67(5), 946–950 (2020). https://doi.org/10.1109/TCSII.2020.2980557
A 2D Cortical Flat Map Space
for Computationally Efficient Mammalian
Brain Simulation
1 Introduction
Brain simulation is an active and promising area of research in computational
neuroscience. The aim is to model the structure and function of the brain using
biologically realistic or simplified neurons and synapses. Brain simulation has
several potential applications, such as understanding the neural basis of cog-
nition and perception, testing hypotheses about brain disorders and diseases,
developing novel brain-inspired algorithms and technologies, and advancing the
field of artificial intelligence. However, simulations face many challenges that can
limit their feasibility and validity.
The first challenge is the scale of cortical brain simulation. The human cortex
contains a huge number of neurons and even more synapses. Simulating such a
large and complex system requires enormous computational resources and power.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 322–331, 2023.
https://doi.org/10.1007/978-3-031-45382-3_27
A 2D Cortical Flat Map Space Cortical Flat Map for Simulation 323
Fig. 1. (a) The Brain/MINDS marmoset brain atlas shown in horizontal, sagittal and
coronal slices. (b) The cortical flat map generated from the atlas. (c) The map of the
scaling at each point in space when going from 2D back to 3D. (d) The map of (normal-
ized) neuron densities. (e) An example of neuron positions generated probabilistically
based on the densities.
Another challenge is dealing with the complexity of the brain. The brain is
not a homogeneous or static system, but rather a heterogeneous and dynamic one
that exhibits multiple levels of organization and adaptation. The brain consists of
different types of neurons and glia cells, which have diverse morphologies, electro-
physiological properties, molecular profiles, synaptic connections, and plasticity
mechanisms. Moreover, the brain changes over time due to development, learn-
ing, aging, injury, and disease. To capture all these aspects of brain complexity
would require a vast amount of data and parameters that are not easily accessi-
ble or measurable, especially in humans. Therefore, most brain simulations rely
on simplifications or assumptions that may compromise their biological realism
or relevance.
One of the main goals and contributions of this paper is to propose a method
for computationally efficient mammalian cortical brain simulation by construct-
ing a 2D cortical flat map space on a regular grid. The rationale behind this is
that in the mammalian brain the cortex forms a sheet with (up to) six layers. In
principle it can therefore be flattened out onto a (2D) surface. A 2D space has
a number of attractive properties, and treating computation on a regular grid
maps well with computer hardware.
This method allows us to map neuroscience data into the flat map space and
use it as experimental constraints for the simulation. Moreover, it enables us
to determine neuron locations probabilistically by sampling from neuron density
distributions, which can be used for parameterizing brain simulations at different
scales. The method accounts for the spatial warping of the cortical surface when
324 A. Woodward et al.
flattening it into 2D. This information can be used to scale properties such as
diffusion rates of neural activity across the flat map. Finally, in this work we
combine the flat map model with region-to-region connectivity information to
create a simulation that accounts for the spread of local activity across the
cortical sheet and long-range connectivity. We demonstrate the feasibility and
utility of our method using a brain atlas and diffusion weighted imaging (DWI)
data of the common marmoset brain.
Our method provides a computationally efficient and flexible way to map
neuroimaging data and specify parameters for cortical brain simulation using a
2D cortical flat map space.
2 Methods
2.1 Construction of a 2D Cortical Flat Map Space
We used our Brain/MINDS marmoset brain atlas [2,10] (see Fig. 1a) to construct
a cortical flat map using the Caret software [9]. The result is shown in Fig. 1b.
This is in the form of a triangular mesh, so it must be processed in order to
use it for computational purposes. We do this by mapping the mesh into a 2D
image space. The domain of an image is the set of coordinates (x, y) that define
the location of each pixel in the image. The domain can be represented by a
regular grid of points that span the width and height of the image. We define
an M × N arithmetic grid: RM,N = (x, y) : 1 ≤ x ≤ M ∧ 1 ≤ y ≤ N . An image
model treats an image as a function: I : R → V, where V is a set of signal
values. For example, grey level intensities can be described as I(p) ≡ I(x, y) ∈
V = {0, 1, · · · , Gmax }. In our case, we can take any neuroscience data as our
signal and map it into this image space. We map the flat map so that it fits
within an arbitrarily chosen M × N grid (image). Each pixel of this 2D image is
then assigned to a particular brain region based on the mapped information. We
can then take advantage of the regular grid structure to accelerate computation
through parallelization. This can speed up the processing time significantly when
compared to applying the operations to the whole image sequentially by a single
processor or core.
One of the challenges of constructing a 2D cortical flat map space is to
account for the spatial warping of the cortical surface that occurs when it is
flattened from 3D. This warping may affect the properties of neural activity that
depend on the distance or area of the cortical regions, such as diffusion rates.
To address this issue, we estimated a scale factor that quantifies the relative
change in surface area of triangles in the brain surface mesh when going from
the 2D cortical flat map space, back to 3D. The factor can be used to modify the
diffusion rates of neural activity across the flat map, so that they are consistent
with the 3D geometry. The result is shown in Fig. 1c.
As an example, we also mapped information about per-region neuron densi-
ties into the flat map space (obtained from [3]). These densities were then nor-
malized to give an empirical probability distribution of relative neural densities
(see Fig. 1d). If we sample this repeatedly we can generate N neuron locations
A 2D Cortical Flat Map Space Cortical Flat Map for Simulation 325
that respect the true relative neuron densities of the cortex (see Fig. 1e). In this
manner we can specify neuron counts up to the real number in the brain. Further
information, such as region connectivity profiles derived from tracer studies or
DWI, neuron types, or cortical layer information, could be mapped into the flat
map space to enhance the realism of the simulation.
For this work we used connectivity information derived from common marmoset
brain in vivo DWI data, from a cohort of 126 individuals. This data was obtained
from the NA216 dataset [5] at the Brain/MINDS Data Portal [2]. Diffusion
weighted imaging (DWI) is a form of magnetic resonance imaging (MRI) that
measures the random motion of water molecules within brain tissue. By applying
different diffusion gradients along different directions, DWI can generate contrast
based on the diffusion anisotropy of the tissue, which reflects its microstructural
organization.
Fig. 2. (a) The brain atlas mapped with 126 tractograms used as input, to generate
(b) the canonical (average) connection matrix. (c) Distribution of connection weights
showing a long-tail. (c) Distribution of the logarithm to the base 10 of the connection
weights.
The DWI data was processed to obtain the fiber orientation density (FOD)
function at each voxel location. This was then used to generate a tractogram -
a set of streamlines that estimates the axonal paths running through the brain.
A simplified version of the Brain/MINDS marmoset brain atlas [2,10], with 52
326 A. Woodward et al.
regions per hemisphere, suitable for MRI studies, was registered with the DWI
data using a nonlinear transformation (see Fig. 2a for a conceptual diagram). The
number of streamlines connecting two regions were used as a measure of strength
and stacked in a connection matrix. The matrices of the 126 individuals were
averaged to generate a final average connection matrix (Fig. 2b). This symmetric
104 × 104 matrix describes the long range structural connectivity of the common
marmoset brain.
The detailed methodology for data acquisition, preprocessing and connection
matrix generation is described in [2,5,10,11]
We plotted the distribution of weights for the average connection matrix,
shown in Figs. 2c–d. The distribution shows a long tail, a feature of real brains
that is important for realistic brain simulation.
1
Some versions of the model add a diffusion term to w to give:
∂w
= Dw ∇2 w + (v − γw − β)
∂t
.
A 2D Cortical Flat Map Space Cortical Flat Map for Simulation 327
∂u
= Du ∇2 u + f (u) − v + I + ω, (2a)
∂t
∂v
= (u − γv − β) (2b)
∂t
Here, Du is a positive constant that represents the diffusion coefficient of u.
The diffusion term accounts for the spatial coupling of neighboring cells, and
can lead to the formation of traveling waves or spiral waves in the system. We
modify Du on a per-pixel basis to include the scale factor describing the warping
of space when going from 2D back to 3D. We used f (v) = v(v − α)(1 − v), as
described in Eq. (5) of [5], with α = −0.3, β = 0, γ = 1e−8 and = 0.1. Here
I represents input into the neural assembly. In our case this comes from input
based on the region-to-region long range (extrinsic) connectivity derived from
the brain atlas and cohort DWI data. In terms of a reaction diffusion system this
can be seen as a global feedback term. A scale factor G is applied to the input
term (i.e. I = G × Iextrinsic ) in order to explore the affects of different levels of
global coupling.
The Spatially-extended FitzHugh-Nagumo model is a special case of the 2D
Generic Oscillator model used in The Virtual Brain (TVB) platform ([6,7]).
Simulation Loop. At each iteration of the simulation the steps are as follows:
(1) Calculate the Laplacian of u to estimate the diffusion term, (2) update the
values of u and v using Eqs. (2a, 2b) through integration, (3) based on the
region-to-region connectivty matrix, sum the activity of u to calculate I for each
region - this will be used for the next iteration.
Fig. 3. (a) An example of mapping the 2D activity back onto the 3D brain surface. (d)
The flat map space (sampled at 200×200 pixels) based on the 52 region per hemisphere
atlas. (c) The mask used for specifying the region of interest for computations. (d) A
comparison of the average computation time for one iteration of the simulation, for
pure Python vs. Numba.
Fig. 4. Results for different global coupling parameter values. (a) Strong global cou-
pling shows dynamics that emphasize global feedback, with different brain region activ-
ity switching over time. (b) Reduced global coupling shows a mixture of local and
global feedback dynamics. (c) With no global coupling only the spontaneous activity
from local noise drives the dynamics.
We will need to validate our method against experimental data and compare it
with other existing methods for cortical brain simulation. This will be tackled
in future research.
4 Conclusion
In this paper we have proposed a novel method for mammalian cortical brain
simulation that leverages the advantages of 2D cortical flat maps. Our method
can map neuroscience data into the flat map space and generate realistic neuron
locations by sampling from empirical density distributions. Our method can also
account for the spatial warping of the cortical surface when going from the 2D flat
map space back into 3D. We have demonstrated the feasibility and utility of our
method using connectivity information from a marmoset brain atlas and DWI
data. Our method has implications for neuroscience research and its applications.
We envision that such a model can shed light on how neural activity propagates
across regions and how different regions interact with each other. This may then
contribute to the study of various aspects of brain function and dysfunction, such
as sensory processing, memory formation, cognitive control, and neurological
disorders.
Some possible future directions for our work are: Implement a CUDA version
of our method that can run on GPUs, which would enable even faster and larger
simulations. Validate and tune the parameters in our model using data such as
resting state and functional MRI and electrocorticography (ECoG). Integrate
our method with other computational models of neural dynamics and plasticity
to simulate brain function and learning. Extend our method into 3D and include
other brain regions, such as the subcortical brain structures.
Acknowledgements. The authors wish to thank Drs. Jun Igarashi and Hiromich
Tsukada for their valuable discussions on the topic of brain simulation. This research
was supported by the program for Brain Mapping by Integrated Neurotechnologies
for Disease Studies (Brain/MINDS) from the Japan Agency for Medical Research
and Development, AMED. Grant number: JP15dm0207001 to A.W. and R.G.,
JP19dm0207088 to K.N.
References
1. Numba: A High Performance Python Compiler. https://numba.pydata.org/.
Accessed 30 Apr 2023
2. The Brain/MINDS Data Portal. https://dataportal.brainminds.jp. Accessed 30
Apr 2023
3. Atapour, N., Majka, P., Wolkowicz, I.H., Malamanova, D., Worthy, K.H., Rosa,
M.G.P.: Neuronal distribution across the cerebral cortex of the marmoset monkey
(callithrix jacchus). Cereb. Cortex 29(9), 3836–3863 (2019). https://doi.org/10.
1093/cercor/bhy263
4. Dahlem, M.A., Isele, T.M.: Transient localized wave patterns and their application
to migraine. J. Math. Neurosci. 3(1), 7 (2013). https://doi.org/10.1186/2190-8567-
3-7
A 2D Cortical Flat Map Space Cortical Flat Map for Simulation 331
5. Hata, J., et al.: Multi-modal brain magnetic resonance imaging database covering
marmosets with a wide age range. Sci. Data 10(1), 221 (2023). https://doi.org/10.
1038/s41597-023-02121-2
6. Sanz Leon, P., et al.: The virtual brain: a simulator of primate brain network
dynamics. Front. Neurosci. 7, 10 (2013). https://doi.org/10.3389/fninf.2013.00010
7. Sanz-Leon, P., Knock, S.A., Spiegler, A., Jirsa, V.K.: Mathematical framework
for large-scale brain network modeling in the virtual brain. NeuroImage 111,
385–430 (2015). https://doi.org/10.1016/j.neuroimage.2015.01.002. https://www.
sciencedirect.com/science/article/pii/S1053811915000051
8. Stefanescu, R.A., Jirsa, V.K.: A low dimensional description of globally coupled
heterogeneous neural networks of excitatory and inhibitory neurons. PLoS Comput.
Biol. 4(11), 1–17 (2008). https://doi.org/10.1371/journal.pcbi.1000219
9. Van Essen, D.C.: Cortical cartography and caret software. NeuroImage 62(2),
757–764 (2012). https://doi.org/10.1016/j.neuroimage.2011.10.077. https://www.
sciencedirect.com/science/article/pii/S1053811911012419. 20 YEARS OF fMRI
10. Woodward, A.: The NanoZoomer artificial intelligence connectomics pipeline for
tracer injection studies of the marmoset brain. Brain Struct. Funct. 225(4), 1225–
1243 (2020). https://doi.org/10.1007/s00429-020-02073-y
11. Woodward, A., et al.: The Brain/MINDS 3D digital marmoset brain atlas. Sci.
Data 5, 180009 (2018). https://doi.org/10.1038/sdata.2018.9
12. Xu, B., Binczak, S., Jacquir, S., Pont, O., Yahia, H.: Isolation and characterization
of plasmid deoxyribonucleic acid from streptomyces fradiae. In: Annual Interna-
tional Conference of the IEEE Engineering in Medicine and Biology Society, pp.
4334–4337 (2014). https://doi.org/10.1109/EMBC.2014.6944583
Construction of a Novel Data Set
for Pedestrian Tree Species Detection
Using Google Street View Data
1 Introduction
Computer vision applied to object detection has improved in performance with
the development of larger deep-learning models such as the YOLO family of
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 332–344, 2023.
https://doi.org/10.1007/978-3-031-45382-3_28
Pedestrian Tree Species Detection 333
methods [16]. The later development and addition of specialised processing units
to these computing devices, such as tensor processing units (TPU) and neural
processing units (NPU), have made it possible to add AI capabilities to small
devices while maintaining a modest level of power consumption [12]. These devel-
opments have enabled the deployment of AI models for low-power edge devices
[20]. One of the major limitations of deep-learning models is the need for a consid-
erable amount of data. Although a vast amount of data is already available, every
problem is different, and maximising detection rates under uncontrolled condi-
tions requires domain-specific data. Therefore, constructing custom datasets is
an essential step in training models for niche computer vision problems. How-
ever, labelling, reviewing, and validating the data requires considerable time and
effort. It is preferable to avoid some of this work by using available datasets and
customizing the data to meet specific requirements [3]. Once a custom dataset
is constructed and validated using a state-of-the-art object detection method, it
can be implemented using off-the-shelf hardware, according to the user’s specific
needs. One object detection application is tree censuses to assess the number of
tree species in urban environments. Previous deep learning applications for tree
censuses applied more cumbersome deep learning models [19] and often utilised
Google Street View (GSV) data [9,15]. In this study, we train deep-learning
models for arbitrary tree species detection from pedestrian view and apply one
of these models for on-the-edge inference using an Nvidia Jetson device1 . The
main contributions of this study are as follows:
– We present a novel data set unique to urban trees within the New Zealand
ecosystem. The data acquisition methodology described in this paper consid-
ers data collection at scale using easily accessible data sources, and we capture
images and related metadata for localised urban trees linked to a systematic
census.
– We demonstrate the effectiveness of this methodology combined with data
sets from previous authors to pre-train a model to accelerate the annotation
process.
– We deploy a trained model to an edge device that is capable of providing
inference on live streams for detecting multiple urban tree species at near
real-time speeds.
2 Methodology
Since there was (at the time of writing) no systematic, publicly available street-
view image database of New Zealand urban trees, we acquired and annotated
images of trees within the Auckland Central region (comprising the Auck-
land Central Business District (CBD) and central suburban areas using GSV
images, with a specific focus on three tree species: (i) Metrosideros Excelsa (or
1
https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/.
334 M. Ooi et al.
Pohutukawa), (ii) Cordyline Australia (or Cabbage Tree), and (iii) Vitex Lucens
(or Puriri). These species were selected because of their visual distinctness, prolif-
eration in urban and suburban Auckland, and because they are evergreen species
and hence are not affected by seasonal variation in appearance.
To build our street-view image database, we obtained an urban tree register
from the University of Auckland GeoDataHub [14]. This register contains a list
of identified tree assets within the Greater Auckland region. Each identified tree
asset was recorded using a unique table key, a unique asset ID, the species and
common name of the tree, and its geographical coordinates. From the 198,065
records in the urban tree register, we discarded all entries that did not record
the species of a tree asset, leaving 178,215 records.
The geographic coordinates for each of the remaining tree assets were then
submitted to the GSV Street View Static API [1] which searched for the nearest
GSV panorama within a 50-metre radius and returned its metadata, including
a unique panorama ID, the geographical coordinates of the panorama location,
and the year and month of the panorama capture. A total of 12,353 tree assets
within the register failed to return the nearest panorama ID, either because
no panorama existed within 50 m of the location of the tree asset or because
the GSV Street View Static API timed out, resulting in a total of 165,862 tree
assets successfully matched to a panorama. Tree assets that contained invalid
species labels (e.g., “Not Applicable” or “Unknown”, totalling 5,720 records) were
removed and 20,257 tree assets that were located within the Auckland CBD
and central suburban areas (defined as areas that were within the 2014 Auck-
land Central General Electorate District [10], excluding islands or inlets) were
selected. The purpose of this exclusion was to focus on tree specimens located in
urban or suburban settings. Finally, the remaining tree assets were divided into
specimens that belonged to species of interest to us (3,354) and other tree species
(16,903). These tree assets were identified by their common names as a degree of
variation in species spellings and granularity (e.g., identification of genus only,
identification of species, and identification of specific cultivars). From the loca-
tions of the 3,354 tree records belonging to the three selected species within our
region of interest, we selected 1,249 unique GSV panoramas associated with the
locations of these tree records as our sample for further analysis.
leaf abscission due to seasonal weather changes. The three selected tree species
(Pohutukawas, Cabbage Trees and Puriris) are evergreen plants; therefore, we
would not expect detection or classification tasks on these species to be sig-
nificantly affected by seasonal variation. Finally, the majority of the selected
panoramas (in excess of 86%) were acquired in 2022, indicating a good degree
of recency in our resulting street-view image dataset.
For image annotation, we first trained the YOLO-v7 model on the urban tree
dataset released by Wang et al. [18]. We used Wong et al.’s implementation of
the YOLO-v7 model [7], with the default YOLOv7 model and hyper-parameter
settings, and initialised the model with pretrained weights on the COCO dataset.
The purpose of this trained model is to produce an initial set of bounding boxes
for use in the labelling process.
We used an open-source data labelling solution, Label Studio [6], to edit
the generated bounding boxes, produce new bounding boxes for trees that were
not detected by the initial model, and provide class labels corresponding to
tree species. An additional label, “other trees” was assigned to all bounding
boxes of trees that did not belong to one of the three species of interest. Only
trees containing a visible section of a branch (or trunk) and crown within the
image were annotated. Along with the classification label, we also included some
additional metadata with each bounding box indicating, if the tree was in a state
of leaf abscission, was occluded (and if so whether it was occluded by a tree or
by another object) and was only partially visible (and if so, which parts of the
tree were visible).
336 M. Ooi et al.
horizontal flips, cropping, hue and saturation adjustments, Gaussian noise, and
affine scaling and translations. No augmentation was applied to the testing data
to ensure that the test dataset resembled real-world data encountered during
detection or inference.
This section describes the process of training various object detection models
to the collected data set for two tasks: single-class detection and multi-class
detection (i.e. detection of trees by labelled species). YOLO-v7, a member of the
You Only Look Once (YOLO) family of models, was selected for this task. First
published by Redmon et al. [13] YOLO is a single-stage object detection model
that treats the object detection problem as a regression problem, simultaneously
predicting bounding box coordinates and class probability scores from a single
convolutional network.
3.1 YOLO-V7
The changes proposed by Wang et al. [16] in YOLO-v7 reduced the required
parameters and computation, providing a better balance between inference speed
and accuracy compared with prior versions. We briefly discuss some of the con-
tributions of Wang et al. [16] in their paper.
E-ELAN. Wang et al. [17] studied the effects of stacking increasingly large num-
bers of computational blocks using gradient path analysis, which was used to
understand the impact on accuracy and convergence and create a new network
design paradigm, gradient path driven design, which emphasised the source of
gradients and how they were updated during training. Based on this network
design strategy, the authors proposed E-ELAN (Extended ELAN) which, in
addition to preventing the shortest gradient path from increasing too rapidly,
used “expand, merge and shuffle cardinality” to allow the network to learn with-
out destroying the original gradient path.
Fig. 1. Figure 1a illustrates traditional lead and auxiliary label assignment, with pre-
dictions and loss calculations being performed independently. Figure 1b illustrates the
label assignment used by [16], showing the shared prediction between the lead and
auxiliary heads and the use of fine and coarse labels.
3.3 Evaluation
Validation and Test Metrics. In this section, we present our evaluation of
the trained models before and after hyper-parameter evolution. For each model
iteration, we report the mAPs of the best epoch, defined as the epoch that
produced the highest fitness value, based on the function defined in Eq. 1.
We can see from Table 2 that for single-class detection, the YOLO-v5s model
exceeded the performance of the YOLO-v7 tiny model but performed poorly in
the multi-class detection task. The YOLO-v7 model had the highest mAP in
both tasks which was offset by slower inference speeds than either YOLO-v5s or
YOLO-v7 tiny models. However, the incremental improvement in mAP for the
YOLO-v7 model is relatively small compared to that of the YOLO-v7 tiny model.
The relative success of the smaller and more parsimonious model suggests that
the current volume and variety of training data are still insufficient to support a
larger model, potentially leading to a more complex YOLO-v7 model overfitting.
Fig. 2. Portable computer vision system (Left) with example frames of tree detection
results obtained using the multi-class YOLO-v5s model.
Pedestrian Tree Species Detection 341
Figure 3 shows the selection of successful detections (a) and poor detections
(b) on images retrieved from Google Image Search using multi- and single-class
models, respectively. We see in the successful detections that there were no false
positives and, in all but one case, the multi-class model successfully identified
the correct species of trees. The false negatives in these detections are likely
attributed to either the presence of occlusion by foreign objects (as in the traffic
poles in image 3 of the single-class model) or occlusions by other trees or foliage,
resulting in either major features of the tree being obscured by foliage (in image
5) or lighting being obscured (in image 3).
However, the poor detections exhibited several detection failures. In addition
to false negatives, some of the detections failed to properly separate instances,
produced false positives from green shrubbery, misclassified species, and pro-
duced overlapping predictions for the same object. In image 2, the single-class
model failed to detect any objects, despite the clearly defined tree objects.
Because these detections were performed using default sensitivity and IoU
threshold settings (for non-max suppression), refinement of these settings would
likely produce better results, particularly for false negatives or overlapping pre-
dictions. Overlapping predictions of different classes may also be improved using
class-agnostic non-max suppression.
4 Discussion
Volume of Training Data. The data set curated and used in this study was
acquired and labelled over a relatively short span of time by a limited number
of people. A larger volume of data would reduce the amount of class imbalance
across the dataset, preventing over-weighting of the larger classes and reducing
the errors observed in Fig. 3. Additionally, a larger and more diverse dataset
will also improve the model performance during inference by allowing it to learn
a greater range of representations and patterns from the underlying data and
reduce the generalisation error.
Data Validation. Whilst the specific species being labelled for fine-grained detec-
tion are visually distinct and every care was taken to provide accurate species
labels, the resulting labels should be inspected and validated by a domain expert.
This will help reduce inaccuracies due to label noise in the model training
process.
Tree Appearance Variation Over Time. The tree species specified in this study
have no seasonal variation in appearance. However, if the study is expanded to
incorporate other prominent species, consideration would need to be given to
the effect of season, age and health on tree appearance and how this can be
efficiently tracked over time.
342 M. Ooi et al.
Fig. 3. Out-of-sample detections showing (a) high quality detections and classifications
and (b) low quality detections and classifications by the multi and single-class models.
Preprocessing. During the real world test using the NVIDIA Jetson Nano device,
we observed that the species labels assigned to detections were sensitive to ambi-
ent exposure. The introduction of a pre-processing step with adaptive exposure
correction (or alternatively, adjustment of the camera aperture settings) would
likely assist in obtaining greater stability in classification.
5 Conclusion
This study presented our methodology and approach for constructing a novel
dataset for the task of fine-grained detection of select urban tree species in New
Zealand. By employing a combination of pre-existing tree inventory data, we
created a novel detection dataset of geographically unique urban tree species by
employing a combination of pre-existing tree inventory data, an easily accessed
dataset that allows the acquisition of image data and metadata at scale. We also
demonstrated the effectiveness of the current generation of detection models for
performing species detection and the viability of performing inference on edge
devices at near real-time speeds.
References
1. Alphabet Inc: Google street view static API,. https://maps.googleapis.com/maps/
api/streetview/. Accessed 25 Mar 2023
2. Beery, S., et al.: The auto arborist dataset: a large-scale benchmark for multiview
urban forest monitoring under domain shift. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21294–
21307 (2022)
344 M. Ooi et al.
3. Braithwaite, J.M.: Chapter 17: challenges and payoffs of building a dataset from
scratch, pp. 300–316. Edward Elgar Publishing, Cheltenham, UK (2022). https://
doi.org/10.4337/9781839101014.00028
4. Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning
augmentation policies from data. CoRR (2018)
5. Fu-En, W.: Equirec2perspec. https://github.com/fuenwang/Equirec2Perspec.
Accessed 9 Apr 2023
6. Heartex Inc: Label studio. https://labelstud.io/. Accessed 9 Apr 2023
7. Kin Yiu, W.: Yolov 7: implementation of paper (2022). https://github.com/
WongKinYiu/yolov7. Accessed 9 Apr 2023
8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/
10.1145/3065386
9. Lumnitz, S.: Mapping urban trees with deep learning and street-level imagery.
Ph.D. thesis, University of British Columbia (2019). http://dx.doi.org/10.14288/
1.0387513
10. NZ, S.: General electoral district 2014. https://datafinder.stats.govt.nz/layer/
104062-general-electoral-district-2014/. Accessed 3 Apr 2023
11. Orlita, T.: Steet view download 360 (2016). https://svd360.istreetview.com/.
Accessed 9 Apr 2023
12. Pias, M., Botelho, S., Drews, P.: Perfect storm: DSAs embrace deep learning for
GPU-based computer vision. In: 2019 32nd SIBGRAPI Conference on Graphics,
Patterns and Images Tutorials (SIBGRAPI-T), pp. 8–21 (2019). https://doi.org/
10.1109/SIBGRAPI-T.2019.00007
13. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified,
real-time object detection (2015). https://doi.org/10.48550/ARXIV.1506.02640
14. University of Auckland: Geodatahub. https://geodatahub.library.auckland.ac.nz/.
Accessed 25 Mar 2023
15. Velasquez, L., Echeverria, L., Etxegarai, M., Anzaldi Varas, G., Miguel, S.D.: Map-
ping street trees using google street view and artificial intelligence (2022)
16. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Yolov7: trainable bag-of-freebies
sets new state-of-the-art for real-time object detectors (2022). https://doi.org/
10.48550/ARXIV.2207.02696
17. Wang, C.Y., Liao, H.Y.M., Yeh, I.H.: Designing network design strategies through
gradient path analysis (2022). https://doi.org/10.48550/ARXIV.2211.04800
18. Wang, Y., et al.: Utd dataset (2022). https://github.com/yz-wang/OD-UTDNet.
Accessed 23 Oct 2022
19. Wegner, J.D., Branson, S., Hall, D., Schindler, K., Perona, P.: Cataloging public
objects using aerial and street-level images - urban trees. In: 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 6014–6023 (2016)
20. Zou, Z., Chen, K., Shi, Z., Guo, Y., Ye, J.: Object detection in 20 years: a sur-
vey. Proc. IEEE 111(3), 257–276 (2023). https://doi.org/10.1109/JPROC.2023.
3238524
Texture-Based Data Augmentation
for Small Datasets
1 Introduction
This paper proposes a domain-specific out-of-domain texture-based data aug-
mentation technique for small dataset training. Data augmentation is a tech-
nique for supplementing small datasets by artificially generating “new” training
images. The increased availability of large public general datasets has signif-
icantly contributed to the successful application of deep convolutional neural
networks (CNNs) to difficult computer vision tasks. However, it is difficult to
create large custom datasets for specific domains. For example, medical datasets
are typically very small due to legal and privacy regulations in the medical world.
When training models from scratch, undesirable behaviour occurs, such as over-
fitting. Recent studies indicate that overfitting is not an extensive issue [15,25],
however, these studies are restricted to large datasets. When training on small
datasets, issues such as slow convergence, vanishing gradients, sensitive param-
eter tuning, etc. are amplified during optimization. In fact, [21] notes that the
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 345–356, 2023.
https://doi.org/10.1007/978-3-031-45382-3_29
346 A. Dash and A. B. Albu
dataset, Oxford IIIT-Pet, and evaluating against a subset of the Open Images
Database to simulate large data variation in a small dataset.
2 Related Work
Image mixing is a type of data augmentation where entire images or patches
are interpolated or replaced, thus obscuring or discarding some image areas;
image mixing can be thought as performing dropout in the image space. Data
augmentation does not affect inference, unlike other regularization methods like
dropout. This section discusses related works on data augmentation based on
image mixing.
Data disruption is a domain-specific image-mixing that uses out-of-domain
samples to represent object obstruction or occlusion in the dataset. Cutout [4]
performs a dropout-like process where a square region of the training image is
zero-masked. The authors reported a 2.56% test error rate on the CIFAR-10
dataset [9]. Zhong et al. [27] use random pixel values or the ImageNet mean
pixel value instead of a zero-mask (with random pixel values yielding better
results). The authors reported an error rate reduction from 5.17 to 4.31% on the
CIFAR-10 dataset.
In-domain image mixing techniques use multiple images from the training
dataset to construct a unique training sample. Unlike conventional data warp-
ing techniques, such as random cropping or horizontal flipping, mixing images
together seems counter-intuitive as the resulting images can be difficult for
humans to interpret. However, these images successfully address misclassifica-
tions due to class competition.
Mixup [26] encourages a linear relationship between the soft labels of the
training samples, allowing the prediction confidence to have a linear transi-
tion between classes, and thus improving optimization. The authors use alpha-
blending to interpolate and superimpose two random training samples; the newly
created training images increase data coverage, as well as emulate adversarial
examples. They use soft labels by mixing class labels. Soft labels are interme-
diate probabilities, unlike binary or one-hot label encoding [8]. Mixup provides
the soft labels of a probability equal to the mixing/blending hyperparameter
α. The authors reported a decrease in the error rate of 2.7% on the CIFAR-
10 dataset over previous state-of-the-art. Summers and Dinnen [20] explored
non-linear mixed-example localized image mixing. The authors explored mul-
tiple methods, such as “VH-Mixup” and “VH-BC+” which concatenated two
training samples in four grids, two of which use Mixup alpha-mixing. Summers
and Dinnen reported a decrease in the error rate of 3.8% on the CIFAR-10
dataset for VH-Mixup/BC+ over previous state-of-the-art. Takashi et al. [22],
with RICAP, depart from alpha-blending [20,26]; they concatenate four training
samples, cropped together to form a new training sample. They also use soft
labels, proportional to the relative random crop sizes. The authors reported a
test error rate of 2.85% on CIFAR-10 over previous state-of-the-art.
Data disruption methods cause opaque obstruction of objects, which does
not take into account partially translucent obstructions, such as shadows or
348 A. Dash and A. B. Albu
overlaid classes (i.e. a dog behind a gate). Label noise can also be introduced
if a key feature is obstructed for a given class. Label noise is also an issue for
data-dependent regularization augmentation. Mixup employs an alpha-mixing
strategy which may generate local features not present in the dataset. [5] noted
that Mixup will cause underfitting resulting from the incongruities between the
synthetic soft-labels and training labels. RICAP [22], which generates non-linear
four-grid training images and corresponding soft-labels, also exhibits label con-
flicts due to randomized crop selection on the training samples. RICAP preserves
local features by using random crops (spatial blending), but there are no guar-
antees that the foreground object is predominantly displayed in the crop, which
may cause low confidence logits due to class competition. For datasets that con-
tain classes with features very similar to common background scenes (i.e. shower
curtains), if the random crop contains too much background then not only will
not enough object features be learned by the CNNs, but the image would essen-
tially be mis-annotated (i.e. the wrong label is applied to that image).
Our proposed approach is domain-specific, but applies an out-of-domain
localized sparse disruption to the entire image globally, unlike [4,27]. The local-
ized sparse disruptions are generated using a set of textures. The textures provide
varying local and global patterns to the obstructions. The chosen textures occur
naturally in the real-world (see Fig. 1) mimicking complex backgrounds and nat-
ural occlusion.
Fig. 1. Spectrum of texture categories from highly structured (left) to random (right).
Regular textures have regular patterns. The regularity of these pattern vary from
repeating duplicate structures, to repeating similar structures. Stochastic textures look
like noise, with very small repeating primitives.
3 Proposed Method
We propose a texture-based method for Deep Convolutional Networks (CNNs)
which performs label-preserving augmentation for small datasets. Label noise is
usually present in large datasets, which have limited expert curation due to their
size. With small datasets label noise and label preservation become critical to
Texture-Based Data Augmentation for Small Datasets 349
training generalizable and robust models; small datasets allow for careful cura-
tion but this advantage is lost if the data augmentation contributes significantly
to label noise.
Our approach consists of three modules: (1) generation of the texture-based
perturbation maps, (2) augmentation via image mixing and (3) modification of
the loss function using a perceptual difference regularization.
The first step is the conversion of natural texture images into templates used for
augmentation. We consider five classes of natural textures: regular, near regu-
lar, irregular, near stochastic, and stochastic. The texture spectrum is shown in
Fig. 1. At one end of the spectrum, regular textures are arrangements of texels
(elementary texture elements) exhibiting a high spatial order, while at the other
end of the spectrum, stochastic textures are random arrangements of texels.
Standard image processing steps are taken to generate the texture-based pertur-
bation maps from the original textured images. The gray-scale texture image is
processed using histogram equalization to enhance subtle textural patterns. A
bilateral filter [23] removes noise before an edge detection step using the Lapla-
cian operator is applied. The texture is re-scaled to unit variance [0, 1] so that
pixels belonging to uniform regions are zero-valued and thus have no effect on
the training image. The edges are important for our algorithm, as they represent
local regions of rapid change, and represent potential features to “fool” the model.
A random pair consisting of a training image from the original dataset and
a texture-based perturbation map is selected and resized to matching spatial
dimensions. The training sample is converted to the L*a*b* colorspace. The
L*-channel represents the perceptual lightness of the image. We maintain the
plausibility of the data augmentation by only generating “textured” images, as
350 A. Dash and A. B. Albu
1
ˆ =
T V (I) ˆ u + 1, w) − I(v,
[(I(v, ˆ u, w)2 ) + (I(v
ˆ + 1, u, w) − I(v,
ˆ u, w)2 )] (2)
HW C uvw
ˆ 1 ˆ
Loss(I(x), I(x)) = Loss(I(x)) + δT V (I(x)) (3)
4eλ
where Loss is the task loss function (i.e. cross-entropy loss), λ is the learning rate,
δ ∈ (0, 1] is a tuning hyperparameter, and T V (x) is the penalty regularization
term (Eq. 2). The 4e1λ acts as an adaptive weighting term for the task loss
function and was determined experimentally.
Texture-Based Data Augmentation for Small Datasets 351
4 Experimental Results
In this section we describe the experimental methodology for evaluating our
approach against RICAP and Mixup for two-class classification tasks using
the “Bird-or-Bicycle” [2] and Oxford-IIIT-Pet [14] datasets. We also evaluated
on a large random selection of “dog” or “cat” images from the Open Images
Dataset [10] (“Dog-or-Cat”).
For the texture-based augmentation, we use five different classes of natural tex-
tures: regular (R), near regular (NR), irregular (IR), near stochastic (NS), and
stochastic (S). The texture spectrum is shown in Fig. 1; we use textures that
vary from having high local structures (high local patch similarity) to textures
that are very noisy. We constructed a small textures dataset from the following
datasets: Kylberg [11], University of Illinois Urbana Champaign (UIUC) tex-
ture dataset [12], and Describe Textures Dataset (DTD) [3]. Two DTD classes
were chosen per texture classification, with each texture category consisting of
20 images, for a total of 100 images. To evaluate the behaviour of the network
when using our method with different texture categories, we trained a model
using the training hyperparameters discussed in the previous section.
When using the same hyperparameters, the textures have a mean test error
rate of 0.2035 ± 0.0132. We observed that there were cases in which the test
error rate for the textured validation/test augmented image outperformed the
test error rate for the unaugmented sample image (see “regular (R)” in Table 1).
This seems to indicate that when using low local variance, high global struc-
ture, neurons are co-adapting too much and overfitting. This will cause non-data
domain features (i.e. texture features) to be learned.
The behaviour is greatly reduced at high λ, Δ and lower δ parameter settings,
resulting in a decrease in the test error rate. The best hyperparameter set found
was λ = 1.0, Δ = 32, δ = 0.6 for the near stochastic (NS) class. When using
all (A) textures the λ, Δ needed to be reduced due to the higher variation in
training samples to 0.1, and 16 respectively, with δ increased to 0.7. As the δ
value decreases, the results converge to the baseline, as the data augmentation
becomes negligible. If the Δ becomes too large, then smooth regions have more
influence on the augmentation and the data distortion causes label noise. Since
the stochastic textures have smaller smooth regions, the better performance at
higher Δ seems to confirm this theory.
Texture-Based Data Augmentation for Small Datasets 353
Table 1. Test Error Rates (%) of the best set of hyperparameters (λ,Δ,δ) for each tex-
ture classifications trained on the ResNet-50 model with the “Bird-or-Bicyle” dataset.
The test error rate is calculated for unaugmented (original) images and our texture-
based augmented images.
OxfordPet-B OxforPet
Method OxfordPet-B OxfordPet Dog-or-Cat OxfordPet OxfordPet-B Dog-or-Cat
Baseline 5.25 5.22 28.60 5.22 7.00 31.07
+ mixup (α = 0.2) 6.75 5.57 31.06 4.32 9.11 29.60
+ RICAP (β = 0.2) 5.55 4.64 28.37 5.16 7.05 31.83
+ Ours+all (Δ = 16, δ = 0.6) 8.75 7.92 31.15 3.99 4.70 27.59
+ Ours+R (Δ = 16, δ = 0.6) 7.20 7.18 29.46 4.78 5.30 28.91
as the dataset increases in size, the cost-benefit of using a less sparse but more
structure texture increases. In Fig. 3, we provide images from the “Dog-or-Cat”
dataset with complex background textural patterns that were misclassified when
using the other data augmentation methods but correctly classified when using
our best method, trained on OxfordPet.
Fig. 3. Examples of images from the “Dog-or-Cat” dataset that were misclassified by
the ResNet-50 model trained using no augmentation (baseline), Mixup, and RICAP
but correctly classified using our method (using all textures) on the OxfordPet dataset
(see Table 3). Image class left to right: (a) Cat; (b) Cat; (c) Dog; (d) Dog
5 Conclusion
This paper proposed a texture-based domain-specific data augmentation tech-
nique applicable when training on small datasets for deep learning classifica-
tion tasks. Our method focuses on label-preservation to improve generalization
and optimization robustness over data-dependent augmentation methods using
textures. Naturally occurring textures are used to apply patterned occlusion
to training images. Image-mixing augmentation on the light channel of a sam-
pled L*a*b* image creates a label-preserving training image. The mixing is con-
strained by a perturbation hyperparameter, Δ, and a randomly sampled sparse
texture map. We explored different texture taxonomies: regular, near regular,
irregular, near stochastic and stochastic. For tiny datasets like “Bird-or-Bicycle”,
we achieved a test error rate of 17.95% using the near stochastic texture, improv-
ing over the baseline by 0.52 pp. Experimentally, we evaluated the generaliza-
tion to out-of-distribution examples using Oxford-IIIT-Pet and Open Images
database (“Dog-or-Cat”). Using all available textures, we improved our test error
rate by 0.33 pp on Oxford-IIIT-Pet and 2.01 pp on “Dog-or-Cat” over RICAP.
In future work, we aim to overcome the weakness of using texture images by
using generated synthetic textures based on gradient information.
References
1. Bergstra, J., et al.: Making a science of model search: Hyperparameter optimization
in hundreds of dimensions for vision architectures. In: ICML, pp. 115–23. PMLR
(2013)
356 A. Dash and A. B. Albu
1 Introduction
Visual reasoning, the ability to reason about the visual world, encompasses var-
ious canonical sub-tasks such as object and attribute categorization, object and
relationship detection, comparison, and spatial reasoning. Solving this complex
task requires robust computational models that can effectively capture visual
cues and perform intricate reasoning operations. In recent years, deep learn-
ing approaches have gained prominence in tackling visual reasoning challenges,
with the emergence of foundation models playing a key role in advancing the
field. Among the deep learning techniques employed for visual reasoning tasks,
integrated attention networks, such as transformers [16], have demonstrated
remarkable success in natural language processing and computer vision appli-
cations, including image classification and object detection. These models lever-
age attention mechanisms to capture long-range dependencies and contextual
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 357–369, 2023.
https://doi.org/10.1007/978-3-031-45382-3_30
358 W. Aissa et al.
relationships for language and vision inputs. However, despite their impressive
performance, reasoning solely based on integrated attention networks may be
susceptible to taking “shortcuts” and relying heavily on dataset bias. There is
an increasing need to address the issue of interpretability and explainability,
enabling reaserchers to understand the reasoning behind the model’s predic-
tions. By leveraging transformer models as a backbone for VQA systems to
encode language and visual information, we can adopt modular approaches that
enable an improved understanding and transparency in the visual reasoning pro-
cess. Modular approaches, such as neural module networks (NMNs), break down
the problem (question) into smaller sub-tasks which can be independently solved
and then combined to produce the final answer. The modular design confers the
advantage of greater transparency and interpretability, as the model explicitly
represents the various sub-tasks and their interrelationships.
In this paper we aim to bridge the gap between accuracy and explainabil-
ity in visual question answering systems by providing insights into the reason-
ing process. To accomplish this, we propose an enhanced training approach for
our modular network using a teacher forcing training technique [17], where the
ground truth output of an intermediate module is used to guide the learning
process of subsequent modules during training. By employing this approach, our
module gains the ability to learn its reasoning sub-task both in a stand-alone
and end-to-end manners leading to improved training efficiency.
To evaluate the effectiveness of our approach, we conduct experiments using
training programs sourced from the GQA dataset. This dataset provides a diverse
range of scenarios, enabling us to thoroughly assess the capability of our app-
roach to reason about the visual world. Through experimentation and analysis,
our results demonstrate the effectiveness of our proposed method in achieving
explainability in VQA systems while maintaining a high degree of effectiveness.
In summary, this work makes two key contributions: first, the utilization of
decaying teacher forcing during training, which enhances generalization capabil-
ities, and second, the incorporation of cross-modal language and vision features
to capture intricate relationships between text and images, resulting in more
accurate and interpretable results.
The remaining sections of the paper are structured as follows: Sect. 2 pro-
vides a discussion on related work, Sect. 3 presents our cross-modal neural mod-
ule network framework, and Sect. 4 introduces our teacher guidance procedure.
In Sect. 5, we outline the validation protocol, followed by the presentation of
experimental results in Sect. 6. Finally, in Sect. 7, we conclude the paper by
synthesizing our findings and discuss potential future developments.
2 Related Work
In this section, we begin by examining integrated and modular approaches
employed in visual reasoning tasks. We then introduce the teacher forcing train-
ing method and its application to modular neural networks.
Transformer Networks. Transformers [16] have been widely applied as foun-
dation models for various language and vision tasks due to their remarkable
Multimodal Teacher-Guided Compositional Visual Reasoning 359
performance. They have also been adapted for reasoning problems like Visual
Question Answering (VQA). Notably, models such as ViLBERT [13], Visual-
BERT [12] and LXMERT [15] have demonstrated interesting performance on
popular VQA datasets like VQA2.0 [6] and GQA [8]. These frameworks follow a
two-step approach: first, they extract textual and image features. Word embed-
dings are obtained using a pre-trained BERT [5] model, while Faster RCNN
generates image region bounding boxes along with their corresponding visual
features. Subsequently, a cross-attention mechanism is employed to align the
word embeddings with the image features, leveraging training on a diverse range
of multi-modal tasks.
Despite the benefits of the integrated approaches, these models also have
notable drawbacks. One prominent limitation is their lack of interpretability,
making it challenging to understand—and debug, when necessary—the underly-
ing reasoning process. Moreover, these models often rely on “shortcuts” in the rea-
soning, which means learning biases present in the training data. Consequently,
their performance tends to suffer when confronted with out-of-distribution data,
as shown on GQA-OOD [9]. This research also emphasizes the importance of
employing high-quality input representations for the transformer model.
To address interpretability concerns, we use features produced by an off-the-
shelf cross-modal transformer encoder in a step-by-step explainable reasoning
architecture. This approach balances the power of the transformer model in
capturing relationships between modalities with the ability to understand the
reasoning process.
Neural Module Networks. To enhance the transparency and emulate a
human-like reasoning, compositional Neural Module Networks (NMNs) such as
those introduced by [7] and [11] break down complex reasoning tasks into more
manageable subtasks through a multi-hop reasoning approach. A typical NMN
comprises a generator and an executor. The generator maps a given question
to a sequence of reasoning instructions, known as a program. Subsequently, the
executor assigns each sub-task from the program to a neural module and prop-
agates the results to subsequent modules.
In a recent study by [4], a meta-learning approach is adopted within the NMN
framework to enhance the scalability and generalization capabilities of the result-
ing model. The generator decodes the question to generate a program, which is
utilized to instantiate a meta-module. Visual features are extracted through a
transformer-based visual encoder, while a cross-attention layer combines word
embeddings and image features. Although the combination of a generator and an
executor in NMNs may appear more intricate compared to an integrated model,
the inherent transparency of the “hardwired” reasoning process in NMNs has the
potential to mitigate certain reasoning “shortcuts” resulting from data bias.
A more recent study [1] has investigated the effects of curriculum learning
techniques in the context of neural module networks. The research demonstrated
that reorganizing the dataset to begin training with simpler programs and pro-
gressively increasing the difficulty by incorporating longer programs (based on
the number of concepts involved in the program) facilitates faster convergence
360 W. Aissa et al.
and promotes a more human-like reasoning process. This highlights the impor-
tance of curriculum learning in improving the training dynamics and enhancing
the model’s ability to reason and generalize effectively.
Interestingly, [10] demonstrated that leveraging the programs generated from
questions as additional supervision for the LXMERT integrated model led to a
reduction in sample complexity and improved performance on the GQA-OOD
(Out Of Distribution) dataset [9].
Building upon this, our work aims to capitalize on both the transparency
offered by NMN architectures and the high-quality transformer-encoded rep-
resentations by implementing a composable NMN that integrates multimodal
vision and language features.
Teacher Forcing. Teacher forcing (TF) [17] is a widely used technique in
sequence prediction or generation tasks, especially in RNNs with an encoder-
decoder architecture. It involves training the model using the true output as
novel input, which helps improve prediction accuracy. However, during infer-
ence, the model relies on its own predictions without access to ground-truth
information, leading to a discrepancy known as exposure bias.
Scheduled sampling (SS) is a notable approach to mitigating the train-test
discrepancy in sequence generation tasks [2]. It introduces randomness during
training by choosing between using ground truth tokens or the model’s predic-
tions at each time step. This technique, initially developed for RNN architec-
tures, has also been adapted for transformer networks [14], aiding to align the
model’s performance during training and inference.
NMNs, on the other hand, are trained using only the output of a module
as input for the next module, which has drawbacks. Errors made by an inter-
mediate module can propagate to subsequent modules, leading to cumulative
bad predictions. This effect is particularly prominent during the early stages of
training when the model’s predictions are close to random.
NMNs can leverage the TF strategy to enhance their training process. Ini-
tially, training begins with a fully guided schema, where the true previous out-
puts are used as input. As training progresses, the model gradually transitions
to a less guided scheme, relying more on the generated outputs from previous
steps as input. This gradual reduction in guidance and increased reliance on
the model’s own predictions, named decaying TF, helps NMNs better learn and
adapt to the complexity of the task. With decaying TF, modules can conform
to their expected behavior for their respective sub-tasks.
Fig. 1. The proposed modular VQA framework. Plain arrows represent the output
flow, while dotted arrows represent the Multi-Task loss backward flow.
Fig. 2. The teacher guidance for the program execution process related to the question
‘On what is the animal to the right of the laptop sleeping?’. Plain arrows represent
input guidance and dotted arrows represent the output feedback.
interrupted at the first ground truth input. However, in the case of collaborative
module interactions without TF, the full back-propagation can be computed. The
intermediate outputs are preserved in continuous form throughout the program
execution, enabling the flow of backward gradients between modules. Errors
and updates can be propagated through the entire network, facilitating effective
learning and enhancing the overall performance of the NMN. We give details
about our guidance mechanism in the following subsections.
Input Guidance. The modules receive input guidance through decaying teacher
forcing (TF). As shown in Fig. 2, at each reasoning step t the executor randomly
decides whether to use the predicted output ôt−1 or the ground-truth output
o∗t−1 from the previous module mt−1 as its input. This decision is made by
flipping a coin, where o∗t−1 is chosen with a probability of e and ôt−1 is chosen
with a probability of 1 − e . The coin-flipping process for input selection occurs
at each reasoning step, allowing the model to train on various sub-programs.
The probability e of selecting o∗t−1 depends on the epoch number e. As training
progresses and the epoch number increases, e decreases, giving more preference
to the module’s predictions over the ground-truth intermediate outputs.
Output Feedback. We employ a multi-task (MT) loss approach to provide
feedback to the modules based on their outputs. The loss consists of a weighted
sum L = αLatt +βLbool +γLanswer of individual losses for the attention modules,
boolean modules and answer modules, with α, β and γ scaling factors. Each
module is assigned its own average loss, considering its frequency of appearance,
to prevent overemphasis on frequent modules at the expense of infrequent ones.
For Boolean modules, we rely on the provided answer to infer the module’s
output and generate the intermediate Boolean outputs. However, for attention
modules, it is necessary to establish correspondences between the bounding boxes
in the image graph and those obtained from Faster-RCNN, which brings us to
the issue of the ground-truth intermediate outputs, detailed in the following.
364 W. Aissa et al.
Soft Matching and Hard Matching. Two mapping techniques, namely hard
matching and soft matching, are employed to align the ground-truth bounding
boxes with those obtained from the feature extractor. In the hard matching
approach, a ground-truth bounding box bg is matched with the bounding box
o∗i from the feature extractor that has the highest Intersection over Union (IoU)
factor. On the other hand, the soft mapping matches bg with all o∗i that have
an IoU value above a threshold, resembling a multi-label classification task. The
choice of the matching technique directly affects the representation of the atten-
tion intermediate output vectors. Hard mapping produces one-hot-like vectors,
while soft mapping multi-label vectors, with one(s) for positive matching and
zeros for negative matching boxes. It is important to acknowledge that not all
modules have ground-truth outputs that can be extracted.
5 Protocol Design
Dataset and Metrics. The GQA balanced dataset [8] consists of over 1 mil-
lion compositional questions and 113,000 real-world images. The questions are
represented by functional programs that capture the reasoning steps involved
in answering them. To ensure consistent evaluation, the dataset authors suggest
using the testdev split instead of the val split when utilizing object-based fea-
tures due to potential overlap in training images. In line with the latter and
following LXMERT, our model is trained on the combined train+val set. For
testing, we evaluate the model’s performance on the testdev-all split from
the unbalanced set. This allows us to gather additional examples and gain a
comprehensive understanding of the NMN’s behavior. To simplify the module
structure in the GQA dataset, we consolidate specific modules into more general
ones based on similar operations. For example, modules like ChooseHealthier
and ChooseOlder are combined into ChooseAttribute module, with an argu-
ment txtm specifying the attribute to select. This reduces the number of modules
from 124 to 32. Our experiments directly utilize the pre-processed GQA dataset
programs, with a specific focus on evaluating the teacher forcing training on the
Program Executor module. While our system employs a transformer model as a
generator to convert the question into its corresponding program, this task is rel-
atively straightforward compared to the training of the executor. As in previous
studies [4,11], we achieve nearly perfect translation results on testdev-all.
We assess the performance of our approach by measuring answer accuracy.
Additionally, we conduct a qualitative evaluation of the intermediate outputs,
visualized through plotted images in Sect. 6.
Evaluated Methods. As presented in the previous sections, we propose two
contributions to improve neural module networks for VQA. First, we use teacher
guidance during training, leading to better generalization. Second, we lever-
age cross-modal language and vision features to capture complex relationships
between text and images, resulting in more accurate and interpretable results.
Multimodal Teacher-Guided Compositional Visual Reasoning 365
6 Results Analysis
Model accuracy
LXV-TF-hard 0.548
LXV-MT-hard 0.598
LXV-TF-MT-hard 0.630
LXV-TF-soft 0.536
LXV-MT-soft 0.563
LXV-TF-MT-soft 0.632
366 W. Aissa et al.
Model accuracy
FasttextV-TF-MT-hard 0.495
BertV-TF-MT-hard 0.506
LXV-TF-MT-hard 0.630
BertV-TF-MT-soft 0.485
FasttextV-TF-MT-soft 0.511
LXV-TF-MT-soft 0.632
Multimodal Teacher-Guided Compositional Visual Reasoning 367
7 Conclusion
We have presented a neural module framework trained using a teacher guid-
ance strategy, which has demonstrated several key contributions. First, our app-
368 W. Aissa et al.
References
1. Aissa, W., Ferecatu, M., Crucianu, M.: Curriculum learning for compositional
visual reasoning. In: Proceedings of VISIGRAPP 2023, Volume 5: VISAPP (2023)
2. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence
prediction with recurrent neural networks. CoRR (2015)
3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
subword information. Trans. ACL 5, 135–146 (2016)
4. Chen, W., Gan, Z., Li, L., Cheng, Y., Wang, W.Y., Liu, J.: Meta module network
for compositional visual reasoning. In: WACV (2021)
5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep
bidirectional transformers for language understanding. In: Proceedings of the 2019
Conference of the NAACL: Human Language Technologies, Volume 1 (2019)
6. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in
VQA matter: elevating the role of image understanding in visual question answer-
ing. In: CVPR (2017)
7. Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason:
end-to-end module networks for visual question answering. In: ICCV (2017)
8. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning
and compositional question answering (2019)
9. Kervadec, C., Antipov, G., Baccouche, M., Wolf, C.: Roses are red, violets are
blue... but should VQA expect them to? In: CVPR (2021)
10. Kervadec, C., Wolf, C., Antipov, G., Baccouche, M., Nadri, M.: Supervising the
transfer of reasoning patterns in VQA, vol. 34. Curran Associates, Inc. (2021)
11. Li, G., Wang, X., Zhu, W.: Perceptual visual reasoning with knowledge propaga-
tion. In: ACM MM, p. 530–538. MM 2019, ACM, New York, NY, USA (2019)
12. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple
and performant baseline for vision and language. In: Arxiv (2019)
13. Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolin-
guistic representations for vision-and-language tasks. In: NeurIPS (2019)
Multimodal Teacher-Guided Compositional Visual Reasoning 369
14. Mihaylova, T., Martins, A.F.T.: Scheduled sampling for transformers. In: Proceed-
ings of ACL: Student Research Workshop. Florence, Italy (2019)
15. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations
from transformers. In: Proceedings of EMNLP-IJCNLP (2019)
16. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
17. Williams, R.J., Zipser, D.: A Learning algorithm for continually running fully recur-
rent neural networks. Neural Comput. 1(2), 270–280 (1989)
Enhanced Color QR Codes with Resilient
Error Correction for Dirt-Prone Surfaces
Minh Nguyen(B)
1 Introduction
QR (Quick Response) codes have become the most widely recognized and inter-
nationally standardized 2D barcodes in today’s world. They are commonly found
on websites, magazines, and various marketing materials. QR codes offer numer-
ous benefits across industries, as they can digitally store information about a
product or service, allowing users to efficiently scan and transfer this information
to smartphones, tablets, and other electronic devices. The primary advantage of
QR codes lies in their ability to eliminate tedious typing and searching for infor-
mation. During the COVID-19 pandemic, QR codes have become instrumental
in contact tracing efforts in many countries [9], enabling contactless QR code
scanning for check-ins at workplaces and other locations [6].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 370–381, 2023.
https://doi.org/10.1007/978-3-031-45382-3_31
Enhanced Color QR Codes with Resilient Error Correction 371
Fig. 1. A traditional black and white QR code (left) and our proposed CQR Code
(right); both codes store the same data.
Increasing the error correction level enhances the code’s resilience, albeit at
the cost of a larger code size. Typically, Level M (15%) is the most commonly
used. However, these percentages can be misleading. The data bytes are stored
within a fish-shaped area, as illustrated in Fig. 2. Damages occurring within this
area can be restored, as demonstrated by the damaged but decodable QR codes
in Fig. 3. It is crucial to note that damages can occur anywhere on the code, and
there is no guarantee that they will be confined to this fish-shaped area.
A more detailed examination of the QR code structure is provided in Fig. 2,
which outlines the crucial patterns that must be preserved for accurate QR code
detection and decoding. Among these patterns, the three “Finder Patterns” sit-
uated in the top-left, top-right, and bottom-left corners are arguably the most
critical. Damage to any of these patterns would render the QR code undetectable,
resulting in incorrect decoding. Maintaining the “Quiet Zone” clear is also essen-
tial for proper QR code functionality .
Conventional QR codes may be susceptible to minor damages affecting the
“Finder Patterns.” For example, in Fig. 4-left, the QR code becomes unreadable
when a single yellow dot is placed inside one of its “Finder Patterns.” Fig. 4-
middle illustrates a real-world scenario where the top-right corner of the QR
code is torn off, rendering it unscannable.
QR codes work optimally when printed on flat, matte surfaces. However, many
real-world situations require QR codes to be displayed on reflective LCD screens,
adhered behind glass doors, or printed on plastic bags, tubes, bottles, syringes,
Enhanced Color QR Codes with Resilient Error Correction 373
Fig. 3. One demonstration of a damaged but decodable QR code shows how the
codes can still be read despite having a quarter of the codes missing, demonstrated
on wikipedia.org
beer mugs, coffee cups, and fruits. Figure 4-right demonstrates an example where
a QR code is printed on a bottle, and the curved surface prevents it from being
properly decoded using a standard scanner. Several attempts have been made to
address this issue, such as [7,10]. However, they can only handle uniformly curved
or nearly curved shapes, and the entire QR code must be visible. Currently, for
applications involving printing QR codes on such surfaces, most companies opt
for smaller versions of the codes. However, smaller QR codes also mean reduced
data capacity, and the customer’s camera might not capture enough detail to
decode it correctly. In summary, there are still no adequate solutions for printing
and scanning 2D barcodes on products with shiny, glossy, and uneven surfaces.
Fig. 4. Three situations where minor damages or uneven surfaces could create non-
decidable codes: (left) a yellow point added, (middle) a torn corner, (right) curved
surface. (Color figure online)
374 M. Nguyen
Fig. 5. Two demonstrations of our proposed Colour QR Code, which remain readable
even when large parts of them are covered.
Fig. 6. The steps to create our proposed Colour QR Code from an original QR Code:
(1) extract the red channel, (2) extract the green channel and flip diagonally around
the top-left bottom-right axis, (3) extract the blue channel and flip diagonally around
the top-right bottom-left axis, (4) reconstruct a dot-version of the code, (5) merge all
four to construct the CQR code. (Color figure online)
The green channel IG is flipped diagonally around the top-left bottom-right axis
(step 2). This can be achieved with the following:
The blue channel IB is flipped diagonally around the top-right bottom-left axis
(step 3), achieved by:
The merged QRC contains four layers. Each of the three R, G, and B layers
holds a copy of the original QR Code QRCode in different orientations, while
the fourth layer is there to maintain backward compatibility. This means most
available QR code scanners can read this CQR Code. We tested several available
scanner apps on Apple and Android markets; all could read this CQR code
correctly. The encoding process for the proposed CQR code offers increased
robustness against damage while maintaining compatibility with existing QR
code scanners. By leveraging the redundancy provided by the three layers of
differently oriented QR codes, the CQR code can effectively recover information
even when some portions are damaged.
color channel of every pixel in the input image to produce an output image with
maximized saturation (Fig. 7).
Subsequently, we eliminate the dots by applying a median filter. Following
this, we separate the CQR code into its Red, Green, and Blue channels:
The proposed decoding process is robust to various types of damage and can
effectively recover the information stored in the CQR code.
378 M. Nguyen
Fig. 8. Top: QR codes with 10, 20, 30, 40, 50, and 60% damages. Bottom: CQR codes
with 10, 20, 30, 40, 50, and 60% damages.
After scanning all 24,000 barcode samples, we gathered data on the accuracy
of the decoded information. The first 44 rows of the results, which represent the
average percentage of correct scanning outcomes for all barcodes with varying
levels of damage, are presented in Table 1. These findings are also illustrated
graphically in Fig. 10. Both the graph and the table’s data clearly demonstrate
that our proposed CQR code outperforms the standard QR code in terms of
error correction. The red curve (representing the CQR) is noticeably higher
than the green curve (representing the QR code). When damage to the QR code
ranges from 0 to 1%, its readability immediately falls from 100% to 94%. In
contrast, the CQR code maintains nearly 100% readability with damage levels
of up to 20%. Furthermore, the QR code’s readability approaches zero with
approximately 35% damage, whereas the CQR code’s readability persists above
55%. The CQR code appears to fail completely when damage exceeds 50Ȯverall,
these results emphasize the superior error correction capabilities of our proposed
CQR code compared to traditional QR codes (Fig. 9).
Enhanced Color QR Codes with Resilient Error Correction 379
Fig. 10. Successfully decoded rates of CQR Codes versus original QR codes.
ing a more robust and reliable means of encoding and decoding information in
challenging environments. Future work will involve further testing of the CQR
code on the mentioned surfaces and exploring additional applications where this
technology could provide a distinct advantage.
References
1. Berchtold, W., Liu, H., Steinebach, M., Klein, D., Senger, T., Thenee, N.: Jab
code-a versatile polychrome 2D barcode. Electron. Imaging 2020(3), 207-1 (2020)
2. Bhardwaj, N., Kumar, R., Verma, R., Jindal, A., Bhondekar, A.P.: Decoding algo-
rithm for color QR code: a mobile scanner application. In: 2016 International Con-
ference on Recent Trends in Information Technology (ICRTIT), pp. 1–6 (2016).
https://doi.org/10.1109/ICRTIT.2016.7569561
3. Bulan, O., Blasinski, H., Sharma, G.: Color QR codes: increased capacity via per-
channel data encoding and interference cancellation. In: Color and Imaging Con-
ference, vol. 2011, pp. 156–159. Society for Imaging Science and Technology (2011)
4. Kato, H., Tan, K.: 2D barcodes for mobile phones. In: 2005 2nd Asia Pacific Con-
ference on Mobile Technology, Applications and Systems, pp. 8-pp. IEEE (2005)
5. Kieseberg, P., et al.: QR code security. In: Proceedings of the 8th International
Conference on Advances in Mobile Computing and Multimedia, pp. 430–435 (2010)
6. Lee, C.Y., Mohd-Mokhtar, R.: Contactless tool for COVID-19 surveillance system.
In: 2021 IEEE 19th Student Conference on Research and Development (SCOReD),
pp. 52–57. IEEE (2021)
7. Liu, P., Duan, M., Liu, W., Wang, Y., Li, Q., Dai, Y.: Research on the graphic cor-
rection technology based on morphological dilation and form function QR codes.
In: 2016 International Conference on Network and Information Systems for Com-
puters (ICNISC), pp. 323–327. IEEE (2016)
8. Melgar, M.E.V., Zaghetto, A., Macchiavello, B., Nascimento, A.C.: CQR codes:
colored quick-response codes. In: 2012 IEEE Second International Conference on
Consumer Electronics-Berlin (ICCE-Berlin), pp. 321–325. IEEE (2012)
9. Nakamoto, I., Wang, S., Guo, Y., Zhuang, W.: A QR code-based contact tracing
framework for sustainable containment of COVID-19: evaluation of an approach
to assist the return to normal activity. JMIR Mhealth Uhealth 8(9), e22321 (2020)
10. Qian, J., Xing, B., Zhang, B., Yang, H.: Optimizing QR code readability for curved
agro-food packages using response surface methodology to improve mobile phone-
based traceability. Food Packag. Shelf Life 28, 100638 (2021)
11. Querini, M., Grillo, A., Lentini, A., Italiano, G.F.: 2D color barcodes for mobile
phones. Int. J. Comput. Sci. Appl. 8(1), 136–155 (2011)
12. Querini, M., Italiano, G.F.: Reliability and data density in high capacity color
barcodes. Comput. Sci. Inf. Syst. 11(4), 1595–1615 (2014)
13. Ramya, M., Jayasheela, M.: Improved color QR codes for real time applications
with high embedding capacity. Int. J. Comput. Appl. 91(8) (2014)
Author Index
A E
Ababou, Rachel 332 Endo, Toshio 53
Aissa, Wafa 357 Ezhilarasan, M. 235
Alata, Olivier 100
Albu, Alexandra Branzan 345 F
Azhar, Mihailo 311 Fassih, Mohamed 299
Fendri, Emna 76
Ferecatu, Marin 357
B Fischer, Johannes 287
Backhaus, Anton 112 Frost, Peter 287
Baker, Stephen 184
Bammey, Quentin 222 G
Ben Ammar, Mohammed 196 Gatoula, Panagiota 160
Ben-Artzi, Gil 14 Gong, Rui 322
Boisbunon, Pierre-Yves 299 Guntinas-Lichius, Orlando 262
Bouhlel, Fatma 66
Büchner, Tim 262 H
Hammami, Mohamed 66
C Hast, Anders 88
Cao, Liuyan 275 Heidari, Shahrokh 1
Capelle-Laizé, Anne-Sophie 299 Hershkovitch Neiterman, Evgeny 14
Capraru, Richard 148 Hillman, Jen 311
Carré, Philippe 299 Hirofuchi, Takahiro 53
Chandra Sekhar, C. 40
Cladière, Tristan 100 I
Crucianu, Michel 357 Iakovidis, Dimitris K. 160, 172
Cvenkel, Tilen 209 Ikegami, Tsutomu 53
Ivanovska, Marija 209
D J
Dammak, Sahar 76 Jiang, Gangyi 275
Dash, Amanda 345 Jiang, Hao 275
Debski, Igor 287 Jiang, Zhidi 275
Delmas, Patrice 1, 287, 311, 322, 332
Denzler, Joachim 262 K
Desroziers, Sylvain 124 Kalozoumis, Panagiotis G. 172
Dimas, George 160, 172 Khediri, Nouha 196
Ducottet, Christophe 100, 124 Kherallah, Monji 196
Duong, Dat Q. 184 Kieu, Phuong Nhi Nguyen 184
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2023
J. Blanc-Talon et al. (Eds.): ACIVS 2023, LNCS 14124, pp. 383–384, 2023.
https://doi.org/10.1007/978-3-031-45382-3
384 Author Index
Klyuchka, Michael 14 R
Konik, Hubert 100 Rogers, Mitchell 287, 332
Koottungal, Akash 27
S
L Soong, Boon Hee 148
Le, Bao 184 Strozzi, Alfonso Gastelum 311
Legrand, Anne-Claire 100
Li, Cyril 124 T
Lin, Huei-Fang 250 Thrush, Simon 311
Lin, Huei-Yung 250 Tran, Tuan-Anh 184
Luettel, Thorsten 112 Triantafyllou, Georgios 172
M
V
McComb, Peter 287
Vakada, Naveen 40
Meiresone, Pieter 136
Valdez, David Arturo Soriano 311, 332
Mliki, Hazar 66, 76
Van Hamme, David 136
Moreaud, Maxime 124
Muhovič, Jon 209
W
Wang, Chenyu 53
N
Wang, Jian-Gang 148
Nakae, Ken 322
Woodward, Alexander 322
Nambiar, Athira 27
Wuensche, Hans-Joachim 112
Nguyen, Binh T. 184
Nguyen, Minh 370
Nguyen, Tien K. 184 X
Xue, Bing 287
O
Ooi, Martin 332 Y
You, Jihao 275
P Yu, Mei 275
Pandey, Shailesh 27
Perš, Janez 209 Z
Philips, Wilfried 136 Zhang, Mengjie 287
Poovayar Priya, M. 235 Zhao, Kaiqi 332