Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (80)

Search Parameters:
Keywords = perceptual coding

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
14 pages, 1803 KiB  
Article
Generative Adversarial Network-Based Distortion Reduction Adapted to Peak Signal-to-Noise Ratio Parameters in VVC
by Weihao Deng and Zhenglong Yang
Appl. Sci. 2024, 14(24), 11561; https://doi.org/10.3390/app142411561 - 11 Dec 2024
Viewed by 461
Abstract
In order to address the issues of image quality degradation and distortion that arise in the context of video transmission coding and decoding, a method based on an enhanced version of CycleGAN is put forth. The lightweight attention module is integrated into the [...] Read more.
In order to address the issues of image quality degradation and distortion that arise in the context of video transmission coding and decoding, a method based on an enhanced version of CycleGAN is put forth. The lightweight attention module is integrated into the residual block of the generator module structure, thereby facilitating the extraction of image details and motion compensation. Furthermore, the perceptual function LPIPS loss is increased to align the image restoration effect more closely with human perception. Additionally, the network training method is modified, and the original image is divided into 128 × 128 small blocks for training, thus enhancing the network’s accuracy in restoring details. The experimental results demonstrate that the algorithm attains an average PSNR value of 30.1147 on the publicly accessible YUV sequence dataset, YUV Trace Dataset, which is a 9.02% enhancement compared to the original network. Additionally, the LPIPS value reaches 0.2639, representing a 10.42% reduction, and effectively addresses the issue of image quality deterioration during transmission. Full article
Show Figures

Figure 1

Figure 1
<p>8K video transmission system.</p>
Full article ">Figure 2
<p>Control of video decoding by quantization parameters.</p>
Full article ">Figure 3
<p>Basic structure of GAN network.</p>
Full article ">Figure 4
<p>The CycleGAN network adversarial process.</p>
Full article ">Figure 5
<p>Lightweight attention mechanism.</p>
Full article ">Figure 6
<p>Lightweight attention mechanism.</p>
Full article ">Figure 7
<p>Training and inference adapted to PSNR.</p>
Full article ">
21 pages, 1649 KiB  
Article
Bridging Perceptual Gaps: Designers vs. Non-Designers in Urban Wayfinding Signage Preferences
by Jialu Zhou, Norsidah Ujang, Mohd Shahrudin Abd Manan and Faziawati Abdul Aziz
Sustainability 2024, 16(22), 9653; https://doi.org/10.3390/su16229653 - 6 Nov 2024
Viewed by 903
Abstract
As urban environments become increasingly complex and the costs and challenges of infrastructure upgrades continue to rise, wayfinding signage has become an effective solution to cope with urban dynamics due to its low cost and high flexibility. Although the functionality of wayfinding signage [...] Read more.
As urban environments become increasingly complex and the costs and challenges of infrastructure upgrades continue to rise, wayfinding signage has become an effective solution to cope with urban dynamics due to its low cost and high flexibility. Although the functionality of wayfinding signage has been extensively studied, the perceptual differences between designers and non-designers have not been adequately explored. Ignoring these differences may lead to the overlooking of users’ real and diverse needs, resulting in suboptimal signage performance in practical applications and ultimately a reduction in the overall functionality and user experience of urban spaces. This study aims to bridge this perceptual gap. For this study, we conducted a questionnaire survey in China to compare the visual preferences of designers and non-designers regarding text, shape, color coding, and patterns. The results indicate that designers prioritize functionality and clarity to ensure the effective use of signage in complex urban environments, whereas non-designers prefer wayfinding signages that reflect local cultural symbols and characteristics. Our conclusions suggest that the public’s expectations for wayfinding signage extend beyond basic navigational functions, with an emphasis on cultural expression and visual appeal. Understanding these perceptual differences is crucial in developing design strategies that balance functionality, esthetics, and sustainability, thereby facilitating the sustainable integration of signage into urban landscapes. Full article
Show Figures

Figure 1

Figure 1
<p>Text information on signage.</p>
Full article ">Figure 2
<p>Shapes of signage.</p>
Full article ">Figure 3
<p>Color coding in signage.</p>
Full article ">Figure 4
<p>Patterns in signage.</p>
Full article ">
34 pages, 2908 KiB  
Article
A Hybrid Contrast and Texture Masking Model to Boost High Efficiency Video Coding Perceptual Rate-Distortion Performance
by Javier Ruiz Atencia, Otoniel López-Granado, Manuel Pérez Malumbres, Miguel Martínez-Rach, Damian Ruiz Coll, Gerardo Fernández Escribano and Glenn Van Wallendael
Electronics 2024, 13(16), 3341; https://doi.org/10.3390/electronics13163341 - 22 Aug 2024
Viewed by 722
Abstract
As most of the videos are destined for human perception, many techniques have been designed to improve video coding based on how the human visual system perceives video quality. In this paper, we propose the use of two perceptual coding techniques, namely contrast [...] Read more.
As most of the videos are destined for human perception, many techniques have been designed to improve video coding based on how the human visual system perceives video quality. In this paper, we propose the use of two perceptual coding techniques, namely contrast masking and texture masking, jointly operating under the High Efficiency Video Coding (HEVC) standard. These techniques aim to improve the subjective quality of the reconstructed video at the same bit rate. For contrast masking, we propose the use of a dedicated weighting matrix for each block size (from 4×4 up to 32×32), unlike the HEVC standard, which only defines an 8×8 weighting matrix which it is upscaled to build the 16×16 and 32×32 weighting matrices (a 4×4 weighting matrix is not supported). Our approach achieves average Bjøntegaard Delta-Rate (BD-rate) gains of between 2.5% and 4.48%, depending on the perceptual metric and coding mode used. On the other hand, we propose a novel texture masking scheme based on the classification of each coding unit to provide an over-quantization depending on the coding unit texture level. Thus, for each coding unit, its mean directional variance features are computed to feed a support vector machine model that properly predicts the texture type (plane, edge, or texture). According to this classification, the block’s energy, the type of coding unit, and its size, an over-quantization value is computed as a QP offset (DQP) to be applied to this coding unit. By applying both techniques in the HEVC reference software, an overall average of 5.79% BD-rate gain is achieved proving their complementarity. Full article
(This article belongs to the Special Issue Recent Advances in Image/Video Compression and Coding)
Show Figures

Figure 1

Figure 1
<p>Default HEVC quantization weighting matrices.</p>
Full article ">Figure 2
<p>Contrast sensitivity function. The red curve represents the original CSF as defined by Equation (1), while the blue dashed curve represents the flattened CSF, with spatial frequencies below the peak sensitivity saturated.</p>
Full article ">Figure 3
<p>Proposed 4 × 4 quantization weighting matrices for intra- and interprediction modes.</p>
Full article ">Figure 4
<p>Rate-distortion curves comparing our proposed CSF with the default implemented in the HEVC standard using different perceptual metrics. (<b>a</b>,<b>b</b>) correspond to the BQTerrace sequence of class B, while (<b>c</b>,<b>d</b>) correspond to the ChinaSpeed sequence of class F.</p>
Full article ">Figure 5
<p>Samples of manually classified blocks (left-hand side) and their associated polar diagram of the MDV metric (right-hand side). From top to bottom: <math display="inline"><semantics> <mrow> <mn>8</mn> <mo>×</mo> <mn>8</mn> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mn>16</mn> <mo>×</mo> <mn>16</mn> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mn>32</mn> <mo>×</mo> <mn>32</mn> </mrow> </semantics></math> block sizes; from left- to right-hand side: plain, edge, and texture blocks.</p>
Full article ">Figure 6
<p>(<b>a</b>) Scatter plot of manually classified <math display="inline"><semantics> <mrow> <mn>16</mn> <mo>×</mo> <mn>16</mn> </mrow> </semantics></math> blocks (training dataset), and (<b>b</b>) the classification results provided by the trained SVM model (testing dataset).</p>
Full article ">Figure 7
<p>Example of block classification for the first frame of sequence BasketballDrill, using optimal SVM models for each block size.</p>
Full article ">Figure 8
<p>Box and whisker plot of the block energy (<math display="inline"><semantics> <mi>ε</mi> </semantics></math>) distribution by size and texture classification.</p>
Full article ">Figure 9
<p>Representation of Equation (6) for two sets of function parameter, (<b>red</b>) <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>i</mi> <mi>n</mi> <msub> <mi>E</mi> <mn>1</mn> </msub> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>a</mi> <mi>x</mi> <msub> <mi>E</mi> <mn>1</mn> </msub> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>a</mi> <mi>x</mi> <mi>Q</mi> <mi>S</mi> <mi>t</mi> <mi>e</mi> <msub> <mi>p</mi> <mn>1</mn> </msub> </mrow> </semantics></math> and (<b>blue</b>) <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>i</mi> <mi>n</mi> <msub> <mi>E</mi> <mn>2</mn> </msub> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>a</mi> <mi>x</mi> <msub> <mi>E</mi> <mn>2</mn> </msub> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>a</mi> <mi>x</mi> <mi>Q</mi> <mi>S</mi> <mi>t</mi> <mi>e</mi> <msub> <mi>p</mi> <mn>2</mn> </msub> </mrow> </semantics></math>. <math display="inline"><semantics> <mrow> <mo>Δ</mo> <mi>Q</mi> <mi>S</mi> <mi>t</mi> <mi>e</mi> <msub> <mi>p</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> </mrow> </semantics></math> is different for each set.</p>
Full article ">Figure 10
<p>Flowchart of candidate selection for brute-force analysis of perceptually optimal parameters. The Ps in energy range boxes refer to the percentile.</p>
Full article ">Figure 11
<p>BD-rate curves (MS-SSIM metric) for PeopleOnStreet video test sequence over the <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>a</mi> <mi>x</mi> <mi>Q</mi> <mi>S</mi> <mi>t</mi> <mi>e</mi> <mi>p</mi> </mrow> </semantics></math> parameter when modifying texture blocks of size 8. Each curve represents a different block energy range (<math display="inline"><semantics> <mrow> <mi>M</mi> <mi>i</mi> <mi>n</mi> <mi>E</mi> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>a</mi> <mi>x</mi> <mi>E</mi> </mrow> </semantics></math>).</p>
Full article ">Figure 12
<p>Rate-distortion curves of the first frame of the BQSquare sequence, comparing our proposed contrast masking (red line) and contrast and texture masking (yellow line) with the HM reference coding (blue line), using the (<b>a</b>) SSIM, (<b>b</b>) MS-SSIM, and (<b>c</b>) PSNR-HVS-M perceptual metrics.</p>
Full article ">Figure 13
<p>Visual comparison of the first frame of the BQSquare sequence encoded at <math display="inline"><semantics> <mrow> <mi>Q</mi> <mi>P</mi> <mo>=</mo> <mn>22</mn> </mrow> </semantics></math>. (<b>a</b>) HM reference-encoded frame; (<b>b</b>) frame encoded with contrast and texture masking.</p>
Full article ">Figure 13 Cont.
<p>Visual comparison of the first frame of the BQSquare sequence encoded at <math display="inline"><semantics> <mrow> <mi>Q</mi> <mi>P</mi> <mo>=</mo> <mn>22</mn> </mrow> </semantics></math>. (<b>a</b>) HM reference-encoded frame; (<b>b</b>) frame encoded with contrast and texture masking.</p>
Full article ">Figure A1
<p>Traffic 2560 × 1600 30 fps Class A.</p>
Full article ">Figure A2
<p>PeopleOnStreet 2560 × 1600 30 fps Class A.</p>
Full article ">Figure A3
<p>NebutaFestival 2560 × 1600 60 fps Class A.</p>
Full article ">Figure A4
<p>SteamLocomotiveTrain 2560 × 1600 60 fps Class A.</p>
Full article ">Figure A5
<p>Kimono 1920 × 1080 24 fps Class B.</p>
Full article ">Figure A6
<p>ParkScene 1920 × 1080 24 fps Class B.</p>
Full article ">Figure A7
<p>Cactus 1920 × 1080 50 fps Class B.</p>
Full article ">Figure A8
<p>BQTerrace 1920 × 1080 60 fps Class B.</p>
Full article ">Figure A9
<p>BasketballDrive 1920 × 1080 50 fps Class B.</p>
Full article ">Figure A10
<p>RaceHorses 832 × 480 30 fps Class C.</p>
Full article ">Figure A11
<p>BQMall 832 × 480 60 fps Class C.</p>
Full article ">Figure A12
<p>PartyScene 832 × 480 50 fps Class C.</p>
Full article ">Figure A13
<p>BasketballDrill 832 × 480 50 fps Class C.</p>
Full article ">Figure A14
<p>RaceHorses 416 × 240 30 fps Class D.</p>
Full article ">Figure A15
<p>BQSquare 416 × 240 60 fps Class D.</p>
Full article ">Figure A16
<p>BlowingBubbles 416 × 240 50 fps Class D.</p>
Full article ">Figure A17
<p>BasketballPass 416 × 240 50 fps Class D.</p>
Full article ">Figure A18
<p>FourPeople 1280 × 720 60 fps Class E.</p>
Full article ">Figure A19
<p>Johnny 1280 × 720 60 fps Class E.</p>
Full article ">Figure A20
<p>KristenAndSara 1280 × 720 60 fps Class E.</p>
Full article ">Figure A21
<p>BasketballDrillText 832 × 480 50 fps Class F.</p>
Full article ">Figure A22
<p>ChinaSpeed 1024 × 768 30 fps Class F.</p>
Full article ">Figure A23
<p>SlideEditing 1280 × 720 30 fps Class F.</p>
Full article ">Figure A24
<p>SlideShow 1280 × 720 20 fps Class F.</p>
Full article ">
15 pages, 2334 KiB  
Article
Ensemble Coding of Crowd with Cross-Category Facial Expressions
by Zhi Yang, Yifan Wu, Shuaicheng Liu, Lili Zhao, Cong Fan and Weiqi He
Behav. Sci. 2024, 14(6), 508; https://doi.org/10.3390/bs14060508 - 19 Jun 2024
Viewed by 1111
Abstract
Ensemble coding allows observers to form an average to represent a set of elements. However, it is unclear whether observers can extract an average from a cross-category set. Previous investigations on this issue using low-level stimuli yielded contradictory results. The current study addressed [...] Read more.
Ensemble coding allows observers to form an average to represent a set of elements. However, it is unclear whether observers can extract an average from a cross-category set. Previous investigations on this issue using low-level stimuli yielded contradictory results. The current study addressed this issue by presenting high-level stimuli (i.e., a crowd of facial expressions) simultaneously (Experiment 1) or sequentially (Experiment 2), and asked participants to complete a member judgment task. The results showed that participants could extract average information from a group of cross-category facial expressions with a short perceptual distance. These findings demonstrate cross-category ensemble coding of high-level stimuli, contributing to the understanding of ensemble coding and providing inspiration for future research. Full article
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) The continuum of morphed facial expressions used in the identification task. (<b>b</b>) The six faces used in the formal experiments around the categorical boundary, with each face having a morphed distance of 18% from one another; they belong to two sequences, where sequence 1 included the A1, B1, C1, D1, and E1 faces and sequence 2 included the A2, B2, C2, D2, and E2 faces. (<b>c</b>) Plot of the results of the identification task. The percentage of ‘fear’ responses in the two-alternative forced choice task, where the results of the male faces are indicated by the black solid line and the results of the female faces are indicated by the black dashed line; it was plotted against the percentage of fearful expression in the morphed continuum. Gray dashed line indicates the position where observers had an equal chance to perceive the faces as either happy or fearful.</p>
Full article ">Figure 2
<p>Illustration of unique members of the ensembles in different conditions (<b>a</b>) and overview of the procedure of Experiment 1 (<b>b</b>). (<b>a</b>) The left part shows unique members of ensembles in different conditions. The number of each unique member in the ensembles was as follows: two B1 faces and two D1 faces in the happy BD ensembles, two B2 faces and two D2 faces in the fearful BD ensembles, four B1 faces or four B2 faces in the happy B/D ensembles, and four D1 faces or four D2 faces in the fearful B/D ensembles; the right section displays all the probe faces. In a single trial, the probe face was one of the five faces of each sequence, which was consistent with the ensemble stimuli. (<b>b</b>) Each trial began with a fixation displayed at the centre of the screen for 300 ms, followed by an ensemble stimulus for 2000 ms. Then, a probe face was presented at the centre of the screen following a blank screen displayed for 1300 ms. A signal (i.e., words ‘F: no’ and ‘J: yes’) reminding participants to respond was presented at the bottom of the screen after the probe face was presented for 400 ms. The probe face disappeared once the participants pressed a given key or after the signal was displayed for 2000 ms.</p>
Full article ">Figure 3
<p>The results of Experiment 1. Error bars represent standard errors.</p>
Full article ">Figure 4
<p>Overview of the procedure of Experiment 2. Each trial began with a fixation displayed at the centre of the screen for 300 ms, followed by an ensemble stimulus consisting of 20 faces presented sequentially for 2000 ms, with each face appearing for 50 ms after a 50 ms blank screen. Then, a probe face appeared at the centre of the screen following a blank screen being displayed for 1300 ms. After the probe face appeared for 400 ms, a signal (i.e., words ‘F: no’ and ‘J: yes’) reminding participants to respond appeared at the bottom of the screen. The probe face disappeared once the participants pressed a given key or after the signal was displayed for 2000 ms.</p>
Full article ">Figure 5
<p>The results of Experiment 2. Error bars represent standard errors.</p>
Full article ">
20 pages, 898 KiB  
Article
Singing to a Genre: Constraints on Variable Rhoticity in British Americana
by Rebeka Campos-Astorkiza
Languages 2024, 9(6), 203; https://doi.org/10.3390/languages9060203 - 31 May 2024
Cited by 1 | Viewed by 1039
Abstract
This study focuses on accent shift or stylization to American English features in Anglophone pop-rock music and examines linguistic constraints alongside music-related considerations, as well as the effect of changes in musical genre on variable accent shift. The case study is the British [...] Read more.
This study focuses on accent shift or stylization to American English features in Anglophone pop-rock music and examines linguistic constraints alongside music-related considerations, as well as the effect of changes in musical genre on variable accent shift. The case study is the British band Mumford and Sons and their variable production of non-prevocalic rhotics as either present or absent. Mumford and Sons is of interest because they have displayed a change in their musical style throughout their career from Americana to alt-rock. The band’s four studio albums were auditorily analyzed and coded for rhotic vs. non-rhotic with aid from spectrograms. The linguistic factors considered were word class, preceding vowel according to the word’s lexical set, complexity of the preceding vowel, syllable complexity, stress, and location within the word and phrase. In addition, the effect of singing-related factors of syllable elongation and rhyming, and of the specific album, were also explored. Results show that rhoticity is favored in content words, stressed contexts, complex syllables, and NURSE words. This pattern is explained as stemming from the perceptual prominence of those contexts based on their acoustic and phonological characteristics. Results further show that syllable elongation leads to more rhoticity and that rhyming words tend to agree in their (non-)rhoticity. Finally, the degree of rhoticity decreases as the band departs from Americana in their later albums, highlighting the relevance of music genre for accent stylization. Full article
(This article belongs to the Special Issue Interface between Sociolinguistics and Music)
Show Figures

Figure 1

Figure 1
<p>Overall rate of rhotic and non-rhotic production.</p>
Full article ">Figure 2
<p>Rate of rhotic and non-rhotic production by word class and stress position.</p>
Full article ">Figure 3
<p>Rate of rhotic and non-rhotic production by syllable elongation.</p>
Full article ">
15 pages, 2006 KiB  
Article
Effects of Temporal Processing on Speech-in-Noise Perception in Middle-Aged Adults
by Kailyn A. McFarlane and Jason Tait Sanchez
Biology 2024, 13(6), 371; https://doi.org/10.3390/biology13060371 - 23 May 2024
Viewed by 1690
Abstract
Auditory temporal processing is a vital component of auditory stream segregation, or the process in which complex sounds are separated and organized into perceptually meaningful objects. Temporal processing can degrade prior to hearing loss, and is suggested to be a contributing factor to [...] Read more.
Auditory temporal processing is a vital component of auditory stream segregation, or the process in which complex sounds are separated and organized into perceptually meaningful objects. Temporal processing can degrade prior to hearing loss, and is suggested to be a contributing factor to difficulties with speech-in-noise perception in normal-hearing listeners. The current study tested this hypothesis in middle-aged adults—an under-investigated cohort, despite being the age group where speech-in-noise difficulties are first reported. In 76 participants, three mechanisms of temporal processing were measured: peripheral auditory nerve function using electrocochleography, subcortical encoding of periodic speech cues (i.e., fundamental frequency; F0) using the frequency following response, and binaural sensitivity to temporal fine structure (TFS) using a dichotic frequency modulation detection task. Two measures of speech-in-noise perception were administered to explore how contributions of temporal processing may be mediated by different sensory demands present in the speech perception task. This study supported the hypothesis that temporal coding deficits contribute to speech-in-noise difficulties in middle-aged listeners. Poorer speech-in-noise perception was associated with weaker subcortical F0 encoding and binaural TFS sensitivity, but in different contexts, highlighting that diverse aspects of temporal processing are differentially utilized based on speech-in-noise task characteristics. Full article
(This article belongs to the Special Issue Neural Correlates of Perception in Noise in the Auditory System)
Show Figures

Figure 1

Figure 1
<p>Audiograms of each participant (averaged across ears). Each participant is represented by an individual grey line. A mean audiogram (<span class="html-italic">n</span> = 76) is represented by the bolded black line, bracketed by ±1 standard deviation for each test frequency (0.25, 0.5, 1, 2, 3, 4, 6, 8, 10, 12.5, 14, 16 kHz). The dashed horizontal line indicates the cut−off for normal hearing (25 dB HL).</p>
Full article ">Figure 2
<p>(<b>Left</b>) Representative trace of a single participant’s FFR to the 40 ms/da/stimulus in the time domain. The Fast Fourier Transform (FFT) analysis window (19.5–44.2 ms) is indicated by vertical dashed lines. (<b>Right</b>) The FFT of the same participant’s response. The shaded region represents the 100 Hz wide bin (75–175 Hz) centered on the stimulus F0 used to average the spectral magnitude (µV) of the F0 response.</p>
Full article ">Figure 3
<p>Representative ECochG responses from a single participant to broadband clicks presented at (<b>left</b>) 9.1/s and (<b>right</b>) 21.1/s. Amplitude was defined as the µV difference between wave I peak and its following trough, marked as I<sub>p</sub> and I<sub>t</sub>, respectively. The percent change in amplitude as a function of increasing click rate was calculated for each participant and used for analysis.</p>
Full article ">Figure 4
<p>(<b>Left</b>) Distribution of the wave I amplitude change in the top (<span class="html-italic">n</span> = 17) and bottom (<span class="html-italic">n</span> = 18) AzBio performers. The bold horizontal line represents group medians, bracketed by the interquartile range. A Mann–Whitney U test revealed no significant differences between groups. (<b>Right</b>) Distribution of the wave I amplitude change in the top (<span class="html-italic">n</span> = 21) and bottom (<span class="html-italic">n</span> = 19) percentile SR2-SRM scores. The bold horizontal line represents the group means, bracketed by 1 standard deviation. An unpaired <span class="html-italic">t</span>-test revealed no significant difference between the groups. ns = non−significant.</p>
Full article ">Figure 5
<p>(<b>Left</b>) Distribution of the FFR F0 response magnitude in the top (<span class="html-italic">n</span> = 16) and bottom (<span class="html-italic">n</span> = 15) AzBio performers. The bold horizontal line represents the group means bracketed by 1 standard deviation. An unpaired <span class="html-italic">t</span>-test revealed no significant difference between the groups. (<b>Right</b>) Distribution of the FFR F0 response magnitude in the top (<span class="html-italic">n</span> = 18) and bottom (<span class="html-italic">n</span> = 17) percentile SR2-SRM scores. A Mann−Whitney U test revealed significantly lower F0 magnitudes in the bottom SR2-SRM group. The bold horizontal line represents group medians, bracketed by the interquartile range. ns = non−significant, * <span class="html-italic">p</span> &lt; 0.05.</p>
Full article ">Figure 6
<p>(<b>Left</b>) Distribution of the dichotic FM thresholds in the top (<span class="html-italic">n</span> = 18) and bottom (<span class="html-italic">n</span> = 17) AzBio performers. The bold horizontal line represents the group medians, bracketed by the interquartile range. A Mann–Whitney U test revealed significantly lower FM thresholds in the top AzBio performers. (<b>Right</b>) Distribution of dichotic FM thresholds in the top (<span class="html-italic">n</span> = 21) and bottom (<span class="html-italic">n</span> = 18) quarter percentile SR2-SRM scores. The bold horizontal line represents the group means, bracketed by 1 standard deviation. An unpaired <span class="html-italic">t</span>-test revealed no significant difference between the groups. ns = non–significant, * <span class="html-italic">p</span> &lt; 0.05.</p>
Full article ">Figure A1
<p>Distribution of self-rated hearing abilities across each SSQ12 question and the average score across all questions. A rating of 10 indicates no perceived difficulty in the given listening environment, while a rating of 0 indicates a perceived significant difficulty. SSQ12 questions can be found in Noble et al. [<a href="#B48-biology-13-00371" class="html-bibr">48</a>].</p>
Full article ">
14 pages, 431 KiB  
Article
Deep Learning-Driven Interference Perceptual Multi-Modulation for Full-Duplex Systems
by Taehyoung Kim and Gyuyeol Kong
Mathematics 2024, 12(10), 1542; https://doi.org/10.3390/math12101542 - 15 May 2024
Viewed by 1095
Abstract
In this paper, a novel data transmission scheme, interference perceptual multi-modulation (IP-MM), is proposed for full-duplex (FD) systems. In order to unlink the conventional uplink (UL) data transmission using a single modulation and coding scheme (MCS) over the entire assigned UL bandwidth, IP-MM [...] Read more.
In this paper, a novel data transmission scheme, interference perceptual multi-modulation (IP-MM), is proposed for full-duplex (FD) systems. In order to unlink the conventional uplink (UL) data transmission using a single modulation and coding scheme (MCS) over the entire assigned UL bandwidth, IP-MM enables the transmission of UL data channels based on multiple MCS levels, where a different MCS level is applied to each subband of UL transmission. In IP-MM, a deep convolutional neural network is used for MCS-level prediction for each UL subband by estimating the potential residual self-interference (SI) according to the downlink (DL) resource allocation pattern. In addition, a subband-based UL transmission procedure is introduced from a specification point of view to enable IP-MM-based UL transmission. The benefits of IP-MM are verified using simulations, and it is observed that IP-MM achieves approximately 20% throughput gain compared to the conventional UL transmission scheme. Full article
Show Figures

Figure 1

Figure 1
<p>Illustration of UL transmission in FD systems: (<b>a</b>) conventional UL transmission scheme and (<b>b</b>) proposed IP-MM based UL transmission scheme.</p>
Full article ">Figure 2
<p>CDF of effective SINR according to <math display="inline"><semantics> <msub> <mi>N</mi> <mi>sub</mi> </msub> </semantics></math> when <math display="inline"><semantics> <mrow> <msubsup> <mi>N</mi> <mi>RBG</mi> <mi>DL</mi> </msubsup> <mo>=</mo> <mn>8</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mi>ρ</mi> <mi>SIC</mi> </msub> <mo>=</mo> <mo>−</mo> <mn>130</mn> </mrow> </semantics></math> dB.</p>
Full article ">Figure 3
<p>Examples of MCS determination according to <math display="inline"><semantics> <msub> <mi>N</mi> <mi>sub</mi> </msub> </semantics></math> when <math display="inline"><semantics> <mrow> <msubsup> <mi>N</mi> <mi>RBG</mi> <mi>DL</mi> </msubsup> <mo>=</mo> <mn>8</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mi>ρ</mi> <mi>SIC</mi> </msub> <mo>=</mo> <mo>−</mo> <mn>130</mn> </mrow> </semantics></math> dB.</p>
Full article ">Figure 4
<p>UL throughput performances according to <math display="inline"><semantics> <msub> <mi>N</mi> <mi>sub</mi> </msub> </semantics></math> when <math display="inline"><semantics> <mrow> <msubsup> <mi>N</mi> <mi>RBG</mi> <mi>DL</mi> </msubsup> <mo>=</mo> <mn>8</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mi>ρ</mi> <mi>SIC</mi> </msub> <mo>=</mo> <mo>−</mo> <mn>130</mn> </mrow> </semantics></math> dB.</p>
Full article ">Figure 5
<p>UL throughput performances according to <math display="inline"><semantics> <msubsup> <mi>N</mi> <mi>RBG</mi> <mi>DL</mi> </msubsup> </semantics></math> when <math display="inline"><semantics> <mrow> <msub> <mi>N</mi> <mi>sub</mi> </msub> <mo>=</mo> <mn>8</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mi>ρ</mi> <mi>SIC</mi> </msub> <mo>=</mo> <mo>−</mo> <mn>130</mn> </mrow> </semantics></math> dB.</p>
Full article ">Figure 6
<p>UL throughput according to <math display="inline"><semantics> <msub> <mi>ρ</mi> <mi>SIC</mi> </msub> </semantics></math> when <math display="inline"><semantics> <mrow> <msubsup> <mi>N</mi> <mi>RBG</mi> <mi>DL</mi> </msubsup> <mo>=</mo> <mn>8</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mi>N</mi> <mi>sub</mi> </msub> <mo>=</mo> <mn>8</mn> </mrow> </semantics></math>.</p>
Full article ">
13 pages, 1805 KiB  
Article
Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction
by Yusuf Brima, Ulf Krumnack, Simone Pika and Gunther Heidemann
Information 2024, 15(2), 114; https://doi.org/10.3390/info15020114 - 15 Feb 2024
Viewed by 2322
Abstract
Self-supervised learning (SSL) has emerged as a promising paradigm for learning flexible speech representations from unlabeled data. By designing pretext tasks that exploit statistical regularities, SSL models can capture useful representations that are transferable to downstream tasks. Barlow Twins (BTs) is an [...] Read more.
Self-supervised learning (SSL) has emerged as a promising paradigm for learning flexible speech representations from unlabeled data. By designing pretext tasks that exploit statistical regularities, SSL models can capture useful representations that are transferable to downstream tasks. Barlow Twins (BTs) is an SSL technique inspired by theories of redundancy reduction in human perception. In downstream tasks, BTs representations accelerate learning and transfer this learning across applications. This study applies BTs to speech data and evaluates the obtained representations on several downstream tasks, showing the applicability of the approach. However, limitations exist in disentangling key explanatory factors, with redundancy reduction and invariance alone being insufficient for factorization of learned latents into modular, compact, and informative codes. Our ablation study isolated gains from invariance constraints, but the gains were context-dependent. Overall, this work substantiates the potential of Barlow Twins for sample-efficient speech encoding. However, challenges remain in achieving fully hierarchical representations. The analysis methodology and insights presented in this paper pave a path for extensions incorporating further inductive priors and perceptual principles to further enhance the BTs self-supervision framework. Full article
(This article belongs to the Topic Advances in Artificial Neural Networks)
Show Figures

Figure 1

Figure 1
<p>The BTs framework for learning invariant speech representations. <b>Stage 1:</b> An encoder <math display="inline"><semantics> <msub> <mi>f</mi> <mi>θ</mi> </msub> </semantics></math> process augments views <math display="inline"><semantics> <msup> <mi>X</mi> <mi>A</mi> </msup> </semantics></math> and <math display="inline"><semantics> <msup> <mi>X</mi> <mi>B</mi> </msup> </semantics></math> of the same speech input <span class="html-italic">X</span> and projects them into a shared latent space. The BTs’ loss (Equation (<a href="#FD1-information-15-00114" class="html-disp-formula">1</a>)) enforces redundancy reduction between latents from different samples while maximizing correlation for positive pairs (two views of the same sample). This causes the encoders to produce invariant representations capturing speaker identity while reducing sensitivity to augmentations. <b>Stage 2:</b> The learned latent representations <math display="inline"><semantics> <msup> <mi>Z</mi> <mi>A</mi> </msup> </semantics></math> and <math display="inline"><semantics> <msup> <mi>Z</mi> <mi>B</mi> </msup> </semantics></math> can then be used for downstream speech-processing tasks to evaluate the model’s generalization capability.</p>
Full article ">Figure 2
<p>(<b>Left column</b>) View 1 provides a dual representation, featuring the time-domain signal (<b>top row</b>) and its corresponding time-frequency spectrogram (<b>second row</b>), both derived from the first perturbed version of the original audio signal. (<b>Right column</b>) View 2 presents a similar pair of representations. The higher harmonic partials present in the first view are not visibly present in the second view; however, the underlying information content remains invariant.</p>
Full article ">Figure 3
<p>The empirical cross-correlation between the 128 features of the latent representations <math display="inline"><semantics> <msup> <mi>Z</mi> <mi>A</mi> </msup> </semantics></math> and <math display="inline"><semantics> <msup> <mi>Z</mi> <mi>B</mi> </msup> </semantics></math> for paired augmented views, contrasting the untrained state (<b>left</b>) with the trained state (<b>right</b>) within the BTs framework. These matrices visually represent the relationships between different views of the same speech input for the current mini-batch. The comparison allows us to observe the transformation in cross-correlation patterns following the self-supervised learning process, highlighting the model’s ability to capture invariance (higher correlation of diagonal elements of the trained network’s matrix) and de-correlation of off-diagonal elements.</p>
Full article ">Figure 4
<p>(<b>a</b>) Top-1 accuracy for speaker recognition, comparing five base models over 50 experimental runs, highlighting the performance and stability of these techniques. (<b>b</b>) Top-1 accuracy for gender recognition from speech, using the same base models, which shows a similar performance trend, indicating task-specific model effectiveness and the nuanced nature of gender features in speech data.</p>
Full article ">Figure 5
<p>(<b>a</b>) Boxplot of Top-1 accuracy in emotion recognition across five different base models over 50 experimental runs, showing the consistency and variability in model performances. (<b>b</b>) Boxplot of Top-1 accuracy in a keyword-spotting task for the same base models and number of runs, illustrating the impact of model architecture on task-specific accuracy.</p>
Full article ">
17 pages, 15546 KiB  
Article
A Pedestrian Trajectory Prediction Method for Generative Adversarial Networks Based on Scene Constraints
by Zhongli Ma, Ruojin An, Jiajia Liu, Yuyong Cui, Jun Qi, Yunlong Teng, Zhijun Sun, Juguang Li and Guoliang Zhang
Electronics 2024, 13(3), 628; https://doi.org/10.3390/electronics13030628 - 2 Feb 2024
Cited by 3 | Viewed by 1518
Abstract
Pedestrian trajectory prediction is one of the most important topics to be researched for unmanned driving and intelligent mobile robots to perform perceptual interaction with the environment. To solve the problem of the SGAN (social generative adversarial networks) model lacking an understanding of [...] Read more.
Pedestrian trajectory prediction is one of the most important topics to be researched for unmanned driving and intelligent mobile robots to perform perceptual interaction with the environment. To solve the problem of the SGAN (social generative adversarial networks) model lacking an understanding of pedestrian interaction and scene constraints, this paper proposes a trajectory prediction method based on a scenario-constrained generative adversarial network. Firstly, a self-attention mechanism is added, which can integrate information at every moment. Secondly, mutual information is introduced to enhance the influence of latent code on the predicted trajectory. Finally, a new social pool is introduced into the original trajectory prediction model, and a scene edge extraction module is added to ensure the final output path of the model is within the passable area in line with the physical scene, which greatly improves the accuracy of trajectory prediction. Based on the CARLA (CAR Learning to Act) simulation platform, the improved model was tested on the public dataset and the self-built dataset. The experimental results showed that the average moving deviation was reduced by 26.4% and the final offset was reduced by 23.8%, which proved that the improved model could better solve the uncertainty of pedestrian turning decisions. The accuracy and stability of pedestrian trajectory prediction are improved while maintaining multiple modes. Full article
(This article belongs to the Special Issue Intelligent Mobile Robotic Systems: Decision, Planning and Control)
Show Figures

Figure 1

Figure 1
<p>SGAN algorithm framework.</p>
Full article ">Figure 2
<p>Structure of SC-SIGAN model generator and encoder.</p>
Full article ">Figure 3
<p>Changed VGG network structure.</p>
Full article ">Figure 4
<p>Social pooling layer structure.</p>
Full article ">Figure 5
<p>Self-attention module.</p>
Full article ">Figure 6
<p>The pseudocode of SC-SIGAN network model.</p>
Full article ">Figure 7
<p>Simulated traffic scene.</p>
Full article ">Figure 8
<p>Open dataset test metrics comparison curve. (<b>a</b>) <math display="inline"><semantics> <mrow> <mi>A</mi> <mi>D</mi> <mi>E</mi> </mrow> </semantics></math> indices. (<b>b</b>) <math display="inline"><semantics> <mrow> <mi>F</mi> <mi>D</mi> <mi>E</mi> </mrow> </semantics></math> indices.</p>
Full article ">Figure 9
<p>Algorithm visual comparison. (<b>a</b>) SGAN. (<b>b</b>) Our model.</p>
Full article ">Figure 10
<p>Visualization effect of trajectory prediction of self-built dataset.</p>
Full article ">Figure 11
<p>Histogram of test results of five algorithms. (<b>a</b>) <math display="inline"><semantics> <mrow> <mi>A</mi> <mi>D</mi> <mi>E</mi> </mrow> </semantics></math> indices. (<b>b</b>) <math display="inline"><semantics> <mrow> <mi>F</mi> <mi>D</mi> <mi>E</mi> </mrow> </semantics></math> indices.</p>
Full article ">Figure 12
<p>Visualization of trajectory prediction results. (<b>a</b>) SGAN. (<b>b</b>) Our model.</p>
Full article ">
21 pages, 1182 KiB  
Perspective
Development, Insults and Predisposing Factors of the Brain’s Predictive Coding System to Chronic Perceptual Disorders—A Life-Course Examination
by Anusha Yasoda-Mohan and Sven Vanneste
Brain Sci. 2024, 14(1), 86; https://doi.org/10.3390/brainsci14010086 - 16 Jan 2024
Viewed by 2138
Abstract
The predictive coding theory is currently widely accepted as the theoretical basis of perception and chronic perceptual disorders are explained as the maladaptive compensation of the brain to a prediction error. Although this gives us a general framework to work with, it is [...] Read more.
The predictive coding theory is currently widely accepted as the theoretical basis of perception and chronic perceptual disorders are explained as the maladaptive compensation of the brain to a prediction error. Although this gives us a general framework to work with, it is still not clear who may be more susceptible and/or vulnerable to aberrations in this system. In this paper, we study changes in predictive coding through the lens of tinnitus and pain. We take a step back to understand how the predictive coding system develops from infancy, what are the different neural and bio markers that characterise this system in the acute, transition and chronic phases and what may be the factors that pose a risk to the aberration of this system. Through this paper, we aim to identify people who may be at a higher risk of developing chronic perceptual disorders as a reflection of aberrant predictive coding, thereby giving future studies more facets to incorporate in their investigation of early markers of tinnitus, pain and other disorders of predictive coding. We therefore view this paper to encourage the thinking behind the development of preclinical biomarkers to maladaptive predictive coding. Full article
(This article belongs to the Section Sensory and Motor Neuroscience)
Show Figures

Figure 1

Figure 1
<p>Summary of event-related potentials generated by a three-stimulus active oddball paradigm. The dashed black line shows the response to standard stimuli, the solid black line shows the response to the target (usually paid attention to) stimuli, the solid red line shows the response to a novel distractor stimulus and the solid blue line shows the difference in response to the standard and target.</p>
Full article ">Figure 2
<p>This figure summarises the neural markers and changes in the predictive coding system involved in the acute, transition and chronic phase of tinnitus and pain. In the acute phase, damage to the cochlear hair cells or an acute injury increase spontaneous activity in the peripheral system which is brought to consciousness by the domain-specific lateral pathway (blue = auditory, purple = somatosensory). Specific to predictive coding, there is an increase in the precision of the likelihood. In the transition phase, there are specific changes to connectivity between regions of the predictive coding network. In pain, the connectivity between mPFC and NAc predicts the transition from acute to chronic with over 80% accuracy. In tinnitus, a change in the connectivity between the PHC and PCC was noted in a cross-sectional study comparing acute and chronic patients. From a predictive coding perspective, there is a change in the prediction (from silence to perceiving a phantom) accompanied by a change in inference owing to a possible reinforcement learning facilitated by the limbic system. In the chronic phase, damaged peripheral nerves in the auditory and somatosensory systems cause a domain-specific central sensitisation, which together with a malfunctioning inhibitory pathway from the DLPFC or the pgACC for tinnitus and pain, respectively, bring the increased gain to consciousness. It is accompanied by a myriad of biopsychosocial components and a domain-general distress network (shown in red) whose regions also overlap with the predictive coding network. From a predictive coding perspective, the current prediction has now fully transitioned to expect the perception of the phantom as the new norm, thereby compensating for the increased precision of the likelihood and changing the inference as well. The abbreviations involved are HC = hair cells, DH = dorsal horn, S1 = primary somatosensory cortex, AC = primary auditory cortex, VPN = ventral posterior nucleus, MGB = medial geniculate body, CN = cochlear nucleus, mPFC = medial prefrontal cortex, NAc = nucleus accumbens, DLPFC = dorsolateral pre-frontal cortex, aI = anterior insula, dACC = dorsal anterior cingulate cortex, pgACC = pregenual anterior cingulate cortex, sgACC = subgenual anterior cingulate cortex, PHC = parahippocampal cortex, PCC = posterior cingulate cortex, PAG = periaqueductal gray.</p>
Full article ">Figure 3
<p>Summary of the different risk factors that can predispose a person to aberrant predictive coding. These include developmental insults and environmental and genetic risk factors.</p>
Full article ">
19 pages, 2809 KiB  
Article
Learning Adaptive Quantization Parameter for Consistent Quality Oriented Video Coding
by Tien Huu Vu, Minh Ngoc Do, Sang Quang Nguyen, Huy PhiCong, Thipphaphone Sisouvong and Xiem HoangVan
Electronics 2023, 12(24), 4905; https://doi.org/10.3390/electronics12244905 - 6 Dec 2023
Viewed by 1671
Abstract
In the industry 4.0 era, video applications such as surveillance visual systems, video conferencing, or video broadcasting have been playing a vital role. In these applications, for manipulating and tracking objects in decoded video, the quality of decoded video should be consistent because [...] Read more.
In the industry 4.0 era, video applications such as surveillance visual systems, video conferencing, or video broadcasting have been playing a vital role. In these applications, for manipulating and tracking objects in decoded video, the quality of decoded video should be consistent because it largely affects the performance of the machine analysis. To cope with this problem, we propose a novel perceptual video coding (PVC) solution in which a full reference quality metric named video multimethod assessment fusion (VMAF) is employed together with a deep convolutional neural network (CNN) to obtain consistent quality while still achieving high compression performance. First of all, in order to achieve the consistent quality requirement, we propose a CNN model with an expected VMAF as input to adaptively adjust the quantization parameters (QP) for each coding block. Afterwards, to increase the compression performance, a Lagrange coefficient of rate-distortion optimization (RDO) mechanism is adaptively computed according to rate-QP and quality-QP models. The experimental results show that the proposed PVC solution has achieved two targets simultaneously: the quality of video sequence is kept consistently with an expected quality level and the bit rate saving of the proposed method is higher than traditional video coding standards and the relevant benchmark, notably with around 10% bitrate saving on average. Full article
(This article belongs to the Section Computer Science & Engineering)
Show Figures

Figure 1

Figure 1
<p>Example of video with inconsistent quality [<a href="#B8-electronics-12-04905" class="html-bibr">8</a>]. (<b>a</b>–<b>d</b>) Frames of Foreman video sequence. (<b>e</b>–<b>h</b>): Difference with the original images.</p>
Full article ">Figure 2
<p>Framework of the proposed method.</p>
Full article ">Figure 3
<p>Steps in generating CNN model.</p>
Full article ">Figure 4
<p>A sample of QP map.</p>
Full article ">Figure 5
<p>The fitting curve of “City” sequence for rate and distortion function.</p>
Full article ">Figure 6
<p>Steps in dataset generation process.</p>
Full article ">Figure 7
<p>The architecture of the proposed CNN model.</p>
Full article ">Figure 8
<p>Architecture of test methodology for the proposed method.</p>
Full article ">Figure 9
<p>Comparison of quality level between methods.</p>
Full article ">Figure 10
<p>Comparison of quality level between methods.</p>
Full article ">Figure 11
<p>Reconstructed frames in three methods: (<b>a</b>) x.264, (<b>b</b>) CADQ, and (<b>c</b>) LAQP.</p>
Full article ">
23 pages, 9472 KiB  
Article
Underwater Image Enhancement Based on Hybrid Enhanced Generative Adversarial Network
by Danmi Xu, Jiajia Zhou, Yang Liu and Xuyu Min
J. Mar. Sci. Eng. 2023, 11(9), 1657; https://doi.org/10.3390/jmse11091657 - 24 Aug 2023
Cited by 1 | Viewed by 1645
Abstract
In recent years, underwater image processing has played an essential role in ocean exploration. The complexity of seawater leads to the phenomena of light absorption and scattering, which in turn cause serious image degradation problems, making it difficult to capture high-quality underwater images. [...] Read more.
In recent years, underwater image processing has played an essential role in ocean exploration. The complexity of seawater leads to the phenomena of light absorption and scattering, which in turn cause serious image degradation problems, making it difficult to capture high-quality underwater images. A novel underwater image enhancement model based on Hybrid Enhanced Generative Adversarial Network (HEGAN) is proposed in this paper. By designing a Hybrid Underwater Image Synthesis Model (HUISM) based on a physical model and a deep learning method, many richly varied paired underwater images are acquired to compensate for the missing problem of underwater image enhancement dataset training. Meanwhile, the Detection Perception Enhancement Model (DPEM) with Perceptual Loss is designed to transfer the coding knowledge in the form of the gradient to the enhancement model through the perceptual loss, which leads to the generation of visually better and detection-friendly underwater images. Then, the synthesized and enhanced models are integrated into an adversarial network to generate high-quality underwater clear images through game learning. Experiments show that the proposed method significantly outperforms several state-of-the-art methods both qualitatively and quantitatively. Furthermore, it is also demonstrated that the method can improve target detection performance in underwater environments, which has specific application value for subsequent image processing. Full article
(This article belongs to the Section Ocean Engineering)
Show Figures

Figure 1

Figure 1
<p>Network structure of HUISM. (<b>a</b>) Light Absorption module (LA) (<b>b</b>) Light Scattering module (LS) (<b>c</b>) CNN module (CNN) (<b>d</b>) Fusion module (Fusion) (* means multiplication).</p>
Full article ">Figure 2
<p>Visual comparison of different UIS algorithms from the Multiview dataset.</p>
Full article ">Figure 3
<p>Visual comparison of different UIS algorithms from the OUC dataset.</p>
Full article ">Figure 4
<p>A simplified framework of two detection perceptual enhancement models. (<b>a</b>) Patch detection perceptual enhancement model (<b>b</b>) Target Focus detection perceptual enhancement model.</p>
Full article ">Figure 5
<p>Hybrid Augmented Generative Adversarial Network HEGAN. The overall framework contains two cyclic consistency paths, the forward cyclic consistency path starts from a real RGB-D image of the water and ends with an augmented clear RGB-D map of the underwater, the reverse cyclic consistency path starts from a real underwater image and ends with a reconstructed underwater image.</p>
Full article ">Figure 6
<p>Visual comparison of two challenging examples from the ChinaMM dataset.</p>
Full article ">Figure 7
<p>Visual comparison of two challenging examples from the Multiview dataset.</p>
Full article ">Figure 8
<p>Visual comparison of two challenging examples from the OUC dataset.</p>
Full article ">Figure 9
<p>Analysis of mAP assessment metrics on ChinaMM dataset.</p>
Full article ">Figure 10
<p>Analysis of mAP assessment metrics on Multiview dataset.</p>
Full article ">Figure 11
<p>Analysis of mAP assessment metrics on OUC dataset.</p>
Full article ">Figure 12
<p>Visualization of target detection results from ChinaMM dataset.</p>
Full article ">Figure 13
<p>Visualization of target detection results from Multiview dataset.</p>
Full article ">Figure 14
<p>Qualitative comparison of Multiview dataset into model ablation study.</p>
Full article ">Figure 15
<p>Qualitative comparison of OUC dataset into model ablation study.</p>
Full article ">Figure 16
<p>Qualitative comparison of ablation study of perceptual model for detection of Multiview dataset.</p>
Full article ">Figure 17
<p>Qualitative comparison of ablation study of perceptual model for detection of OUC dataset.</p>
Full article ">
34 pages, 7332 KiB  
Article
Assist-Dermo: A Lightweight Separable Vision Transformer Model for Multiclass Skin Lesion Classification
by Qaisar Abbas, Yassine Daadaa, Umer Rashid and Mostafa E. A. Ibrahim
Diagnostics 2023, 13(15), 2531; https://doi.org/10.3390/diagnostics13152531 - 29 Jul 2023
Cited by 14 | Viewed by 1979
Abstract
A dermatologist-like automatic classification system is developed in this paper to recognize nine different classes of pigmented skin lesions (PSLs), using a separable vision transformer (SVT) technique to assist clinical experts in early skin cancer detection. In the past, researchers have developed a [...] Read more.
A dermatologist-like automatic classification system is developed in this paper to recognize nine different classes of pigmented skin lesions (PSLs), using a separable vision transformer (SVT) technique to assist clinical experts in early skin cancer detection. In the past, researchers have developed a few systems to recognize nine classes of PSLs. However, they often require enormous computations to achieve high performance, which is burdensome to deploy on resource-constrained devices. In this paper, a new approach to designing SVT architecture is developed based on SqueezeNet and depthwise separable CNN models. The primary goal is to find a deep learning architecture with few parameters that has comparable accuracy to state-of-the-art (SOTA) architectures. This paper modifies the SqueezeNet design for improved runtime performance by utilizing depthwise separable convolutions rather than simple conventional units. To develop this Assist-Dermo system, a data augmentation technique is applied to control the PSL imbalance problem. Next, a pre-processing step is integrated to select the most dominant region and then enhance the lesion patterns in a perceptual-oriented color space. Afterwards, the Assist-Dermo system is designed to improve efficacy and performance with several layers and multiple filter sizes but fewer filters and parameters. For the training and evaluation of Assist-Dermo models, a set of PSL images is collected from different online data sources such as Ph2, ISBI-2017, HAM10000, and ISIC to recognize nine classes of PSLs. On the chosen dataset, it achieves an accuracy (ACC) of 95.6%, a sensitivity (SE) of 96.7%, a specificity (SP) of 95%, and an area under the curve (AUC) of 0.95. The experimental results show that the suggested Assist-Dermo technique outperformed SOTA algorithms when recognizing nine classes of PSLs. The Assist-Dermo system performed better than other competitive systems and can support dermatologists in the diagnosis of a wide variety of PSLs through dermoscopy. The Assist-Dermo model code is freely available on GitHub for the scientific community. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

Figure 1
<p>A visual example of nine types PSLs, where (<b>a</b>) actinic keratosis (AK), (<b>b</b>) “basal cell carcinoma” (BCC), (<b>c</b>) dermatofibroma (DF), (<b>d</b>) melanoma (MEL), (<b>e</b>) nevus (NV), (<b>f</b>) pigmented benign keratosis (PBK), (<b>g</b>) “seborrheic keratosis” (SK), and (<b>h</b>) “squamous cell carcinoma” (SCC).</p>
Full article ">Figure 2
<p>A methodical illustration of suggested Assist-Dermo system to recognize nine classes of pigmented skin lesions.</p>
Full article ">Figure 3
<p>An example of data augmentation techniques applied on the selected datasets from the ISIC-2019, ISIC-2020, and HAM10000 sources in case of benign and malignant skin lesions.</p>
Full article ">Figure 3 Cont.
<p>An example of data augmentation techniques applied on the selected datasets from the ISIC-2019, ISIC-2020, and HAM10000 sources in case of benign and malignant skin lesions.</p>
Full article ">Figure 4
<p>An example of preprocessing enhancement step to enhance the contrast of SCC lesion taken from <a href="#diagnostics-13-02531-f001" class="html-fig">Figure 1</a>, where (<b>a</b>–<b>e</b>) represents the original PSLs, and (<b>f</b>–<b>j</b>) shows the corresponding contrast enhancement using nonlinear sigmoidal function.</p>
Full article ">Figure 5
<p>SqueezeNet-Light architecture (<b>a</b>) SqueezeNet-Light base structure, (<b>b</b>) fire module structure and (<b>c</b>) SepConv module basic structure.</p>
Full article ">Figure 6
<p>(<b>a</b>) Feature representations example of original SequeezeNet model using PSLs image and (<b>b</b>) feature representations example of our SequeezeNet-Light model using PSLs image.</p>
Full article ">Figure 7
<p>Plot of loss, accuracy, AUC, and recall on train/validation sets without data augmentations.</p>
Full article ">Figure 8
<p>Plot loss, accuracy, AUC, and recall on the train and validation sets with data augmentation.</p>
Full article ">Figure 9
<p>Plot loss, accuracy, AUC, and recall on train/validation sets with data augmentation and 40 epochs.</p>
Full article ">Figure 10
<p>Confusion matrices: Sub-figures (<b>a</b>–<b>c</b>) show the result of proposed model with preprocessing in CIECAM, CIELab, and HSV color spaces. Sub-figures (<b>d</b>–<b>f</b>) show the result of proposed model without preprocessing step to enhance the contrast and adjust brightness, for the same color spaces.</p>
Full article ">Figure 11
<p>Performance of SOTA systems when compared to our Assist-Dermo system (<b>a</b>) binary classification (malignant and benign lesions), (<b>b</b>) represents five-classes’ classifications (AK, BCC, DF, Mel, NV), (<b>c</b>) illustrates seven-classes’ classifications (AK, BCC, DF, Mel, NV, PBK, SK), and (<b>d</b>) shows nine-classes’ classifications (AK, BCC, DF, Mel, NV, PBK, SK, SCC, Vasc).</p>
Full article ">Figure 11 Cont.
<p>Performance of SOTA systems when compared to our Assist-Dermo system (<b>a</b>) binary classification (malignant and benign lesions), (<b>b</b>) represents five-classes’ classifications (AK, BCC, DF, Mel, NV), (<b>c</b>) illustrates seven-classes’ classifications (AK, BCC, DF, Mel, NV, PBK, SK), and (<b>d</b>) shows nine-classes’ classifications (AK, BCC, DF, Mel, NV, PBK, SK, SCC, Vasc).</p>
Full article ">Figure 12
<p>A visual example of proposed SqueezeNet-Light classification, where (<b>a</b>) actinic keratosis, (<b>b</b>) basal cell carcinoma, (<b>c</b>) dermatofibroma, (<b>d</b>) melanoma, (<b>e</b>) nevus, (<b>f</b>) pigmented benign keratosis, (<b>g</b>) seborrheic keratosis, (<b>h</b>) squamous cell carcinoma, and (<b>i</b>) vascular lesion.</p>
Full article ">
19 pages, 4702 KiB  
Article
Examining Factors Influencing Cognitive Load of Computer Programmers
by Didem Issever, Mehmet Cem Catalbas and Fecir Duran
Brain Sci. 2023, 13(8), 1132; https://doi.org/10.3390/brainsci13081132 - 28 Jul 2023
Viewed by 2064
Abstract
In this study, the factors influencing the cognitive load of computer programmers during the perception of different code tasks were investigated. The eye movement features of computer programmers were used to provide a significant relationship between the perceptual processes of the sample codes [...] Read more.
In this study, the factors influencing the cognitive load of computer programmers during the perception of different code tasks were investigated. The eye movement features of computer programmers were used to provide a significant relationship between the perceptual processes of the sample codes and cognitive load. Thanks to the relationship, the influence of various personal characteristics of programmers on cognitive load was examined. Various personal parameters such as programming experience, age, native language, and programming frequency were used in the study. The study was performed on the Eye Movements in Programming (EMIP) dataset containing 216 programmers with different characteristics. Eye movement information recorded during two different code comprehension tasks was decomposed into sub-information, such as pupil movement speed and diameter change. Rapid changes in eye movement signals were adaptively detected using the z-score peak detection algorithm. Regarding the cognitive load calculations, canonical correlation analysis was used to build a statistically significant and efficient mathematical model connecting the extracted eye movement features and the different parameters of the programmers, and the results were statistically significant. As a result of the analysis, the factors affecting the cognitive load of computer programmers for the related database were converted into percentages, and it was seen that linguistic distance is an essential factor in the cognitive load of programmers and the effect of gender on cognitive load is quite limited. Full article
(This article belongs to the Section Computational Neuroscience and Neuroinformatics)
Show Figures

Figure 1

Figure 1
<p>CLT and schema representation.</p>
Full article ">Figure 2
<p>Schematic diagram of CCA with parameters.</p>
Full article ">Figure 3
<p>Rectangle code comprehension task. (<b>a</b>) Java code. (<b>b</b>) Multiple choice for Rectangle.</p>
Full article ">Figure 4
<p>Vehicle class code comprehension task. (<b>a</b>) Java code for vehicle class. (<b>b</b>) Multiple choice for vehicle class.</p>
Full article ">Figure 5
<p>Movement of pupil and detected saccadic movement via <span class="html-italic">z</span>-score-based peak detection.</p>
Full article ">Figure 6
<p>Pupil diameter and center coordinate. (<b>a</b>) Diameter change. (<b>b</b>) Coordinate change.</p>
Full article ">Figure 7
<p>First pair (<span class="html-italic">U</span><sub>1</sub> and <span class="html-italic">V</span><sub>1</sub>).</p>
Full article ">Figure 8
<p>Second pair (<span class="html-italic">U</span><sub>2</sub> and <span class="html-italic">V</span><sub>2</sub>).</p>
Full article ">Figure 9
<p>Path diagram of cognitive load and personal parameters.</p>
Full article ">Figure 10
<p>Factors influencing the cognitive load of programmers.</p>
Full article ">
13 pages, 2812 KiB  
Article
Effects of Different Full-Reference Quality Assessment Metrics in End-to-End Deep Video Coding
by Weizhi Xian, Bin Chen, Bin Fang, Kunyin Guo, Jie Liu, Ye Shi and Xuekai Wei
Electronics 2023, 12(14), 3036; https://doi.org/10.3390/electronics12143036 - 11 Jul 2023
Cited by 2 | Viewed by 1254
Abstract
Visual quality assessment is often used as a key performance indicator (KPI) to evaluate the performance of electronic devices. There exists a significant association between visual quality assessment and electronic devices. In this paper, we bring attention to alternative choices of perceptual loss [...] Read more.
Visual quality assessment is often used as a key performance indicator (KPI) to evaluate the performance of electronic devices. There exists a significant association between visual quality assessment and electronic devices. In this paper, we bring attention to alternative choices of perceptual loss function for end-to-end deep video coding (E2E-DVC), which can be used to reduce the amount of data generated by electronic sensors and other sources. Thus, we analyze the effects of different full-reference quality assessment (FR-QA) metrics on E2E-DVC. First, we select five optimization-suitable FR-QA metrics as perceptual objectives, which are differentiable and thus support back propagation, and use them to optimize an E2E-DVC model. Second, we analyze the rate–distortion (R-D) behaviors of an E2E-DVC model under different loss function optimizations. Third, we carry out subjective human perceptual tests on the reconstructed videos to show the performance of different FR-QA optimizations on subjective visual quality. This study reveals the effects of the competing FR-QA metrics on E2E-DVC and provides a guide for further future study on E2E-DVC in terms of perceptual loss function design. Full article
(This article belongs to the Special Issue Security and Privacy Evaluation of Machine Learning in Networks)
Show Figures

Figure 1

Figure 1
<p>Framework of traditional hybrid video coding and E2E-DVC: (<b>a</b>) framework of traditional hybrid video coding and (<b>b</b>) framework of E2E-DVC.</p>
Full article ">Figure 1 Cont.
<p>Framework of traditional hybrid video coding and E2E-DVC: (<b>a</b>) framework of traditional hybrid video coding and (<b>b</b>) framework of E2E-DVC.</p>
Full article ">Figure 2
<p>Framework of the DVC and perceptual optimization.</p>
Full article ">Figure 3
<p>Performance on the HEVC Class B dataset between different FR-QA metrics in term of PSNR: (<b>a</b>) R-D performance and (<b>b</b>) fitted R-D functions.</p>
Full article ">Figure 4
<p>Objective rankings and subjective ranking of the reconstructed videos by the five FR-QA metrics. The horizontal axis indicators are used for evaluation.</p>
Full article ">Figure 5
<p>Visual results of the DVC optimized using different FR-QA metrics: (<b>a</b>) reference, (<b>b</b>) original, (<b>c</b>) MS-SSIM, (<b>d</b>) LPIPS, (<b>e</b>) MAE, (<b>f</b>) VIF, and (<b>g</b>) DISTS.</p>
Full article ">
Back to TopTop