Vicsgaze: a gaze estimation method using self-supervised contrastive learning
Existing deep learning-based gaze estimation methods achieved high accuracy, and the prerequisite for ensuring their performance is large-scale datasets with gaze labels. However, collecting large-scale gaze datasets is time-consuming and ...
3D human pose estimation with multi-hypotheses gated transformer
Human pose estimation aims to locate human joints from inputs such as images and videos. Recent works have made significant progress in 3D human pose estimation, but they still face the ill-posed problem caused by the deep ambiguity of estimating ...
Mutual-weighted feature disentanglement for unsupervised domain adaptation
Unsupervised domain adaptation (UDA) aims to reduce the distribution discrepancy across domains, enabling the transfer of knowledge from the labeled source domain to the unlabeled target domain. The main focus of most current UDA methods lies on ...
Motion synthesis via distilled absorbing discrete diffusion model
In this work, we explore the potential of discrete diffusion model in text-driven motion synthesis. Previous methods aimed at improving the quality of generated motions often led to an increase in model parameters, while neglecting the diversity ...
TEST-Net: transformer-enhanced Spatio-temporal network for infectious disease prediction
Outbreaks of infectious diseases have caused tremendous human suffering and incalculable economic losses, and infectious diseases are a global public health problem that threatens human society. Therefore, it is necessary to model the spatial and ...
Model-based portrait video compression with spatial constraint and adaptive pose processing
Motion model based video coding approach, which employs sparse sets of keypoints instead of dense optical flows, can efficiently compress videos at ultra-low bitrates. Such schemes obtain notable performance gains over traditional video codecs in ...
Dynamical semantic enhancement network for continuous sign language recognition
In the field of sign language recognition, effective interpretation of semantic information, which is primarily conveyed through facial and hand gestures, poses significant challenges. Previous methods often struggle to simultaneously capture ...
DS-SRD: a unified framework for structured representation distillation
To improve the representation performance of smaller models, representation distillation has been investigated to transfer structured knowledge from a larger model (teacher) to a smaller model (student). Current work aims to maximize a lower bound ...
Local and global context cooperation for temporal action detection
Temporal action detection (TAD) is a fundamental task for video understanding. The task aims to locate the start and end boundaries of action instances and identify their corresponding categories within untrimmed videos. Distinguishing between ...
Multi-scale feature correspondence and restriction mechanism for visible X-ray baggage re-Identification
Recently, social security surveillance has posed a new AI challenge, i.e., Visible-X-ray baggage Re-Identification (VX-ReID), which aims to re-identify and retrieve baggage between visible and X-ray imaging modalities. Compared with cross-modality ...
Exploring the impact of volumetric graphics on the engagement of broadcast media professionals
The purpose of this study is to explore content creator preferences in broadcast media, with a specific focus on the impact of integrating 3D graphics to enhance viewer engagement. In this study, we investigated the integration of 3D technology in ...
SS-CMT: a label independent cross-modal transferable adversarial video attack with sparse strategy
Deep neural networks are vulnerable to adversarial examples which are generated by adding carefully crafted perturbations on benign examples. Some research works explore the transferability of adversarial examples between hetero-modal models from ...
PillarVTP: vehicle trajectory prediction method based on local point cloud aggregation and receptive field expansion
Vehicle trajectory prediction plays a crucial role in the control and safety warning of autonomous vehicles. Existing methods often depend on costly high definition (HD) maps for generating trajectories to fit their scenarios, or involve ...
DSTANet: learning a dual-stream model for anomaly driving action detection using spatio-temporal and appearance features
Driving action anomaly detection based on in-cab surveillance video has become the mainstream of current driving action research. However, there is a substantial redundancy of spatio-temporal information in the spatio-temporal action features ...
SiamRCSC: Robust siamese network with channel and spatial constraints for visual object tracking
Locating and classifying the target object is performed by the siamese-based tracking framework by evaluating the similarity on the feature maps from the template and search branches. While the promising tracking performances have been achieved by ...
Hierarchical bi-directional conceptual interaction for text-video retrieval
The large pre-trained vision-language models (VLMs) utilized in text-video retrieval have demonstrated strong cross image-text understanding ability. Existing works leverage VLMs to extract features and design fine-grained uni-directional ...
Multi-view anomaly detection via hybrid instance-neighborhood aligning and cross-view reasoning
Multi-view anomaly detection aims to identify anomalous instances whose patterns are disparate across different views, and existing works usually project the multi-view data into a common subspace for abnormal instance identification. Nevertheless,...
A lightweight distillation recurrent convolution network on FPGA for real-time video super-resolution
In the application of image super-resolution (SR) based on field-programmable gate array (FPGA), depthwise separable convolution is widely utilized. However, existing network designs overly simplify the structures used for deep feature extraction ...
Design and experimental evaluation of an intelligent sugarcane stem node recognition system based on enhanced YOLOv5s
The rapid and accurate identification of sugarcane internodes is of great significance for tasks such as field operations and precision management in the sugarcane industry, and it is also a fundamental task for the intelligence of the sugarcane ...
Dynamic spatial-temporal topology graph network for skeleton-based action recognition
Over the past few years, skeleton-based action recognition has gained significant attention for its simple yet robust representation of the human body structure. Many researchers have employed Graph Convolutional Network (GCN) to explore ...
Expressive feature representation pyramid network for pulmonary nodule detection
Lung cancer has the highest fatality rate among all types of cancers. The detection of pulmonary nodules serves as the primary means for early diagnosis, utilizing deep learning models for pulmonary nodule detection can improve the accuracy and ...
Universal NIR-II fluorescence image enhancement via covariance weighted attention network
The second near-infrared (NIR-II) fluorescence imaging has become a new imaging mode due to its characteristics of real-time intraoperative imaging. The NIR-IIb window (1500–1700 nm) has stronger light penetration and has a clearer imaging effect ...
CFFANet: category feature fusion and attention mechanism network for retinal vessel segmentation
Retinal vessel segmentation is a computer-aided diagnostic method for ophthalmic disease analysis. Owing to the complex structure of the retinal vasculature, it is difficult for the segmentation network to capture effective features, and the ...
Collaborative multi-knowledge distillation under the influence of softmax regression representation
Knowledge distillation can transfer knowledge from a powerful yet cumbersome teacher model to a less-parameterized student model, thus effectively achieving model compression. Various knowledge distillation methods have mainly focused on the task ...
Dual-branch network object detection algorithm based on dual-modality fusion of visible and infrared images
Aiming at the limitations of visible images in object detection, this paper proposes a dual-branch network object detection algorithm based on dual-modality fusion of visible and infrared images. Based on YOLOv7-s, the algorithm firstly introduces ...
Panoramic image semantic segmentation using channel attention-based HarDNet and distorted boundary learning
In this paper, we propose a semantic segmentation framework for panoramic images. First, in order to solve the problem of large panoramic image size, we use HarDNet in the backbone. By applying HarDNet, while improving segmentation accuracy, it ...