No abstract available.
Front Matter
SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking
Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set. Currently, the best-performing methods are mainly based on pure appearance matching. Due to the complexity of motion patterns ...
Tensorial Template Matching for Fast Cross-Correlation with Rotations and Its Application for Tomography
Object detection is a main task in computer vision. Template matching is the reference method for detecting objects with arbitrary templates. However, template matching computational complexity depends on the rotation accuracy, being a limiting ...
FreeAugment: Data Augmentation Search Across All Degrees of Freedom
Data augmentation has become an integral part of deep learning, as it is known to improve the generalization capabilities of neural networks. Since the most effective set of image transformations differs between tasks and domains, automatic data ...
Learning Representations of Satellite Images From Metadata Supervision
Self-supervised learning is increasingly applied to Earth observation problems that leverage satellite and other remotely sensed data. Within satellite imagery, metadata such as time and location often hold significant semantic information that ...
-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM
We present an inverse image-formation module that can enhance the robustness of existing visual SLAM pipelines for casually captured scenarios. Casual video captures often suffer from motion blur and varying appearances, which degrade the final ...
FlashTex: Fast Relightable Mesh Texturing with LightControlNet
Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach ...
GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence
Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics. Recently, deep-learning-based approaches have made great progress, but are typically hindered by the need for large datasets of ...
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision-language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot ...
Lagrangian Hashing for Compressed Neural Field Representations
- Shrisudhan Govindarajan,
- Zeno Sambugaro,
- Akhmedkhan Shabanov,
- Towaki Takikawa,
- Daniel Rebain,
- Weiwei Sun,
- Nicola Conci,
- Kwang Moo Yi,
- Andrea Tagliasacchi
We present Lagrangian Hashing, a representation for neural fields combining the characteristics of fast training NeRF methods that rely on Eulerian grids (i.e. InstantNGP), with those that employ points equipped with features as a way to ...
EDformer: Transformer-Based Event Denoising Across Varied Noise Levels
Currently, there is relatively limited research on the background activity noise of event cameras in different brightness conditions, and the relevant real-world datasets are extremely scarce. This limitation contributes to the lack of robustness ...
Foster Adaptivity and Balance in Learning with Noisy Labels
Label noise is ubiquitous in real-world scenarios, posing a practical challenge to supervised models due to its effect in hurting the generalization performance of deep neural networks. Existing methods primarily employ the sample selection ...
MetaAug: Meta-data Augmentation for Post-training Quantization
Post-Training Quantization (PTQ) has received significant attention because it requires only a small set of calibration data to quantize a full-precision model, which is more practical in real-world applications in which full access to a large ...
Thermal3D-GS: Physics-Induced 3D Gaussians for Thermal Infrared Novel-View Synthesis
Novel-view synthesis based on visible light has been extensively studied. In comparison to visible light imaging, thermal infrared imaging offers the advantage of all-weather imaging and strong penetration, providing increased possibilities for ...
Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach
In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this ...
Unleashing the Power of Prompt-Driven Nucleus Instance Segmentation
- Zhongyi Shui,
- Yunlong Zhang,
- Kai Yao,
- Chenglu Zhu,
- Sunyi Zheng,
- Jingxiong Li,
- Honglin Li,
- Yuxuan Sun,
- Ruizhe Guo,
- Lin Yang
Nucleus instance segmentation in histology images is crucial for a broad spectrum of clinical applications. Current dominant algorithms rely on regression of nuclear proxy maps. Distinguishing nucleus instances from the estimated maps requires ...
Gaze Target Detection Based on Head-Local-Global Coordination
This paper introduces a novel approach to gaze target detection leveraging a head-local-global coordination framework. Unlike traditional methods that rely heavily on estimating gaze direction and identifying salient objects in global view images, ...
3DSA: Multi-view 3D Human Pose Estimation With 3D Space Attention Mechanisms
In this study, we introduce the 3D space attention module (3DSA) as a novel approach to address the drawback of multi-view 3D human pose estimation methods, which fail to recognize the object’s significance from diverse viewpoints. Specifically, ...
An Economic Framework for 6-DoF Grasp Detection
Robotic grasping in clutters is a fundamental task in robotic manipulation. In this work, we propose an economic framework for 6-DoF grasp detection, aiming to economize the resource cost in training and meanwhile maintain effective grasp ...
GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction
3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as ...
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning
Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using diffusion-based generation ...
AdaLog: Post-training Quantization for Vision Transformers with Adaptive Logarithm Quantizer
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. Despite the high accuracy, deploying it in real applications raises critical challenges including the high computational ...
Multi-label Cluster Discrimination for Visual Representation Learning
Contrastive Language Image Pre-training (CLIP) has recently demonstrated success across various tasks due to superior feature representation empowered by image-text contrastive learning. However, the instance discrimination method used by CLIP can ...
DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion
Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images. However, highly dynamically variable complementary characteristics and commonly ...
Index Terms
- Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXVII