No abstract available.
Front Matter
Revisit Human-Scene Interaction via Space Occupancy
Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. However, one of the major obstacles is its limited data scale. High-quality data with simultaneously captured human and 3D environments is ...
Face-Adapter for Pre-trained Diffusion Models with Fine-Grained ID and Attribute Control
- Yue Han,
- Junwei Zhu,
- Keke He,
- Xu Chen,
- Yanhao Ge,
- Wei Li,
- Xiangtai Li,
- Jiangning Zhang,
- Chengjie Wang,
- Yong Liu
Current face reenactment and swapping methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the ...
Grid-Attention: Enhancing Computational Efficiency of Large Vision Models Without Fine-Tuning
Recently, transformer-based large vision models, e.g., the Segment Anything Model (SAM) and Stable Diffusion (SD), have achieved remarkable success in the computer vision field. However, the quartic complexity within the transformer’s Multi-Head ...
Mitigating Background Shift in Class-Incremental Semantic Segmentation
Class-Incremental Semantic Segmentation (CISS) aims to learn new classes without forgetting the old ones, using only the labels of the new classes. To achieve this, two popular strategies are employed: 1) pseudo-labeling and knowledge distillation ...
BKDSNN: Enhancing the Performance of Learning-Based Spiking Neural Networks Training with Blurred Knowledge Distillation
Spiking neural networks (SNNs), which mimic biological neural systems to convey information via discrete spikes, are well-known as brain-inspired models with excellent computing efficiency. By utilizing the surrogate gradient estimation for ...
Agent Attention: On the Integration of Softmax and Linear Attention
The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel ...
Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion
This paper presents a self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding applications. In contrast with the state-of-the-art method of CASA, where sequences of 3D skeleton ...
Resolving Scale Ambiguity in Multi-view 3D Reconstruction Using Dual-Pixel Sensors
Multi-view 3D reconstruction, namely structure-from-motion and multi-view stereo, is an essential component in 3D computer vision. In general, multi-view 3D reconstruction suffers from unknown scale ambiguity unless a reference object of known ...
Object-Oriented Anchoring and Modal Alignment in Multimodal Learning
Modality alignment has been of paramount importance in recent developments of multimodal learning, which has inspired many innovations in multimodal networks and pre-training tasks. Single-stream networks can effectively leverage self-attention ...
Towards Stable 3D Object Detection
In autonomous driving, the temporal stability of 3D object detection greatly impacts the driving safety. However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the ...
FYI: Flip Your Images for Dataset Distillation
Dataset distillation synthesizes a small set of images from a large-scale real dataset such that synthetic and real images share similar behavioral properties (e.g., distributions of gradients or features) during a training process. Through ...
On-the-Fly Category Discovery for LiDAR Semantic Segmentation
LiDAR semantic segmentation is important for understanding the surrounding environment in autonomous driving. Existing methods assume closed-set situations with the same training and testing label space. However, in the real world, unknown classes ...
Dual-Camera Smooth Zoom on Mobile Phones
When zooming between dual cameras on a mobile, noticeable jumps in geometric content and image color occur in the preview, inevitably affecting the user’s zoom experience. In this work, we introduce a new task, i.e., dual-camera smooth zoom (DCSZ) ...
ProtoComp: Diverse Point Cloud Completion with Controllable Prototype
Point cloud completion aims to reconstruct the geometry of partial point clouds captured by various sensors. Traditionally, training a point cloud model is carried out on synthetic datasets, which have limited categories and deviate significantly ...
CONDA: Condensed Deep Association Learning for Co-salient Object Detection
- Long Li,
- Nian Liu,
- Dingwen Zhang,
- Zhongyu Li,
- Salman Khan,
- Rao Anwer,
- Hisham Cholakkal,
- Junwei Han,
- Fahad Shahbaz Khan
Inter-image association modeling is crucial for co-salient object detection. Despite satisfactory performance, previous methods still have limitations on sufficient inter-image association modeling. Because most of them focus on image feature ...
PolyRoom: Room-Aware Transformer for Floorplan Reconstruction
Reconstructing geometry and topology structures from raw unstructured data has always been an important research topic in indoor mapping research. In this paper, we aim to reconstruct the floorplan with a vectorized representation from point ...
BenchLMM: Benchmarking Cross-Style Visual Capability of Large Multimodal Models
Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning on data in common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains ...
HENet: Hybrid Encoding for End-to-End Multi-task 3D Perception from Multi-view Cameras
Three-dimensional perception from multi-view cameras is a crucial component in autonomous driving systems, which involves multiple tasks like 3D object detection and bird’s-eye-view (BEV) semantic segmentation. To improve perception precision, ...
Hierarchical Unsupervised Relation Distillation for Source Free Domain Adaptation
Source free domain adaptation (SFDA) aims to transfer the model trained on labeled source domain to unlabeled target domain without accessing source data. Recent SFDA methods predominantly rely on self-training, which supervise the model with ...
Customized Generation Reimagined: Fidelity and Editability Harmonized
Customized generation aims to incorporate a novel concept into a pre-trained text-to-image model, enabling new generations of the concept in novel contexts guided by textual prompts. However, customized generation suffers from an inherent trade-...
AUFormer: Vision Transformers Are Parameter-Efficient Facial Action Unit Detectors
Facial Action Units (AU) is a vital concept in the realm of affective computing, and AU detection has always been a hot research topic. Existing methods suffer from overfitting issues due to the utilization of a large number of learnable ...
Improving Video Segmentation via Dynamic Anchor Queries
Modern video segmentation methods adopt feature transitions between anchor and target queries to perform cross-frame object association. The smooth feature transitions between anchor and target queries enable these methods to achieve satisfactory ...
Index Terms
- Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part L