Computer Vision – ECCV 2024 | Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part L

Sep 2024

2024 Proceeding

Editors:
Aleš Leonardis
University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
Technical University of Darmstadt, Darmstadt, Germany
,
Olga Russakovsky
Princeton University, Princeton, NJ, USA
,
Torsten Sattler
Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
École des Ponts ParisTech, Marne-la-Vallée, France

Publisher:

Springer-Verlag
Berlin, Heidelberg

Conference:

European Conference on Computer VisionMilan, Italy29 September 2024

ISBN:

978-3-031-72972-0

Published:

14 November 2024

Bibliometrics

Abstract

No abstract available.

Select All

Export Citations Save to Binder

front-matter

Revisit Human-Scene Interaction via Space Occupancy

Pages 1–19https://doi.org/10.1007/978-3-031-72973-7_1

Abstract

Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. However, one of the major obstacles is its limited data scale. High-quality data with simultaneously captured human and 3D environments is ...

Article

Face-Adapter for Pre-trained Diffusion Models with Fine-Grained ID and Attribute Control

Pages 20–36https://doi.org/10.1007/978-3-031-72973-7_2

Abstract

Current face reenactment and swapping methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the ...

Article

WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model

Pages 37–53https://doi.org/10.1007/978-3-031-72973-7_3

Abstract

Recently learned image compression (LIC) has achieved great progress and even outperformed the traditional approach using DCT or discrete wavelet transform (DWT). However, LIC mainly reduces spatial redundancy in the autoencoder networks and ...

Article

Grid-Attention: Enhancing Computational Efficiency of Large Vision Models Without Fine-Tuning

Pages 54–70https://doi.org/10.1007/978-3-031-72973-7_4

Abstract

Recently, transformer-based large vision models, e.g., the Segment Anything Model (SAM) and Stable Diffusion (SD), have achieved remarkable success in the computer vision field. However, the quartic complexity within the transformer’s Multi-Head ...

Article

Mitigating Background Shift in Class-Incremental Semantic Segmentation

Pages 71–88https://doi.org/10.1007/978-3-031-72973-7_5

Abstract

Class-Incremental Semantic Segmentation (CISS) aims to learn new classes without forgetting the old ones, using only the labels of the new classes. To achieve this, two popular strategies are employed: 1) pseudo-labeling and knowledge distillation ...

Article

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

Pages 89–105https://doi.org/10.1007/978-3-031-72973-7_6

Abstract

This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from the self-...

Article

BKDSNN: Enhancing the Performance of Learning-Based Spiking Neural Networks Training with Blurred Knowledge Distillation

Pages 106–123https://doi.org/10.1007/978-3-031-72973-7_7

Abstract

Spiking neural networks (SNNs), which mimic biological neural systems to convey information via discrete spikes, are well-known as brain-inspired models with excellent computing efficiency. By utilizing the surrogate gradient estimation for ...

Article

Agent Attention: On the Integration of Softmax and Linear Attention

Pages 124–140https://doi.org/10.1007/978-3-031-72973-7_8

Abstract

The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel ...

Article

Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion

Pages 141–161https://doi.org/10.1007/978-3-031-72973-7_9

Abstract

This paper presents a self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding applications. In contrast with the state-of-the-art method of CASA, where sequences of 3D skeleton ...

Article

Resolving Scale Ambiguity in Multi-view 3D Reconstruction Using Dual-Pixel Sensors

Pages 162–178https://doi.org/10.1007/978-3-031-72973-7_10

Abstract

Multi-view 3D reconstruction, namely structure-from-motion and multi-view stereo, is an essential component in 3D computer vision. In general, multi-view 3D reconstruction suffers from unknown scale ambiguity unless a reference object of known ...

Article

Object-Oriented Anchoring and Modal Alignment in Multimodal Learning

Pages 179–196https://doi.org/10.1007/978-3-031-72973-7_11

Abstract

Modality alignment has been of paramount importance in recent developments of multimodal learning, which has inspired many innovations in multimodal networks and pre-training tasks. Single-stream networks can effectively leverage self-attention ...

Article

Towards Stable 3D Object Detection

Pages 197–213https://doi.org/10.1007/978-3-031-72973-7_12

Abstract

In autonomous driving, the temporal stability of 3D object detection greatly impacts the driving safety. However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the ...

Article

FYI: Flip Your Images for Dataset Distillation

Pages 214–230https://doi.org/10.1007/978-3-031-72973-7_13

Abstract

Dataset distillation synthesizes a small set of images from a large-scale real dataset such that synthetic and real images share similar behavioral properties (e.g., distributions of gradients or features) during a training process. Through ...

Article

On-the-Fly Category Discovery for LiDAR Semantic Segmentation

Pages 231–249https://doi.org/10.1007/978-3-031-72973-7_14

Abstract

LiDAR semantic segmentation is important for understanding the surrounding environment in autonomous driving. Existing methods assume closed-set situations with the same training and testing label space. However, in the real world, unknown classes ...

Article

Dual-Camera Smooth Zoom on Mobile Phones

Pages 250–269https://doi.org/10.1007/978-3-031-72973-7_15

Abstract

When zooming between dual cameras on a mobile, noticeable jumps in geometric content and image color occur in the preview, inevitably affecting the user’s zoom experience. In this work, we introduce a new task, i.e., dual-camera smooth zoom (DCSZ) ...

Article

ProtoComp: Diverse Point Cloud Completion with Controllable Prototype

Pages 270–286https://doi.org/10.1007/978-3-031-72973-7_16

Abstract

Point cloud completion aims to reconstruct the geometry of partial point clouds captured by various sensors. Traditionally, training a point cloud model is carried out on synthetic datasets, which have limited categories and deviate significantly ...

Article

CONDA: Condensed Deep Association Learning for Co-salient Object Detection

Pages 287–303https://doi.org/10.1007/978-3-031-72973-7_17

Abstract

Inter-image association modeling is crucial for co-salient object detection. Despite satisfactory performance, previous methods still have limitations on sufficient inter-image association modeling. Because most of them focus on image feature ...

Article

Cascade Prompt Learning for Vision-Language Model Adaptation

Pages 304–321https://doi.org/10.1007/978-3-031-72973-7_18

Abstract

Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of ...

Article

PolyRoom: Room-Aware Transformer for Floorplan Reconstruction

Pages 322–339https://doi.org/10.1007/978-3-031-72973-7_19

Abstract

Reconstructing geometry and topology structures from raw unstructured data has always been an important research topic in indoor mapping research. In this paper, we aim to reconstruct the floorplan with a vectorized representation from point ...

Article

BenchLMM: Benchmarking Cross-Style Visual Capability of Large Multimodal Models

Pages 340–358https://doi.org/10.1007/978-3-031-72973-7_20

Abstract

Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning on data in common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains ...

Article

SMFANet: A Lightweight Self-Modulation Feature Aggregation Network for Efficient Image Super-Resolution

Pages 359–375https://doi.org/10.1007/978-3-031-72973-7_21

Abstract

Transformer-based restoration methods achieve significant performance as the self-attention (SA) of the Transformer can explore non-local information for better high-resolution image reconstruction. However, the key dot-product SA requires ...

Article

HENet: Hybrid Encoding for End-to-End Multi-task 3D Perception from Multi-view Cameras

Pages 376–392https://doi.org/10.1007/978-3-031-72973-7_22

Abstract

Three-dimensional perception from multi-view cameras is a crucial component in autonomous driving systems, which involves multiple tasks like 3D object detection and bird’s-eye-view (BEV) semantic segmentation. To improve perception precision, ...

Article

Hierarchical Unsupervised Relation Distillation for Source Free Domain Adaptation

Pages 393–409https://doi.org/10.1007/978-3-031-72973-7_23

Abstract

Source free domain adaptation (SFDA) aims to transfer the model trained on labeled source domain to unlabeled target domain without accessing source data. Recent SFDA methods predominantly rely on self-training, which supervise the model with ...

Article

Customized Generation Reimagined: Fidelity and Editability Harmonized

Pages 410–426https://doi.org/10.1007/978-3-031-72973-7_24

Abstract

Customized generation aims to incorporate a novel concept into a pre-trained text-to-image model, enabling new generations of the concept in novel contexts guided by textual prompts. However, customized generation suffers from an inherent trade-...

Article

AUFormer: Vision Transformers Are Parameter-Efficient Facial Action Unit Detectors

Pages 427–445https://doi.org/10.1007/978-3-031-72973-7_25

Abstract

Facial Action Units (AU) is a vital concept in the realm of affective computing, and AU detection has always been a hot research topic. Existing methods suffer from overfitting issues due to the utilization of a large number of learnable ...

Article

Improving Video Segmentation via Dynamic Anchor Queries

Pages 446–463https://doi.org/10.1007/978-3-031-72973-7_26

Abstract

Modern video segmentation methods adopt feature transitions between anchor and target queries to perform cross-frame object association. The smooth feature transitions between anchor and target queries enable these methods to achieve satisfactory ...

Article

Controllable Contextualized Image Captioning: Directing the Visual Narrative Through User-Defined Highlights

Pages 464–481https://doi.org/10.1007/978-3-031-72973-7_27

Abstract

Contextualized Image Captioning (CIC) evolves traditional image captioning into a more complex domain, necessitating the ability for multimodal reasoning. It aims to generate image captions given specific contextual information. This paper further ...

Contributors

Index Terms

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part L
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Interest point and salient region detections
        Object recognition
        Video segmentation
      2. Computer vision tasks
        Scene understanding
  2. Computer graphics

Index terms have been assigned to the content through auto-classification.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Browse Proceedings

Sections

Front Matter

Back Matter

Revisit Human-Scene Interaction via Space Occupancy

Face-Adapter for Pre-trained Diffusion Models with Fine-Grained ID and Attribute Control

WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model

Grid-Attention: Enhancing Computational Efficiency of Large Vision Models Without Fine-Tuning

Mitigating Background Shift in Class-Incremental Semantic Segmentation

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

BKDSNN: Enhancing the Performance of Learning-Based Spiking Neural Networks Training with Blurred Knowledge Distillation

Agent Attention: On the Integration of Softmax and Linear Attention

Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion

Resolving Scale Ambiguity in Multi-view 3D Reconstruction Using Dual-Pixel Sensors

Object-Oriented Anchoring and Modal Alignment in Multimodal Learning

Towards Stable 3D Object Detection

FYI: Flip Your Images for Dataset Distillation

On-the-Fly Category Discovery for LiDAR Semantic Segmentation

Dual-Camera Smooth Zoom on Mobile Phones

ProtoComp: Diverse Point Cloud Completion with Controllable Prototype

CONDA: Condensed Deep Association Learning for Co-salient Object Detection

Cascade Prompt Learning for Vision-Language Model Adaptation

PolyRoom: Room-Aware Transformer for Floorplan Reconstruction

BenchLMM: Benchmarking Cross-Style Visual Capability of Large Multimodal Models

SMFANet: A Lightweight Self-Modulation Feature Aggregation Network for Efficient Image Super-Resolution

HENet: Hybrid Encoding for End-to-End Multi-task 3D Perception from Multi-view Cameras

Hierarchical Unsupervised Relation Distillation for Source Free Domain Adaptation

Customized Generation Reimagined: Fidelity and Editability Harmonized

AUFormer: Vision Transformers Are Parameter-Efficient Facial Action Unit Detectors

Improving Video Segmentation via Dynamic Anchor Queries

Controllable Contextualized Image Captioning: Directing the Visual Narrative Through User-Defined Highlights

Index Terms

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XIII

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XVI

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part III

Save to Binder

Sections

Save to Binder

Index Terms

Recommendations

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XIII

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XVI

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part III