Keyword: video captioning : Search

research-article

QAVidCap: Enhancing Video Captioning through Question Answering Techniques

ICMR '24: Proceedings of the 2024 International Conference on Multimedia RetrievalPages 155–164https://doi.org/10.1145/3652583.3658061

Video captioning is the task of describing video content using natural sentences. While recent models have shown significant improvements in metrics, there are still some unresolved issues. Model-generated captions often contain factual errors and omit ...

research-article

Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers

NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long VideosPages 51–56https://doi.org/10.1145/3607540.3617141

Video captioning aims to generate natural language sentences of an input video. Generating coherent natural language sentences is a challenging task due to the complex nature of video content such as object and scene understanding, extraction of object- ...

research-article

Emotion-Prior Awareness Network for Emotional Video Captioning

MM '23: Proceedings of the 31st ACM International Conference on MultimediaPages 589–600https://doi.org/10.1145/3581783.3611726

Emotional video captioning (EVC) is an emerging task to describe the factual content with the inherent emotion expressed in a video. It is crucial for the EVC task to effectively perceive subtle and ambiguous visual emotion cues in the stage of caption ...

short-paper

Open Access

GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge ManagementPages 5391–5395https://doi.org/10.1145/3583780.3615120

Despite the recent emergence of video captioning models, how to generate vivid, fine-grained video descriptions based on the background knowledge (i.e., long and informative commentary about the domain-specific scenes with appropriate reasoning) is still ...

research-article

Video Annotation & Descriptions using Machine Learning & Deep learning: Critical Survey of methods

IC3-2023: Proceedings of the 2023 Fifteenth International Conference on Contemporary ComputingPages 722–735https://doi.org/10.1145/3607947.3608091

Video description methods aim to produce the most relevant description of a video. This could be description based on full video, frame based or based on important events of the video. This work focuses on recent advancement of video description ...

research-article

Read It, Don't Watch It: Captioning Bug Recordings Automatically

ICSE '23: Proceedings of the 45th International Conference on Software EngineeringPages 2349–2361https://doi.org/10.1109/ICSE48619.2023.00197

Screen recordings of mobile applications are easy to capture and include a wealth of information, making them a popular mechanism for users to inform developers of the problems encountered in the bug reports. However, watching the bug recordings and ...

research-article

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

MM '22: Proceedings of the 30th ACM International Conference on MultimediaPages 7070–7074https://doi.org/10.1145/3503161.3551581

In this work, we present Auto-captions on GIF (ACTION), which is a new large-scale pre-training dataset for generic video understanding. All video-sentence pairs are created by automatically extracting and filtering video caption annotations from ...

research-article

Open Access

Differentiate Visual Features with Guidance Signals for Video Captioning

CCRIS '22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent SystemPages 235–240https://doi.org/10.1145/3562007.3562052

The task of video captioning is to generate comprehensible and grammatically correct sentences which describe the main visual content of videos. Existing neural modules based methods improve the model interpretability by separately predicting words of ...

research-article

Ingredient-enriched Recipe Generation from Cooking Videos

ICMR '22: Proceedings of the 2022 International Conference on Multimedia RetrievalPages 249–257https://doi.org/10.1145/3512527.3531388

Cooking video captioning aims to generate the text instructions that describes the cooking procedures presented in the video. Current approaches tend to use large neural models or use more robust feature extractors to increase the expressive ability of ...

research-article

Dual-Level Decoupled Transformer for Video Captioning

ICMR '22: Proceedings of the 2022 International Conference on Multimedia RetrievalPages 219–228https://doi.org/10.1145/3512527.3531380

Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences. The de-facto approach to this task dictates a text generator to learn from offline-extracted motion or appearance features from pre-...

research-article

Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 4Article No.: 104, Pages 1–17https://doi.org/10.1145/3503927

Fully mining visual cues to aid in content understanding is crucial for video captioning. However, most state-of-the-art video captioning methods are limited to generating captions purely based on straightforward information while ignoring the scenario ...

research-article

Soccer captioning: dataset, transformer-based model, and triple-level evaluation

Procedia Computer Science (PROCS), Volume 210, Issue CPages 104–111https://doi.org/10.1016/j.procs.2022.10.125

Abstract

This work aims at generating captions for soccer videos using deep learning. The paper introduces a novel dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, ...

short-paper

Semantic Tag Augmented XlanV Model for Video Captioning

MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 4818–4822https://doi.org/10.1145/3474085.3479228

The key of video captioning is to leverage the cross-modal information from both vision and language perspectives. We propose to leverage the semantic tags to bridge the gap between these modalities rather than directly concatenating or attending to the ...

short-paper

Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning

MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 4750–4754https://doi.org/10.1145/3474085.3479217

This paper describes our bronze-medal solution for the video captioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with technical improvements on both video content encoding and ...

short-paper

MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques

MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 4853–4857https://doi.org/10.1145/3474085.3479216

The quality of video representation directly decides the performance of video related tasks, for both understanding and generation. In this paper, we propose single-modality pretrained feature fusion technique which is composed of reasonable multi-view ...

research-article

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 5600–5608https://doi.org/10.1145/3474085.3475703

BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask ...

research-article

Open Access

Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention

MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 4220–4229https://doi.org/10.1145/3474085.3475557

Automatically describing video, or video captioning, has been widely studied in the multimedia field. This paper proposes a new task of sensor-augmented egocentric-video captioning, a newly constructed dataset for it called MMAC Captions, and a method ...

research-article

Discriminative Latent Semantic Graph for Video Captioning

MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 3556–3564https://doi.org/10.1145/3474085.3475519

Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and ...

research-article

Toward Automatic Audio Description Generation for Accessible Videos

CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing SystemsArticle No.: 277, Pages 1–12https://doi.org/10.1145/3411764.3445347

Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating high-quality audio descriptions requires a lot of ...

short-paper

VideoTRM: Pre-training for Video Captioning Challenge 2020

MM '20: Proceedings of the 28th ACM International Conference on MultimediaPages 4605–4609https://doi.org/10.1145/3394171.3416291

The Pre-training for Video Captioning Challenge 2020 mainly focuses on developing video captioning systems by pre-training on the newly released large-scale Auto-captions on GIF dataset and further transferring the pre-trained model to MSR-VTT ...

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Caption

QAVidCap: Enhancing Video Captioning through Question Answering Techniques

Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers

Emotion-Prior Awareness Network for Emotional Video Captioning

GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation

Video Annotation & Descriptions using Machine Learning & Deep learning: Critical Survey of methods

Upcoming Conferences

Read It, Don't Watch It: Captioning Bug Recordings Automatically

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

Differentiate Visual Features with Guidance Signals for Video Captioning

Ingredient-enriched Recipe Generation from Cooking Videos

Dual-Level Decoupled Transformer for Video Captioning

Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning

Soccer captioning: dataset, transformer-based model, and triple-level evaluation

Semantic Tag Augmented XlanV Model for Video Captioning

Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning

MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention

Discriminative Latent Semantic Graph for Video Captioning

Toward Automatic Audio Description Generation for Accessible Videos

VideoTRM: Pre-training for Video Captioning Challenge 2020

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Save to Binder

Upcoming Conferences