Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJune 2024
QAVidCap: Enhancing Video Captioning through Question Answering Techniques
ICMR '24: Proceedings of the 2024 International Conference on Multimedia RetrievalPages 155–164https://doi.org/10.1145/3652583.3658061Video captioning is the task of describing video content using natural sentences. While recent models have shown significant improvements in metrics, there are still some unresolved issues. Model-generated captions often contain factual errors and omit ...
- research-articleOctober 2023
Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers
NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long VideosPages 51–56https://doi.org/10.1145/3607540.3617141Video captioning aims to generate natural language sentences of an input video. Generating coherent natural language sentences is a challenging task due to the complex nature of video content such as object and scene understanding, extraction of object- ...
- research-articleOctober 2023
Emotion-Prior Awareness Network for Emotional Video Captioning
MM '23: Proceedings of the 31st ACM International Conference on MultimediaPages 589–600https://doi.org/10.1145/3581783.3611726Emotional video captioning (EVC) is an emerging task to describe the factual content with the inherent emotion expressed in a video. It is crucial for the EVC task to effectively perceive subtle and ambiguous visual emotion cues in the stage of caption ...
- short-paperOctober 2023
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge ManagementPages 5391–5395https://doi.org/10.1145/3583780.3615120Despite the recent emergence of video captioning models, how to generate vivid, fine-grained video descriptions based on the background knowledge (i.e., long and informative commentary about the domain-specific scenes with appropriate reasoning) is still ...
- research-articleSeptember 2023
Video Annotation & Descriptions using Machine Learning & Deep learning: Critical Survey of methods
IC3-2023: Proceedings of the 2023 Fifteenth International Conference on Contemporary ComputingPages 722–735https://doi.org/10.1145/3607947.3608091Video description methods aim to produce the most relevant description of a video. This could be description based on full video, frame based or based on important events of the video. This work focuses on recent advancement of video description ...
-
- research-articleJuly 2023
Read It, Don't Watch It: Captioning Bug Recordings Automatically
ICSE '23: Proceedings of the 45th International Conference on Software EngineeringPages 2349–2361https://doi.org/10.1109/ICSE48619.2023.00197Screen recordings of mobile applications are easy to capture and include a wealth of information, making them a popular mechanism for users to inform developers of the problems encountered in the bug reports. However, watching the bug recordings and ...
- research-articleOctober 2022
Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
MM '22: Proceedings of the 30th ACM International Conference on MultimediaPages 7070–7074https://doi.org/10.1145/3503161.3551581In this work, we present Auto-captions on GIF (ACTION), which is a new large-scale pre-training dataset for generic video understanding. All video-sentence pairs are created by automatically extracting and filtering video caption annotations from ...
- research-articleOctober 2022
Differentiate Visual Features with Guidance Signals for Video Captioning
CCRIS '22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent SystemPages 235–240https://doi.org/10.1145/3562007.3562052The task of video captioning is to generate comprehensible and grammatically correct sentences which describe the main visual content of videos. Existing neural modules based methods improve the model interpretability by separately predicting words of ...
- research-articleJune 2022
Ingredient-enriched Recipe Generation from Cooking Videos
ICMR '22: Proceedings of the 2022 International Conference on Multimedia RetrievalPages 249–257https://doi.org/10.1145/3512527.3531388Cooking video captioning aims to generate the text instructions that describes the cooking procedures presented in the video. Current approaches tend to use large neural models or use more robust feature extractors to increase the expressive ability of ...
- research-articleJune 2022
Dual-Level Decoupled Transformer for Video Captioning
ICMR '22: Proceedings of the 2022 International Conference on Multimedia RetrievalPages 219–228https://doi.org/10.1145/3512527.3531380Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences. The de-facto approach to this task dictates a text generator to learn from offline-extracted motion or appearance features from pre-...
- research-articleMarch 2022
Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 4Article No.: 104, Pages 1–17https://doi.org/10.1145/3503927Fully mining visual cues to aid in content understanding is crucial for video captioning. However, most state-of-the-art video captioning methods are limited to generating captions purely based on straightforward information while ignoring the scenario ...
- research-articleJanuary 2022
Soccer captioning: dataset, transformer-based model, and triple-level evaluation
Procedia Computer Science (PROCS), Volume 210, Issue CPages 104–111https://doi.org/10.1016/j.procs.2022.10.125AbstractThis work aims at generating captions for soccer videos using deep learning. The paper introduces a novel dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, ...
- short-paperOctober 2021
Semantic Tag Augmented XlanV Model for Video Captioning
MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 4818–4822https://doi.org/10.1145/3474085.3479228The key of video captioning is to leverage the cross-modal information from both vision and language perspectives. We propose to leverage the semantic tags to bridge the gap between these modalities rather than directly concatenating or attending to the ...
- short-paperOctober 2021
Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning
MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 4750–4754https://doi.org/10.1145/3474085.3479217This paper describes our bronze-medal solution for the video captioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with technical improvements on both video content encoding and ...
- short-paperOctober 2021
MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques
MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 4853–4857https://doi.org/10.1145/3474085.3479216The quality of video representation directly decides the performance of video related tasks, for both understanding and generation. In this paper, we propose single-modality pretrained feature fusion technique which is composed of reasonable multi-view ...
- research-articleOctober 2021
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 5600–5608https://doi.org/10.1145/3474085.3475703BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask ...
- research-articleOctober 2021
Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention
MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 4220–4229https://doi.org/10.1145/3474085.3475557Automatically describing video, or video captioning, has been widely studied in the multimedia field. This paper proposes a new task of sensor-augmented egocentric-video captioning, a newly constructed dataset for it called MMAC Captions, and a method ...
- research-articleOctober 2021
Discriminative Latent Semantic Graph for Video Captioning
MM '21: Proceedings of the 29th ACM International Conference on MultimediaPages 3556–3564https://doi.org/10.1145/3474085.3475519Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and ...
- research-articleMay 2021
Toward Automatic Audio Description Generation for Accessible Videos
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing SystemsArticle No.: 277, Pages 1–12https://doi.org/10.1145/3411764.3445347Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating high-quality audio descriptions requires a lot of ...
- short-paperOctober 2020
VideoTRM: Pre-training for Video Captioning Challenge 2020
MM '20: Proceedings of the 28th ACM International Conference on MultimediaPages 4605–4609https://doi.org/10.1145/3394171.3416291The Pre-training for Video Captioning Challenge 2020 mainly focuses on developing video captioning systems by pre-training on the newly released large-scale Auto-captions on GIF dataset and further transferring the pre-trained model to MSR-VTT ...