research-article

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Authors:

Peng Wu,

Xiangteng He,

Mingqian Tang,

Yiliang Lv,

Jing LiuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 3518 - 3527

https://doi.org/10.1145/3474085.3475515

Published: 17 October 2021 Publication History

Get Access

Abstract

Video-text retrieval is an important yet challenging task in vision-language understanding, which aims to learn a joint embedding space where related video and text instances are close to each other. Most current works simply measure the video-text similarity based on video-level and text-level embeddings. However, the neglect of more fine-grained or local information causes the problem of insufficient representation. Some works exploit the local details by disentangling sentences, but overlook the corresponding videos, causing the asymmetry of video-text representation. To address the above limitations, we propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching. Specifically, we first decompose video and text into three semantic levels, namely event (video and text), action (motion and verb), and entity (appearance and noun). Based on these, we naturally construct hierarchical representations in the individual-local-global manner, where the individual level focuses on the alignment between frame and word, local level focuses on the alignment between video clip and textual context, and global level focuses on the alignment between the whole video and text. Different level alignments capture fine-to-coarse correlations between video and text, as well as take the advantage of the complementary information among three semantic levels. Besides, our HANet is also richly interpretable by explicitly learning key semantic concepts. Extensive experiments on two public datasets, namely MSR-VTT and VATEX, show the proposed HANet outperforms other state-of-the-art methods, which demonstrates the effectiveness of hierarchical representation and alignment. Our code is publicly available at https://github.com/Roc-Ng/HANet.

References

[1]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

Abstract

References

Cited By

Index Terms

Recommendations

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

FeatInter: Exploring fine-grained object features for video-text retrieval

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations