research-article

Text-Embedded Bilinear Model for Fine-Grained Visual Recognition

Authors:

Liang Sun,

Xiang Guan,

Yang Yang,

Lei ZhangAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 211 - 219

https://doi.org/10.1145/3394171.3413638

Published: 12 October 2020 Publication History

Get Access

Abstract

Fine-grained visual recognition, which aims to identify subcategories of the same base-level category, is a challenging task because of its large intra-class variances and small inter-class variances. Human beings can perform object recognition task based on not only the visual appearance but also the knowledge from texts, as texts can point out the discriminative parts or characteristics which are always the key to distinguishing different subcategories. This is an involuntary transfer from human textual attention to visual attention, suggesting that texts are able to assist fine-grained recognition. In this paper, we propose a Text-Embedded Bilinear (TEB) model which incorporates texts as extra guidance for fine-grained recognition. Specially, we first conduct a text-embedded network to embed text feature into the discriminative image feature learning to get a embedded feature. In addition, since the cross-layer part feature interaction and fine-grained feature learning are mutually correlated and can reinforce each other, we also extract a candidate feature from the text encoder and embed it into the inter-layer feature of the image encoder to get an embedded candidate feature. At last we utilize a cross-layer bilinear network to fuse the two embedded features. Comparing with state-of-the-art methods on the widely used CUB-200-2011 dataset and Oxford Flowers-102 dataset for fine-grained image recognition, the experimental results demonstrate our TEB model achieves the best performance.

Supplementary Material

MP4 File (3394171.3413638.mp4)

In the video, we introduce our Text-Embedded Bilinear (TEB) model which incorporates texts as extra guidance for fine-grained\r\nvisual recognition. Specially, we propose a text-embedded network, which learns a channel-wise attention from text to embed text to image feature learning. Also, we utilize the candidate feature from both image and text by cross-layer bilinear network. Experiments and evaluations conducted on CUB-200-2011 dataset and Oxford Flowers-102 dataset demonstrate the superiority of our TEB model over existing state-of-the-art methods.

Download
49.24 MB

References

[1]

Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, and Stefan Carlsson. 2015. From generic to specific deep representations for visual recognition. In CVPR Workshops. 36--45.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Fine-grained face verification

Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition

Multi-proxy feature learning for robust fine-grained visual recognition

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations