Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3394171.3413638acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Text-Embedded Bilinear Model for Fine-Grained Visual Recognition

Published: 12 October 2020 Publication History

Abstract

Fine-grained visual recognition, which aims to identify subcategories of the same base-level category, is a challenging task because of its large intra-class variances and small inter-class variances. Human beings can perform object recognition task based on not only the visual appearance but also the knowledge from texts, as texts can point out the discriminative parts or characteristics which are always the key to distinguishing different subcategories. This is an involuntary transfer from human textual attention to visual attention, suggesting that texts are able to assist fine-grained recognition. In this paper, we propose a Text-Embedded Bilinear (TEB) model which incorporates texts as extra guidance for fine-grained recognition. Specially, we first conduct a text-embedded network to embed text feature into the discriminative image feature learning to get a embedded feature. In addition, since the cross-layer part feature interaction and fine-grained feature learning are mutually correlated and can reinforce each other, we also extract a candidate feature from the text encoder and embed it into the inter-layer feature of the image encoder to get an embedded candidate feature. At last we utilize a cross-layer bilinear network to fuse the two embedded features. Comparing with state-of-the-art methods on the widely used CUB-200-2011 dataset and Oxford Flowers-102 dataset for fine-grained image recognition, the experimental results demonstrate our TEB model achieves the best performance.

Supplementary Material

MP4 File (3394171.3413638.mp4)
In the video, we introduce our Text-Embedded Bilinear (TEB) model which incorporates texts as extra guidance for fine-grained\r\nvisual recognition. Specially, we propose a text-embedded network, which learns a channel-wise attention from text to embed text to image feature learning. Also, we utilize the candidate feature from both image and text by cross-layer bilinear network. Experiments and evaluations conducted on CUB-200-2011 dataset and Oxford Flowers-102 dataset demonstrate the superiority of our TEB model over existing state-of-the-art methods.

References

[1]
Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, and Stefan Carlsson. 2015. From generic to specific deep representations for visual recognition. In CVPR Workshops. 36--45.
[2]
Steve Branson, Grant Van Horn, Serge J. Belongie, and Pietro Perona. 2014. Bird Species Categorization Using Pose Normalized Deep Convolutional Nets. CoRR, Vol. abs/1406.2952 (2014).
[3]
Dongliang Chang, Yifeng Ding, Jiyang Xie, Ayan Kumar Bhunia, Xiaoxu Li, Zhanyu Ma, Ming Wu, Jun Guo, and Yi-Zhe Song. 2020. The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification. IEEE Trans. Image Processing (2020), 4683--4695.
[4]
Tianshui Chen, Liang Lin, Riquan Chen, Yang Wu, and Xiaonan Luo. 2018a. Knowledge-Embedded Representation Learning for Fine-Grained Image Recognition. In IJCAI. 627--634.
[5]
Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. 2018b. Recurrent Attentional Reinforcement Learning for Multi-Label Image Recognition. In AAAI. 6730--6737.
[6]
Yue Chen, Yalong Bai, Wei Zhang, and Tao Mei. 2019. Destruction and Construction Learning for Fine-Grained Image Recognition. In CVPR. 5157--5166.
[7]
Yin Cui, Feng Zhou, Yuanqing Lin, and Serge J. Belongie. 2016. Fine-Grained Categorization and Dataset Bootstrapping Using Deep Metric Learning with Humans in the Loop. In CVPR. 1153--1162.
[8]
Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition. In CVPR. 4476--4484.
[9]
Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact Bilinear Pooling. In CVPR. 317--326.
[10]
Weifeng Ge, Xiangru Lin, and Yizhou Yu. 2019. Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification From the Bottom Up. In CVPR. 3034--3043.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.
[12]
Xiangteng He and Yuxin Peng. 2017. Fine-Grained Image Classification via Combining Vision and Language. In CVPR. 7332--7340.
[13]
Xiangteng He and Yuxin Peng. 2020. Fine-Grained Visual-Textual Representation Learning. IEEE Trans. Circuits Syst. Video Techn., Vol. 30, 2 (2020), 520--531.
[14]
Jie Hu, Li Shen, and Gang Sun. 2018b. Squeeze-and-Excitation Networks. In CVPR. 7132--7141.
[15]
Tao Hu, Honggang Qi, Cong Huang, Qingming Huang, Yan Lu, and Jizheng Xu. 2018a. Weakly Supervised Local Attention Network for Fine-Grained Visual Classification. CoRR, Vol. abs/1808.02152 (2018).
[16]
Chao Huang, Hongliang Li, Yurui Xie, Qingbo Wu, and Bing Luo. 2017. PBC: Polygon-Based Classifier for Fine-Grained Categorization. IEEE Trans. Multimedia, Vol. 19 (2017), 673--684.
[17]
Shaoli Huang, Zhe Xu, Dacheng Tao, and Ya Zhang. 2016. Part-Stacked CNN for Fine-Grained Visual Categorization. In CVPR. 1173--1182.
[18]
Yuqi Huo, Yao Lu, Yulei Niu, Zhiwu Lu, and Ji-Rong Wen. 2019. Coarse-to-Fine Grained Classification. In SIGIR. 1033--1036.
[19]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial Transformer Networks. In Advances in Neural Information Processing Systems 28. 2017--2025.
[20]
Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. In ICLR.
[21]
Shu Kong and Charless C. Fowlkes. 2017. Low-Rank Bilinear Pooling for Fine-Grained Classification. In CVPR. 7025--7034.
[22]
Jonathan Krause, Hailin Jin, Jianchao Yang, and Fei-Fei Li. 2015. Fine-grained recognition without part annotations. In CVPR. 5546--5555.
[23]
Michael Lam, Behrooz Mahasseni, and Sinisa Todorovic. 2017. Fine-Grained Recognition as HSnet Search for Informative Image Parts. In CVPR. 6497--6506.
[24]
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sö ren Auer, and Christian Bizer. 2015. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, Vol. 6, 2 (2015), 167--195.
[25]
Jingjing Li, Lei Zhu, Zi Huang, Ke Lu, and Jidong Zhao. 2018. I read, I saw, I tell: Texts Assisted Fine-Grained Visual Classification. In MM. 663--671.
[26]
Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. 2017. Factorized Bilinear Models for Image Recognition. In ICCV. 2098--2106.
[27]
Di Lin, Xiaoyong Shen, Cewu Lu, and Jiaya Jia. 2015b. Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In CVPR. 1666--1674.
[28]
Tsung-Yu Lin, Aruni Roy Chowdhury, and Subhransu Maji. 2015a. Bilinear CNN Models for Fine-Grained Visual Recognition. In ICCV. 1449--1457.
[29]
Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated Flower Classification over a Large Number of Classes. In ICVGIP. 722--729.
[30]
Liang Peng, Yang Yang, Zheng Wang, Xiao Wu, and Zi Huang. 2019. CRA-Net: Composed Relation Attention Network for Visual Question Answering. In ACM MM. 1202--1210.
[31]
Liang Peng, Yang Yang, Xiaopeng Zhang, Yanli Ji, Huimin Lu, and Heng Tao Shen. 2020. Answer Again: Imporving VQA with Cascaded-Answering Model. TKDE (2020), 1--12.
[32]
Yuxin Peng, Xiangteng He, and Junjie Zhao. 2018. Object-Part Attention Model for Fine-Grained Image Classification. IEEE Trans. Image Processing, Vol. 27, 3 (2018), 1487--1500.
[33]
Ninh Pham and Rasmus Pagh. 2013. Fast and scalable polynomial kernels via explicit feature maps. In TSIGKDD. 239--247.
[34]
Qi Qian, Rong Jin, Shenghuo Zhu, and Yuanqing Lin. 2015. Fine-grained visual categorization via multi-stage metric learning. In CVPR. 3716--3724.
[35]
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. In CVPR. 512--519.
[36]
Scott E. Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning Deep Representations of Fine-Grained Visual Descriptions. In CVPR. 49--58.
[37]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2015. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis., Vol. 115, 3 (2015), 211--252.
[38]
Marcel Simon and Erik Rodner. 2015. Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks. In ICCV. 1143--1151.
[39]
Ming Sun, Yuchen Yuan, Feng Zhou, and Errui Ding. 2018. Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition. In Computer Vision - ECCV, Vol. 11220. 834--850.
[40]
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge J. Belongie. 2011. The Caltech-UCSD Birds-200--2011 Dataset.
[41]
Dequan Wang, Zhiqiang Shen, Jie Shao, Wei Zhang, Xiangyang Xue, and Zheng Zhang. 2015. Multiple Granularity Descriptors for Fine-Grained Categorization. In ICCV. 2399--2406.
[42]
Yaming Wang, Vlad I. Morariu, and Larry S. Davis. 2018. Learning a Discriminative Filter Bank Within a CNN for Fine-Grained Recognition. In CVPR. 4148--4157.
[43]
Zheng Wang, Jie Zhou, Jing Ma, Jingjing Li, Jiangbo Ai, and Yang Yang. 2020. Discovering attractive segments in the user-generated video streams. Inf. Process. Manag., Vol. 57, 1 (2020).
[44]
Xiu-Shen Wei, Chen-Wei Xie, Jianxin Wu, and Chunhua Shen. 2018. Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognit., Vol. 76 (2018), 704--714.
[45]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. In Computer Vision - ECCV, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.), Vol. 11211. 3--19.
[46]
Lingxi Xie, Jingdong Wang, Weiyao Lin, Bo Zhang, and Qi Tian. 2017. Towards Reversal-Invariant Image Representation. IJCV, Vol. 123 (2017), 226--250.
[47]
Zhe Xu, Shaoli Huang, Ya Zhang, and Dacheng Tao. 2015. Augmenting Strong Supervision Using Web Data for Fine-Grained Categorization. In ICCV. 2524--2532.
[48]
Ze Yang, Tiange Luo, Dong Wang, Zhiqiang Hu, Jun Gao, and Liwei Wang. 2018. Learning to Navigate for Fine-Grained Classification. In Computer Vision - ECCV, Vol. 11218. 438--454.
[49]
Chaojian Yu, Xinyi Zhao, Qi Zheng, Peng Zhang, and Xinge You. 2018. Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition. In Computer Vision - ECCV, Vol. 11220. 595--610.
[50]
Hua Zhang, Xiaochun Cao, and Rui Wang. 2018a. Audio Visual Attribute Discovery for Fine-Grained Object Recognition. In AAAI. 7542--7549.
[51]
Lianbo Zhang, Shaoli Huang, Wei Liu, and Dacheng Tao. 2019. Learning a Mixture of Granularity-Specific Experts for Fine-Grained Categorization. In ICCV. 8330--8339.
[52]
Ning Zhang, Jeff Donahue, Ross B. Girshick, and Trevor Darrell. 2014. Part-Based R-CNNs for Fine-Grained Category Detection. In Computer Vision - ECCV, Vol. 8689. 834--849.
[53]
Shaofeng Zhang, Zheng Wang, Xing Xu, Xiang Guan, and Yang Yang. 2020. Fooled by Imagination: Adversarial Attack to Image Captioning Via Perturbation in Complex Domain. In ICME. 1--6.
[54]
Yabin Zhang, Hui Tang, and Kui Jia. 2018b. Fine-Grained Visual Categorization Using Meta-learning Optimization with Sample Selection of Auxiliary Data. In Computer Vision - ECCV, Vol. 11212. 241--256.
[55]
Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition. In ICCV. 5219--5227.
[56]
Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. 2019. Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition. In CVPR. 5012--5021.

Cited By

View all
  • (2024)Fine-Grained Recognition With Learnable Semantic Data AugmentationIEEE Transactions on Image Processing10.1109/TIP.2024.336450033(3130-3144)Online publication date: 2024
  • (2024)Integrating IoT and visual question answering in smart cities: Enhancing educational outcomesAlexandria Engineering Journal10.1016/j.aej.2024.09.059108(878-888)Online publication date: Dec-2024
  • (2023)Object Size Recognition as Intra-class Variations using Transfer Learning2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)10.1109/ICCoSITE57641.2023.10127785(568-573)Online publication date: 16-Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-layer bilinear network
  2. deep learning
  3. fine-grained visual recognition
  4. multi-modal analysis

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Program of China
  • Fundamental Research Funds for the Central Universities
  • Sichuan Science and Technology Program China
  • Dongguan Songshan Lake Introduction Program of Leading Innovative and Entrepreneurial Talents

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)2
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Fine-Grained Recognition With Learnable Semantic Data AugmentationIEEE Transactions on Image Processing10.1109/TIP.2024.336450033(3130-3144)Online publication date: 2024
  • (2024)Integrating IoT and visual question answering in smart cities: Enhancing educational outcomesAlexandria Engineering Journal10.1016/j.aej.2024.09.059108(878-888)Online publication date: Dec-2024
  • (2023)Object Size Recognition as Intra-class Variations using Transfer Learning2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)10.1109/ICCoSITE57641.2023.10127785(568-573)Online publication date: 16-Feb-2023
  • (2022)MAVT-FGProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548383(3811-3819)Online publication date: 10-Oct-2022
  • (2022)Rethinking Open-World Object Detection in Autonomous Driving ScenariosProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548165(1279-1288)Online publication date: 10-Oct-2022
  • (2021)Learning Hierarchal Channel Attention for Fine-grained Visual ClassificationProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475184(5011-5019)Online publication date: 17-Oct-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media