Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Fine-Grained Instance-Level Sketch-Based Image Retrieval

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The problem of fine-grained sketch-based image retrieval (FG-SBIR) is defined and investigated in this paper. In FG-SBIR, free-hand human sketch images are used as queries to retrieve photo images containing the same object instances. It is thus a cross-domain (sketch to photo) instance-level retrieval task. It is an extremely challenging problem because (i) visual comparisons and matching need to be executed under large domain gap, i.e., from black and white line drawing sketches to colour photos; (ii) it requires to capture the fine-grained (dis)similarities of sketches and photo images while free-hand sketches drawn by different people present different levels of deformation and expressive interpretation; and (iii) annotated cross-domain fine-grained SBIR datasets are scarce, challenging many state-of-the-art machine learning techniques, particularly those based on deep learning. In this paper, for the first time, we address all these challenges, providing a step towards the capabilities that would underpin a commercial sketch-based object instance retrieval application. Specifically, a new large-scale FG-SBIR database is introduced which is carefully designed to reflect the real-world application scenarios. A deep cross-domain matching model is then formulated to solve the intrinsic drawing style variability, large domain gap issues, and capture instance-level discriminative features. It distinguishes itself by a carefully designed attention module. Extensive experiments on the new dataset demonstrate the effectiveness of the proposed model and validate the need for a rigorous definition of the FG-SBIR problem and collecting suitable datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Free-hand sketch in this work refers to sketches drawn by amateurs based on their mental recollection. Specifically, we assume that before a human draws a sketch, (s)he has seen a reference object instance, but does not have the object or a photo at hand while drawing.

  2. Here ‘CFF’ refers to the operation of combining the feature map extracted from an earlier layer with the final layer output. This is different with the meaning in the preliminary version (Song et al. 2017) where it indicates both feature fusion and residual attention module.

References

  • Bui, T., Ribeiro, L., Ponti, M., & Collomosse, J. (2016). Generalisation and sharing in triplet convnets for sketch based visual search. arXiv preprint arXiv:1611.05301.

  • Bui, T., Ribeiro, L., Ponti, M., & Collomosse, J. (2018). Sketching out the details: sketch-based image retrieval using convolutional neural networks with multi-stage regression. Computers & Graphics, 71, 77–87.

    Article  Google Scholar 

  • Cao, Y., Wang, H., Wang, C., Li, Z., Zhang, L., & Zhang, L. (2010). Mindfinder: interactive sketch-based image search on millions of images. In International conference on multimedia.

  • Cao, Y., Wang, C., Zhang, L., & Zhang, L. (2011) Edgel index for large-scale sketch-based image search. In CVPR.

  • Chen, T., Cheng, M. M., Tan, P., Shamir, A., & Hu, S. M. (2009). Sketch2photo: internet image montage. ACM Transactions on Graphics (TOG), 28, 1–10.

    Google Scholar 

  • Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In IEEE computer society conference on computer vision and pattern recognitio.

  • Collomosse, J., Bui, T., Wilber, M. J., Fang, C., & Jin, H. (2017). Sketching with style: visual search with sketches and aesthetic context. In Proceedings of the IEEE international conference on computer vision.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009) ImageNet: a large-scale hierarchical image database. In CVPR.

  • Eitz, M., Hildebrand, K., Boubekeur, T., & Alexa, M. (2010). An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Computers & Graphics, 34(5), 482–498.

    Article  Google Scholar 

  • Eitz, M., Hildebrand, K., Boubekeur, T., & Alexa, M. (2011). Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE Transactions on Visualization and Computer Graphics, 17(11), 1624–1636.

    Article  Google Scholar 

  • Eitz, M., Hays, J., & Alexa, M. (2012). How do humans sketch objects? ACM Transactions on Graphics (TOG), 31, 1–10.

    Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP.

  • Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks. CoRR, arXiv:1505.07376.

  • Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In European conference on computer vision.

  • Gordo, A., Almazan, J., Revaud, J., & Larlus, D. (2017). End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision, 124(2), 237–254.

    Article  MathSciNet  Google Scholar 

  • Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., & Van Gool, L. (2013). The interestingness of images. In IEEE international conference on computer vision.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • Hu, R., & Collomosse, J. (2013). A performance evaluation of gradient field HOG descriptor for sketch based image retrieval. Computer Vision and Image Understanding, 117(7), 790–806.

    Article  Google Scholar 

  • Hu, R., Barnard, M., & Collomosse, J. (2010). Gradient field descriptor for sketch based retrieval and localization. In IEEE international conference on image processing.

  • Hu, R., Wang, T., & Collomosse, J. (2011). A bag-of-regions approach to sketch based image retrieval. In IEEE international conference on image processing.

  • Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in neural information processing systems.

  • James, S., Fonseca, M., & Collomosse, J. (2014). Reenact: Sketch based choreographic design from archival dance footage. In Proceedings of international conference on multimedia retrieval.

  • Jiang, Y. G., Wang, Y., Feng, R., Xue, X., Zheng, Y., & Yang, H. (2013). Understanding and predicting interestingness of videos. In AAAI.

  • Johnson, J., Krishna, R., Stark, M., Li, L. J., Shamma, D., Bernstein, M., & Fei-Fei, L. (2015). Image retrieval using scene graphs. In CVPR.

  • Krizhevsky, A., & Hinton, G. E. (2011). Using very deep autoencoders for content-based image retrieval. In European symposium on artificial neural networks.

  • Krizhevsky, A., Sutskever, I., Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.

  • Landay, J. A., & Myers, B. A. (2001). Sketching interfaces: toward more human interface design. IEEE Computer, 34(3), 56–64.

    Article  Google Scholar 

  • Li, Y., Hospedales, T., Song, Y. Z., & Gong, S. (2014). Fine-grained sketch-based image retrieval by matching deformable part models. In BMVC.

  • Li, Y., Hospedales, T. M., Song, Y. Z., & Gong, S. (2015). Free-hand sketch recognition by multi-kernel feature learning. Computer Vision and Image Understanding, 137, 1–11.

    Article  Google Scholar 

  • Li, K., Pang, K., Song, Y. Z., Hospedales, T. M., Xiang, T., & Zhang, H. (2017). Synergistic instance-level subspace alignment for fine-grained sketch-based image retrieval. IEEE Transactions on Image Processing, 26(12), 5908–5921.

    Article  MathSciNet  Google Scholar 

  • Lin, Y., Huang, C., Wan, C., & Hsu, W. (2013) 3D sub-query expansion for improving sketch-based multi-view image retrieval. In Proceedings of the IEEE international conference on computer vision.

  • Lin, T. Y., RoyChowdhury, A., & Maji, S. (2015). Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1449–1457).

  • Liu, L., Shen, F., Shen, Y., Liu, X., & Shao, L. (2017a). Deep sketch hashing: fast free-hand sketch-based image retrieval. arXiv preprint arXiv:1703.05605.

  • Liu, Y., Guo, Y., Lew, M. S. (2017b). On the exploration of convolutional fusion networks for visual recognition. In International conference on multimedia modeling.

  • Lu, J., Xiong, C., Parikh, D., & Socher, R. (2016). Knowing when to look: adaptive attention via a visual sentinel for image captioning. arXiv preprint arXiv:1612.01887.

  • Mahendran, A., & Vedaldi, A. (2015) Understanding deep image representations by inverting them. In IEEE conference on computer vision and pattern recognition.

  • Marr, D. (1982). Vision. New York: W. H. Freeman and Company.

    Google Scholar 

  • Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. In Advances in neural information processing systems.

  • Moulin, C., Largeron, C., Ducottet, C., Géry, M., & Barat, C. (2014). Fisher linear discriminant analysis for text-image combination in multimedia information retrieval. Pattern Recognition, 47(1), 260–269.

    Article  Google Scholar 

  • Nam, H., Ha, J. W., & Kim, J. (2016). Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471.

  • Newell, A., Yang, K., Deng, J. (2016). Stacked hourglass networks for human pose estimation. In European conference on computer vision.

  • Noh, H., Araujo, A., Sim, J., Weyand, T., & Han, B. (2017). Large-scale image retrieval with attentive deep local features. In IEEE international conference on computer vision.

  • Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In IEEE conference on computer vision and pattern recognition.

  • Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2008). Lost in quantization: improving particular object retrieval in large scale image databases. In IEEE conference on computer vision and pattern recognition.

  • Prosser, B. J., Zheng, W. S., Gong, S., Xiang, T., & Mary, Q. (2010). Person re-identification by support vector ranking. In British machine vision conference.

  • Radenovic, F., Tolias, G., & Chum, O. (2018). Deep shape matching. In Proceedings of the European conference on computer vision.

  • Radenović, F., Tolias, G., & Chum, O. (2018). Fine-tuning cnn image retrieval with no human annotation. TPAMI, 41(7), 1655–1668.

    Article  Google Scholar 

  • Ren, X. (2008). Multi-scale improves boundary detection in natural images. In Proceedings of the European conference on computer vision.

  • Sangkloy, P., Burnell, N., Ham, C., & Hays, J. (2016). The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), 35, 1–12.

    Article  Google Scholar 

  • Sermanet, P., Frome, A., & Real, E. (2014). Attention for fine-grained categorization. arXiv preprint arXiv:1412.7054.

  • Song, J., Yu, Q., Song, Y. Z., Xiang, T., & Hospedales, T. M. (2017). Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In Proceedings of the IEEE international conference on computer vision.

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. IEEE conference on computer vision and pattern recognition. arXiv:1409.4842.

  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016a). Rethinking the inception architecture for computer vision. In IEEE conference on computer vision and pattern recognition.

  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016b). Rethinking the inception architecture for computer vision. In IEEE conference on computer vision and pattern recognition.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeuIPS.

  • Wang, X., & Tang, X. (2009). Face photo-sketch synthesis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11), 1955–1967.

    Article  MathSciNet  Google Scholar 

  • Wang, C., Li, Z., & Zhang, L. (2010). Mindfinder: image search by interactive sketching and tagging. In Proceedings of the 19th international conference on world wide web.

  • Wang, F., Kang, L., & Li, Y. (2015). Sketch-based 3D shape retrieval using convolutional neural networks. In IEEE conference on computer vision and pattern recognition.

  • Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., & Zhang, Z. (2015). The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In IEEE conference on computer vision and pattern recognition.

  • Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In IEEE international conference on computer vision.

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., & Bengio, Y. (2015). Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning.

  • Yang, S., & Ramanan, D. (2015). Multi-scale recognition with DAG-CNNS. In IEEE international conference on computer vision.

  • Yu, A., & Grauman, K. (2014). Fine-grained visual comparisons with local learning. In IEEE conference on computer vision and pattern recognition.

  • Yu, Q., Yang, Y., Song, Y., Xiang, T., & Hospedales, T. (2015). Sketch-a-net that beats humans. In BMVC.

  • Yu, Q., Liu, F., Song, Y. Z., Xiang, T., Hospedales, T. M., & Loy, C. C. (2016). Sketch me that shoe. In IEEE conference on computer vision and pattern recognition.

  • Yu, Q., Yang, Y., Liu, F., Song, Y. Z., Xiang, T., & Hospedales, T. M. (2017). Sketch-a-net: a deep neural network that beats humans. International Journal of Computer Vision, 122(3), 411–425.

    Article  MathSciNet  Google Scholar 

  • Zhang, J., Shen, F., Liu, L., Zhu, F., Yu, M., Shao, L., Tao Shen, H., & Van Gool, L. (2018). Generative domain-migration hashing for sketch-to-image retrieval. In Proceedings of the European conference on computer vision (ECCV).

  • Zhu, J. Y., Lee, Y. J., & Efros, A. A. (2014). Averageexplorer: interactive exploration and alignment of visual data collections. ACM Transactions on Graphics (TOG), 33, 1–11.

    Google Scholar 

  • Zitnick, C. L., & Dollár, P. (2014). Edge boxes: locating object proposals from edges. In Proceedings of the European conference on computer vision.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qian Yu.

Additional information

Communicated by Patrick Perez.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, Q., Song, J., Song, YZ. et al. Fine-Grained Instance-Level Sketch-Based Image Retrieval. Int J Comput Vis 129, 484–500 (2021). https://doi.org/10.1007/s11263-020-01382-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01382-3

Keywords