research-article

GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce

Authors:

Fedor BorisyukAuthors Info & Claims

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 2608 - 2616

https://doi.org/10.1145/3394486.3403311

Published: 20 August 2020 Publication History

Abstract

In this paper, we present GrokNet, a deployed image recognition system for commerce applications. GrokNet leverages a multi-task learning approach to train a single computer vision trunk. We achieve a 2.1x improvement in exact product match accuracy when compared to the previous state-of-the-art Facebook product recognition system. We achieve this by training on 7 datasets across several commerce verticals, using 80 categorical loss functions and 3 embedding losses. We share our experience of combining diverse sources with wide-ranging label semantics and image statistics, including learning from human annotations, user-generated tags, and noisy search engine interaction data. GrokNet has demonstrated gains in production applications and operates at Facebook scale.

References

[1]

2016. PyTorch. http://pytorch.org/

[2]

2017. Google Lens. https://lens.google.com/

[3]

Sean Bell and Kavita Bala. 2015. Learning visual similarity for product design with convolutional neural networks. ACM Trans. Graph. (2015).

[4]

Maxim Berman, Hervé Jégou, Vedaldi Andrea, Iasonas Kokkinos, and Matthijs Douze. 2019. MultiGrain: a unified image embedding for classes and instances. arXiv e-prints (Feb 2019).

[5]

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature Verification Using a "Siamese" Time Delay Neural Network. In NIPS.

[6]

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. In ICML.

[7]

Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a Similarity Metric Discriminatively, with Application to Face Verification. In CVPR.

[8]

Chuan, Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In JMLR.

[9]

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In CVPR.

[10]

Piotr Dollár, Zhuowen Tu, Pietro Perona, and Serge J. Belongie. 2009. Integral Channel Features. In BMVC.

[11]

Jeffrey Dunn. 2016. Introducing FBLearner Flow: Facebook's AI backbone. https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/

[12]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.

Digital Library

[13]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality Reduction by Learning an Invariant Mapping. In CVPR.

[14]

Houdong Hu, Yan Wang, Linjun Yang, Pavel Komlev, Li Huang, Xi (Stephen) Chen, Jiapei Huang, Ye Wu, Meenaz Merchant, and Arun Sacheti. 2018. Web-Scale Responsive Visual Search at Bing. In KDD.

[15]

Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. 2016. Learning visual features from large weakly supervised data. In ECCV.

[16]

Alex Kendall, Yarin Gal, and Roberto Cipolla. 2017. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. CoRR, Vol. abs/1705.07115 (2017).

[17]

Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. 2017. SphereFace: Deep Hypersphere Embedding for Face Recognition. In CVPR.

[18]

Wenjie Luo, Bin Yang, and Raquel Urtasun. 2018. Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting With a Single Convolutional Net. In CVPR.

[19]

Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. 2018. Exploring the Limits of Weakly Supervised Pretraining. In ECCV.

[20]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. 3111--3119.

[21]

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Cross-Stitch Networks for Multi-task Learning. In CVPR.

[22]

Filip Radenovic, Giorgos Tolias, and Ondrej Chum. 2019. Fine-Tuning CNN Image Retrieval with No Human Annotation. (2019).

[23]

Zhongzheng Ren and Yong Jae Lee. 2017. Cross-Domain Self-supervised Multi-task Feature Learning using Synthetic Imagery. arxiv: cs.CV/1711.09082

[24]

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jé gou. 2019. Spreading vectors for similarity search. In ICLR.

[25]

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV.

[26]

Yina Tang, Fedor Borisyuk, Siddarth Malreddy, Yixuan Li, Yiqun Liu, and Sergey Kirshner. 2019. MSURU: Large Scale E-commerce Image Classification with Weakly Supervised Search Data. In KDD.

[27]

Giorgos Tolias, Ronan Sicre, and Hervé Jé gou. 2016. Particular object retrieval with integral max-pooling of CNN activations. In ICLR.

[28]

Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. 2017. NormFace: L(_mbox2) Hypersphere Embedding for Face Verification. In Multimedia Conference, MM.

Digital Library

[29]

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. 2018. CosFace: Large Margin Cosine Loss for Deep Face Recognition. In CVPR.

[30]

Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, and Philipp Krahenbuhl. 2017. Sampling Matters in Deep Embedding Learning. In ICCV.

[31]

Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. 2018. Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination. CoRR (2018).

[32]

Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated Residual Transformations for Deep Neural Networks. CVPR (2016).

[33]

Fan Yang, Ajinkya Kale, Yury Bubnov, Leon Stein, Qiaosong Wang, M. Hadi Kiapour, and Robinson Piramuthu. 2017. Visual Search at eBay. In KDD.

[34]

Andrew Zhai, Dmitry Kislyuk, Yushi Jing, Michael Feng, Eric Tzeng, Jeff Donahue, Yue Li Du, and Trevor Darrell. 2017. Visual Discovery at Pinterest. In WWW.

[35]

Andrew Zhai and Hao-Yu Wu. 2019. Classification is a strong baseline for deep metric learning. In BMVC.

[36]

Andrew Zhai, Hao-Yu Wu, Eric Tzeng, Dong Huk Park, and Charles Rosenberg. 2019. Learning a Unified Embedding for Visual Search at Pinterest. In KDD.

[37]

Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Yingya Zhang, Xiaofeng Ren, and Rong Jin. 2018. Visual Search at Alibaba. In KDD.

Cited By

Zhu XHuang SDing HYang JChen KZhou TNeiman TXie OTran SYao BGray DBindal ADhua ABaeza-Yates RBonchi F(2024)Bringing Multimodality to Amazon Visual Search SystemProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671640(6390-6399)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671640
Velioglu RChan RHammer B(2024)FashionFail: Addressing Failure Cases in Fashion Object Detection and Segmentation2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651287(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651287
Hu ZLi SDu MDhua AGray D(2024)De-noised Vision-language Fusion Guided by Visual Cues for E-commerce Product Search2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00204(1986-1996)Online publication date: 17-Jun-2024
https://doi.org/10.1109/CVPRW63382.2024.00204
Show More Cited By

Index Terms

GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
  2. Machine learning
    1. Learning paradigms
      1. Multi-task learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Recommendations

MSURU: Large Scale E-commerce Image Classification with Weakly Supervised Search Data
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

In this paper we present a deployed image recognition system used in a large scale commerce search engine, which we call MSURU. It is designed to process product images uploaded daily to Facebook Marketplace. Social commerce is a growing area within ...
Decoupled self-supervised label augmentation for fully-supervised image classification
Abstract
Self-supervised label augmentation has emerged as an effective means to overcome the data scarcity problem for supervised vision tasks. Existing rotation-based self-supervised label augmentation methods either impose or relax the ...
Highlights
- We decouple the image classification task into three tasks with different objectives.
Semi-supervised robust deep neural networks for multi-label image classification
Highlights
- Large-scale data includes many noisily labeled and unlabeled examples.
- With ...
Abstract
This paper introduces a robust method for semi-supervised training of deep neural networks for multi-label image classification. To this end, a ramp loss is utilized since it is more robust against noisy and incomplete image labels ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

August 2020

3664 pages

ISBN:9781450379984

DOI:10.1145/3394486

General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '20

Sponsor:

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

July 6 - 10, 2020

CA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
558
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)5

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu XHuang SDing HYang JChen KZhou TNeiman TXie OTran SYao BGray DBindal ADhua ABaeza-Yates RBonchi F(2024)Bringing Multimodality to Amazon Visual Search SystemProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671640(6390-6399)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671640
Velioglu RChan RHammer B(2024)FashionFail: Addressing Failure Cases in Fashion Object Detection and Segmentation2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651287(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651287
Hu ZLi SDu MDhua AGray D(2024)De-noised Vision-language Fusion Guided by Visual Cues for E-commerce Product Search2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00204(1986-1996)Online publication date: 17-Jun-2024
https://doi.org/10.1109/CVPRW63382.2024.00204
Torbarina LFerkovic TRoguski LMihelcic VSarlija BKraljevic Z(2024)Challenges and Opportunities of Using Transformer-Based Multi-Task Learning in NLP Through ML Lifecycle: A Position PaperNatural Language Processing Journal10.1016/j.nlp.2024.1000767(100076)Online publication date: Jun-2024
https://doi.org/10.1016/j.nlp.2024.100076
He YTian YWang MChen FYu LTang MChen CZhang NKuang BPrakash A(2023)Que2Engage: Embedding-based Retrieval for Relevant and Engaging Products at Facebook MarketplaceCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3584633(386-390)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3584633
Dagan AGuy INovgorodov S(2023)Shop by image: characterizing visual search in e-commerceInformation Retrieval Journal10.1007/s10791-023-09418-126:1Online publication date: 3-Mar-2023
https://doi.org/10.1007/s10791-023-09418-1
Pancha NZhai ALeskovec JRosenberg CZhang ARangwala H(2022)PinnerFormerProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539156(3702-3712)Online publication date: 14-Aug-2022
https://doi.org/10.1145/3534678.3539156
Yu LChen JSinha AWang MChen YBerg TZhang NZhang ARangwala H(2022)CommerceMMProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539151(4433-4442)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539151
Du MRamisa AK C AChanda SWang MRajesh NLi SHu YZhou TLakshminarayana NTran SGray DZhang ARangwala H(2022)Amazon Shop the Look: A Visual Search System for Fashion and HomeProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539071(2822-2830)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539071
Shin WPark JWoo TCho YOh KSong HAl Hasan MXiong L(2022)e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerceProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557067(3484-3494)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557067
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents