research-article

Sequential Learning for Ingredient Recognition From Images

Authors:

Mengyang Zhang,

Hong LiuAuthors Info & Claims

IEEE Transactions on Circuits and Systems for Video Technology, Volume 33, Issue 5

Pages 2162 - 2175

https://doi.org/10.1109/TCSVT.2022.3218790

Published: 01 May 2023 Publication History

Abstract

To incorporate the cooking logic into ingredient recognition from food images is beneficial for food cognition. Compared with food categorization, ingredient recognition gives a better understanding on food cognition, by providing crucial information on food compositions. However, there exist situations in which different food are made of different ingredients, thus it is necessary to incorporate cooking logic into ingredient recognition to achieve a better food cognition. Based on this point, our paper proposes a sequential learning method to guide a neural network based (NN-based) model on producing ingredients following the corresponding cooking logic in recipes. Firstly, in order to make a maximum utilization of visual features from images, a double-flow feature fusion module (DFFF) is proposed to obtain features from two image-based, visual tasks (food name proposal and multi-label ingredient proposal). After that, fused features from DFFF, together with original image features, are feed into a bidirectional long short time memory (Bi-LSTM) based ingredient generator to produce sequential ingredients. To guide the sequential ingredient generation process, reinforcement learning is employed by designing a hybrid loss related to both the common and personality traits in ingredients for optimizing the model ability of associating images and sequential ingredients. In addition, sequential ingredients are utilized in a backward flow by reconstructing food images, so that sequential ingredient generation can be further optimized in a complementary manner. In experiments, the results demonstrate the superiority of our method on driving the model to allocate more attention to the correlation between images and sequential ingredients, and produced ingredients are comprehensive and logical.

References

[1]

P. Rozin, C. Fischler, S. Imada, A. Sarubin, and A. Wrzesniewski, “Attitudes to food and the role of food in life in the U.S.A., Japan, Flemish Belgium and France: Possible implications for the diet–health debate,” Appetite, vol. 33, no. 2, pp. 163–180, Oct. 1999.

[2]

K. Aizawa and M. Ogawa, “FoodLog: Multimedia tool for healthcare applications,” IEEE Multimedia Mag., vol. 22, no. 2, pp. 4–8, Apr. 2015.

Digital Library

[3]

X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, “Recipe recognition with large multimodal food dataset,” in Proc. IEEE Int. Conf. Multimedia Expo Workshops (ICMEW), Jun. 2015, pp. 1–6.

[4]

Y. Anet al., “PIC2DISH: A customized cooking assistant system,” in Proc. 25th ACM Int. Conf. Multimedia, Oct. 2017, pp. 1269–1273.

[5]

J.-J. Chen, C.-W. Ngo, and T.-S. Chua, “Cross-modal recipe retrieval with rich food attributes,” in Proc. ACM Multimedia Conf. (MM), 2017, pp. 1771–1779.

[6]

A. Salvador, M. Drozdzal, X. Giro-i-Nieto, and A. Romero, “Inverse cooking: Recipe generation from food images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, p. 10453.

[7]

T. Ege and K. Yanai, “Simultaneous estimation of food categories and calories with multi-task CNN,” in Proc. 15th IAPR Int. Conf. Mach. Vis. Appl. (MVA), May 2017, pp. 198–201.

[8]

S. Sajadmaneshet al., “Kissing cuisines: Exploring worldwide culinary habits on the web,” in Proc. 26th Int. Conf. World Wide Web Companion (WWW Companion), 2017, pp. 1013–1021.

[9]

T. H. Silva, P. O. de Melo, J. Almeida, M. Musolesi, and A. Loureiro, “You are what you eat (and drink): Identifying cultural boundaries by analyzing food and drink habits in foursquare,” in Proc. Int. AAAI Conf. Weblogs Social Media, 2014, pp. 1–10.

[10]

L. Zepeda and D. Deal, “Think before you eat: Photographic food diaries as intervention tools to change dietary decision making and attitudes,” Int. J. Consum. Stud., vol. 32, no. 6, pp. 692–698, Nov. 2008.

[11]

L. Yang, Y. Cui, F. Zhang, J. P. Pollak, S. Belongie, and D. Estrin, “PlateClick: Bootstrapping food preferences through an adaptive visual interface,” in Proc. 24th ACM Int. Conf. Inf. Knowl. Manag., Oct. 2015, pp. 183–192.

[12]

S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition using statistics of pairwise local features,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2249–2256.

[13]

H. He, F. Kong, and J. Tan, “DietCam: Multiview food recognition using a multikernel SVM,” IEEE J. Biomed. Health Informat., vol. 20, no. 3, pp. 848–855, May 2016.

[14]

L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101—Mining discriminative components with random forests,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 446–461.

[15]

A. Myerset al., “Im2Calories: Towards an automated mobile vision food diary,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1233–1241.

[16]

C. Liu, Y. Liang, Y. Xue, X. Qian, and J. Fu, “Food and ingredient joint learning for fine-grained recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 6, pp. 2480–2493, Jun. 2021. 10.1109/TCSVT.2020.3020079.

[17]

J. Chen, B. Zhu, C.-W. Ngo, T.-S. Chua, and Y.-G. Jiang, “A study of multi-task and region-wise deep learning for food ingredient recognition,” IEEE Trans. Image Process., vol. 30, pp. 1514–1526, 2021. 10.1109/TIP.2020.3045639.

[18]

M.-Y. Chenet al., “Automatic Chinese food identification and quantity estimation,” in Proc. SIGGRAPH Asia Tech. Briefs (SA), 2012, p. 29.

[19]

M. Puri, Z. Zhu, Q. Yu, A. Divakaran, and H. Sawhney, “Recognition and volume estimation of food intake using a mobile device,” in Proc. Workshop Appl. Comput. Vis. (WACV), Dec. 2009, pp. 1–8.

[20]

H. Matsunaga, K. Doman, T. Hirayama, I. Ide, D. Deguchi, and H. Murase, “Tastes and textures estimation of foods based on the analysis of its ingredients list and image,” in Proc. New Trends Image Anal. Process. Workshop. Cham, Switzerland: Springer, 2015, pp. 326–333.

[21]

Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and construction learning for fine-grained image recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5157–5166.

[22]

T. Ege and K. Yanai, “Estimating food calories for multiple-dish food photos,” in Proc. 4th IAPR Asian Conf. Pattern Recognit. (ACPR), Nov. 2017, pp. 646–651.

[23]

E. Aguilar, B. Remeseiro, M. Bolaños, and P. Radeva, “Grab, pay, and eat: Semantic food detection for smart restaurants,” IEEE Trans. Multimedia, vol. 20, no. 12, pp. 3266–3275, Dec. 2018.

Digital Library

[24]

Y. Wang, J.-J. Chen, C.-W. Ngo, T.-S. Chua, W. Zuo, and Z. Ming, “Mixed dish recognition through multi-label learning,” in Proc. 11th Workshop Multimedia Cooking Eating Activities (CEA), 2019, pp. 1–8.

[25]

D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. 7th IEEE Int. Conf. Comput. Vis., 1999, pp. 1150–1157.

[26]

N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2005, pp. 886–893.

[27]

X. Shu, J. Tang, G.-J. Qi, W. Liu, and J. Yang, “Hierarchical long short-term concurrent memory for human interaction recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 3, pp. 1110–1118, Mar. 2021.

[28]

G.-J. Qi, “Hierarchically gated deep networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 2267–2275. 10.1109/CVPR.2016.249.

[29]

J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480, Dec. 2020. 10.1109/TCSVT.2019.2947482.

Digital Library

[30]

A.-A. Liu, Y. Zhai, N. Xu, W. Nie, W. Li, and Y. Zhang, “Region-aware image captioning via interaction learning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 6, pp. 3685–3696, Jun. 2022. 10.1109/TCSVT.2021.3107035.

[31]

S. Jiang, W. Min, L. Liu, and Z. Luo, “Multi-scale multi-view deep feature aggregation for food recognition,” IEEE Trans. Image Process., vol. 29, pp. 265–276, 2020.

[32]

Y. Kawano and K. Yanai, “Food image recognition with deep convolutional features,” in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous Computing: Adjunct Publication, Sep. 2014, pp. 589–593.

[33]

N. Martinel, G. L. Foresti, and C. Micheloni, “Wide-slice residual networks for food recognition,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018, pp. 567–576.

[34]

H. Chen, J. Wang, Q. Qi, Y. Li, and H. Sun, “Bilinear CNN models for food recognition,” in Proc. Int. Conf. Digit. Image Comput., Techn. Appl. (DICTA), Nov. 2017, pp. 1–6.

[35]

X.-J. Zhang, Y.-F. Lu, and S.-H. Zhang, “Multi-task learning for food identification and analysis with deep convolutional neural networks,” J. Comput. Sci. Technol., vol. 31, no. 3, pp. 489–500, May 2016.

[36]

J. Chen and C.-W. Ngo, “Deep-based ingredient recognition for cooking recipe retrieval,” in Proc. 24th ACM Int. Conf. Multimedia, Oct. 2016, pp. 32–41.

[37]

W. Min, S. Jiang, J. Sang, H. Wang, X. Liu, and L. Herranz, “Being a supercook: Joint food attributes and multimodal content modeling for recipe retrieval and exploration,” IEEE Trans. Multimedia, vol. 19, no. 5, pp. 1100–1113, May 2017.

Digital Library

[38]

J.-J. Chen, L. Pang, and C.-W. Ngo, “Cross-modal recipe retrieval with stacked attention model,” Multimedia Tools Appl., vol. 77, no. 22, pp. 29457–29473, Nov. 2018.

Digital Library

[39]

V. Mnihet al., “Playing atari with deep reinforcement learning,” 2013, arXiv:1312.5602.

[40]

Y. Keneshloo, T. Shi, N. Ramakrishnan, and C. K. Reddy, “Deep reinforcement learning for sequence-to-sequence models,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 7, pp. 2469–2489, Jul. 2020. 10.1109/TNNLS.2019.2929141.

[41]

H. Zhang, J. Cheng, L. Zhang, Y. Li, and W. Zhang, “H2GNN: hierarchical-hops graph neural networks for multi-robot exploration in unknown environments,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 3435–3442, Apr. 2022. 10.1109/LRA.2022.3146912.

[42]

Z. Guoet al., “Multi-turn video question generation via reinforced multi-choice attention network,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 5, pp. 1697–1710, May 2021. 10.1109/TCSVT.2020.3014775.

Digital Library

[43]

Y. Peng, J. Zhang, and Z. Ye, “Deep reinforcement learning for image hashing,” IEEE Trans. Multimedia, vol. 22, no. 8, pp. 2061–2073, Aug. 2020. 10.1109/TMM.2019.2951462.

[44]

C. Yanet al., “Self-weighted robust LDA for multiclass classification with edge classes,” ACM Trans. Intell. Syst. Technol., vol. 12, no. 1, pp. 1–19, Feb. 2021.

Digital Library

[45]

N. Xuet al., “Multi-level policy and reward-based deep reinforcement learning framework for image captioning,” IEEE Trans. Multimedia, vol. 22, no. 5, pp. 1372–1383, May 2020. 10.1109/TMM.2019.2941820.

[46]

Z. Zhao, Z. Zhang, X. Jiang, and D. Cai, “Multi-turn video question answering via hierarchical attention context reinforced networks,” IEEE Trans. Image Process., vol. 28, no. 8, pp. 3860–3872, Aug. 2019. 10.1109/TIP.2019.2902106.

[47]

W. Zhang, B. Wang, L. Ma, and W. Liu, “Reconstruct and represent video contents for captioning via reinforcement learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 12, pp. 3088–3101, Dec. 2020.

[48]

H. Liu, S. Zhang, K. Lin, J. Wen, J. Li, and X. Hu, “Vocabulary-wide credit assignment for training image captioning models,” IEEE Trans. Image Process., vol. 30, pp. 2450–2460, 2021. 10.1109/TIP.2021.3051476.

[49]

T. Rakthanmanonet al., “Searching and mining trillions of time series subsequences under dynamic time warping,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012, pp. 262–270.

[50]

W. Zaremba and I. Sutskever, “Reinforcement learning neural turing machines—Revised,” 2015, arXiv:1505.00521.

[51]

M. D. Zeiler, “ADADELTA: An adaptive learning rate method,” 2012, arXiv:1212.5701.

[52]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980.

[53]

J. Zhang, M. Liu, K. Lu, and Y. Gao, “Group-wise learning for aurora image classification with multiple representations,” IEEE Trans. Cybern., vol. 51, no. 8, pp. 4112–4124, Aug. 2021. 10.1109/TCYB.2019.2903591.

[54]

X. Liuet al., “Deep multiview union learning network for multisource image classification,” IEEE Trans. Cybern., vol. 52, no. 6, pp. 4534–4546, Jun. 2022. 10.1109/TCYB.2020.3029787.

[55]

Z. Gong, W. Hu, X. Du, P. Zhong, and P. Hu, “Deep manifold embedding for hyperspectral image classification,” IEEE Trans. Cybern., vol. 52, no. 10, pp. 10430–10443, Oct. 2022. 10.1109/TCYB.2021.3069790.

[56]

X. Shu, L. Zhang, G.-J. Qi, W. Liu, and J. Tang, “Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3300–3315, Jun. 2022. 10.1109/TPAMI.2021.3050918.

[57]

X. Shu, J. Yang, R. Yan, and Y. Song, “Expansion-squeeze-excitation fusion network for elderly activity recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5281–5292, Aug. 2022. 10.1109/TCSVT.2022.3142771.

Digital Library

Cited By

Zhang ROuyang DHe LKuang LBai H(2024)Recognize after early fusion: the Chinese food recognition based on the alignment of image and ingredientsMultimedia Systems10.1007/s00530-024-01297-w30:2Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1007/s00530-024-01297-w
Cui YTian GJiang ZZhang MGu YWang Y(2023)An Active Task Cognition Method for Home Service Robot Using Multi-Graph Attention Fusion MechanismIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.333929234:6(4957-4972)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3339292

Recommendations

Deep-based Ingredient Recognition for Cooking Recipe Retrieval
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Retrieving recipes corresponding to given dish pictures facilitates the estimation of nutrition facts, which is crucial to various health relevant applications. The current approaches mostly focus on recognition of food category based on global dish ...
Region-Level Attention Network for Food and Ingredient Joint Recognition
ACAI '21: Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence

Food image recognition is a challenging task that predicts the image’s food category or ingredient composition. It is an essential and fundamental step to realize automatic dietary recognition and assessment. Many food images, especially Chinese food, ...
Real-time mobile recipe recommendation system using food ingredient recognition
IMMPD '12: Proceedings of the 2nd ACM international workshop on Interactive multimedia on mobile and portable devices

In this paper, we propose a mobile cooking recipe recom mendation system employing object recognition for food ingredients such as vegetables and meats. The proposed system carries out object recognition on food ingredients in a real-time way on an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology

IEEE Transactions on Circuits and Systems for Video Technology Volume 33, Issue 5

May 2023

524 pages

ISSN:1051-8215

Issue’s Table of Contents

1051-8215 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 May 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ROuyang DHe LKuang LBai H(2024)Recognize after early fusion: the Chinese food recognition based on the alignment of image and ingredientsMultimedia Systems10.1007/s00530-024-01297-w30:2Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1007/s00530-024-01297-w
Cui YTian GJiang ZZhang MGu YWang Y(2023)An Active Task Cognition Method for Home Service Robot Using Multi-Graph Attention Fusion MechanismIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.333929234:6(4957-4972)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3339292

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents