Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Sequential Learning for Ingredient Recognition From Images

Published: 01 May 2023 Publication History

Abstract

To incorporate the cooking logic into ingredient recognition from food images is beneficial for food cognition. Compared with food categorization, ingredient recognition gives a better understanding on food cognition, by providing crucial information on food compositions. However, there exist situations in which different food are made of different ingredients, thus it is necessary to incorporate cooking logic into ingredient recognition to achieve a better food cognition. Based on this point, our paper proposes a sequential learning method to guide a neural network based (NN-based) model on producing ingredients following the corresponding cooking logic in recipes. Firstly, in order to make a maximum utilization of visual features from images, a double-flow feature fusion module (DFFF) is proposed to obtain features from two image-based, visual tasks (food name proposal and multi-label ingredient proposal). After that, fused features from DFFF, together with original image features, are feed into a bidirectional long short time memory (Bi-LSTM) based ingredient generator to produce sequential ingredients. To guide the sequential ingredient generation process, reinforcement learning is employed by designing a hybrid loss related to both the common and personality traits in ingredients for optimizing the model ability of associating images and sequential ingredients. In addition, sequential ingredients are utilized in a backward flow by reconstructing food images, so that sequential ingredient generation can be further optimized in a complementary manner. In experiments, the results demonstrate the superiority of our method on driving the model to allocate more attention to the correlation between images and sequential ingredients, and produced ingredients are comprehensive and logical.

References

[1]
P. Rozin, C. Fischler, S. Imada, A. Sarubin, and A. Wrzesniewski, “Attitudes to food and the role of food in life in the U.S.A., Japan, Flemish Belgium and France: Possible implications for the diet–health debate,” Appetite, vol. 33, no. 2, pp. 163–180, Oct. 1999.
[2]
K. Aizawa and M. Ogawa, “FoodLog: Multimedia tool for healthcare applications,” IEEE Multimedia Mag., vol. 22, no. 2, pp. 4–8, Apr. 2015.
[3]
X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, “Recipe recognition with large multimodal food dataset,” in Proc. IEEE Int. Conf. Multimedia Expo Workshops (ICMEW), Jun. 2015, pp. 1–6.
[4]
Y. Anet al., “PIC2DISH: A customized cooking assistant system,” in Proc. 25th ACM Int. Conf. Multimedia, Oct. 2017, pp. 1269–1273.
[5]
J.-J. Chen, C.-W. Ngo, and T.-S. Chua, “Cross-modal recipe retrieval with rich food attributes,” in Proc. ACM Multimedia Conf. (MM), 2017, pp. 1771–1779.
[6]
A. Salvador, M. Drozdzal, X. Giro-i-Nieto, and A. Romero, “Inverse cooking: Recipe generation from food images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, p. 10453.
[7]
T. Ege and K. Yanai, “Simultaneous estimation of food categories and calories with multi-task CNN,” in Proc. 15th IAPR Int. Conf. Mach. Vis. Appl. (MVA), May 2017, pp. 198–201.
[8]
S. Sajadmaneshet al., “Kissing cuisines: Exploring worldwide culinary habits on the web,” in Proc. 26th Int. Conf. World Wide Web Companion (WWW Companion), 2017, pp. 1013–1021.
[9]
T. H. Silva, P. O. de Melo, J. Almeida, M. Musolesi, and A. Loureiro, “You are what you eat (and drink): Identifying cultural boundaries by analyzing food and drink habits in foursquare,” in Proc. Int. AAAI Conf. Weblogs Social Media, 2014, pp. 1–10.
[10]
L. Zepeda and D. Deal, “Think before you eat: Photographic food diaries as intervention tools to change dietary decision making and attitudes,” Int. J. Consum. Stud., vol. 32, no. 6, pp. 692–698, Nov. 2008.
[11]
L. Yang, Y. Cui, F. Zhang, J. P. Pollak, S. Belongie, and D. Estrin, “PlateClick: Bootstrapping food preferences through an adaptive visual interface,” in Proc. 24th ACM Int. Conf. Inf. Knowl. Manag., Oct. 2015, pp. 183–192.
[12]
S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition using statistics of pairwise local features,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2249–2256.
[13]
H. He, F. Kong, and J. Tan, “DietCam: Multiview food recognition using a multikernel SVM,” IEEE J. Biomed. Health Informat., vol. 20, no. 3, pp. 848–855, May 2016.
[14]
L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101—Mining discriminative components with random forests,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 446–461.
[15]
A. Myerset al., “Im2Calories: Towards an automated mobile vision food diary,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1233–1241.
[16]
C. Liu, Y. Liang, Y. Xue, X. Qian, and J. Fu, “Food and ingredient joint learning for fine-grained recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 6, pp. 2480–2493, Jun. 2021. 10.1109/TCSVT.2020.3020079.
[17]
J. Chen, B. Zhu, C.-W. Ngo, T.-S. Chua, and Y.-G. Jiang, “A study of multi-task and region-wise deep learning for food ingredient recognition,” IEEE Trans. Image Process., vol. 30, pp. 1514–1526, 2021. 10.1109/TIP.2020.3045639.
[18]
M.-Y. Chenet al., “Automatic Chinese food identification and quantity estimation,” in Proc. SIGGRAPH Asia Tech. Briefs (SA), 2012, p. 29.
[19]
M. Puri, Z. Zhu, Q. Yu, A. Divakaran, and H. Sawhney, “Recognition and volume estimation of food intake using a mobile device,” in Proc. Workshop Appl. Comput. Vis. (WACV), Dec. 2009, pp. 1–8.
[20]
H. Matsunaga, K. Doman, T. Hirayama, I. Ide, D. Deguchi, and H. Murase, “Tastes and textures estimation of foods based on the analysis of its ingredients list and image,” in Proc. New Trends Image Anal. Process. Workshop. Cham, Switzerland: Springer, 2015, pp. 326–333.
[21]
Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and construction learning for fine-grained image recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5157–5166.
[22]
T. Ege and K. Yanai, “Estimating food calories for multiple-dish food photos,” in Proc. 4th IAPR Asian Conf. Pattern Recognit. (ACPR), Nov. 2017, pp. 646–651.
[23]
E. Aguilar, B. Remeseiro, M. Bolaños, and P. Radeva, “Grab, pay, and eat: Semantic food detection for smart restaurants,” IEEE Trans. Multimedia, vol. 20, no. 12, pp. 3266–3275, Dec. 2018.
[24]
Y. Wang, J.-J. Chen, C.-W. Ngo, T.-S. Chua, W. Zuo, and Z. Ming, “Mixed dish recognition through multi-label learning,” in Proc. 11th Workshop Multimedia Cooking Eating Activities (CEA), 2019, pp. 1–8.
[25]
D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. 7th IEEE Int. Conf. Comput. Vis., 1999, pp. 1150–1157.
[26]
N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2005, pp. 886–893.
[27]
X. Shu, J. Tang, G.-J. Qi, W. Liu, and J. Yang, “Hierarchical long short-term concurrent memory for human interaction recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 3, pp. 1110–1118, Mar. 2021.
[28]
G.-J. Qi, “Hierarchically gated deep networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 2267–2275. 10.1109/CVPR.2016.249.
[29]
J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480, Dec. 2020. 10.1109/TCSVT.2019.2947482.
[30]
A.-A. Liu, Y. Zhai, N. Xu, W. Nie, W. Li, and Y. Zhang, “Region-aware image captioning via interaction learning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 6, pp. 3685–3696, Jun. 2022. 10.1109/TCSVT.2021.3107035.
[31]
S. Jiang, W. Min, L. Liu, and Z. Luo, “Multi-scale multi-view deep feature aggregation for food recognition,” IEEE Trans. Image Process., vol. 29, pp. 265–276, 2020.
[32]
Y. Kawano and K. Yanai, “Food image recognition with deep convolutional features,” in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous Computing: Adjunct Publication, Sep. 2014, pp. 589–593.
[33]
N. Martinel, G. L. Foresti, and C. Micheloni, “Wide-slice residual networks for food recognition,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018, pp. 567–576.
[34]
H. Chen, J. Wang, Q. Qi, Y. Li, and H. Sun, “Bilinear CNN models for food recognition,” in Proc. Int. Conf. Digit. Image Comput., Techn. Appl. (DICTA), Nov. 2017, pp. 1–6.
[35]
X.-J. Zhang, Y.-F. Lu, and S.-H. Zhang, “Multi-task learning for food identification and analysis with deep convolutional neural networks,” J. Comput. Sci. Technol., vol. 31, no. 3, pp. 489–500, May 2016.
[36]
J. Chen and C.-W. Ngo, “Deep-based ingredient recognition for cooking recipe retrieval,” in Proc. 24th ACM Int. Conf. Multimedia, Oct. 2016, pp. 32–41.
[37]
W. Min, S. Jiang, J. Sang, H. Wang, X. Liu, and L. Herranz, “Being a supercook: Joint food attributes and multimodal content modeling for recipe retrieval and exploration,” IEEE Trans. Multimedia, vol. 19, no. 5, pp. 1100–1113, May 2017.
[38]
J.-J. Chen, L. Pang, and C.-W. Ngo, “Cross-modal recipe retrieval with stacked attention model,” Multimedia Tools Appl., vol. 77, no. 22, pp. 29457–29473, Nov. 2018.
[39]
V. Mnihet al., “Playing atari with deep reinforcement learning,” 2013, arXiv:1312.5602.
[40]
Y. Keneshloo, T. Shi, N. Ramakrishnan, and C. K. Reddy, “Deep reinforcement learning for sequence-to-sequence models,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 7, pp. 2469–2489, Jul. 2020. 10.1109/TNNLS.2019.2929141.
[41]
H. Zhang, J. Cheng, L. Zhang, Y. Li, and W. Zhang, “H2GNN: hierarchical-hops graph neural networks for multi-robot exploration in unknown environments,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 3435–3442, Apr. 2022. 10.1109/LRA.2022.3146912.
[42]
Z. Guoet al., “Multi-turn video question generation via reinforced multi-choice attention network,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 5, pp. 1697–1710, May 2021. 10.1109/TCSVT.2020.3014775.
[43]
Y. Peng, J. Zhang, and Z. Ye, “Deep reinforcement learning for image hashing,” IEEE Trans. Multimedia, vol. 22, no. 8, pp. 2061–2073, Aug. 2020. 10.1109/TMM.2019.2951462.
[44]
C. Yanet al., “Self-weighted robust LDA for multiclass classification with edge classes,” ACM Trans. Intell. Syst. Technol., vol. 12, no. 1, pp. 1–19, Feb. 2021.
[45]
N. Xuet al., “Multi-level policy and reward-based deep reinforcement learning framework for image captioning,” IEEE Trans. Multimedia, vol. 22, no. 5, pp. 1372–1383, May 2020. 10.1109/TMM.2019.2941820.
[46]
Z. Zhao, Z. Zhang, X. Jiang, and D. Cai, “Multi-turn video question answering via hierarchical attention context reinforced networks,” IEEE Trans. Image Process., vol. 28, no. 8, pp. 3860–3872, Aug. 2019. 10.1109/TIP.2019.2902106.
[47]
W. Zhang, B. Wang, L. Ma, and W. Liu, “Reconstruct and represent video contents for captioning via reinforcement learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 12, pp. 3088–3101, Dec. 2020.
[48]
H. Liu, S. Zhang, K. Lin, J. Wen, J. Li, and X. Hu, “Vocabulary-wide credit assignment for training image captioning models,” IEEE Trans. Image Process., vol. 30, pp. 2450–2460, 2021. 10.1109/TIP.2021.3051476.
[49]
T. Rakthanmanonet al., “Searching and mining trillions of time series subsequences under dynamic time warping,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012, pp. 262–270.
[50]
W. Zaremba and I. Sutskever, “Reinforcement learning neural turing machines—Revised,” 2015, arXiv:1505.00521.
[51]
M. D. Zeiler, “ADADELTA: An adaptive learning rate method,” 2012, arXiv:1212.5701.
[52]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980.
[53]
J. Zhang, M. Liu, K. Lu, and Y. Gao, “Group-wise learning for aurora image classification with multiple representations,” IEEE Trans. Cybern., vol. 51, no. 8, pp. 4112–4124, Aug. 2021. 10.1109/TCYB.2019.2903591.
[54]
X. Liuet al., “Deep multiview union learning network for multisource image classification,” IEEE Trans. Cybern., vol. 52, no. 6, pp. 4534–4546, Jun. 2022. 10.1109/TCYB.2020.3029787.
[55]
Z. Gong, W. Hu, X. Du, P. Zhong, and P. Hu, “Deep manifold embedding for hyperspectral image classification,” IEEE Trans. Cybern., vol. 52, no. 10, pp. 10430–10443, Oct. 2022. 10.1109/TCYB.2021.3069790.
[56]
X. Shu, L. Zhang, G.-J. Qi, W. Liu, and J. Tang, “Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3300–3315, Jun. 2022. 10.1109/TPAMI.2021.3050918.
[57]
X. Shu, J. Yang, R. Yan, and Y. Song, “Expansion-squeeze-excitation fusion network for elderly activity recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5281–5292, Aug. 2022. 10.1109/TCSVT.2022.3142771.

Cited By

View all
  • (2024)Recognize after early fusion: the Chinese food recognition based on the alignment of image and ingredientsMultimedia Systems10.1007/s00530-024-01297-w30:2Online publication date: 26-Mar-2024
  • (2023)An Active Task Cognition Method for Home Service Robot Using Multi-Graph Attention Fusion MechanismIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.333929234:6(4957-4972)Online publication date: 4-Dec-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology
IEEE Transactions on Circuits and Systems for Video Technology  Volume 33, Issue 5
May 2023
524 pages

Publisher

IEEE Press

Publication History

Published: 01 May 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Recognize after early fusion: the Chinese food recognition based on the alignment of image and ingredientsMultimedia Systems10.1007/s00530-024-01297-w30:2Online publication date: 26-Mar-2024
  • (2023)An Active Task Cognition Method for Home Service Robot Using Multi-Graph Attention Fusion MechanismIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.333929234:6(4957-4972)Online publication date: 4-Dec-2023

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media