research-article

Generation of virtual digital human for customer service industry

Authors:

Yong-Jin LiuAuthors Info & Claims

Volume 115, Issue C

Pages 359 - 370

https://doi.org/10.1016/j.cag.2023.07.018

Published: 01 February 2024 Publication History

Abstract

With the widespread development of digital technology, individuals’ daily activities are inseparable from interaction with electronic devices. Researchers have become interested in developing novel methods, to enable users to experience social and emotional satisfaction that traditional face-to-face interaction provides. In this study, we propose a novel deep learning-based pipeline to generate virtual digital humans for customer service industry. Specifically, we propose a method to construct a database with template service actions. Furthermore, we propose a two-stage method for generating 2D virtual human videos with gestures and emotional lip-sync expressions. We have conducted qualitative and quantitative experiments on the proposed 2D virtual human video generation method. The results demonstrate that the method effectively generates high-quality virtual digital humans for the customer service industry.

Graphical abstract

Display Omitted

Highlights

•

Constructing a dataset of template actions used in scenarios of service industries.

•

Proposing a simple yet effective method for generating 2D virtual human videos.

•

Generating gestures and emotional lip-sync expressions using neural network.

References

[1]

Paiva A., Leite I., Boukricha H., Wachsmuth I., Empathy in virtual agents and robots: A survey, ACM Trans Interact Intell Syst (TiiS) 7 (3) (2017) 1–40.

[2]

Kimani E, Parmar D, Murali P, Bickmore T. Sharing the load online: Virtual presentations with virtual co-presenter agents. In: Extended abstracts of the 2021 CHI conference on human factors in computing systems. 2021, p. 1–7.

[3]

Suwajanakorn S., Seitz S.M., Kemelmacher-Shlizerman I., Synthesizing obama: learning lip sync from audio, ACM Trans Graph 36 (4) (2017) 1–13.

[4]

Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C. A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. 2020, p. 484–92.

[5]

Bremner P., Pipe A.G., Melhuish C., Fraser M., Subramanian S., The effects of robot-performed co-verbal gesture on listener behaviour, in: Humanoids, IEEE, ISBN 978-1-61284-866-2, 2011, pp. 458–465.

[6]

Wilson J.R., Lee N.Y., Saechao A., Hershenson S., Scheutz M., Tickle-Degnen L., Hand gestures and verbal acknowledgments improve human-robot rapport, in: ICSR, in: Lecture notes in computer science, vol. 10652, Springer, 2017, pp. 334–344.

[7]

Castillo G, Neff M. What do we express without knowing? Emotion in Gesture. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems. 2019, p. 702–10.

[8]

Smith H.J., Neff M., Understanding the impact of animated gesture performance on personality perceptions, ACM Trans Graph 36 (4) (2017) 1–12.

[9]

Liao M., Zhang S., Wang P., Zhu H., Yang R., Speech2Video with 3D skeleton regularization and expressive body poses, 2020, CoRR abs/2007.09198.

[10]

Mori M., MacDorman K.F., Kageki N., The uncanny valley [from the field], IEEE Robot Autom Mag 19 (2) (2012) 98–100.

[11]

Wang K., Wu Q., Song L., Yang Z., Wu W., Qian C., et al., Mead: A large-scale audio-visual dataset for emotional talking-face generation, in: Computer vision–ECCV 2020: 16th European conference, Springer, 2020, pp. 700–717.

[12]

Ding H, Sricharan K, Chellappa R. Exprgan: Facial expression editing with controllable expression intensity. In: Proceedings of the AAAI conference on artificial intelligence, vol. 32, No. 1. 2018.

[13]

Sun Z., Wen Y.H., Lv T., Sun Y., Zhang Z., Wang Y., et al., Continuously controllable facial expression editing in talking face videos, 2022, arXiv preprint arXiv:2209.08289.

[14]

Chan C, Ginosar S, Zhou T, Efros AA. Everybody dance now. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, p. 5933–42.

[15]

Shysheya A, Zakharov E, Aliev KA, Bashirov R, Burkov E, Iskakov K, et al. Textured neural avatars. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 2387–97.

[16]

Liu L., Xu W., Habermann M., Zollhöfer M., Bernard F., Kim H., et al., Neural human video rendering by learning dynamic textures and rendering-to-video translation, IEEE Trans Vis Comput Graphics PP (2020) 1.

[17]

Isola P, Zhu JY, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, p. 1125–34.

[18]

Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 4401–10.

[19]

Wang TC, Liu MY, Zhu JY, Liu G, Tao A, Kautz J, et al. Video-to-video synthesis. In: Proceedings of the 32nd international conference on neural information processing systems. 2018, p. 1152–64.

[20]

Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B. High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 8798–807.

[21]

Wang TC, Liu MY, Tao A, Liu G, Kautz J, Catanzaro B. Few-shot video-to-video synthesis. In: Proceedings of the 33rd international conference on neural information processing systems. 2019, p. 5013–24.

[22]

Zakharov E, Shysheya A, Burkov E, Lempitsky V. Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, p. 9459–68.

[23]

Sun Y.T., Huang H.Z., Wang X., Lai Y.K., Liu W., Gao L., Robust pose transfer with dynamic details using neural video rendering, IEEE Trans Pattern Anal Mach Intell 45 (2) (2022) 2660–2666.

[24]

Weng CY, Curless B, Srinivasan PP, Barron JT, Kemelmacher-Shlizerman I. Humannerf: Free-viewpoint rendering of moving people from monocular video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 16210–20.

[25]

Liu J.W., Cao Y.P., Yang T., Xu E.Z., Keppo J., Shan Y., et al., HOSNeRF: Dynamic human-object-scene neural radiance fields from a single video, 2023, arXiv preprint arXiv:2304.12281.

[26]

Işık M., Rünz M., Georgopoulos M., Khakhulin T., Starck J., Agapito L., et al., Humanrf: High-fidelity neural radiance fields for humans in motion, ACM Trans Graph 42 (4) (2023) 1–12,.

Digital Library

[27]

Ma L., Jia X., Sun Q., Schiele B., Tuytelaars T., Van Gool L., Pose guided person image generation, Adv Neural Inf Process Syst 30 (2017).

[28]

Neverova N, Guler RA, Kokkinos I. Dense pose transfer. In: Proceedings of the European conference on computer vision. 2018, p. 123–38.

[29]

Balakrishnan G, Zhao A, Dalca AV, Durand F, Guttag J. Synthesizing images of humans in unseen poses. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 8340–8.

[30]

Esser P, Sutter E, Ommer B. A variational U-net for conditional appearance and shape generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 8857–66.

[31]

Ma L, Sun Q, Georgoulis S, Van Gool L, Schiele B, Fritz M. Disentangled person image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 99–108.

[32]

Li Y, Huang C, Loy CC. Dense intrinsic appearance flow for human pose transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 3693–702.

[33]

Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N. Animating arbitrary objects via deep motion transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 2377–86.

[34]

Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N. First order motion model for image animation. In: Proceedings of the 33rd international conference on neural information processing systems. 2019, p. 7137–47.

[35]

Liu W, Piao Z, Min J, Luo W, Ma L, Gao S. Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, p. 5904–13.

[36]

Siarohin A, Sangineto E, Lathuiliere S, Sebe N. Deformable gans for pose-based human image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 3408–16.

[37]

Siarohin A, Woodford OJ, Ren J, Chai M, Tulyakov S. Motion representations for articulated animation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, p. 13653–62.

[38]

Zheng H., Chen L., Xu C., Luo J., Unsupervised pose flow learning for pose guided synthesis, 2019, arXiv preprint arXiv:1909.13819.

[39]

Zhou Y, Yang J, Li D, Saito J, Aneja D, Kalogerakis E. Audio-driven neural gesture reenactment with video motion graphs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 3418–28.

[40]

Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio. In: Proceedings of the 24th annual conference on computer graphics and interactive techniques. 1997, p. 353–60.

[41]

Ezzat T., Geiger G., Poggio T., Trainable videorealistic speech animation, ACM Trans Graph 21 (3) (2002) 388–398.

[42]

Fried O., Tewari A., Zollhöfer M., Finkelstein A., Shechtman E., Goldman D.B., et al., Text-based editing of talking-head video, ACM Trans Graph 38 (4) (2019) 1–14.

[43]

Wang J., Qian X., Zhang M., Tan R.T., Li H., Seeing what you said: Talking face generation guided by a lip reading expert, 2023, arXiv preprint arXiv:2303.17480.

[44]

Thies J., Elgharib M., Tewari A., Theobalt C., Nießner M., Neural voice puppetry: Audio-driven facial reenactment, in: 16th European conference on computer vision, Springer, 2020, pp. 716–731.

[45]

Yi R., Ye Z., Zhang J., Bao H., Liu Y.J., Audio-driven talking face video generation with learning-based personalized head pose, 2020, arXiv preprint arXiv:2002.10137.

[46]

Karras T., Aila T., Laine S., Herva A., Lehtinen J., Audio-driven facial animation by joint end-to-end learning of pose and emotion, ACM Trans Graph 36 (4) (2017) 1–12.

[47]

Chen L, Maddox RK, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 7832–41.

[48]

Brand M. Voice puppetry. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques. 1999, p. 21–8.

[49]

Wen X., Wang M., Richardt C., Chen Z.Y., Hu S.M., Photorealistic audio-driven video portraits, IEEE Trans Vis Comput Graphics 26 (12) (2020) 3457–3466.

[50]

Ye Z., Jiang Z., Ren Y., Liu J., He J., Zhao Z., Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis, 2023, arXiv preprint arXiv:2301.13430.

[51]

Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., et al., Generative adversarial networks, Commun ACM 63 (11) (2020) 139–144.

Digital Library

[52]

Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T. Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, p. 8110–9.

[53]

Deng Y, Yang J, Chen D, Wen F, Tong X. Disentangled and controllable face image generation via 3d imitative-contrastive learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, p. 5154–63.

[54]

Wu R, Zhang G, Lu S, Chen T. Cascade ef-gan: Progressive facial expression editing with local focuses. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, p. 5021–30.

[55]

Cao Z, Simon T, Wei SE, Sheikh Y. Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, p. 7291–9.

[56]

Deng Y, Yang J, Xu S, Chen D, Jia Y, Tong X. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2019.

[57]

Paysan P., Knothe R., Amberg B., Romdhani S., Vetter T., A 3D face model for pose and illumination invariant face recognition, in: 2009 Sixth IEEE international conference on advanced video and signal based surveillance, IEEE, 2009, pp. 296–301.

[58]

Cao C., Weng Y., Zhou S., Tong Y., Zhou K., Facewarehouse: A 3d facial expression database for visual computing, IEEE Trans Vis Comput Graphics 20 (3) (2013) 413–425.

Digital Library

[59]

Ronneberger O., Fischer P., Brox T., U-net: Convolutional networks for biomedical image segmentation, in: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Springer, 2015, pp. 234–241.

[60]

Zhu JY, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. 2017, p. 2223–32.

[61]

Gulrajani I., Ahmed F., Arjovsky M., Dumoulin V., Courville A.C., Improved training of wasserstein gans, Adv Neural Inf Process Syst 30 (2017).

[62]

Richardson E, Alaluf Y, Patashnik O, Nitzan Y, Azar Y, Shapiro S, et al. Encoding in style: a stylegan encoder for image-to-image translation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, p. 2287–96.

[63]

Wang X, Li Y, Zhang H, Shan Y. Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, p. 9168–78.

[64]

Wang Z., Bovik A.C., Sheikh H.R., Simoncelli E.P., Image quality assessment: from error visibility to structural similarity, IEEE Trans Image Process 13 (4) (2004) 600–612.

Digital Library

[65]

Zhang R, Isola P, Efros AA, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 586–95.

[66]

Heusel M., Ramsauer H., Unterthiner T., Nessler B., Hochreiter S., Gans trained by a two time-scale update rule converge to a local nash equilibrium, Adv Neural Inf Process Syst 30 (2017).

[67]

Ennaji Y., Boulmalf M., Alaoui C., Experimental analysis of video performance over wireless local area networks, in: 2009 International conference on multimedia computing and systems, IEEE, 2009, pp. 488–494.

[68]

Narvekar N.D., Karam L.J., A no-reference image blur metric based on the cumulative probability of blur detection (CPBD), IEEE Trans Image Process 20 (9) (2011) 2678–2683.

Digital Library

[69]

Chung J.S., Zisserman A., Out of time: automated lip sync in the wild, in: Computer vision–ACCV 2016 workshops: ACCV 2016 international workshops, Springer, 2017, pp. 251–263.

[70]

Deng J, Guo J, Xue N, Zafeiriou S. Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 4690–9.

Cited By

Ma XPeng YZhang YSi Z(2024)Large Language Model Based Intelligent Interaction for Digital HumanAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5678-0_18(204-211)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1007/978-981-97-5678-0_18
Jorge J(2023)Note computers & graphics issue 115Computers and Graphics10.1016/j.cag.2023.10.018115:C(A1-A3)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1016/j.cag.2023.10.018

Index Terms

Generation of virtual digital human for customer service industry
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Computer graphics
    1. Graphics systems and interfaces
      1. Virtual reality
2. Human-centered computing
  1. Human computer interaction (HCI)

Index terms have been assigned to the content through auto-classification.

Recommendations

Gesture Evaluation in Virtual Reality
ICMI Companion '24: Companion Proceedings of the 26th International Conference on Multimodal Interaction

Gestures play a crucial role in human communication, enhancing interpersonal interactions through non-verbal expression. Burgeoning technology allows virtual avatars to leverage communicative gestures to enhance their life-likeness and communication ...
Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture

Virtual reality applications with virtual humans, such as virtual reality exposure therapy, health coaches and negotiation simulators, are developed for different contexts and usually for users from different countries. The emphasis on a virtual human's ...
Multi-sensorial virtual reality and augmented human food interaction
MHFI '16: Proceedings of the 1st Workshop on Multi-sensorial Approaches to Human-Food Interaction

In the field of virtual reality (VR) research, media technologies to create a realistic feeling of being present in a real/virtual world by duplicating multi-sensory information have been studied over a long period. Recently, technologies for multi-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computers and Graphics

Computers and Graphics Volume 115, Issue C

Oct 2023

554 pages

ISSN:0097-8493

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 February 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ma XPeng YZhang YSi Z(2024)Large Language Model Based Intelligent Interaction for Digital HumanAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5678-0_18(204-211)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1007/978-981-97-5678-0_18
Jorge J(2023)Note computers & graphics issue 115Computers and Graphics10.1016/j.cag.2023.10.018115:C(A1-A3)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1016/j.cag.2023.10.018

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents