Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Generation of virtual digital human for customer service industry

Published: 01 February 2024 Publication History

Abstract

With the widespread development of digital technology, individuals’ daily activities are inseparable from interaction with electronic devices. Researchers have become interested in developing novel methods, to enable users to experience social and emotional satisfaction that traditional face-to-face interaction provides. In this study, we propose a novel deep learning-based pipeline to generate virtual digital humans for customer service industry. Specifically, we propose a method to construct a database with template service actions. Furthermore, we propose a two-stage method for generating 2D virtual human videos with gestures and emotional lip-sync expressions. We have conducted qualitative and quantitative experiments on the proposed 2D virtual human video generation method. The results demonstrate that the method effectively generates high-quality virtual digital humans for the customer service industry.

Graphical abstract

Display Omitted

Highlights

Constructing a dataset of template actions used in scenarios of service industries.
Proposing a simple yet effective method for generating 2D virtual human videos.
Generating gestures and emotional lip-sync expressions using neural network.

References

[1]
Paiva A., Leite I., Boukricha H., Wachsmuth I., Empathy in virtual agents and robots: A survey, ACM Trans Interact Intell Syst (TiiS) 7 (3) (2017) 1–40.
[2]
Kimani E, Parmar D, Murali P, Bickmore T. Sharing the load online: Virtual presentations with virtual co-presenter agents. In: Extended abstracts of the 2021 CHI conference on human factors in computing systems. 2021, p. 1–7.
[3]
Suwajanakorn S., Seitz S.M., Kemelmacher-Shlizerman I., Synthesizing obama: learning lip sync from audio, ACM Trans Graph 36 (4) (2017) 1–13.
[4]
Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C. A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. 2020, p. 484–92.
[5]
Bremner P., Pipe A.G., Melhuish C., Fraser M., Subramanian S., The effects of robot-performed co-verbal gesture on listener behaviour, in: Humanoids, IEEE, ISBN 978-1-61284-866-2, 2011, pp. 458–465.
[6]
Wilson J.R., Lee N.Y., Saechao A., Hershenson S., Scheutz M., Tickle-Degnen L., Hand gestures and verbal acknowledgments improve human-robot rapport, in: ICSR, in: Lecture notes in computer science, vol. 10652, Springer, 2017, pp. 334–344.
[7]
Castillo G, Neff M. What do we express without knowing? Emotion in Gesture. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems. 2019, p. 702–10.
[8]
Smith H.J., Neff M., Understanding the impact of animated gesture performance on personality perceptions, ACM Trans Graph 36 (4) (2017) 1–12.
[9]
Liao M., Zhang S., Wang P., Zhu H., Yang R., Speech2Video with 3D skeleton regularization and expressive body poses, 2020, CoRR abs/2007.09198.
[10]
Mori M., MacDorman K.F., Kageki N., The uncanny valley [from the field], IEEE Robot Autom Mag 19 (2) (2012) 98–100.
[11]
Wang K., Wu Q., Song L., Yang Z., Wu W., Qian C., et al., Mead: A large-scale audio-visual dataset for emotional talking-face generation, in: Computer vision–ECCV 2020: 16th European conference, Springer, 2020, pp. 700–717.
[12]
Ding H, Sricharan K, Chellappa R. Exprgan: Facial expression editing with controllable expression intensity. In: Proceedings of the AAAI conference on artificial intelligence, vol. 32, No. 1. 2018.
[13]
Sun Z., Wen Y.H., Lv T., Sun Y., Zhang Z., Wang Y., et al., Continuously controllable facial expression editing in talking face videos, 2022, arXiv preprint arXiv:2209.08289.
[14]
Chan C, Ginosar S, Zhou T, Efros AA. Everybody dance now. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, p. 5933–42.
[15]
Shysheya A, Zakharov E, Aliev KA, Bashirov R, Burkov E, Iskakov K, et al. Textured neural avatars. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 2387–97.
[16]
Liu L., Xu W., Habermann M., Zollhöfer M., Bernard F., Kim H., et al., Neural human video rendering by learning dynamic textures and rendering-to-video translation, IEEE Trans Vis Comput Graphics PP (2020) 1.
[17]
Isola P, Zhu JY, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, p. 1125–34.
[18]
Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 4401–10.
[19]
Wang TC, Liu MY, Zhu JY, Liu G, Tao A, Kautz J, et al. Video-to-video synthesis. In: Proceedings of the 32nd international conference on neural information processing systems. 2018, p. 1152–64.
[20]
Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B. High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 8798–807.
[21]
Wang TC, Liu MY, Tao A, Liu G, Kautz J, Catanzaro B. Few-shot video-to-video synthesis. In: Proceedings of the 33rd international conference on neural information processing systems. 2019, p. 5013–24.
[22]
Zakharov E, Shysheya A, Burkov E, Lempitsky V. Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, p. 9459–68.
[23]
Sun Y.T., Huang H.Z., Wang X., Lai Y.K., Liu W., Gao L., Robust pose transfer with dynamic details using neural video rendering, IEEE Trans Pattern Anal Mach Intell 45 (2) (2022) 2660–2666.
[24]
Weng CY, Curless B, Srinivasan PP, Barron JT, Kemelmacher-Shlizerman I. Humannerf: Free-viewpoint rendering of moving people from monocular video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 16210–20.
[25]
Liu J.W., Cao Y.P., Yang T., Xu E.Z., Keppo J., Shan Y., et al., HOSNeRF: Dynamic human-object-scene neural radiance fields from a single video, 2023, arXiv preprint arXiv:2304.12281.
[26]
Işık M., Rünz M., Georgopoulos M., Khakhulin T., Starck J., Agapito L., et al., Humanrf: High-fidelity neural radiance fields for humans in motion, ACM Trans Graph 42 (4) (2023) 1–12,.
[27]
Ma L., Jia X., Sun Q., Schiele B., Tuytelaars T., Van Gool L., Pose guided person image generation, Adv Neural Inf Process Syst 30 (2017).
[28]
Neverova N, Guler RA, Kokkinos I. Dense pose transfer. In: Proceedings of the European conference on computer vision. 2018, p. 123–38.
[29]
Balakrishnan G, Zhao A, Dalca AV, Durand F, Guttag J. Synthesizing images of humans in unseen poses. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 8340–8.
[30]
Esser P, Sutter E, Ommer B. A variational U-net for conditional appearance and shape generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 8857–66.
[31]
Ma L, Sun Q, Georgoulis S, Van Gool L, Schiele B, Fritz M. Disentangled person image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 99–108.
[32]
Li Y, Huang C, Loy CC. Dense intrinsic appearance flow for human pose transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 3693–702.
[33]
Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N. Animating arbitrary objects via deep motion transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 2377–86.
[34]
Siarohin A, Lathuilière S, Tulyakov S, Ricci E, Sebe N. First order motion model for image animation. In: Proceedings of the 33rd international conference on neural information processing systems. 2019, p. 7137–47.
[35]
Liu W, Piao Z, Min J, Luo W, Ma L, Gao S. Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, p. 5904–13.
[36]
Siarohin A, Sangineto E, Lathuiliere S, Sebe N. Deformable gans for pose-based human image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 3408–16.
[37]
Siarohin A, Woodford OJ, Ren J, Chai M, Tulyakov S. Motion representations for articulated animation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, p. 13653–62.
[38]
Zheng H., Chen L., Xu C., Luo J., Unsupervised pose flow learning for pose guided synthesis, 2019, arXiv preprint arXiv:1909.13819.
[39]
Zhou Y, Yang J, Li D, Saito J, Aneja D, Kalogerakis E. Audio-driven neural gesture reenactment with video motion graphs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 3418–28.
[40]
Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio. In: Proceedings of the 24th annual conference on computer graphics and interactive techniques. 1997, p. 353–60.
[41]
Ezzat T., Geiger G., Poggio T., Trainable videorealistic speech animation, ACM Trans Graph 21 (3) (2002) 388–398.
[42]
Fried O., Tewari A., Zollhöfer M., Finkelstein A., Shechtman E., Goldman D.B., et al., Text-based editing of talking-head video, ACM Trans Graph 38 (4) (2019) 1–14.
[43]
Wang J., Qian X., Zhang M., Tan R.T., Li H., Seeing what you said: Talking face generation guided by a lip reading expert, 2023, arXiv preprint arXiv:2303.17480.
[44]
Thies J., Elgharib M., Tewari A., Theobalt C., Nießner M., Neural voice puppetry: Audio-driven facial reenactment, in: 16th European conference on computer vision, Springer, 2020, pp. 716–731.
[45]
Yi R., Ye Z., Zhang J., Bao H., Liu Y.J., Audio-driven talking face video generation with learning-based personalized head pose, 2020, arXiv preprint arXiv:2002.10137.
[46]
Karras T., Aila T., Laine S., Herva A., Lehtinen J., Audio-driven facial animation by joint end-to-end learning of pose and emotion, ACM Trans Graph 36 (4) (2017) 1–12.
[47]
Chen L, Maddox RK, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 7832–41.
[48]
Brand M. Voice puppetry. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques. 1999, p. 21–8.
[49]
Wen X., Wang M., Richardt C., Chen Z.Y., Hu S.M., Photorealistic audio-driven video portraits, IEEE Trans Vis Comput Graphics 26 (12) (2020) 3457–3466.
[50]
Ye Z., Jiang Z., Ren Y., Liu J., He J., Zhao Z., Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis, 2023, arXiv preprint arXiv:2301.13430.
[51]
Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., et al., Generative adversarial networks, Commun ACM 63 (11) (2020) 139–144.
[52]
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T. Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, p. 8110–9.
[53]
Deng Y, Yang J, Chen D, Wen F, Tong X. Disentangled and controllable face image generation via 3d imitative-contrastive learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, p. 5154–63.
[54]
Wu R, Zhang G, Lu S, Chen T. Cascade ef-gan: Progressive facial expression editing with local focuses. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, p. 5021–30.
[55]
Cao Z, Simon T, Wei SE, Sheikh Y. Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, p. 7291–9.
[56]
Deng Y, Yang J, Xu S, Chen D, Jia Y, Tong X. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2019.
[57]
Paysan P., Knothe R., Amberg B., Romdhani S., Vetter T., A 3D face model for pose and illumination invariant face recognition, in: 2009 Sixth IEEE international conference on advanced video and signal based surveillance, IEEE, 2009, pp. 296–301.
[58]
Cao C., Weng Y., Zhou S., Tong Y., Zhou K., Facewarehouse: A 3d facial expression database for visual computing, IEEE Trans Vis Comput Graphics 20 (3) (2013) 413–425.
[59]
Ronneberger O., Fischer P., Brox T., U-net: Convolutional networks for biomedical image segmentation, in: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Springer, 2015, pp. 234–241.
[60]
Zhu JY, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. 2017, p. 2223–32.
[61]
Gulrajani I., Ahmed F., Arjovsky M., Dumoulin V., Courville A.C., Improved training of wasserstein gans, Adv Neural Inf Process Syst 30 (2017).
[62]
Richardson E, Alaluf Y, Patashnik O, Nitzan Y, Azar Y, Shapiro S, et al. Encoding in style: a stylegan encoder for image-to-image translation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, p. 2287–96.
[63]
Wang X, Li Y, Zhang H, Shan Y. Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, p. 9168–78.
[64]
Wang Z., Bovik A.C., Sheikh H.R., Simoncelli E.P., Image quality assessment: from error visibility to structural similarity, IEEE Trans Image Process 13 (4) (2004) 600–612.
[65]
Zhang R, Isola P, Efros AA, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 586–95.
[66]
Heusel M., Ramsauer H., Unterthiner T., Nessler B., Hochreiter S., Gans trained by a two time-scale update rule converge to a local nash equilibrium, Adv Neural Inf Process Syst 30 (2017).
[67]
Ennaji Y., Boulmalf M., Alaoui C., Experimental analysis of video performance over wireless local area networks, in: 2009 International conference on multimedia computing and systems, IEEE, 2009, pp. 488–494.
[68]
Narvekar N.D., Karam L.J., A no-reference image blur metric based on the cumulative probability of blur detection (CPBD), IEEE Trans Image Process 20 (9) (2011) 2678–2683.
[69]
Chung J.S., Zisserman A., Out of time: automated lip sync in the wild, in: Computer vision–ACCV 2016 workshops: ACCV 2016 international workshops, Springer, 2017, pp. 251–263.
[70]
Deng J, Guo J, Xue N, Zafeiriou S. Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 4690–9.

Cited By

View all

Index Terms

  1. Generation of virtual digital human for customer service industry
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Computers and Graphics
        Computers and Graphics  Volume 115, Issue C
        Oct 2023
        554 pages

        Publisher

        Pergamon Press, Inc.

        United States

        Publication History

        Published: 01 February 2024

        Author Tags

        1. 2D virtual humans
        2. Service gestures
        3. Emotion editing
        4. Gesture generation

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 22 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media