Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Visual-audio correspondence and its effect on video tipping: : Evidence from Bilibili vlogs

Published: 01 May 2023 Publication History

Abstract

Video tipping takes a remarkable share in the income of online streaming platforms such as Bilibili. There are some specific mappings between the audio and visual signals that viewers can sense (e.g., congruency of pitch and size), which is generally called visual-audio correspondence (VAC). And it is believed to influence viewer satisfaction with video clips. The way to automatically measure VAC, however, still remains missing and its possible effect on video tipping is rarely examined in previous efforts. In this study, a deep neural network with two sub-networks, namely VAC-Net, is established to map both visual and audio stimuli into a shared embedding space. And the Euclidean distance between visual and audio representations in this space is accordingly presented to be the indicator of VAC. Pre-trained models of both modalities and the triplet loss are further leveraged to train the VAC-Net and it competently evaluates VAC of video clips with a test accuracy of 68.37% by outperforming alternative baselines and even exceeding humans on the similar task. Lab-experiments further show that the VAC measurement of VAC-Net conforms to human cognition. Second, considering that viewers’ tipping behavior (TIP) on videos is consistent with the pricing strategy Pay What You Want (PWYW), it is hypothesized that VAC would indirectly influence TIP by reshaping viewer satisfaction (VS). Regression models are thus built to test the hypotheses and it is found that VAC can promote TIP by enhancing VS significantly. Additional tests also demonstrate the robustness of this mechanism by considering various controls and measurement errors. Our results supplement PWYW in streaming videos with a new motive of VAC for viewer tipping and provide streaming practitioners with an automatic tool to estimate the tips videos will receive.

Graphical abstract

Display Omitted

Highlights

Establish the VAC-Net to measure the visual-audio correspondence.
Demonstrate the competence of VAC-Net by baselines and lab-experiments.
Reveal the positive effect of visual-audio correspondence on video tipping.
Explain this positive effect by the mediation of viewer satisfaction.
Help practitioners by providing an automatic tool to predict tips videos received.

References

[1]
Alhabash S., Baek J.h., Cunningham C., Hagerstrom A., To comment or not to comment?: How virality, arousal level, and commenting behavior on YouTube videos affect civic behavioral intentions, Computers in Human Behavior 51 (2015) 520–531.
[2]
Arandjelovic, R., & Zisserman, A. (2017). Look, Listen and Learn. In Proceedings of the IEEE international conference on computer vision (pp. 609–617).
[3]
Arandjelovic, R., & Zisserman, A. (2018). Objects that Sound. In Proceedings of the European conference on computer vision (pp. 435–451).
[4]
Aroian L.A., The probability function of the product of two normally distributed variables, The Annals of Mathematical Statistics 18 (2) (1947) 265–271.
[5]
Aytar Y., Vondrick C., Torralba A., SoundNet: Learning sound representations from unlabeled video, in: Proceedings of the 30th international conference on neural information processing systems, Curran Associates Inc., Red Hook, NY, USA, 2016, pp. 892–900.
[6]
Baron R.M., Kenny D.A., The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations, Journal of Personality and Social Psychology 51 (6) (1986) 1173.
[7]
Becker-Olsen K., Music-visual congruency and its impact on two-sided message recall, NA - Advances in Consumer Research 33 (2006) 578–579.
[8]
Bernstein I.H., Edelstein B.A., Effects of some variations in auditory input upon visual choice reaction time, Journal of Experimental Psychology 87 (2) (1971) 241–247.
[9]
Bolivar V.J., Cohen A.J., Fentress J.C., Semantic and formal congruency in music and motion pictures: Effects on the interpretation of visual action, Psychomusicology: A Journal of Research in Music Cognition 13 (1–2) (1994) 28–59.
[10]
Bollen K.A., Stine R., Direct and indirect effects: Classical and bootstrap estimates of variability, Sociological Methodology 20 (1990) 115–140.
[11]
Brengman M., Willems K., De Gauquier L., Customer engagement in multi-sensory virtual reality advertising: The effect of sound and scent congruence, Frontiers in Psychology 13 (2022),.
[12]
Chen Y., Huang A.X., Faber I., Makransky G., Perez-Cueto F.J.A., Assessing the influence of visual-taste congruency on perceived sweetness and product liking in immersive VR, Foods 9 (4) (2020) 465,. URL https://www.mdpi.com/2304-8158/9/4/465.
[13]
Chen H., Xie W., Vedaldi A., Zisserman A., Vggsound: A large-scale audio-visual dataset, in: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing, 2020, pp. 721–725,.
[14]
Choi K., Fazekas G., Sandler M., Cho K., Convolutional recurrent neural networks for music classification, in: 2017 IEEE international conference on acoustics, speech and signal processing, 2017, pp. 2392–2396,.
[15]
Chung S.W., Chung J.S., Kang H.G., Perfect match: Improved cross-modal embeddings for audio-visual synchronisation, in: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing, 2019, pp. 3965–3969,.
[16]
Chung, J. S., & Zisserman, A. (2017). Out of Time: Automated Lip Sync in the Wild. In Computer vision – ACCV 2016 workshops (pp. 251–263).
[17]
Demoulin N.T., Music congruency in a service setting: The mediating role of emotional and cognitive responses, Journal of Retailing and Consumer Services 18 (1) (2011) 10–18.
[18]
Evans K.K., Treisman A., Natural cross-modal mappings between visual and auditory features, Journal of Vision 10 (1) (2010) 6,.
[19]
Fan W., Su Y., Huang Y., ConchShell: A generative adversarial networks that turns pictures into piano music, 2022,. URL https://arxiv.org/abs/2210.05076.
[20]
Frazier P.A., Tix A.P., Barron K.E., Testing moderator and mediator effects in counseling psychology research, Journal of Counseling Psychology 51 (1) (2004) 115–134.
[21]
Geng X., Chen Z., Lam W., Zheng Q., Hedonic evaluation over short and long retention intervals: The mechanism of the peak–end rule, Journal of Behavioral Decision Making 26 (3) (2013) 225–236.
[22]
Gentile C., Spiller N., Noci G., How to sustain the customer experience: An overview of experience components that co-create value with the customer, European Management Journal 25 (5) (2007) 395–410.
[23]
Gneezy A., Gneezy U., Riener G., Nelson L.D., Pay-what-you-want, identity, and self-signaling in markets, Proceedings of the National Academy of Sciences 109 (19) (2012) 7236–7240.
[24]
Goodman L.A., On the exact variance of products, Journal of the American Statistical Association 55 (292) (1960) 708–713.
[25]
Gregory R.L., Heard P., Border locking and the Café wall illusion, Perception 8 (4) (1979) 365–380.
[26]
Haber R.N., Hershenson M., The psychology of visual perception, Holt, Rinehart & Winston, 1973.
[27]
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
[28]
Herget A.K., Well-known and unknown music as an emotionalizing carrier of meaning in film, Media Psychology 24 (3) (2021) 385–412.
[29]
Hershey S., Chaudhuri S., Ellis D.P.W., Gemmeke J.F., Jansen A., Moore R.C., et al., CNN architectures for large-scale audio classification, in: 2017 IEEE international conference on acoustics, speech and signal processing, 2017, pp. 131–135,.
[30]
Hinton G., Roweis S., Stochastic neighbor embedding, in: Proceedings of the 15th international conference on neural information processing systems, MIT Press, Cambridge, MA, USA, 2002, pp. 857–864.
[31]
Hoffer, E., & Ailon, N. (2015). Deep Metric Learning Using Triplet Network. In International workshop on similarity-based pattern recognition, vol. 9370 (pp. 84–92).
[32]
Hong S., Im W., Yang H.S., Content-based video-music retrieval using soft intra-modal structure constraint, 2017,. URL https://arxiv.org/abs/1704.06761.
[33]
Hult G.T.M., Sharma P.N., Morgeson F.V. III, Zhang Y., Antecedents and consequences of customer satisfaction: Do they differ across online and offline purchases?, Journal of Retailing 95 (1) (2019) 10–23.
[34]
Ji S., Xu W., Yang M., Yu K., 3D convolutional neural networks for human action recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1) (2013) 221–231.
[35]
Kahsay G.A., Samahita M., Pay-what-you-want pricing schemes: A self-image perspective, Journal of Behavioral and Experimental Finance 7 (2015) 17–28.
[36]
Kellaris J.J., Cox A.D., Cox D., The effect of background music on ad processing: A contingency explanation, Journal of Marketing 57 (4) (1993) 114–125.
[37]
Kenny D., Kashy D., Bolger N., Data analysis in social psychology, in: The handbook of social psychology, vol. 1, 1998, pp. 233–265.
[38]
Kim J.Y., Natter M., Spann M., Pay what you want: A new participative pricing mechanism, Journal of Marketing 73 (1) (2009) 44–58.
[39]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014,. URL https://arxiv.org/abs/1412.6980.
[40]
Kitaguchi D., Takeshita N., Matsuzaki H., Igaki T., Hasegawa H., Ito M., Development and validation of a 3-dimensional convolutional neural network for automatic surgical skill assessment based on spatiotemporal video analysis, JAMA Network Open 4 (8) (2021).
[41]
Koo D.M., Ju S.H., The interactional effects of atmospherics and perceptual curiosity on emotions and online shopping intention, Computers in Human Behavior 26 (3) (2010) 377–388.
[42]
Korbar B., Tran D., Torresani L., Cooperative learning of audio and video models from self-supervised synchronization, in: Proceedings of the 32nd international conference on neural information processing systems, Curran Associates Inc., Red Hook, NY, USA, 2018, pp. 7774–7785.
[43]
Krishna A., An integrative review of sensory marketing: Engaging the senses to affect perception, judgment and behavior, Journal of Consumer Psychology 22 (3) (2012) 332–351.
[44]
Krishna A., Cian L., Aydınoğlu N.Z., Sensory aspects of package design, Journal of Retailing 93 (1) (2017) 43–54.
[45]
Kunter M., Exploring the pay-what-you-want payment motivation, Journal of Business Research 68 (11) (2015) 2347–2357.
[46]
Lalwani A.K., Lwin M.O., Ling P.B., Does audiovisual congruency in advertisements increase persuasion? The role of cultural music and products, Journal of Global Marketing 22 (2) (2009) 139–153.
[47]
Lang A., The limited capacity model of mediated message processing, Journal of Communication 50 (1) (2006) 46–70.
[48]
Li R., Lu Y., Ma J., Wang W., Examining gifting behavior on live streaming platforms: An identity-based motivation model, Information & Management 58 (6) (2021).
[49]
Lipscomb S.D., Kendall R.A., Perceptual judgement of the relationship between musical and visual components in film, Psychomusicology: A Journal of Research in Music Cognition 13 (1–2) (1994) 60–98.
[50]
Logan K., Hulu. com or NBC? Streaming video versus traditional TV: A study of an industry in its infancy, Journal of Advertising Research 51 (1) (2011) 276–287.
[51]
Lu Z., Xia H., Heo S., Wigdor D., You watch, you give, and you engage: A study of live streaming practices in China, in: Proceedings of the 2018 CHI conference on human factors in computing systems, Association for Computing Machinery, New York, NY, USA, 2018, pp. 1–13,.
[52]
Lu S., Yao D., Chen X., Grewal R., Do larger audiences generate greater revenues under pay what you want? Evidence from a live streaming platform, Marketing Science 40 (5) (2021) 964–984.
[53]
Maeda F., Kanai R., Shimojo S., Changing pitch induced visual motion illusion, Current Biology 14 (23) (2004) R990–R991,. URL https://www.sciencedirect.com/science/article/pii/S0960982204008863.
[54]
Marett K., Pearson R., Moore R.S., Pay what you want: An exploratory study of social exchange and buyer-determined prices of iproducts, Communications of the Association for Information Systems 30 (1) (2012) 10,.
[55]
Mondloch C.J., Maurer D., Do small white balls squeak? Pitch-object correspondences in young children, Cognitive, Affective, & Behavioral Neuroscience 4 (2004) 133–136.
[56]
Murauer B., Specht G., Detecting music genre using extreme gradient boosting, in: Companion proceedings of the the web conference 2018, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 2018, pp. 1923–1927,.
[57]
Nesbitt, K. V., & Hoskens, I. (2008). Multi-sensory game interface improves player satisfaction but not performance. In Proceedings of the ninth conference on Australasian user interface, vol. 76 (pp. 13–18).
[58]
Oakes S., North A.C., Reviewing congruity effects in the service environment musicscape, International Journal of Service Industry Management 19 (1) (2008) 63–82.
[59]
Oliver R., Satisfaction: A behavioral perspective on the consumer, Routledge, 2010,.
[60]
Owens A., Efros A.A., Audio-visual scene analysis with self-supervised multisensory features, in: Proceedings of the European conference on computer vision, 2018, pp. 631–648.
[61]
Parise C.V., Spence C., ‘When birds of a feather flock together’: Synesthetic correspondences modulate audiovisual integration in non-synesthetes, PLoS One 4 (5) (2009).
[62]
Peng L., Cui G., Chung Y., Zheng W., The faces of success: Beauty and ugliness premiums in e-commerce platforms, Journal of Marketing 84 (4) (2020) 67–85.
[63]
Petit O., Velasco C., Spence C., Digital sensory marketing: Integrating new technologies into multisensory online experience, Journal of Interactive Marketing 45 (2019) 42–61.
[64]
Preacher K.J., Hayes A.F., Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models, Behavior Research Methods 40 (3) (2008) 879–891.
[65]
Racherla, P., Babb, J. S., & Keith, M. J. (2011). Pay-what-you-want pricing for mobile applications: The effect of privacy assurances and social information. In Conference for information systems applied research proceedings, vol. 4 (pp. 1–13).
[66]
Raghubir P., Krishna A., As the crow flies: Bias in consumers’ map-based distance judgments, Journal of Consumer Research 23 (1) (1996) 26–39.
[67]
Rawat W., Wang Z., Deep convolutional neural networks for image classification: A comprehensive review, Neural Computation 29 (9) (2017) 2352–2449.
[68]
Roy R., Rabbanee F.K., Sharma P., Antecedents, outcomes, and mediating role of internal reference prices in pay-what-you-want (PWYW) pricing, Marketing Intelligence & Planning 34 (1) (2016) 117–136.
[69]
Schmitt B., Experiential marketing, Journal of Marketing Management 15 (1–3) (1999) 53–67.
[70]
Scholler S., Bosse S., Treder M.S., Blankertz B., Curio G., Muller K.R., et al., Toward a direct measure of video quality perception using EEG, IEEE Transactions on Image Processing 21 (5) (2012) 2619–2629.
[71]
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815–823).
[72]
Simonyan K., Zisserman A., Very deep convolutional networks for large-scale image recognition, 2014,. URL https://arxiv.org/abs/1409.1556.
[73]
Smith L.N., Topin N., Super-convergence: very fast training of neural networks using large learning rates, in: Pham T. (Ed.), Artificial intelligence and machine learning for multi-domain operations applications, vol. 11006, International Society for Optics and Photonics, SPIE, 2019,.
[74]
Sobel M.E., Asymptotic confidence intervals for indirect effects in structural equation models, Sociological Methodology 13 (1982) 290–312.
[75]
Song, Y., & Soleymani, M. (2019). Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1979–1988).
[76]
Spence C., Crossmodal correspondences: A tutorial review, Attention, Perception, & Psychophysics 73 (4) (2011) 971–995.
[77]
Suris D., Duarte A., Salvador A., Torres J., Giro-i Nieto X., Cross-modal embeddings for video and audio retrieval, in: Proceedings of the European conference on computer vision (ECCV) workshops, 2018, pp. 711–716,.
[78]
Temme J.E.V., Amount and kind of information in museums: Its effects on visitors satisfaction and appreciation of art, Visual Arts Research 18 (2) (1992) 28–36.
[79]
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning Spatiotemporal Features With 3D Convolutional Networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489–4497).
[80]
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6450–6459).
[81]
Varol G., Laptev I., Schmid C., Long-term temporal convolutions for action recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (6) (2018) 1510–1517.
[82]
Walker P., Bremner J.G., Mason U., Spring J., Mattock K., Slater A., et al., Preverbal infants’ sensitivity to synaesthetic cross-modality correspondences, Psychological Science 21 (1) (2010) 21–25.
[83]
Wang Z., Zhou J., Ma J., Li J., Ai J., Yang Y., Discovering attractive segments in the user-generated video streams, Information Processing & Management 57 (1) (2020).
[84]
Weisstein F.L., Kukar-Kinney M., Monroe K.B., Determinants of consumers’ response to pay-what-you-want pricing strategy on the Internet, Journal of Business Research 69 (10) (2016) 4313–4320.
[85]
Xian Y., Li J., Zhang C., Liao Z., Video highlight shot extraction with time-sync comment, in: Proceedings of the 7th international workshop on hot topics in planet-scale mobile computing and online social networking, Association for Computing Machinery, 2015, pp. 31–36,.
[86]
Yang M., Adomavicius G., Burtch G., Ren Y., Mind the gap: Accounting for measurement error and misclassification in variables generated via data mining, Information Systems Research 29 (1) (2018) 4–24.
[87]
Yosinski J., Clune J., Bengio Y., Lipson H., How transferable are features in deep neural networks?, Advances in Neural Information Processing Systems 27 (2014) 3320–3328.
[88]
Zhang B., Niu L., Zhang L., Image composition assessment with saliency-augmented multi-pattern pooling, 2021,. URL https://arxiv.org/abs/2104.03133.
[89]
Zhang Q., Wang W., Chen Y., Frontiers: In-consumption social listening with moment-to-moment unstructured data: The case of movie appreciation and live comments, Marketing Science 39 (2) (2020) 285–295.
[90]
Zhang Z., Wang X., Wu R., Is the devil in the details? Construal-level effects on perceived usefulness of online reviews for experience services, Electronic Commerce Research and Applications 46 (2021).
[91]
Zhao Z., Zhu H., Xue Z., Liu Z., Tian J., Chua M.C.H., et al., An image-text consistency driven multimodal sentiment analysis approach for social media, Information Processing & Management 56 (6) (2019),.
[92]
Zheng K., Zhang Y., Lv L., Yang C., Depth masking based binocular just-noticeable-distortion model, in: 2018 IEEE international conference on multimedia & expo workshops, 2018, pp. 1–5,.
[93]
Zhou J., Zhou J., Ding Y., Wang H., The magic of danmaku: A social interaction perspective of gift sending on live streaming platforms, Electronic Commerce Research and Applications 34 (2019),.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal
Information Processing and Management: an International Journal  Volume 60, Issue 3
May 2023
1647 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 May 2023

Author Tags

  1. Streaming videos
  2. Visual-audio correspondence
  3. Deep neural network
  4. Pay What You Want
  5. Consumer satisfaction
  6. Tipping behavior

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media