Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3581783.3611737acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach

Published: 27 October 2023 Publication History

Abstract

The proliferation of in-the-wild videos has greatly expanded the Video Quality Assessment (VQA) problem. Unlike early definitions that usually focus on limited distortion types, VQA on in-the-wild videos is especially challenging as it could be affected by complicated factors, including various distortions and diverse contents. Though subjective studies have collected overall quality scores for these videos, how the abstract quality scores relate with specific factors is still obscure, hindering VQA methods from more concrete quality evaluations (e.g. sharpness of a video). To solve this problem, we collect over two million opinions on 4,543 in-the-wild videos on 13 dimensions of quality-related factors, including in-capture authentic distortions (e.g. motion blur, noise, flicker), errors introduced by compression and transmission, and higher-level experiences on semantic contents and aesthetic issues (e.g. composition, camera trajectory), to establish the multi-dimensional Maxwell database. Specifically, we ask the subjects to label among a positive, a negative, and a neutral choice for each dimension. These explanation-level opinions allow us to measure the relationships between specific quality factors and abstract subjective quality ratings, and to benchmark different categories of VQA algorithms on each dimension, so as to more comprehensively analyze their strengths and weaknesses. Furthermore, we propose the MaxVQA, a language-prompted VQA approach that modifies vision-language foundation model CLIP to better capture important quality issues as observed in our analyses. The MaxVQA can jointly evaluate various specific quality factors and final quality scores with state-of-the-art accuracy on all dimensions, and superb generalization ability on existing datasets. Code and data available at https://github.com/VQAssessment/MaxVQA.

References

[1]
Recommendation 500-10: Methodology for the subjective assessment of the quality of television pictures. ITU-R Rec. BT.500, 2000.
[2]
Antsiferova, A., Lavrushkin, S., Smirnov, M., Gushchin, A., Vatolin, D. S., and Kulikov, D. Video compression dataset and benchmark of learning-based video-quality metrics. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022).
[3]
Chan, K. C., Wang, X., Yu, K., Dong, C., and Loy, C. C. Basicvsr: The search for essential components in video super-resolution and beyond. In CVPR (2021).
[4]
Chen, B., Zhu, L., Li, G., Lu, F., Fan, H., and Wang, S. Learning generalized spatial-temporal deep feature representation for no-reference video quality assessment. IEEE TCSVT (2021).
[5]
Chen, P., Li, L., Ma, L., Wu, J., and Shi, G. Rirnet: Recurrent-in-recurrent network for video quality assessment. ACM MM (2020).
[6]
Dong, Y., Liu, X., Gao, Y., Zhou, X., Tan, T., and Zhai, G. Light-vqa: A multi-dimensional quality assessment model for low-light video enhancement. In n Proceedings of the 31st ACM International Conference on Multimedia (2023).
[7]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[8]
Fang, Y., Zhu, H., Zeng, Y., Ma, K., and Wang, Z. Perceptual quality assessment of smartphone photography. In CVPR.
[9]
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., and Qiao, Y. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021).
[10]
Ghadiyaram, D., Pan, J., Bovik, A. C., Moorthy, A. K., Panda, P., and Yang, K.-C. In-capture mobile video distortions: A study of subjective behavior and objective algorithms. IEEE TCSVT 28, 9 (2018), 2061--2077.
[11]
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recogni-tion. In CVPR (2016), pp. 770--778.
[12]
Hosu, V., Hahn, F., Jenadeleh, M., Lin, H., Men, H., Szirányi, T., Li, S., and Saupe, D. The konstanz natural video database (konvid-1k). In QoMEX (2017), pp. 1--6.
[13]
Hosu, V., Lin, H., Sziranyi, T., and Saupe, D. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE TIP 29 (2020), 4041--4056.
[14]
Hou, J., Ding, H., Lin, W., Liu, W., and Fang, Y. Distilling knowledge from object classification to aesthetics assessment. IEEE TCSVT (2022).
[15]
Hou, J., Lin, W., Fang, Y., Wu, H., Chen, C., Liao, L., and Liu, W. Towards transparent deep image aesthetics assessment with tag-based content descriptors. IEEE Transactions on Image Processing (2023).
[16]
Hou, J., Yang, S., and Lin, W. Object-level attention for aesthetic rating distribution prediction. In ACM MM (2020), p. 816--824.
[17]
Huang, Z., Zhang, T., Heng, W., Shi, B., and Zhou, S. Real-time intermediate flow estimation for video frame interpolation. In Proceedings of the European Conference on Computer Vision (ECCV) (2022).
[18]
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L. OpenCLIP, July 2021.
[19]
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q. V., Sung, Y., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML (2021).
[20]
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, A., Suleyman, M., and Zisserman, A. The kinetics human action video dataset. ArXiv abs/1705.06950 (2017).
[21]
Ke, J., Ye, K., Yu, J., Wu, Y., Milanfar, P., and Yang, F. Vila: Learning image aesthetics from user comments with vision-language pretraining, 2023.
[22]
Korhonen, J. Two-level approach for no-reference consumer video quality assessment. IEEE TIP 28, 12 (2019), 5923--5938.
[23]
Korhonen, J., Su, Y., and You, J. Blind natural video quality prediction via statistical temporal features and deep spatial features. In ACM MM (2020), p. 3311--3319.
[24]
Kou, T., Liu, X., Jia, J., Sun, W., Zhai, G., and Liu, N. Stablevqa: A deep no-reference quality assessment model for video stability. In Proceedings of the 31st ACM International Conference on Multimedia (2023).
[25]
Lee, Y.-C., Tseng, K.-W., Chen, Y.-T., Chen, C.-C., Chen, C.-S., and Hung, Y.-P. 3d video stabilization with depth estimation by cnn-based optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2021), pp. 10621--10630.
[26]
Li, B., Zhang, W., Tian, M., Zhai, G., and Wang, X. Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception. IEEE TCSVT (2022).
[27]
Li, C., Zhang, Z., Wu, H., Sun, W., Min, X., Liu, X., Zhai, G., and Lin, W. Agiqa-3k: An open database for ai-generated image quality assessment, 2023.
[28]
Li, D., Jiang, T., and Jiang, M. Quality assessment of in-the-wild videos. In ACM MM (2019), p. 2351--2359.
[29]
Li, D., Jiang, T., and Jiang, M. Unified quality assessment of in-the-wild videos with mixed datasets training. International Journal of Computer Vision 129, 4 (2021).
[30]
Liao, L., Xu, K., Wu, H., Chen, C., Sun, W., Yan, Q., and Lin, W. Exploring the effectiveness of video perceptual representation in blind video quality assessment. In ACM MM (2022).
[31]
Liu, X., Van De Weijer, J., and Bagdanov, A. D. Exploiting unlabeled data in cnns by self-supervised learning to rank. IEEE TPAMI (2019), 1--1.
[32]
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and Hu, H. Video swin transformer. In CVPR (2022).
[33]
Mittal, A., Soundararajan, R., and Bovik, A. C. Making a ?completely blind" image quality analyzer. IEEE Signal Processing Letters 20, 3 (2013), 209--212.
[34]
Murray, N., Marchesotti, L., and Perronnin, F. Ava: A large-scale database for aesthetic visual analysis. In CVPR (2012), pp. 2408--2415.
[35]
Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., and Ling, H. Expanding language-image pretrained models for general video recognition. ECCV (2022).
[36]
Nuutinen, M., Virtanen, T., Vaahteranoksa, M., Vuori, T., Oittinen, P., and Häkkinen, J. Cvd2014-a database for evaluating no-reference video quality assessment algorithms. IEEE TIP 25, 7 (2016).
[37]
Park, D., Kim, J., and Chun, S. Y. Down-scaling with learned kernels in multiscale deep neural networks for non-uniform single image deblurring. arXiv preprint arXiv:1903.10157 (2019).
[38]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision, 2021.
[39]
Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., and Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR (2022).
[40]
Seshadrinathan, K., Soundararajan, R., Bovik, A. C., and Cormack, L. K. Study of subjective and objective quality assessment of video. IEEE TIP 19, 6 (2010), 1427--1441.
[41]
Sinno, Z., and Bovik, A. C. Large-scale study of perceptual video quality. IEEE TIP 28, 2 (2019), 612--627.
[42]
Sun, W., Min, X., Lu, W., and Zhai, G. A deep learning based no-reference quality assessment model for ugc videos. arXiv preprint arXiv:2204.14047 (2022).
[43]
Tassano, M., Delon, J., and Veit, T. Fastdvdnet: Towards real-time deep video denoising without flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020).
[44]
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., and Li, L.-J. Yfcc100m: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64--73.
[45]
Tu, Z., Wang, Y., Birkbeck, N., Adsumilli, B., and Bovik, A. C. Ugc-vqa: Benchmarking blind video quality assessment for user generated content. IEEE TIP 30 (2021), 4449--4464.
[46]
Tu, Z., Yu, X., Wang, Y., Birkbeck, N., Adsumilli, B., and Bovik, A. C. Rapique: Rapid and accurate video quality prediction of user generated content. IEEE Open Journal of Signal Processing 2 (2021), 425--440.
[47]
Vonikakis, V., Subramanian, R., Arnfred, J., and Winkler, S. A probabilistic approach to people-centric photo selection and sequencing. IEEE Transactions on Multimedia 19, 11 (2017), 2609--2624.
[48]
Vu, P. V., and Chandler, D. M. Vis3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices. Journal of Electronic Imaging 23 (2014).
[49]
Wallace, G. K. The jpeg still picture compression standard. Commun. ACM 34, 4 (1991), 30--44.
[50]
Wang, H., Li, G., Liu, S., and Kuo, C.-C. J. Icme 2021 ugc-vqa challenge.
[51]
Wang, J., Chan, K. C. K., and Loy, C. C. Exploring clip for assessing the look and feel of images, 2022.
[52]
Wang, Y., Inguva, S., and Adsumilli, B. Youtube ugc dataset for video compression research. In 2019 MMSP (2019).
[53]
Wang, Y., Ke, J., Talebi, H., Yim, J. G., Birkbeck, N., Adsumilli, B., Milanfar, P., and Yang, F. Rich features for perceptual quality assessment of ugc videos. In CVPR (June 2021), pp. 13435--13444.
[54]
Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., and Cao, Y. Simvlm: Simple visual language model pretraining with weak supervision. In ICLR (2022).
[55]
Wiegand, T. Draft itu-t recommendation and final draft international standard of joint video specification.
[56]
Wu, H., Chen, C., Hou, J., Liao, L., Wang, A., Sun, W., Yan, Q., and Lin, W. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In ECCV (2022).
[57]
Wu, H., Chen, C., Liao, L., Hou, J., Sun, W., Yan, Q., Gu, J., and Lin, W. Neighbourhood representative sampling for efficient end-to-end video quality assessment.
[58]
Wu, H., Chen, C., Liao, L., Hou, J., Sun, W., Yan, Q., and Lin, W. Discovqa: Temporal distortion-content transformers for video quality assessment.
[59]
Wu, H., Liao, L., Chen, C., Hou, J., Wang, A., Sun, W., Yan, Q., and Lin, W. Disentangling aesthetic and technical effects for video quality assessment of user generated content.
[60]
Wu, H., Liao, L., Hou, J., Chen, C., Zhang, E., Wang, A., Sun, W., Yan, Q., and Lin, W. Exploring opinion-unaware video quality assessment with semantic affinity criterion. In ICME (2023).
[61]
Wu, H., Liao, L., Wang, A., Chen, C., Hou, J. H., Zhang, E., Sun, W. S., Yan, Q., and Lin, W. Towards robust text-prompted semantic criterion for in-the-wild video quality assessment, 2023.
[62]
Xu, J., Li, J., Zhou, X., Zhou, W., Wang, B., and Chen, Z. Perceptual quality assessment of internet videos. In ACM MM (2021).
[63]
Yang, Y., Xu, L., Li, L., Qie, N., Li, Y., Zhang, P., and Guo, Y. Personalized image aesthetics assessment with rich attributes. In CVPR (2022), pp. 19861--19869.
[64]
Ying, Z., Mandal, M., Ghadiyaram, D., and Bovik, A. Patch-vq: 'patching up' the video quality problem. In CVPR (2021).
[65]
Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., and Bovik, A. From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality. In CVPR (2020).
[66]
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., and Wu, Y. Coca: Contrastive captioners are image-text foundation models.
[67]
Zhang, B., Niu, L., and Zhang, L. Image composition assessment with saliency-augmented multi-pattern pooling. arXiv preprint arXiv:2104.03133 (2021).
[68]
Zhang, W., Ma, K., Yan, J., Deng, D., and Wang, Z. Blind image quality assessment using a deep bilinear convolutional neural network. IEEE TCSVT 30, 1 (2020), 36--47.
[69]
Zhang, W., Zhai, G., Wei, Y., Yang, X., and Ma, K. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In IEEE Conference on Computer Vision and Pattern Recognition (2023).
[70]
Zhang, Z., Sun, W., Zhou, Y., Wu, H., Li, C., Min, X., and Liu, X. Advancing zero-shot digital human quality assessment through text-prompted evaluation, 2023.
[71]
Zhang, Z., Wu, W., Sun, W., Tu, D., Lu, W., Min, X., Chen, Y., and Zhai, G. Md-vqa: Multi-dimensional quality assessment for ugc live videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023).
[72]
Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV) (2022).
[73]
Zhu, H., Li, L., Wu, J., Dong, W., and Shi, G. MetaIQA: deep meta-learning for no-reference image quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun. 2020), pp. 14143--14152.

Cited By

View all
  • (2024)Triple Alignment Strategies for Zero-shot Phrase Grounding under Weak SupervisionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680897(4312-4321)Online publication date: 28-Oct-2024
  • (2024)Semantic-Aware and Quality-Aware Interaction Network for Blind Video Quality AssessmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680598(9970-9979)Online publication date: 28-Oct-2024
  • (2024)A Spatial–Temporal Video Quality Assessment Method via Comprehensive HVS SimulationIEEE Transactions on Cybernetics10.1109/TCYB.2023.333861554:8(4749-4762)Online publication date: Aug-2024
  • Show More Cited By

Index Terms

  1. Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Check for updates

    Author Tags

    1. dataset
    2. explainable
    3. video quality assessment
    4. vision-language

    Qualifiers

    • Research-article

    Funding Sources

    • RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP)

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)678
    • Downloads (Last 6 weeks)99
    Reflects downloads up to 17 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Triple Alignment Strategies for Zero-shot Phrase Grounding under Weak SupervisionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680897(4312-4321)Online publication date: 28-Oct-2024
    • (2024)Semantic-Aware and Quality-Aware Interaction Network for Blind Video Quality AssessmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680598(9970-9979)Online publication date: 28-Oct-2024
    • (2024)A Spatial–Temporal Video Quality Assessment Method via Comprehensive HVS SimulationIEEE Transactions on Cybernetics10.1109/TCYB.2023.333861554:8(4749-4762)Online publication date: Aug-2024
    • (2024)BVI-Artefact: An Artefact Detection Benchmark Dataset for Streamed Videos2024 Picture Coding Symposium (PCS)10.1109/PCS60826.2024.10566356(1-5)Online publication date: 12-Jun-2024
    • (2024)Q-Boost: On Visual Quality Assessment Ability of Low-Level Multi-Modality Foundation Models2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)10.1109/ICMEW63481.2024.10645451(1-6)Online publication date: 15-Jul-2024
    • (2024)A Dataset for Understanding Open UGC Video Datasets2024 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP51287.2024.10647939(165-171)Online publication date: 27-Oct-2024
    • (2024)NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00643(6415-6431)Online publication date: 17-Jun-2024
    • (2024)AIGC-VQA: A Holistic Perception Metric for AIGC Video Quality Assessment2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00640(6384-6394)Online publication date: 17-Jun-2024
    • (2024)NTIRE 2024 Quality Assessment of AI-Generated Content Challenge2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00637(6337-6362)Online publication date: 17-Jun-2024
    • (2024)COVER: A Comprehensive Video Quality Evaluator2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00589(5799-5809)Online publication date: 17-Jun-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media