Article

FAST-VQA: Efficient End-to-End Video Quality Assessment with Fragment Sampling

Authors:

Weisi LinAuthors Info & Claims

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI

Pages 538 - 554

https://doi.org/10.1007/978-3-031-20068-7_31

Published: 23 October 2022 Publication History

Abstract

Current deep video quality assessment (VQA) methods are usually with high computational costs when evaluating high-resolution videos. This cost hinders them from learning better video-quality-related representations via end-to-end training. Existing approaches typically consider naive sampling to reduce the computational cost, such as resizing and cropping. However, they obviously corrupt quality-related information in videos and are thus not optimal to learn good representations for VQA. Therefore, there is an eager need to design a new quality-retained sampling scheme for VQA. In this paper, we propose Grid Mini-patch Sampling (GMS), which allows consideration of local quality by sampling patches at their raw resolution and covers global quality with contextual relations via mini-patches sampled in uniform grids. These mini-patches are spliced and aligned temporally, named as fragments. We further build the Fragment Attention Network (FANet) specially designed to accommodate fragments as inputs. Consisting of fragments and FANet, the proposed FrAgment Sample Transformer for VQA (FAST-VQA) enables efficient end-to-end deep VQA and learns effective video-quality-related representations. It improves state-of-the-art accuracy by around

10 %

while reducing

99.5 %

FLOPs on 1080P high-resolution videos. The newly learned video-quality-related representations can also be transferred into smaller VQA datasets, boosting the performance on these scenarios. Extensive experiments show that FAST-VQA has good performance on inputs of various resolutions while retaining high efficiency. We publish our code at https://github.com/timothyhtimothy/FAST-VQA.

References

[1]

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6836–6846, October 2021

[2]

Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

[3]

Chen B, Zhu L, Li G, Lu F, Fan H, and Wang S Learning generalized spatial-temporal deep feature representation for no-reference video quality assessment IEEE Trans. Circ. Syst. Video Technol. 2021 32 4 1903-1916

[4]

Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. ACL (2014)

[5]

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)

[6]

Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6824–6835, October 2021

[7]

Ghadiyaram D, Pan J, Bovik AC, Moorthy AK, Panda P, and Yang KC In-capture mobile video distortions: a study of subjective behavior and objective algorithms IEEE Trans. Circ. Syst. Video Technol. 2018 28 9 2061-2077

[8]

Götz-Hahn F, Hosu V, Lin H, and Saupe D KonVid-150k: a dataset for no-reference video quality assessment of videos in-the-wild IEEE Access 2021 9 72139-72160

[9]

Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

[10]

Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 3154–3160 (2017)

[11]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

[12]

Hosu, V., et al.: The Konstanz natural video database (KoNViD-1k). In: Ninth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6 (2017)

[13]

Hosu V, Lin H, Sziranyi T, and Saupe D KonIQ-10k: an ecologically valid database for deep learning of blind image quality assessment IEEE Trans. Image Process. 2020 29 4041-4056

[14]

Kang, L., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for no-reference image quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

[15]

Kang, L., Ye, P., Li, Y., Doermann, D.: Simultaneous estimation of image quality and distortion via multi-task convolutional neural networks. In: IEEE International Conference on Image Processing (ICIP) (2015)

[16]

Kay, W., et al.: The kinetics human action video dataset. ArXiv abs/1705.06950 (2017)

[17]

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5148–5157, October 2021

[18]

Kim W, Kim J, Ahn S, Kim J, and Lee S Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Deep video quality assessor: from spatio-temporal visual sensitivity to a convolutional neural aggregation network Computer Vision – ECCV 2018 2018 Cham Springer 224-241

[19]

Kolesnikov, A., et al.: An image is worth

16 \times 16

words: transformers for image recognition at scale (2021)

[20]

Korhonen J Two-level approach for no-reference consumer video quality assessment IEEE Trans. Image Process. 2019 28 12 5923-5938

[21]

Korhonen, J., Su, Y., You, J.: Blind natural video quality prediction via statistical temporal features and deep spatial features. In: Proceedings of the 28th ACM International Conference on Multimedia, MM 2020, pp. 3311–3319. Association for Computing Machinery, New York (2020)

[22]

Li B, Zhang W, Tian M, Zhai G, and Wang X Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception IEEE Trans. Circ. Syst. Video Technol. 2022 32 9 5944-5958

[23]

Li, D., Jiang, T., Jiang, M.: Quality assessment of in-the-wild videos. In: Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, pp. 2351–2359. Association for Computing Machinery, New York (2019)

[24]

Li D, Jiang T, and Jiang M Unified quality assessment of in-the-wild videos with mixed datasets training Int. J. Comput. Vis. 2021 129 4 1238-1257

[25]

Li D, Jiang T, Lin W, and Jiang M Which has better visual quality: the clear blue sky or a blurry animal? IEEE Trans. Multimedia 2019 21 5 1221-1234

[26]

Liao, L., et al.: Exploring the effectiveness of video perceptual representation in blind video quality assessment. In: Proceedings of the 30th ACM International Conference on Multimedia (ACM MM) (2022)

[27]

Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

[28]

Liu, Z., et al.: Video swin transformer. arXiv preprint arXiv:2106.13230 (2021)

[29]

Mittal A, Moorthy AK, and Bovik AC No-reference image quality assessment in the spatial domain IEEE Trans. Image Process. 2012 21 12 4695-4708

[30]

Mittal A, Saad MA, and Bovik AC A completely blind video integrity oracle IEEE Trans. Image Process. 2016 25 1 289-300

[31]

Nuutinen M, Virtanen T, Vaahteranoksa M, Vuori T, Oittinen P, and Häkkinen J CVD2014–a database for evaluating no-reference video quality assessment algorithms IEEE Trans. Image Process. 2016 25 7 3073-3086

[32]

Saad MA, Bovik AC, and Charrier C Blind image quality assessment: a natural scene statistics approach in the DCT domain IEEE Trans. Image Process. 2012 21 8 3339-3352

[33]

Sinno Z and Bovik AC Large-scale study of perceptual video quality IEEE Trans. Image Process. 2019 28 2 612-627

[34]

Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI 2017, pp. 4278–4284. AAAI Press (2017)

[35]

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of the International Conference on Machine Learning (ICML) (2021)

[36]

Tu Z, Wang Y, Birkbeck N, Adsumilli B, and Bovik AC UGC-VQA: benchmarking blind video quality assessment for user generated content IEEE Trans. Image Process. 2021 30 4449-4464

[37]

Tu Z, Yu X, Wang Y, Birkbeck N, Adsumilli B, and Bovik AC RAPIQUE: rapid and accurate video quality prediction of user generated content IEEE Open J. Sig. Process. 2021 2 425-440

[38]

Wang, Y., et al.: Rich features for perceptual quality assessment of UGC videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13435–13444, June 2021

[39]

Wu, H., et al.: DisCoVQA: temporal distortion-content transformers for video quality assessment. arXiv preprint arXiv: 2206.09853 (2022)

[40]

Yim, J.G., Wang, Y., Birkbeck, N., Adsumilli, B.: Subjective quality assessment for YouTube UGC dataset. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 131–135 (2020)

[41]

Ying, Z.A., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.: From patches to pictures (PaQ-2-PiQ): mapping the perceptual space of picture quality. arXiv preprint arXiv:1912.10088 (2019)

[42]

Ying, Z., Mandal, M., Ghadiyaram, D., Bovik, A.: Patch-VQ: ‘patching up’ the video quality problem. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14019–14029, June 2021

[43]

You, J.: Long short-term convolutional transformer for no-reference video quality assessment. In: Proceedings of the 29th ACM International Conference on Multimedia, MM 2021, pp. 2112–2120. Association for Computing Machinery, New York (2021)

[44]

You, J., Korhonen, J.: Deep neural networks for no-reference video quality assessment. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 2349–2353 (2019)

[45]

Zhang W, Ma K, Yan J, Deng D, and Wang Z Blind image quality assessment using a deep bilinear convolutional neural network IEEE Trans. Circ. Syst. Video Technol. 2020 30 1 36-47

Cited By

Chi BSu RChen X(2024)Using Spatial-Temporal Attention for Video Quality EvaluationInternational Journal of Intelligent Systems10.1155/2024/55146272024Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1155/2024/5514627
Wen SQiao MJiang LXu MDeng XLi SLi JGao XLe Callet PJanowski LLu WYang JWang JLi JZhang J(2024)MT-VQA: A Multi-task Approach for Quality Assessment of Short-form VideosProceedings of the 3rd Workshop on Quality of Experience in Visual Multimedia Applications10.1145/3689093.3689181(30-38)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689093.3689181
Yuan DWang LCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Dual-Criterion Quality Loss for Blind Image Quality AssessmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681250(7823-7832)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681250
Show More Cited By

Index Terms

FAST-VQA: Efficient End-to-End Video Quality Assessment with Fragment Sampling

Index terms have been assigned to the content through auto-classification.

Recommendations

Long Short-term Convolutional Transformer for No-Reference Video Quality Assessment
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

No-reference video quality assessment has not been widely benefited from deep learning, mainly due to the complexity, diversity and particularity of modelling spatial and temporal characteristics in quality assessment scenario. Image quality assessment (...
Neighbourhood Representative Sampling for Efficient End-to-End Video Quality Assessment
The increased resolution of real-world videos presents a dilemma between efficiency and accuracy for deep Video Quality Assessment (VQA). On the one hand, keeping the original resolution will lead to unacceptable computational costs. On the other hand, ...
Foveation-based content adaptive root mean squared error for video quality assessment

When the video is compressed and transmitted over heterogeneous networks, it is necessary to ensure the satisfying quality for the end user. Since human observers are the end users of video applications, it is very important that the human visual system ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI

Oct 2022

803 pages

ISBN:978-3-031-20067-0

DOI:10.1007/978-3-031-20068-7

Editors:
Shai Avidan
Tel Aviv University, Tel Aviv, Israel
,
Gabriel Brostow
University College London, London, UK
,
Moustapha Cissé
Google AI, Accra, Ghana
,
Giovanni Maria Farinella
University of Catania, Catania, Italy
,
Tal Hassner
Facebook (United States), Menlo Park, CA, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 October 2022

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chi BSu RChen X(2024)Using Spatial-Temporal Attention for Video Quality EvaluationInternational Journal of Intelligent Systems10.1155/2024/55146272024Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1155/2024/5514627
Wen SQiao MJiang LXu MDeng XLi SLi JGao XLe Callet PJanowski LLu WYang JWang JLi JZhang J(2024)MT-VQA: A Multi-task Approach for Quality Assessment of Short-form VideosProceedings of the 3rd Workshop on Quality of Experience in Visual Multimedia Applications10.1145/3689093.3689181(30-38)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689093.3689181
Yuan DWang LCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Dual-Criterion Quality Loss for Blind Image Quality AssessmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681250(7823-7832)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681250
Zhou YZhang ZSun WLiu XMin XZhai GCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Subjective and Objective Quality-of-Experience Assessment for 3D Talking HeadsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680964(6033-6042)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680964
Tan XZhang JQuan YLi JWu YBian ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Highly Efficient No-reference 4K Video Quality Assessment with Full-Pixel Covering Sampling and Training StrategyProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680907(9913-9922)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680907
Kou TLiu XZhang ZLi CWu HMin XZhai GLiu NCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Subjective-Aligned Dataset and Metric for Text-to-Video Quality AssessmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680868(7793-7802)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680868
Xiang JDang YChen PLiang RHuan RGao NCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Semantic-Aware and Quality-Aware Interaction Network for Blind Video Quality AssessmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680598(9970-9979)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680598
Chen CYang SWu HLiao LZhang ZWang ASun WYan QLin WCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Q-Ground: Image Quality Grounding with Large Multi-modality ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680575(486-495)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680575
Zhang ZSun WWu HZhou YLi CChen ZMin XZhai GLin W(2024)GMS-3DQA: Projection-Based Grid Mini-patch Sampling for 3D Model Quality AssessmentACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364381720:6(1-19)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1145/3643817
Yang ZZhang YSi Z(2024)Conformer Based No-Reference Quality Assessment for UGC VideoAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5597-4_39(464-472)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1007/978-981-97-5597-4_39
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents