NRVC: Neural Representation for Video Compression with Implicit Multiscale Fusion Network
<p>The framework of NRVC consists of three components: position encoding, MLP, and MSRVC. MSRVC includes FECA, five NoRVC blocks, and four upsampling blocks. Each upsampling block is composed of a 3 × 3 Conv and a PixelShuffle layer.</p> "> Figure 2
<p>FECA block, where <span class="html-italic">x</span> represents the original channel weight parameters and <span class="html-italic">w</span> represents the regenerated ones.</p> "> Figure 3
<p>NoRVC block, where S is the upsampling factor.</p> "> Figure 4
<p>Comparison results of model pruning. Sparsity is the ratio of parameters pruned.</p> "> Figure 5
<p>Comparison results of model quantization. The bit refers to the number of bits used to represent the parameter value.</p> "> Figure 6
<p>PSNR vs. BPP on UVG dataset.</p> "> Figure 7
<p>MS-SSIM vs. BPP on UVG dataset.</p> "> Figure 8
<p>Video compression visualization. (<b>a</b>) is the original image. (<b>b</b>,<b>c</b>) are, respectively, the decoded frame of NeRV and NRVC on the UVG dataset. The red and light blue boxes show more specific details in each group of images. Obviously, (<b>c</b>) shows better details in the reconstruction of the rider’s face and horse than (<b>b</b>).</p> "> Figure 9
<p>The visual results of ablation studies. (<b>a</b>) is the original image. The red and light blue boxes show more specific details in each group of images. As the final model, (<b>e</b>) shows more image details than (<b>b</b>–<b>d</b>) in the reconstruction of rabbit ears.</p> "> Figure 10
<p>Different resolution research. In (<b>a</b>–<b>d</b>), the left is the original video frame and the right is the reconstructed frame with NRVC. In (<b>a</b>), we use the “ShakeNDry” dataset with 800 × 720 resolution. In (<b>b</b>), we use the “YachtRide” dataset with 1280 × 1024 resolution. In (<b>c</b>), we use the “Bosphorus” dataset with 1600 × 1200 resolution. In (<b>d</b>), we use the “HoneyBee” dataset with 2128 × 2016.</p> ">
Abstract
:1. Introduction
- We propose a novel model of multiscale representations for video compression (MSRVC) for video compression, which contains multiple NoRVC blocks that effectively extends feature information to improve the quality of reconstructed images.
- We discover that inserting the FECA attention module in the MSRVC network can significantly improve the feature extraction performance of the model. This effectively enhances the network’s performance without significantly increasing its complexity.
- We demonstrate that NRVC has a 2.16% increase in decoded PSNR compared to the NeRV method at similar bits per pixel (BPP). Meanwhile, NRVC also outperforms the conventional HEVC [15] in terms of peak signal-to-noise ratio (PSNR).
2. Related Work
2.1. Traditional Deep Video Compression
2.2. End-to-End Deep Video Compression
2.3. Implicit Neural Representation
3. The Proposed Approach
3.1. Overview of Model Framework
3.2. Positioning Encoding
3.3. MSRVC
3.3.1. FECA Block
3.3.2. NoRVC Block
3.4. Loss Function
4. Experiments
4.1. Datasets and Settings
4.2. Implementation Details
4.3. Comparative Results
4.4. Video Compression
4.5. Ablation Studies
4.6. Resolution Studies
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Rong, Y.; Zhang, X.; Lin, J. Modified Hilbert Curve for Rectangles and Cuboids and Its Application in Entropy Coding for Image and Video Compression. Entropy 2021, 23, 836. [Google Scholar] [CrossRef] [PubMed]
- Wang, W.; Wang, J.; Chen, J. Adaptive block-based compressed video sensing based on saliency detection and side information. Entropy 2021, 23, 1184. [Google Scholar] [CrossRef] [PubMed]
- Bross, B.; Chen, J.; Ohm, J.R. Developments in international video coding standardization after AVC, with an overview of versatile video coding (VVC). Proc. IEEE 2021, 109, 1463–1493. [Google Scholar] [CrossRef]
- Yang, R.; Van Gool, L.; Timofte, R. OpenDVC: An open source implementation of the DVC video compression method. arXiv 2020, arXiv:2006.15862. [Google Scholar]
- Sheng, X.; Li, J.; Li, B.; Li, L.; Liu, D.; Lu, Y. Temporal context mining for learned video compression. IEEE Trans. Multimed. 2022. [Google Scholar] [CrossRef]
- Yilmaz, M.A.; Tekalp, A.M. End-to-end rate-distortion optimized learned hierarchical bi-directional video compression. IEEE Trans. Image Process. 2021, 31, 974–983. [Google Scholar] [CrossRef]
- Lu, G.; Ouyang, W.; Xu, D. DVC: An end-to-end deep video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11006–11015. [Google Scholar]
- Chen, H.; He, B.; Wang, H. Nerv: Neural representations for videos. Adv. Neural Inf. Process. Syst. 2021, 34, 21557–21568. [Google Scholar]
- Li, Z.; Wang, M.; Pi, H. E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal Context. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–24 October 2022; pp. 267–284. [Google Scholar]
- Skorokhodov, I.; Ignatyev, S.; Elhoseiny, M. Adversarial generation of continuous images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10753–10764. [Google Scholar]
- Yu, S.; Tack, J.; Mo, S.; Kim, H.; Kim, J.; Ha, J.W.; Shin, J. Generating videos with dynamics-aware implicit generative adversarial networks. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Park, D.Y.; Lee, K.H. Arbitrary style transfer with style-attentional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5880–5888. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
- Feng, Y.; Yu, J.; Chen, F.; Ji, Y.; Wu, F.; Liu, S.; Jing, X.Y. Visible-Infrared Person Re-Identification via Cross-Modality Interaction Transformer. IEEE Trans. Multimed. 2022. [Google Scholar] [CrossRef]
- Sullivan, G.J.; Ohm, J.R.; Han, W.J. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, I, 1649–1668. [Google Scholar] [CrossRef]
- Wallace, G.K. The JPEG still picture compression standard. Commun. ACM 1991, 34, 30–44. [Google Scholar] [CrossRef]
- Skodras, A.; Christopoulos, C.; Ebrahimi, T. The JPEG 2000 still image compression standard. IEEE Signal Process. Mag. 2021, 18, 36–58. [Google Scholar] [CrossRef]
- Le Gall, D. MPEG: A video compression standard for multimedia applications. Commun. ACM 1991, 34, 46–58. [Google Scholar] [CrossRef]
- Wieg, T.; Sullivan, G.J.; Bjontegaard, G. Overview of the H. 264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol. 2003, 13, 560–576. [Google Scholar]
- Bross, B.; Wang, Y.K.; Ye, Y. Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
- Li, T.; Xu, M.; Tang, R. DeepQTMT: A deep learning approach for fast QTMT-based CU partition of intra-mode VVC. IEEE Trans. Image Process. 2021, 30, 5377–5390. [Google Scholar] [CrossRef]
- Liu, B.; Chen, Y.; Liu, S. Deep learning in latent space for video prediction and compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 701–710. [Google Scholar]
- Li, J.; Li, B.; Lu, Y. Deep contextual video compression. Adv. Neural Inf. Process. Syst. 2021, 34, 18114–18125. [Google Scholar]
- Sitzmann, V.; Martel, J.; Bergman, A.; Lindell, D.; Wetzstein, G. Implicit neural representations with periodic activation functions. Adv. Neural Inf. Process. Syst. 2020, 33, 7462–7473. [Google Scholar]
- Chng, S.F.; Ramasinghe, S.; Sherrah, J.; Lucey, S. Gaussian activated neural radiance fields for high fidelity reconstruction and pose estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–24 October 2022; pp. 264–280. [Google Scholar]
- Chen, Z.; Chen, Y.; Liu, J.; Xu, X.; Goel, V.; Wang, Z.; Shi, H.; Wang, X. Videoinr: Learning video implicit neural representation for continuous space-time super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2047–2057. [Google Scholar]
- Mehta, I.; Gharbi, M.; Barnes, C.; Shechtman, E.; Ramamoorthi, R.; Chraker, M. Modulated periodic activations for generalizable local functional representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 14214–14223. [Google Scholar]
- Zheng, M.; Yang, H.; Huang, D. Imface: A nonlinear 3d morphable face model with implicit neural representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20343–20352. [Google Scholar]
- Chibane, J.; Pons-Moll, G. Neural unsigned distance fields for implicit function learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21638–21652. [Google Scholar]
- Wang, Y.; Rahmann, L.; Sorkine-Hornung, O. Geometry-consistent neural shape representation with implicit displacement fields. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Xian, W.; Huang, J.B.; Kopf, J.; Kim, C. Space-time neural irradiance fields for free-viewpoint video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9421–9431. [Google Scholar]
- Yin, F.; Liu, W.; Huang, Z. Coordinates Are NOT Lonely–Codebook Prior Helps Implicit Neural 3D Representations. arXiv 2022, arXiv:2210.11170. [Google Scholar]
- Dupont, E.; Goliński, A.; Alizadeh, M. Coin: Compression with implicit neural representations. arXiv 2021, arXiv:2103.03123. [Google Scholar]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
- Tancik, M.; Srinivasan, P.; Mildenhall, B.; Fridovich-Keil, S.; Raghavan, N.; Singhal, U.; Ramamoorthi, R.; Barron, J.; Ng, R. Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural Inf. Process. Syst. 2020, 33, 7537–7547. [Google Scholar]
- Liu, S.; Lin, T.; He, D. Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6649–6658. [Google Scholar]
- Shi, W.; Caballero, J.; Huszár, F. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
- Le Cun, Y. Learning process in an asymmetric threshold network. In Disordered Systems and Biological Organization; Springer: Berlin/Heidelberg, Germany, 1986; pp. 233–240. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mercat, A.; Viitanen, M.; Vanne, J. UVG dataset: 50/120fps 4K sequences for video codec analysis and development. In Proceedings of the ACM Multimedia Systems Conference, Istanbul, Turkey, 8–11 June 2020; pp. 297–302. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Ilya, L.; Frank, H. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. Asilomar Conf. Signals Syst. Comput. 2003, 2, 1398–1402. [Google Scholar]
- Yang, R.; Mentzer, F.; Gool, L.V.; Timofte, R. Learning for video compression with hierarchical quality and recurrent enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6628–6637. [Google Scholar]
Methods | Parameters | PSNR | Decoding Fps | MS-SSIM |
---|---|---|---|---|
NeRV | 5.0M | 38.50 | 34.98 | 0.9910 |
NRVC | 5.2M | 39.19 | 31.60 | 0.9921 |
Epoch | NeRV | NRVC |
---|---|---|
300 | 33.53 | 34.10 |
600 | 36.43 | 37.00 |
900 | 37.78 | 38.35 |
1200 | 38.50 | 39.19 |
Datasets | Epoch | NeRV | NRVC |
---|---|---|---|
bunny | 600 | 36.43 | 37.00 |
bunny | 1200 | 38.50 | 39.19 |
UVG | 700 | 30.97 | 31.47 |
UVG | 1050 | 31.58 | 32.18 |
Methods | PSNR | MS-SSIM |
---|---|---|
NeRV | 38.50 | 0.9910 |
NeRV-FECA | 38.69 | 0.9914 |
NeRV-NoRVC | 38.87 | 0.9917 |
NRVC | 39.11 | 0.9921 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, S.; Cao, P.; Feng, Y.; Ji, Y.; Chen, J.; Xie, X.; Wu, L. NRVC: Neural Representation for Video Compression with Implicit Multiscale Fusion Network. Entropy 2023, 25, 1167. https://doi.org/10.3390/e25081167
Liu S, Cao P, Feng Y, Ji Y, Chen J, Xie X, Wu L. NRVC: Neural Representation for Video Compression with Implicit Multiscale Fusion Network. Entropy. 2023; 25(8):1167. https://doi.org/10.3390/e25081167
Chicago/Turabian StyleLiu, Shangdong, Puming Cao, Yujian Feng, Yimu Ji, Jiayuan Chen, Xuedong Xie, and Longji Wu. 2023. "NRVC: Neural Representation for Video Compression with Implicit Multiscale Fusion Network" Entropy 25, no. 8: 1167. https://doi.org/10.3390/e25081167
APA StyleLiu, S., Cao, P., Feng, Y., Ji, Y., Chen, J., Xie, X., & Wu, L. (2023). NRVC: Neural Representation for Video Compression with Implicit Multiscale Fusion Network. Entropy, 25(8), 1167. https://doi.org/10.3390/e25081167