An End-to-End Video Steganography Network Based on a Coding Unit Mask
<p>Architecture of the proposed PyraGAN.</p> "> Figure 2
<p>The CU mask of one video frame.</p> "> Figure 3
<p>Convolutional block attention mechanism.</p> "> Figure 4
<p>The encoder component of PyraGAN.</p> "> Figure 5
<p>The decoder component of PyraGAN.</p> "> Figure 6
<p>The discriminator component of PyraGAN.</p> "> Figure 7
<p>VVC compressed reconstructed video frame and corresponding CU mask.</p> "> Figure 8
<p>The visual quality of steganographic video frame of PyraGAN.</p> ">
Abstract
:1. Introduction
- We introduce a simple and light-weight end-to-end model architecture combined with GAN to achieve steganography for VVC videos.
- While ensuring a relatively large payload capacity, the visual quality is improved based on the CU mask and convolutional block attention mechanism.
2. Proposed End-to-End Video Steganography System
2.1. Architectrue
2.1.1. Coding Unit Mask and Convolutional Block Attention Mechanism
2.1.2. Encoder Network Design
- (1)
- Conv-module is used sequentially, including the convolutional layer, Leaky ReLU [30] activation function, and Batch Normalization (BN) [31,32] to extract features from the cover frames, which are sized C × H × W (C, H, and W represent the channel, height, and width of video frame, respectively), and to get F-channel features that can be utilized by later conv-modules. The kernel size of convolution layer is 3 × 3, the step is 1, and the padding is 1, in order to keep the size of the output the same as that of the input.
- (2)
- The message is reshaped to D × H × W, concatenated in the depth dimension with F-channel features of cover frames and the CU mask, which is achieved by compressing the cover frames using VVC to form (F + D + 1) dimensional feature maps, where D, H, and W represent the depth, height, and width of the message, respectively.
- (3)
- Inspired by the U-net [33] network structure and the different-size CU partition modes of video frames in video compression, the features are fed into three feature extraction network channels. Besides the first channel that implements feature extraction without changing the height and width of features, the other two channels conduct double and quadruple down sampling on the input feature maps, respectively, that is, the height and width are changed to one-half and one-quarter of its original size through down sampling. At the same time, the channel of extracted feature maps by the convolution module becomes 2F and 4F, respectively. Down sampling integrates the spatial information of adjacent areas, captures the context details for feature extraction, and assists the hidden message to find appropriate features in the cover video frames. The larger the down sampling scale, the larger the spatial domain information it can integrate. At the same time, the lost fine details can be compensated by obtaining more channels of features through the subsequent convolution module.
- (4)
- CBAM is introduced into three network channels. The CA module enables the network to learn to focus on “which” feature maps, according to the block division of video frame content in the CU mask. Successively, the SA module enables the network to learn to focus on “where” features, according to the block division at different positions of the CU mask.
- (5)
- The bicubic interpolation is used to reshape the feature size of three networks into H × W, and meanwhile, the accurate location of the secret information in the cover video frames is realized.
- (6)
- The three-way feature maps and the inputs of the pyramid structure (including the feature maps of cover video frame, hidden messages, and CU mask) are concatenated to get (8F + D + 1) channel features.
- (7)
- A convolution module containing only convolution layer maps them into features of C × H × W in seize.
- (8)
- In order to avoid vanishing and exploding of gradients, the idea of the ResNet [34] network is adopted to make identity mapping for the input cover video frame, that is, the cover frame is added to the features of a size of C × H × W pixel by pixel. In fact, the encoder processes the residual between the stego video frames and the cover video frames, that is, the secret message is hidden in the high-frequency part of the cover video frames. Because human eyes are not sensitive to the high-frequency part of the video frames, hiding messages in the high-frequency part helps to facilitate the visual quality of the stego video frames.
2.1.3. Decoder Network Design
- (1)
- F-dimensional feature maps of stego video frames can be obtained through a conv-module.
- (2)
- Feature extraction is carried out in three ways, where the feature size is down sampled to the original, the half, and the quarter, respectively. Through conv-module, F-dimension, 2F-dimension, and 4F-dimension features can be achieved, respectively, which help to integrate cover video frames from different neighborhood spaces and realize comprehensive feature learning.
- (3)
- The features of the three methods are all reshaped to H × W by up sampling.
- (4)
- They, together with the first F-dimension feature maps, are concatenated to form a tensor of size 8F × H × W.
- (5)
- They are sent to the last convolution layer, which implements accurate prediction and outputs the extracted secret information of message. Note that Leaky ReLU and BN are used in all but the final conv-module.
2.1.4. Discriminator Network Design
2.2. Training
2.2.1. Loss Function
2.2.2. Dataset
- Because the video compressed by VVC must be YUV color space, six datasets are spliced into six videos and transformed from the RGB color space to the YUV color space.
- VVC compresses them using the intra prediction mode, with the quantization step (QP) selected as 37. In this process, Make the CU mask of the corresponding cover video frame according to the division of VVC. Figure 7 shows a video frame decoding by VVC compression and the corresponding CU mask. Set the pixel value of the corresponding CU division edge area to 255 and the interior of the CU block to 0, and obtain the CU mask picture, as shown in Figure 7b, which is a single channel gray image. The CU mask in Figure 7b roughly represents the outline of the whole video frame content in Figure 7a. The content of the background area is flat and its CU block is relatively large. The human body has many textures and its CU block is relatively small and diverse.
- Convert the decoded videos from the YUV domain to the RGB domain.
- These, together with their CU masks, are as the whole network training dataset. This is true for the whole network verification dataset.
2.2.3. Configuration and Optimization
3. Experimental Results
3.1. Number of Kernel Parameters Comparison among Different Steganography Networks
3.2. Comparison of Performance in Capacity, Imperceptibility and Extraction Accuracy
3.2.1. Capacity of Information Hiding
3.2.2. Imperceptibility of Stego Video Frames
3.2.3. Extraction Accuracy of Hiding Secret Message
4. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Hussain, M.; Wahab, A.W.A.; Idris, Y.I.B.; Ho, A.T.S.; Jung, K. Image steganography in spatial domain: A survey. Signal Processing Image Commun. 2018, 65, 46–66. [Google Scholar] [CrossRef]
- Chanu, Y.J.; Singh, K.M.; Tuithung, T. Image steganography and steganalysis: A survey. Int. J. Comput. Appl. 2012, 52, 1–11. [Google Scholar]
- Kadhim, I.J.; Premaratne, P.; Vial, P.J.; Halloran, B. Comprehensive survey of image steganography: Techniques, Evaluations, and trends in future research. Neurocomputing 2019, 335, 299–326. [Google Scholar] [CrossRef]
- Alenizi, F.A. Robust Data Hiding in Multimedia for Authentication and Ownership Protection. PhD Thesis, University of California, Irvine, CA, USA, 2017. [Google Scholar]
- Cheddad, A.; Condell, J.; Curran, K.; Kevitt, P.M. Digital image steganography: Survey and analysis of current methods. Signal Processing 2010, 90, 727–752. [Google Scholar] [CrossRef] [Green Version]
- Pevný, T.; Filler, T.; Bas, P. Using High-Dimensional Image Models to Perform Highly Undetectable Steganography. Proceedings of the International Workshop on Information Hiding, Calgary, AB, Canada, 28–30 June 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 161–177. [Google Scholar]
- Filler, T.; Judas, J.; Fridrich, J. Minimizing additive distortion in steganography using syndrome-trellis codes. IEEE Trans. Inf. Forensics Secur. 2011, 6, 920–935. [Google Scholar] [CrossRef] [Green Version]
- Holub, V.; Fridrich, J. Designing steganographic distortion using directional filters. In Proceedings of the 2012 IEEE International workshop on information forensics and security (WIFS), Costa Adeje, Spain, 2–5 December 2012; pp. 234–239. [Google Scholar]
- Li, B.; Wang, M.; Huang, J.; Li, X. A new cost function for spatial image steganography. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 4206–4210. [Google Scholar]
- Holub, V.; Fridrich, J.; Denemark, T. Universal distortion function for steganography in an arbitrary domain. EURASIP J. Inf. Secur. 2014, 2014, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Sedighi, V.; Cogranne, R.; Fridrich, J. Content-adaptive steganography by minimizing statistical detectability. IEEE Trans. Inf. Forensics Secur. 2015, 11, 221–234. [Google Scholar] [CrossRef] [Green Version]
- Sahu, A.K.; Swain, G. An optimal information hiding approach based on pixel value differencing and modulus function. Wirel. Pers. Commun. 2019, 108, 159–174. [Google Scholar] [CrossRef]
- Zhang, C.; Karjauv, A.; Benz, P.; Kweon, I.S. Towards Robust Data Hiding Against (JPEG) Compression: A Pseudo-Differentiable Deep Learning Approach. arXiv 2020, arXiv:2101.00973. [Google Scholar]
- Shi, H.; Dong, J.; Wang, W.; Qian, Y.; Zhang, X. SSGAN: Secure Steganography Based on Generative Adversarial Networks. In Proceedings of the Pacific Rim Conference on Multimedia, Harbin, China, 28–20 September 2017; Springer: Cham, Switzerland, 2017; pp. 534–544. [Google Scholar]
- Volkhonskiy, D.; Nazarov, I.; Burnaev, E. Steganographic Generative Adversarial Networks. In Proceedings of the Twelfth International Conference on Machine Vision (ICMV 2019), Amsterdam, The Netherlands, 16–18 November 2019; International Society for Optics and Photonics: Bellingham, WA, USA, 2020; Volume 11433, p. 11433M. [Google Scholar]
- Tang, W.; Li, B.; Tan, S.; Barni, M.; Huang, J. CNN-based adversarial embedding for image steganography. IEEE Trans. Inf. Forensics Secur. 2019, 14, 2074–2087. [Google Scholar] [CrossRef] [Green Version]
- Tang, W.; Tan, S.; Li, B.; Huang, J. Automatic steganographic distortion learning using a generative adversarial network. IEEE Signal Processing Lett. 2017, 24, 1547–1551. [Google Scholar] [CrossRef]
- Yang, J.; Ruan, D.; Huang, J.; Kang, X.; Shi, Y. An embedding cost learning framework using GAN. IEEE Trans. Inf. Forensics Secur. 2020, 15, 839–851. [Google Scholar] [CrossRef]
- Baluja, S. Hiding images within images. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1685–1697. [Google Scholar] [CrossRef] [PubMed]
- Hayes, J.; Danezis, G. Generating steganographic images via adversarial training. arXiv 2017, arXiv:1703.00371. [Google Scholar]
- Zhu, J.; Kaplan, R.; Johnson, J.; Fei-Fei, L. Hidden: Hiding Data with Deep Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 657–672. [Google Scholar]
- Zhang, K.A.; Cuesta-Infante, A.; Xu, L.; Veeramachaneni, K. SteganoGAN: High capacity image steganography with GANs. arXiv 2019, arXiv:1901.03892. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Zhang, R.; Dong, S.; Liu, J. Invisible steganography via generative adversarial networks. Multimed. Tools Appl. 2019, 78, 8559–8575. [Google Scholar] [CrossRef] [Green Version]
- Chan, C.K.; Cheng, L.M. Hiding data in images by simple LSB substitution. Pattern Recognit. 2004, 37, 469–474. [Google Scholar] [CrossRef]
- Goel, A.K. An Overview of Image Steganography and Steganalysis based on Least Significant Bit (LSB) Algorithm. Des. Eng. 2021, 2021, 4610–4619. [Google Scholar]
- Zhang, K.A.; Xu, L.; Cuesta-Infante, A.; Veeramachaneni, K. Robust invisible video watermarking with attention. arXiv 2019, arXiv:1909.01285. [Google Scholar]
- Luo, X.; Li, Y.; Chang, H.; Liu, C.; Milanfar, P.; Yang, F. DVMark: A Deep Multiscale Framework for Video Watermarking. arXiv 2021, arXiv:2104.12734. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Xu, B.; Wang, N.; Chen, T.; Li, M. Empirical evaluation of rectified activations in convolutional network. arXiv 2015, arXiv:1505.00853. [Google Scholar]
- Santurkar, S.; Tsipras, D.; Ilyas, A.; Mądry, A. How does batch normalization help optimization? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 2488–2498. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Adler, J.; Lunz, S. Banach wasserstein gan. arXiv 2018, arXiv:1806.06621. [Google Scholar]
- Timofte, R.; Agustsson, E.; Gu, S.; Wu, J.; Ignatov, A.; van Gool, L. DIV2K Dataset: DIVerse 2K Resolution High Quality Images as Used for the Challenges @ NTIRE (CVPR 2017 and CVPR 2018) and @ PIRM (ECCV 2018). Available online: http://data.vision.ee.ethz.ch/cvl/DIV2K (accessed on 1 November 2020).
- Available online: http://images.cocodataset.org/zips/train2017.zip (accessed on 1 November 2020).
- Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chai, H.; Li, Z.; Li, F.; Zhang, Z. An End-to-End Video Steganography Network Based on a Coding Unit Mask. Electronics 2022, 11, 1142. https://doi.org/10.3390/electronics11071142
Chai H, Li Z, Li F, Zhang Z. An End-to-End Video Steganography Network Based on a Coding Unit Mask. Electronics. 2022; 11(7):1142. https://doi.org/10.3390/electronics11071142
Chicago/Turabian StyleChai, Huanhuan, Zhaohong Li, Fan Li, and Zhenzhen Zhang. 2022. "An End-to-End Video Steganography Network Based on a Coding Unit Mask" Electronics 11, no. 7: 1142. https://doi.org/10.3390/electronics11071142
APA StyleChai, H., Li, Z., Li, F., & Zhang, Z. (2022). An End-to-End Video Steganography Network Based on a Coding Unit Mask. Electronics, 11(7), 1142. https://doi.org/10.3390/electronics11071142