research-article

Unsupervised Low-Light Video Enhancement With Spatial-Temporal Co-Attention Transformer

Authors:

Shengping Zhang,

Qingming HuangAuthors Info & Claims

IEEE Transactions on Image Processing, Volume 32

Pages 4701 - 4715

https://doi.org/10.1109/TIP.2023.3301332

Published: 01 January 2023 Publication History

Abstract

Existing low-light video enhancement methods are dominated by Convolution Neural Networks (CNNs) that are trained in a supervised manner. Due to the difficulty of collecting paired dynamic low/normal-light videos in real-world scenes, they are usually trained on synthetic, static, and uniform motion videos, which undermines their generalization to real-world scenes. Additionally, these methods typically suffer from temporal inconsistency (e.g., flickering artifacts and motion blurs) when handling large-scale motions since the local perception property of CNNs limits them to model long-range dependencies in both spatial and temporal domains. To address these problems, we propose the first unsupervised method for low-light video enhancement to our best knowledge, named LightenFormer, which models long-range intra- and inter-frame dependencies with a spatial-temporal co-attention transformer to enhance brightness while maintaining temporal consistency. Specifically, an effective but lightweight S-curve Estimation Network (SCENet) is first proposed to estimate pixel-wise S-shaped non-linear curves (S-curves) to adaptively adjust the dynamic range of an input video. Next, to model the temporal consistency of the video, we present a Spatial-Temporal Refinement Network (STRNet) to refine the enhanced video. The core module of STRNet is a novel Spatial-Temporal Co-attention Transformer (STCAT), which exploits multi-scale self- and cross-attention interactions to capture long-range correlations in both spatial and temporal domains among frames for implicit motion estimation. To achieve unsupervised training, we further propose two non-reference loss functions based on the invertibility of the S-curve and the noise independence among frames. Extensive experiments on the SDSD and LLIV-Phone datasets demonstrate that our LightenFormer outperforms state-of-the-art methods.

References

[1]

C. Li, C. Guo, and C. C. Loy, “Learning to enhance low-light image via zero-reference deep curve estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 8, pp. 4225–4238, Aug. 2022.

Digital Library

[2]

Z. Zhao, B. Xiong, L. Wang, Q. Ou, L. Yu, and F. Kuang, “RetinexDIP: A unified deep framework for low-light image enhancement,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1076–1088, Mar. 2022.

[3]

H. Jiang and Y. Zheng, “Learning to see moving objects in the dark,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 7323–7332.

[4]

R. Wang, X. Xu, C.-W. Fu, J. Lu, B. Yu, and J. Jia, “Seeing dynamic scene in the dark: A high-quality video dataset with mechatronic alignment,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 9680–9689.

[5]

H. Woo, Y. M. Jung, J.-G. Kim, and J. K. Seo, “Environmentally robust motion detection for video surveillance,” IEEE Trans. Image Process., vol. 19, no. 11, pp. 2838–2848, Nov. 2010.

Digital Library

[6]

M. Yang, X. Nie, and R. W. Liu, “Coarse-to-fine luminance estimation for low-light image enhancement in maritime video surveillance,” in Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), Oct. 2019, pp. 299–304.

[7]

H. Rashed, M. Ramzy, V. Vaquero, A. El Sallab, G. Sistu, and S. Yogamani, “FuseMODNet: Real-time camera and LiDAR based moving object detection for robust low-light autonomous driving,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 2393–2402.

[8]

H. Wanget al., “SFNet-N: An improved SFNet algorithm for semantic segmentation of low-light autonomous driving road scenes,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 11, pp. 21405–21417, Nov. 2022.

[9]

S. Wen, X. Hu, J. Ma, F. Sun, and B. Fang, “Autonomous robot navigation using Retinex algorithm for multiscale image adaptability in low-light environment,” Intell. Service Robot., vol. 12, no. 4, pp. 359–369, Aug. 2019.

Digital Library

[10]

Q. Guo, W. Feng, R. Gao, Y. Liu, and S. Wang, “Exploring the effects of blur and deblurring to visual object tracking,” IEEE Trans. Image Process., vol. 30, pp. 1812–1824, 2021.

Digital Library

[11]

Z. Zhang, Y. Liu, B. Li, W. Hu, and H. Peng, “Toward accurate pixelwise object tracking via attention retrieval,” IEEE Trans. Image Process., vol. 30, pp. 8553–8566, 2021.

Digital Library

[12]

Z. Chen, Z. Fan, Y. Li, H. Gao, and S. Lin, “Z-domain entropy adaptable flex for semi-supervised action recognition in the dark,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2022, pp. 4258–4265.

[13]

R. Chen, J. Chen, Z. Liang, H. Gao, and S. Lin, “DarkLight networks for action recognition in the dark,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2021, pp. 846–852.

[14]

C. Chen, Q. Chen, M. Do, and V. Koltun, “Seeing motion in the dark,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 3184–3193.

[15]

F. Lv, F. Lu, J. Wu, and C. Lim, “MBLLEN: Low-light image/video enhancement using CNNs,” in Proc. Brit. Mach. Vis. Conf. (BMVC), Sep. 2018, pp. 1–13.

[16]

D. Triantafyllidou, S. Moran, S. McDonagh, S. Parisot, and G. Slabaugh, “Low light video enhancement using synthetic data produced with an intermediate domain mapping,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Aug. 2020, pp. 103–119.

[17]

T. Arici, S. Dikbas, and Y. Altunbasak, “A histogram modification framework and its application for image contrast enhancement,” IEEE Trans. Image Process., vol. 18, no. 9, pp. 1921–1935, Sep. 2009.

Digital Library

[18]

T. Celik and T. Tjahjadi, “Contextual and variational contrast enhancement,” IEEE Trans. Image Process., vol. 20, no. 12, pp. 3431–3441, Dec. 2011.

Digital Library

[19]

C. Lee, C. Lee, and C.-S. Kim, “Contrast enhancement based on layered difference representation of 2D histograms,” IEEE Trans. Image Process., vol. 22, no. 12, pp. 5372–5384, Dec. 2013.

Digital Library

[20]

S. Wang, J. Zheng, H.-M. Hu, and B. Li, “Naturalness preserved enhancement algorithm for non-uniform illumination images,” IEEE Trans. Image Process., vol. 22, no. 9, pp. 3538–3548, Sep. 2013.

Digital Library

[21]

X. Guo, Y. Li, and H. Ling, “LIME: Low-light image enhancement via illumination map estimation,” IEEE Trans. Image Process., vol. 26, no. 2, pp. 982–993, Feb. 2017.

Digital Library

[22]

X. Fu, D. Zeng, Y. Huang, X.-P. Zhang, and X. Ding, “A weighted variational model for simultaneous reflectance and illumination estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2782–2790.

[23]

S. M. Pizer, R. E. Johnston, J. P. Ericksen, B. C. Yankaskas, and K. E. Ḿ’uller, “Contrast-limited adaptive histogram equalization: Speed and effectiveness,” in Proc. IEEE Conf. Vis. Biomed. Comput., Jan. 1990, pp. 337–345.

[24]

E. H. Land, “The Retinex theory of color vision,” Sci. Amer., vol. 237, no. 6, pp. 108–129, Dec. 1977.

[25]

S. W. Zamiret al., “Learning enriched features for real image restoration and enhancement,” in Proc. 16th Eur. Conf. Comput. Vis. (ECCV), Aug. 2020, pp. 492–511.

[26]

W. Yang, S. Wang, Y. Fang, Y. Wang, and J. Liu, “Band representation-based semi-supervised low-light image enhancement: Bridging the gap between signal fidelity and perceptual quality,” IEEE Trans. Image Process., vol. 30, pp. 3461–3473, 2021.

Digital Library

[27]

R. Wan, B. Shi, W. Yang, B. Wen, L.-Y. Duan, and A. C. Kot, “Purifying low-light images via near-infrared enlightened image,” IEEE Trans. Multimedia, early access, Dec. 26, 2022. 10.1109/TMM.2022.3232206.

Digital Library

[28]

C. Guoet al., “Zero-reference deep curve estimation for low-light image enhancement,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 1777–1786.

[29]

R. Wan, B. Shi, H. Li, Y. Hong, L.-Y. Duan, and A. C. Kot, “Benchmarking single-image reflection removal algorithms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 1424–1441, Feb. 2023.

[30]

C. Wang, B. He, S. Wu, R. Wan, B. Shi, and L.-Y. Duan, “Coarse-to-fine disentangling demoiréing framework for recaptured screen images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 9439–9453, Aug. 2023.

Digital Library

[31]

C. Liet al., “Embedding Fourier for ultra-high-definition low-light image enhancement,” in Proc. Int. Conf. Learn. Represent. (ICLR), Feb. 2023. [Online]. Available: https://iclr.cc/virtual/2023/poster/11576

[32]

W. Renet al., “Low-light image enhancement via a deep hybrid network,” IEEE Trans. Image Process., vol. 28, no. 9, pp. 4364–4375, Sep. 2019.

[33]

Y. Wang, R. Wan, W. Yang, H. Li, L.-P. Chau, and A. Kot, “Low-light image enhancement with normalizing flow,” in Proc. 36th AAAI Conf. Artif. Intell., 2022, pp. 2604–2612.

[34]

Y. Jianget al., “EnlightenGAN: Deep light enhancement without paired supervision,” IEEE Trans. Image Process., vol. 30, pp. 2340–2349, 2021.

Digital Library

[35]

R. Liu, L. Ma, J. Zhang, X. Fan, and Z. Luo, “Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 10556–10565.

[36]

E. P. Bennett and L. Mcmillan, “Video enhancement using per-pixel virtual exposures,” ACM Trans. Graph., vol. 24, no. 3, pp. 845–852, Jul. 2005.

Digital Library

[37]

H. Malm, M. Oskarsson, E. Warrant, P. Clarberg, J. Hasselgren, and C. Lejdfors, “Adaptive enhancement and noise reduction in very low light-level video,” in Proc. IEEE 11th Int. Conf. Comput. Vis., Oct. 2007, pp. 1–8.

[38]

M. Kim, D. Park, D. K. Han, and H. Ko, “A novel approach for denoising and enhancement of extremely low-light video,” IEEE Trans. Consum. Electron., vol. 61, no. 1, pp. 72–80, Feb. 2015.

Digital Library

[39]

D. Wang, X. Niu, and Y. Dou, “A piecewise-based contrast enhancement framework for low lighting video,” in Proc. IEEE Int. Conf. Secur., Pattern Anal., Cybern. (SPAC), Oct. 2014, pp. 235–240.

[40]

H. Liu, X. Sun, H. Han, and W. Cao, “Low-light video image enhancement based on multiscale Retinex-like algorithm,” in Proc. Chin. Control Decis. Conf. (CCDC), May 2016, pp. 3712–3715.

[41]

X. Donget al., “Fast efficient algorithm for enhancement of low lighting video,” in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2011, pp. 1–6.

[42]

X. Jiang, H. Yao, S. Zhang, X. Lu, and W. Zeng, “Night video enhancement using improved dark channel prior,” in Proc. IEEE Int. Conf. Image Process., Sep. 2013, pp. 553–557.

[43]

E. J. McCartney, Optics of the Atmosphere: Scattering by Molecules and Particles. New York, NY, USA: Wiley, 1976.

[44]

F. Zhang, Y. Li, S. You, and Y. Fu, “Learning temporal consistency for low light video enhancement from single images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 4965–4974.

[45]

C. Liet al., “Low-light image and video enhancement using deep learning: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 9396–9416, Dec. 2022.

[46]

J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2242–2251.

[47]

A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Jun. 2017, pp. 5998–6008.

[48]

A. Dosovitskiyet al., “An image is worth 16 × 16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent. (ICLR), May 2021, pp. 1–22.

[49]

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proc. 38th Int. Conf. Mach. Learn. (ICML), Jul. 2021, pp. 10347–10357.

[50]

S. Zhenget al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 6877–6886.

[51]

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), Dec. 2021, pp. 12077–12090.

[52]

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. 16th Eur. Conf. Comput. Vis. (ECCV), Nov. 2020, pp. 213–229.

[53]

Z. Liuet al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 9992–10002.

[54]

H. Chenet al., “Pre-trained image processing transformer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 12294–12305.

[55]

J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “SwinIR: Image restoration using Swin transformer,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 1833–1844.

[56]

Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general U-shaped transformer for image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 17662–17672.

[57]

C. Guo, Q. Yan, S. Anwar, R. Cong, W. Ren, and C. Li, “Image dehazing transformer with transmission-aware 3D position embedding,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 5802–5810.

[58]

J. Cao, Y. Li, K. Zhang, and L. Van Gool, “Video super-resolution transformer,” 2021, arXiv:2106.06847.

[59]

L. Yuan and J. Sun, “Automatic exposure correction of consumer photographs,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct. 2012, pp. 771–785.

[60]

L. Zhang, L. Zhang, X. Liu, Y. Shen, S. Zhang, and S. Zhao, “Zero-shot restoration of back-lit images using deep internal learning,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 1623–1631.

[61]

K. C. K. Chan, X. Wang, K. Yu, C. Dong, and C. C. Loy, “Understanding deformable alignment in video super-resolution,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 973–981.

[62]

K. C. K. Chan, S. Zhou, X. Xu, and C. C. Loy, “BasicVSR++: Improving video super-resolution with enhanced propagation and alignment,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 5962–5971.

[63]

X. Wang, K. C. K. Chan, K. Yu, C. Dong, and C. C. Loy, “EDVR: Video restoration with enhanced deformable convolutional networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2019, pp. 1954–1963.

[64]

T. Mertens, J. Kautz, and F. Van Reeth, “Exposure fusion,” in Proc. 15th Pacific Conf. Comput. Graph. Appl. (PG), Nov. 2007, pp. 382–390.

[65]

P. Gehler, C. Rother, M. Kiefel, L. Zhang, and B. Schölkopf, “Recovering intrinsic images with a global sparsity prior on reflectance,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2011, pp. 765–773.

[66]

G. Buchsbaum, “A spatial processor model for object colour perception,” J. Franklin Inst., vol. 310, no. 1, pp. 1–26, Jul. 1980.

[67]

J. Lehtinenet al., “Noise2Noise: Learning image restoration without clean data,” in Proc. 35th Int. Conf. Mach. Learn. (ICML), Jul. 2018, pp. 2971–2980.

[68]

E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of optical flow estimation with deep networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1647–1655.

[69]

W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and M.-H. Yang, “Learning blind video temporal consistency,” in Proc. Eur. Conf. Comput. Vis. (ECCV), vol. 11219, Oct. 2018, pp. 179–195.

[70]

K. Wang, K. Akash, and T. Misu, “Learning temporally and semantically consistent unpaired video-to-video translation through pseudo-supervision from synthetic optical flow,” in Proc. 36th AAAI Conf. Artif. Intell., 2022, pp. 2477–2486.

[71]

S. Zheng and G. Gupta, “Semantic-guided zero-shot learning for low-light image/video enhancement,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. Workshops (WACVW), Jan. 2022, pp. 581–590.

[72]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent. (ICLR), May 2015, pp. 1–14.

[73]

A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a ‘completely blind’ image quality analyzer,” IEEE Signal Process. Lett., vol. 20, no. 3, pp. 209–212, Mar. 2013.

[74]

W. Zhang, K. Ma, G. Zhai, and X. Yang, “Uncertainty-aware blind image quality assessment in the laboratory and wild,” IEEE Trans. Image Process., vol. 30, pp. 3474–3486, 2021.

Digital Library

[75]

K. Gu, W. Lin, G. Zhai, X. Yang, W. Zhang, and C. W. Chen, “No-reference quality metric of contrast-distorted images based on information maximization,” IEEE Trans. Cybern., vol. 47, no. 12, pp. 4559–4565, Dec. 2017.

[76]

K. Gu, D. Tao, J.-F. Qiao, and W. Lin, “Learning a no-reference quality assessment model of enhanced images with big data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, pp. 1301–1313, Apr. 2018.

[77]

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.

Digital Library

[78]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent. (ICLR), May 2015, pp. 1–15.

Recommendations

Adaptive Locally-Aligned Transformer for low-light video enhancement
Abstract
Low-light enhancement is a crucial task that aims to enhance the under-exposed input in computer vision. While state-of-the-art static single-image enhancement methods have made remarkable progress, yet, few attempts are explored the spatial-...
Highlights
- We design a novel transformer-variant architecture LATB for LLVE, which models the long-range spatial-temporal dependencies and preserves the local patterns by aligning the semantically nearby pixels.
- We propose a challenging real-...
Collaborative spatial-temporal video salient object detection with cross attention transformer
Abstract
Noticing the effectiveness of explicit motion encoding in optical flow, a variety of recent works employ flow-guided two-branch structures to handle the video salient object detection (VSOD) task. However, most of them ignore the semantic gap ...
Highlights
- Siamese feature extractor is proposed to jointly extract static and motion features.
- Deep level set method is utilized to fix the semantic gap.
- Cross-attention transformer is proposed to refine and fuse static and motion features.
An illumination-guided dual attention vision transformer for low-light image enhancement
Abstract
Existing Retinex-based low-light image enhancement methods often overlook corruptions hidden in darkness or pattern collapse caused by the lit-up process. Recent deep learning approaches suggest the use of U-shaped networks with Vision in ...
Highlights
- A extended one-stage Retinex theory.
- Depth-wise separable convolution estimator light-up map.
- A CNN-Transformer collaborative framework for channel attention modeling.
- Spatial self-attention computation based on shifted ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Image Processing

IEEE Transactions on Image Processing Volume 32, Issue

2023

5324 pages

ISSN:1057-7149

Issue’s Table of Contents

1941-0042 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 January 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents