research-article

Open access

Joint Source-Channel Decoding of Polar Codes for HEVC-Based Video Streaming

Authors:

Jinzhi Lin,

Yun Zhang,

Na Li,

Hongling JiangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 4

Article No.: 100, Pages 1 - 23

https://doi.org/10.1145/3502208

Published: 04 March 2022 Publication History

All formats PDF

Abstract

Ultra High-Definition (UHD) and Virtual Reality (VR) video streaming over 5G networks are emerging, in which High-Efficiency Video Coding (HEVC) is used as source coding to compress videos more efficiently and polar code is used as channel coding to transmit bitstream reliably over an error-prone channel. In this article, a novel Joint Source-Channel Decoding (JSCD) of polar codes for HEVC-based video streaming is presented to improve the streaming reliability and visual quality. Firstly, a Kernel Density Estimation (KDE) fitting approach is proposed to estimate the positions of error channel decoded bits. Secondly, a modified polar decoder called R-SCFlip is designed to improve the channel decoding accuracy. Finally, to combine the KDE estimator and the R-SCFlip decoder together, the JSCD scheme is implemented in an iterative process. Extensive experimental results reveal that, compared to the conventional methods without JSCD, the error data-frame correction ratios are increased. Averagely, 1.07% and 1.11% Frame Error Ratio (FER) improvements have been achieved for Additive White Gaussian Noise (AWGN) and Rayleigh fading channels, respectively. Meanwhile, the qualities of the recovered videos are significantly improved. For the 2D videos, the average Peak Signal-to-Noise Ratio (PSNR) and Structural SIMilarity (SSIM) gains reach 14% and 34%, respectively. For the 360֯ videos, the average improvements in terms of Weighted-to-Spherically-uniform PSNR (WS-PSNR) and Voronoi-based Video Multimethod Assessment Fusion (VI-VMAF) reach 21% and 7%, respectively.

1 Introduction

Ultra High-Definition (UHD) and Virtual Reality (VR) videos are becoming popular since they are capable of providing more realistic visual experiences. Due to the huge amount of data volume, these videos are usually compressed effectively with highly efficient source coding, such as H.265/High-Efficiency Video Coding (HEVC), which doubles the compression ratio as compared to the H.264/Advanced Video Coding (AVC). Even so, wide bandwidth is required to stream the compressed videos. Thanks to the development of network transmission, the 5th Generation (5G) mobile transmission technologies significantly improve the transmission bandwidth and lower the delay. However, there is still a big gap between the bandwidth provided by the existing communication networks and the bandwidth required by UHD and VR video streaming.

Due to channel noises, signal interference, and multi-path fading, and the like, bitstreams transmitted over wireless channels are error-prone. Channel coding or Forward Error Correction (FEC) [50] is applied for error controlling [20]. The polar code is one of the channel coding schemes adopted in the 5G standard [1], which will probably be further utilized in 5G data plane as it is more suitable for long codes and more effective for massive data coding. Conventionally, source and channel en/decoding are optimized separately, which are expounded extensively by Shannon [45]. However, Shannon’s separation theorems rely on some assumptions, including infinite code length, no delay feedback, and infinite feedback capacity, which may not be guaranteed in a practical system. For practical applications, separate source-channel en/decoding limits system performance [12]. The redundancies hidden in the source coding are the extra extrinsic information for channel coding, which can be utilized for improving channel decoding accuracy. Methods based on this idea are Joint Source-Channel Decoding (JSCD) approaches.

For the emerging VR broadcasting application, the data rate of a 360\(^\circ\) video that allows a full 360\(^\circ\) high quality viewing experience requires about 400 Mb/s [53]. The reasonable motion-to-photon delay is shall be below 15–20 milliseconds [42]. Even transmitting over 5G networks, the wide bandwidth and low latency requirements are challenging to VR video streaming. To tackle these challenges, viewport-specific [51] and tiled-based [37] streaming schemes were developed. Besides, another challenging issue is to handle the error-prone bitstreams to improve data transmission efficiency. Automatic repeat requests could be employed for re-transmitting error bitstreams. However, they consume extra communication resources. The error concealment/error resilience [24] technologies provided by AVC and HEVC decoders reduce the negative impact from errors. However, they brought an extra computational complexity and the quality gain from recovering was still limited. Many researches show that, with the cost of consuming a few additional computation, a JSCD method can improve data transmission efficiency without extra network bandwidths. Since bandwidths are the bottleneck of a network and are hard to be expanded, the JSCD method suits for a practical video streaming system. In this article, we propose a novel JSCD scheme of polar codes for HEVC-based video streaming, which creatively improves the accuracy of the polar channel decoding by exploiting the HEVC bitstreams syntax. The unique challenges of the proposed scheme include: how to find the redundant source information from the HEVC coding standard and how can they be utilized to improve the polar channel decoding performance. The proposed method is specifically suitable for video playing end-devices with rich computation abilities, which at the same time demands high definition videos recovered from remote streaming servers. To focus on JSCD related processes, video encryption and adaptive video streaming technologies are not considered in this article. The contributions of this article are given as follows:

•

A Kernel Density Estimation (KDE) fitting approach for estimating positions of error channel decoded bits from HEVC bitstreams is proposed, based on the analysis of HEVC syntax error types and statistics of the error positions.

•

An algorithm called R-SCFlip for decoding data-frames¹ with polar codes is designed, which improves the decoding accuracy compared to the original successive cancellation flip decoding algorithm.

•

An iterative JSCD scheme is proposed by combining the KDE error bit range estimator and the R-SCFlip decoder, which improves reliability and visual quality of the video streaming.

The rest of this article is organized as follows. Firstly, Section 2 reviews the related works. Section 3 briefly introduces the fundamentals of polar codes and the corresponding successive cancellation decoder. Then, the proposed JSCD scheme is presented in Section 4. Section 5 presents the experimental results and analysis. Finally, Section 6 draws the conclusion.

2 Related Works

2.1 General JSCD, JSCC, and Cross-layer Schemes

There are a number of JSCD schemes for sensor data and controlling signal transmitting. Dumitrescu et al. [14] dealt with Markov sequence sources. JSCD approaches were proposed for accelerating both the Maximum A Posteriori (MAP) sequence decoding and the soft output Max-Log-MAP decoding when the Markov sources satisfied Monge property. Two correlated sensor data encoded by systematic LDPCs independently were considered in [28]. A JSCD decoder composed of two LDPC decoders was proposed, where the encoded bits at the output of each LDPC decoder were used as the a priori information at the other decoder. Abdessalem et al. [2] presented similar ideas by considering relay cooperative communications. Methods of joint channel decoding and state estimation for cyber-physical systems have been studied in [19].

Opposite to JSCD, Joint Source-Channel Coding (JSCC) combined source encoding with channel FEC codes. JSCC schemes are usually demanded for designing the corresponding JSCD approaches. Coupling with multimedia broadcasting, the authors have studied multilevel coded modulation for providing Unequal Error Protection (UEP) JSCC approach in [48]. Liu et al. [35] formulated error-resilient VR video transmission into JSCC optimization problem and solved it by coming up a heuristic algorithm. In [5], a hybrid JSCC scheme with binary and non-binary turbo codes was introduced, which utilized techniques including JSCD, regression-based extrinsic information scaling, and prioritized 16-quadrature amplitude modulation. In fact, Hybrid Digital Analog (HDA) coding can be viewed as another form of JSCC. Recently, some works using HDA coding for UHD and 3D videos transmission are emerging [33, 36]. Deep learning based approaches have been investigated to improve the JSCC and JSCD [7, 31, 34], which brought interesting inspirations and performance improvements.

In a cross-layer perspective, the basic principles and design highlights of JSCD considering coupling multiple protocol layers for video communicated in a wireless network were discussed in [13]. Qiwang et al. reviewed integrated physical-layer and cross-layer communication coding systems for optimization of component elements to achieve green communications [9]. A cross-layer optimization of caching and delivery control for minimizing the overall video delivery time in two-hop relaying networks was investigated in [49]. Cross-layer optimization framework for maximizing the total utility of 360\(^\circ\) videos delivering in the multi-cast systems was proposed in [40]. Cross-layer design techniques in wireless multimedia sensor networks for energy conservation were studied in [29]. Zhu et al. proposed a joint layered approach to achieve reliable and secure JPEG-2000 image streaming over mobile networks [52]. More cross-layer design methods dedicated in video streaming were presented in [10, 17].

2.2 Video Streaming Dedicated JSCD Schemes

Different kinds of H.264/AVC based JSCD schemes have been proposed in [22, 30, 32, 38, 47]. In general, the last step of the video source coding is the arithmetic coding, such as variable length coding and Context-Adaptive Binary Arithmetic Coding (CABAC) for H.264. It is straightforward to practice JSCD by tight coupling arithmetic coding with channel coding. Based on MAP estimation, Wang et al. [47] proposed a JSCD method for variable length coding and convolutional encoded 1D Markov source, and applied it to decode motion vectors of H.264 coded video streams. In [32], an iterative JSCD scheme for videos encoded by H.264 with CABAC entropy coding and a rate-1/2 Recursive Systematic Convolutional code was proposed. In this scheme, slice candidates with different likelihoods were generated and checked for source semantic validation. The bitstreams, corresponding to invalid slice candidates, were modified and fed back to the soft output viterbi algorithm decoder to do channel decoding recurrently. This scheme was extended in [38] by introducing a virtual checking method to accelerate the semantic verification process. Hanzo et al. [22] presented various Short Block Codes (SBC) based iterative JSCD system design for near-capacity H.264 source coded videos, and proposed a redundant source mapping scheme that can improve the convergence behavior of SBCs. In [30], a sequential MAP JSCD scheme was proposed. The scheme utilized forbidden symbols in arithmetic coding, semantics and syntax validation checking and a priori probability estimation for syntax element sequences to implement the core idea of JSCD. The above mentioned works focus on JSCD approaches dedicated to H.264. As HEVC is designed to focus on increasing video resolution and parallel processing, the semantics and syntax of HEVC are somehow different from its predecessor H.264. The existing methods of detecting error bits in bitstreams for H.264 cannot be applied to HEVC directly.

HEVC related JSCD schemes can be found in [25, 41]. Perera et al. [41] designed a cross-layer turbo decoder to make use of HEVC source redundancy in the form of exploiting slice header semantics and field patterns. Specifically, a Slice-header-field Parsing and Correction (SPC) module for checking the HEVC syntax was required. To implement SPC, algorithms of access unit boundary identification and Network Abstract Layer (NAL) unit header fields analyzing and amending were proposed in the article. As a boundary identification algorithm for access units was used, the reported error bit positions were always accurate. The disadvantage of this method is that it can only be applied to a few numbers of semantic and syntax error types. In [25], not constrained to any specific video coding standards and modeling the correlation inherent in compressed video signals as a first-order Markov process, Huo et al. proposed a spatio-temporal iterative JSCD system.

2.3 Polar Code Related JSCD Schemes

Most existing JSCD schemes were developed based on the traditional channel codes, such as turbo codes and Low-Density Parity-Check (LDPC) codes. However, only a few JSCD related studies were on the latest polar codes adopted by the 5G communication standard. A polar code is constructed by a channel polarization transform process which is completely different from the LDPC and turbo codes. The existing JSCD schemes with LDPC or turbo codes are not suitable for polar codes. Jin et al. [27] proposed a distributed JSCD scheme for tackling general correlated sources encoded by systematic polar codes independently, and proposed a joint source-channel polarization scheme by using a quasi-uniform systematic polar code [26]. A JSCD method for language-based sources with polar encoding was exploited in [46]. Source redundancy was utilized by judging the validity of the decoded words in the decoded sequence with the help of a dictionary. The scenario that HEVC and polar codes are involved in source and channel coding respectively is not tackled in the above mentioned works. As for JSCC related polar codes, Jin et al. [26] considered joint source and channel polarization and Hadi et al. [21] utilized the channel polarisation property to achieve UEP.

In summary, JSCD schemes for video streaming are different from conventional data streaming since videos have large data volume and are compressed in lossy. It is required to develop JSCD by considering the properties of both HEVC bitstream and polar code.

3 Preliminaries of Polar Codes

Let a vector \([{x_1},\ldots ,{x_N}]\) be \(x_1^N\). Consider to transmit \(K\) bit information over the communication channel, a polar code of length \(N=2^n, N\gt K\) with rate \(R=K/N\) separates the \(N\) synthetic polarized channels into \(K\) reliable and \(N-K\) unreliable ones, and encode the information bits and frozen bits on them respectively. Denote \(\mathcal {I}\) as the set containing the indices of the \(K\) reliable synthetic channels. The encoding process can be described as

\begin{equation} \boldsymbol {X} = \boldsymbol {U} \cdot {G^{ \otimes n}}, \end{equation}

(1)

where \(\boldsymbol {U} = u_1^N = [{u_1},\ldots ,{u_N}]\) of length \(N\) is the input data vector, containing \(K\) information bits at position \(i \in \mathcal {I}\) and \(N-K\) frozen bits that are set to zero. \(\boldsymbol {X} = x_1^N = [{x_1},\ldots ,{x_N}]\) is the encoded vector. \({G^{ \otimes n}}\) is the generator matrix which is the \(n-th\) Kronecker product of the polarization matrix \(G = \left[\!\!\begin{array}{*{20}{c}} 1&0\\ 1&1 \end{array} \!\!\right]\).

The original standard polar code decoding algorithm is Successive Cancellation (SC) decoder proposed by Arikan [4]. Denote the received signal from channel as \(\boldsymbol {Y} = y_1^N = [{y_1},\ldots ,{y_N}]\) which is the noisy version of \(\boldsymbol {X}\). SC decoder tries to recover \(u_1^N\) via successively decoding \(u_i\) in ascending order of index \(i\). For \(u_i\), its Log-Likelihood Ratio (LLR) is computed by

\begin{equation} L({u_i}) = \log \left({\frac{{P({u_i} = 0|\boldsymbol {Y},u_1^{i - 1})}}{{P({u_i} = 1|\boldsymbol {Y},u_1^{i - 1})}}} \right). \end{equation}

(2)

According to the sign of its LLR value and whether it is a frozen bit, \(u_i\) is decoded as \({\hat{u}_i}\)

\begin{equation} {\hat{u}_i} = \left\lbrace \begin{gathered}1\quad {\text{if }}i \in \mathcal {I}{\text{ and }}L({u_i}) \lt 0 \hfill \\ 0\quad {\text{if }}i \notin \mathcal {I}{\text{ or }}L({u_i}) \geqslant 0. \hfill \\ \end{gathered} \right. \end{equation}

(3)

4 Proposed Joint Source-Channel Decoding Scheme

4.1 System Model

Figure 1 depicts a video streaming system consists of source video compression, channel encoding, transmission, channel decoding, and video decoding. The input video sequence is firstly put into a HEVC encoder, a sequence of syntax elements \(v_1^{{\ell _v}} = [{v_1},\ldots ,{v_{{\ell _v}}}]\) of length \({\ell _v}\) is produced after coding, where \({v_i}\) is a syntax element which is placed into a NAL unit. Then, these syntax elements are mapped into a binary sequence \(s_1^{{\ell _s}} = [{s_1},\ldots ,{s_{{\ell _s}}}]\) of length \({\ell _s}\), where \({s_i}\) is either 0 or 1. Next, the entropy encoder CABAC compresses \(s_1^{{\ell _s}}\) to a bit sequence \(u_1^{{\ell _u}} = [{u_1},\ldots ,{u_{{\ell _u}}}]\) of length \({\ell _u}\), where \(u_i\) is also either 0 or 1. Before transmission, bit sequence \(u_1^{{\ell _u}}\) is encoded with a polar code channel encoder, and the obtained encoded vector \(x_1^{{\ell _x}} = [{x_1},\ldots ,{x_{{\ell _x}}}]\) of length \({\ell _x}\) is modulated and sent through a wireless noisy channel. At the receiver side, the received signal vector \(y_1^{{\ell _y}} = [{y_1},\ldots ,{y_{{\ell _y}}}]\) is the result of \(x_1^{{\ell _x}}\) corrupted by noise \(\boldsymbol {n}\). The goal of the whole decoding process at the receiver is to recover \(u_1^{{\ell _u}}\), \(s_1^{{\ell _s}}\) and \(v_1^{{\ell _v}}\) as best as possible, the estimated value of them are denoted as \(\hat{u}_1^{{\ell _u}}\), \(\hat{s}_1^{{\ell _s}}\) and \(\hat{v}_1^{{\ell _v}}\), respectively.

Fig. 1.

The goal of the polar decoder is to estimate \(u_1^{{\ell _u}}\) by maximizing a posteriori probability as

\begin{equation} \hat{u}_1^{{\ell _u}} = \mathop {\arg \max }\limits _{\tilde{u}_1^{{\ell _u}} \in {\mathcal {B}_{{\ell _u}}}} P\left(\tilde{u}_1^{{\ell _u}}|y_1^{{\ell _y}}\right), \end{equation}

(4)

where \({\mathcal {B}_{{\ell _u}}}\) is the set of all the bit sequences with a length of \({\ell _u}\). Using the Bayes’ rule, it can be written as

\begin{equation} \hat{u}_1^{{\ell _u}} = \mathop {\arg \max }\limits _{\tilde{u}_1^{{\ell _u}} \in {\mathcal {B}_{{\ell _u}}}} \frac{{P\left(y_1^{{\ell _y}}|\tilde{u}_1^{{\ell _u}}\right)P\left(\tilde{u}_1^{{\ell _u}}\right)}}{{P\left(y_1^{{\ell _y}}\right)}}. \end{equation}

(5)

There are three terms in the right side of the equation: channel transition probability \(P(y_1^{{\ell _y}}|\tilde{u}_1^{{\ell _u}})\) depending on the physical characteristics of the communication channel, modulation method and polar encoding; a priori probability \(P(\tilde{u}_1^{{\ell _u}})\) of bit sequence \(u_1^{{\ell _u}}\) determined by the distribution of binary sequence \(s_1^{{\ell _s}}\); and the denominator \(P(y_1^{{\ell _y}}),\) which is constant for all realizations of \(\tilde{u}_1^{{\ell _u}}\) and thus is insignificant in this maximization.

For source-channel separation decoding methodologies, \(P(\tilde{u}_1^{{\ell _u}})\) is unknown to the channel decoder, thus \(\tilde{u}_1^{{\ell _u}}\) are assumed to be Independent and Identically Distributed and Bernoulli (0.5) distribution is usually adopted [18]. However, for HEVC compressed video source, \(\tilde{u}_1^{{\ell _u}}\) are corresponding to \(s_1^{{\ell _s}}\) which are constrained by HEVC syntax elements \(v_1^{{\ell _v}}\). Not all \(\tilde{u}_1^{{\ell _u}} \in {\mathcal {B}_{{\ell _u}}}\) are necessary to be valid bit sequences satisfying HEVC semantic and syntax constraints. Information hidden in \(P(\tilde{u}_1^{{\ell _u}})\) can be utilized to improve the MAP decoding performance. To obtain a specific probability distribution of \(\tilde{u}_1^{{\ell _u}}\) by observing \(v_1^{{\ell _v}}(\tilde{u}_1^{{\ell _u}})\) is unpractical. Therefore, implementing JSCD via MAP decoding is not feasible. In this article, to make use of the extrinsic information from the syntax elements, a model to estimate error bit location range based on HEVC semantic and syntax verification (shown as the green blocks in Figure 1) and an improved SCFlip polar decoder to utilize the estimated ranges (shown as the yellow block in Figure 1) are designed.

4.2 Error Bit Location Range Estimation

In HEVC, encoded bitstreams are organized in a bunch of NAL units separated (or synchronized) by start codes. Each NAL unit [44] consists of a header and the associated payload data called Raw Byte Sequence Payload (RBSP), where some logical syntax elements are presented together. The six bits NALType field in the NAL header specifies the role of the unit and determines the format of the unit’s RBSP. The syntax elements constructing all types of RBSPs are well defined in the HEVC standard, along with some semantic constraints if necessary.

In a video streaming system, NAL units are further channel encoded into data-frames and transmitted through noisy communication channels. At the decoder side, the HEVC decoder may confront that some syntax elements are not compliant with the standard due to error bits. In this case, syntax errors will be reported. Moreover, the decoded values of some syntax elements may violate the semantics in standard, i.e., out of valid range, illegal values, and undefined meaning, then semantic errors may be reported as well.

Video decoding errors could be propagated and cumulated due to intra/inter predictive coding. Information in the headers of NAL units is relatively important. Error bits inside them could cause their following RBSP cannot be decoded properly, or even cause decoding corruption for some referring units. As most percentages of a bitstream are data slices encoded by CABAC [44]. They are extremely sensitive to bit errors due to context correlation and changing probabilities, which cause desynchronization and value deviation problems. When the decoder reports semantic or syntax errors in decoding a bitstream at some specific positions, they are not meant to be the exact locations where the actual errors bits occur. Instead, the error bits may locate nearby. To overcome this problem, a statistical approach for error bit location range estimation is proposed.

In the reference software HEVC test Model (HM) [39], there are many assertion statements associated with all kinds of HEVC semantic and syntax validation. Generally, an assertion failure is considered as a report of a semantic and syntax error. Table 1 lists the most common semantic and syntax errors that the HM software may report in decoding bitstreams. For example, because HEVC standard defines the first bit in a NAL unit’s header to be zero, if the HM software encounters a non-zero value in that, it would report a forbidden_zero_bit \(!=\) 0 semantic error.

Table 1.

Error type	In head	In RBSP	Sem./Syn.
forbidden_zero_bit \(!=\) 0	\(\checkmark\)		Semantic
PPS \(==\) null	\(\checkmark\)		Semantic
invalid sliceQP	\(\checkmark\)		Semantic
numAllocatedSlice \(!=\) SliceIdx	\(\checkmark\)		Syntax
invalid nalUnitType	\(\checkmark\)		Semantic
bits not byte aligned	\(\checkmark\)	\(\checkmark\)	Syntax
fifo_idx \(\geqslant\) fifo.size	\(\checkmark\)	\(\checkmark\)	Syntax
end_of_slice_segment_flag \(!=\) 1		\(\checkmark\)	Semantic
trailingNullByte \(!=\) 0		\(\checkmark\)	Semantic
trailing_zero_8bits \(!=\) 0		\(\checkmark\)	Semantic
rbsp_stop_one_bit \(!=\) 1		\(\checkmark\)	Semantic

Table 1. The Most Common Semantic and Syntax Errors Reported by the HM Software

We conduct the following experiment. Flip a random bit of a bitstream deliberately, then put it into the HM software for HEVC decoding. An assertion failure may be reported when the flipped bit causes a semantic or syntax error. The position of the current bit being decoded were recorded. Denote the recorded bit position and the actual flipped bit position as \(p^{\prime }\) and \(p\), respectively. Call the difference between \(p^{\prime }\) and \(p\) as Causality Position Deviation (CPD), denoted as:

\begin{equation} {\Delta _p} = p^{\prime } - p. \end{equation}

(6)

For a ground truth error bit location estimator, CPDs are always in the predicted ranges. This can be done by predicting a larger range. However, for a practical estimator, the predicted ranges should be as narrow as possible. Small slice sizes in HEVC encoding are associated with narrow predicted ranges. However, it is inefficient to use a very small slice size, since it may include more slice head overhead. The value of 100 bytes is used as the slice size, which is a trade-off achieved from statistical experiments. We collected the statistic of CPDs of all kinds of assertion failures. Figure 2 shows the empirical Cumulative Distribution Function (CDF) of CPDs for syntax error “fifo_idx \(\geqslant\) fifo.size” when bitstreams are encoded with slice size set to 100 bytes. According to which specific parameter is currently being decoded, the errors are further divided into several categories, such as decodeSplitFlag, decodeCoeff, decodePredInfo, and others. In general, most of the CPDs are within range \([-600, 2000]\) and the ranges vary with categories.

Fig. 2.

Actually, the distributions of CPDs can be accurately fitted by the KDE method [43]:

\begin{equation} {{\hat{f}}_h}(x) = \frac{1}{{nh}}\sum \limits _{i = 1}^n {K\left({\frac{{x - {x_i}}}{h}} \right),} \end{equation}

(7)

where \(x_i, i=1\cdots n\) are the sampled data, \(n\) is the sampled data size, \(K(\cdot)\) is the kernel smoothing function and Epanechnikov [15] function \(K(u) = \frac{3}{4}({1 - {u^2}})\)is used in this article, \(h\) is the bandwidth. Utilizing experimental data, \({\hat{f}_h}(x)\) for CPDs of different semantic and syntax errors can be obtained. The estimator predicts the error bit location ranges as

\begin{equation} \begin{array}{*{20}{l}} \begin{array}{*{20}{l}} {{p_{LB}} = p^{\prime } + {{\hat{F}}_h}^{ - 1}({\sigma _{LB}}),}\\ {{p_{UB}} = p^{\prime } + {{\hat{F}}_h}^{ - 1}({\sigma _{UB}}),} \end{array}\\ {{{\hat{F}}_h}(x) = \frac{1}{n}\displaystyle \sum \limits _{i = 1}^n {G\left({\frac{{x - {x_i}}}{h}} \right)},\quad }\\ {G(x) = \displaystyle \int _{ - \infty }^x {K(t)dt}, } \end{array} \end{equation}

(8)

where \({\hat{F}_h}(x)\) is the CDF of \({\hat{f}_h}(x)\), \(\sigma _{LB}\) and \(\sigma _{UB}\) are the parameters to be configured, which determine the possibility that the error bits located in the predicted range. To obtain a 0.95 possibility, 0.025 and 0.975 are used in this article, respectively. The predicted range is denoted as \([{p_{LB}},{p_{UB}}]\) by indicating the lower and upper bounds of error bit position.

To implement the KDE model-based error bit location estimator, a large number of CPD samples are collected through experiments. By applying KDE fitting to these samples, the bandwidths and distributions for different semantic and syntax errors are obtained. Next, the estimated ranges are calculated according to Equation (8). Table 2 gives the adopted range predictions and the corresponding accuracy for different semantic and syntax errors. The predicted ranges of items with \(h\) indicated by ‘-’ are not configured by KDE fitting, as there are not enough samples during the experiments due to their rare occurring, and their \([{p_{LB}},{p_{UB}}]\) values are simply set by the minimum and maximum of their few corresponding samples respectively. Note that the error bit location range estimator is previously established before running the JSCD process. There is no need to run KDE fitting for each specific video during conducting JSCD.

Table 2.

Error type	\(h\)	\([p_{LB},p_{UB}]\)	Acc. (%)
forbidden_zero_bit \(!=\) 0	-	\([p^{\prime },p^{\prime }]\)	100
PPS \(==\) null	8.32	\([p^{\prime }-32,p^{\prime }+11]\)	100
invalid sliceQP	-	\([p^{\prime }-34,p^{\prime }-2]\)	100
numAllocatedSlice \(!=\) SliceIdx	1.11	\([p^{\prime }-39,p^{\prime }-25]\)	98
invalid nalUnitType	-	\([p^{\prime }-27,p^{\prime }-27]\)	100
bits not byte aligned	-	\([p^{\prime }-56,p^{\prime }-2]\)	100
fifo_idx \(\geqslant\) fifo.size (decodeSplitFlag)	42.84	\([p^{\prime }-606,p^{\prime }+1539]\)	95
fifo_idx \(\geqslant\) fifo.size (decodeSkipFlag)	36.78	\([p^{\prime }-652,p^{\prime }+555]\)	94
fifo_idx \(\geqslant\) fifo.size (decodePredMode)	53.13	\([p^{\prime }-504,p^{\prime }+598]\)	93
fifo_idx \(\geqslant\) fifo.size (decodePartSize)	39.79	\([p^{\prime }-624,p^{\prime }+848]\)	88
fifo_idx \(\geqslant\) fifo.size (decodeCoeff)	37.32	\([p^{\prime }-366,p^{\prime }+2012]\)	89
fifo_idx \(\geqslant\) fifo.size (decodeMergeIndex)	48.09	\([p^{\prime }-715,p^{\prime }+514]\)	94
fifo_idx \(\geqslant\) fifo.size (decodePredInfo)	32.15	\([p^{\prime }-549,p^{\prime }+1284]\)	89
fifo_idx \(\geqslant\) fifo.size (others)	2.93	\([p^{\prime }-75,p^{\prime }+8]\)	100
end_of_slice_segment_flag \(!=\) 1	16.82	\([p^{\prime }-423,p^{\prime }+525]\)	92
trailingNullByte \(!=\) 0	1.51	\([p^{\prime }-781,p^{\prime }+44]\)	94
trailing_zero_8bits \(!=\) 0	104.34	\([p^{\prime }-1067,p^{\prime }-59]\)	94
rbsp_stop_one_bit \(!=\) 1	20.07	\([p^{\prime }-662,p^{\prime }+405]\)	91

Table 2. Configuration of KDE Parameters for Common Syntax Errors (\(\sigma _{LB}=0.025, \sigma _{UB}=0.975\))

4.3 Range Specified Successive Cancellation Flip Decoding

In the standard SC polar code decoder, due to the sequential property, intermediate bit decoding errors can propagate through the subsequent bits decoding. It was observed in [3] that when one or more incorrect bit estimations happen, it can cause more incorrect estimations in the subsequent decoding. The Successive Cancellation Flip (SCFlip) decoding [16] was proposed to find and flip the first falsely decoded bit, hoping that its follow-up error bits caused by this incorrect estimation can be corrected. In the original SCFlip decoder, bits to be flipped are determined by their LLR values, which are the smallest ones among all the un-frozen bits. In fact, this operation is not guaranteed to be true. The bit with the smallest LLR value is not necessary the first error bit to be flipped for the next SC decoding trial. We can imagine that, when the SCFlip decoder searches bits to be flipped only within a previously known range, where the actual first error bit may locate in, the decoding performance can be improved. We call this range specified SCFlip decoding as R-SCFlip.

Given a polar code with length \(N\) for sending \(K\) information bits, in which \(r\) CRC bits for checking the validity of the decoded codewords are included. The sending codeword and received signal are denoted as \(u_1^N\) and \(y_1^N\), respectively. Algorithm 1 gives the pseudo-code of R-SCFlip, in that, function \({\rm {SC}}({y_1^N,\mathcal {A},k})\) stands for the SC algorithm based on the received signal \(y_1^N\) and the set of non-frozen bits \(\mathcal {A}\), with bit \({\hat{u}_k}\) flipped. Similar to the SCFlip decoder, R-SCFlip starts by performing a standard SC process to gain the first estimation of \(\hat{u}_1^N\), as well as their corresponding LLR values \({L({y_1^N,\hat{u}_1^{i - 1}|{u_i}})}\). If \(\hat{u}_1^N\) passes the CRC checking, the decoding finishes. Otherwise, R-SCFlip would attempt to find out the \(T\) most unreliable bits (denote \(\mathcal {U}\) as the set of the indices of them) within the given range \(\mathcal {R}\) according to \({L({y_1^N,\hat{u}_1^{i - 1}|{u_i}})}\). Then for every bit \({\hat{u}_k},k \in \mathcal {U}\), R-SCFlip is given a chance to do the SC decoding, in that \({\hat{u}_k}\) is flipped with respect to its decoded result in the standard SC algorithm, the flipped value of \({\hat{u}_k}\) is fed forward to take part in calculating the LLRs of the following decoding bits, thus affect the whole decoding result \(\hat{u}_1^N\). Again, if the newly decoded \(\hat{u}_1^N\) passes the CRC checking, the decoding is completed. Otherwise, R-SCFlip continues the process until all the \(T\) chances have been tried out.

The difference between SCFlip and R-SCFlip lies in that the ranges to find the first error bit for flipping are different. R-SCFlip identifies the most unreliable bits by searching the smallest \({L({y_1^N,\hat{u}_1^{i - 1}|{u_i}})}\) values in the given range \(\mathcal {R}\), while SCFlip searches the whole range of the current decoding data-frame. As can be expected, if the actual first error bit does locate in the given range, R-SCFlip has a higher possibility of flipping the exact error bits than SCFlip and tends to be more likely to render the correct decoding result. Thus, R-SCFlip is expected to be able to correct more data-frames than the SCFlip. Figure 3 shows the simulation performances of SC, SCFlip, and R-SCFlip decoders. The polar code related parameters configured in the simulations are the same as in the experiments in Section 5 given in Table 3. Two of the most common channel models, Additive White Gaussian Noise (AWGN) and Rayleigh fading channels, are considered. The simulation Signal-to-Noise Ratio (SNR) ranges for the two channels are set to 1.5\(\sim\)2.6 and 13.5\(\sim\)15.4 db (measured in \(Eb/N0\)), respectively. The fading channel is more realistic than the AWGN channel, experiments in these two channel models help to prove that the proposal can work both in ideal and practical channels. As can be seen from Figure 3, the R-SCFlip decoder has the smallest Frame Error Ratio (FER), known as the percentage of error data-frames, and Bit Error Ratio (BER). R-SCFlip can significantly decrease the FER and BER for the SCFlip decoder in AWGN and fading channels.

Fig. 3.

Table 3.

Source coding		Channel coding

Parameter	Value	Parameter	Value
Video sequence	\(AerialCity\), \(DrivingInCity,DrivingInCountry\), \(PoleVault\), \(BasketBallDrive\), \(RaceHorses\)	Polar codeword size (bit) Data-frame size (bit) Rate of polar code	16,384 10,240 0.5
Resolution	3840 \(\times\) 1920, 1920 \(\times\) 1080, 832 \(\times\) 480	CRC size (bit)	8
Frames per second	30	Number of SCFlip trials	4
HEVC encoding QP	37, 32, 27, 22	Bandwidth (MHz)	20
GOP size	4	Modulation type	BPSK
Intra period	32	Channel model	AWGN, Rayleigh fading
Max. bytes per slice	100	SNR \(\frac{{{E_b}}}{{{N_0}}}\) (dB)	2.0, 2.1, 2.2, 2.3 (AWGN) 14.6, 14.8, 15.0, 15.2 (fading)

Table 3. Summary of the Experimental Settings

4.4 Joint Range Estimation and R-SCFlip Decoding

By combining the KDE error bit location estimation and R-SCFlip polar decoding together, an iterative JSCD scheme is proposed. As depicted in Figure 4, initially, the channel encoded data-frames are gathered by demodulating wireless channel signals. They are channel decoded by performing SCFlip decoding, which is actually implemented by R-SCFlip decoding with the range \(R\) set to the whole data-frame \([0, MAX]\). By gathering all the R-SCFlip decoded data-frames together, the raw video source encoded bitstream is obtained. As the bitstream is composed of NAL units, the video decoder separates the bitstream into a list of NAL units \(L_{NALs}\) by performing NAL boundary identification.

Fig. 4.

The following process is to do JSCD decoding in an iterative form for the NAL units in \(L_{NALs}\) one by one. The current NAL unit to be decoded is denoted as \(NAL_{cur}\) and is assigned to the elements of \(L_{NALs}\) iteratively by invoking Next(\(L_{NALs}\)). If it is not null (meaning that there still remains units to be processed), it is passed to perform JSCD. Otherwise, the iterative JSCD process finishes. Firstly, \(NAL_{cur}\) is fed into the video decoder to do HEVC decoding. If there are no assertion failures reported by the decoder, then the decoded source bits for \(NAL_{cur}\) are assumed to be correct and move on to the next NAL unit. Otherwise, according to the reported assertion message and the position of the current decoding bit in \(NAL_{cur}\), the error type and error bit position \(p^{\prime }\) can be deduced. This information is saved and compared with the result of the last JSCD decoding process. If they are the same, it indicates that the last R-SCFlip channel decoding has failed to correct the error bits, thus the conventional video decoding error concealment is performed and then moves on to the next NAL unit to continue the JSCD process. If they are not the same, it indicates that a new error situation caused by new encountered error bits happens, which means that the information has not yet been utilized by R-SCFlip decoding in doing error bits correction. Therefore, the error bit location range estimation and R-SCFlip decoding should be combined. The data-frame \(F\) corresponding to \(NAL_{cur}\) is firstly determined. The location range of possible error bits in data-frame \(F\) causing the assertion of failure is estimated as \(R=[p_{LB},p_{UB}]\) through the KDE fitting model according to the semantic and syntax error type, as depicted in the previous subsection. With the given estimated range \(R\), data-frame \(F\) is fed into R-SCFlip decoder to do channel decoding again. The newly generated \(NAL_{cur}\) after R-SCFlip decoding is put back to video decoder to do HEVC decoding again, and a new JSCD process repeats. When all of the NAL units in \(L_{NALs}\) pass HEVC decoding without assertion of failure or have been conducted error concealment, the JSCD process finishes.

In the proposed iterative JSCD, the assertions of failures reported by the HEVC decoding are the outcomes of source bitstream’s violation of HEVC semantic and syntax restriction and are essentially the result of information redundancy hidden in HEVC standard. For R-SCFlip channel decoder, they are viewed as external information, which is utilized to help improve channel decoding accuracy. Specifically, in the proposed scheme, it is implemented in an iterative form of estimating position ranges of error bits and doing R-SCFlip polar channel decoding back and forth. The performance of the proposed iterative JSCD mainly relies on the accuracy of the error bit location range estimation and the validation of the R-SCFlip channel decoder.

5 Experimental Results and Analysis

In this section, extensive experiments are designed to validate the proposed JSCD scheme for transmitting HEVC encoded video bitstreams in the wireless channels with polar encoding. The experimental settings are firstly presented. Then the channel decoding accuracy performance is validated. Next, the video quality improvements are evaluated in terms of different metrics and visual results. Moreover, the computational complexity of the proposal is analyzed and the comparison with a related scheme is given as well.

5.1 Experimental Settings

To simulate polar code en/decoding, the open-source software AFF3CT[8] tool was used. To en/decode VR videos in HEVC, the HM software [6] integrated with 360lib [23] was used. Four 360\(^\circ\) video sequences (\(AerialCity\), \(DrivingInCity\), \(DrivingInCountry\), \(PoleVault\)) representing VR multimedia and two 2D regular videos (\(BasketBallDrive\), \(RaceHorses\)) were chosen for evaluation. They were encoded in HEVC with four encoding Quantization Parameters (QP) 37, 32, 27, 22. Note that, in the following tables, the names of the testing videos are shortened by the abbreviations A.C., D.I.C., D.I.Cnt., P.V., B.D., and R.H. respectively. For wireless communications, the AWGN and Rayleigh fading channels with different SNR levels were considered.

Table 3 summarizes the experimental configurations. For the two 2D regular videos, they were divided into five and three segments, respectively (each segment consists of 100 frames), and all the segments were tested in the experiments. The following experimental results for these two 2D videos are the average values calculated from all of their corresponding segments. It is found that the number of testing video frames has no significant influence on the JSCD schemes’ performances[47]. Therefore, for the 360\(^\circ\) videos, the first 50 video frames are selected for QP 37 and 32, and the first 25 and 10 video frames are selected for QP 27 and 22, respectively. Detailed information of the testing videos is summarized in Table 4, which gives the number of video frames, bitstream sizes and numbers of NALs (The 2D videos are given in segments). All the experiments with the same parameter configurations are run for 100 and 10 times for the 360\(^\circ\) videos and the 2D videos, thus all the presented data are the average values calculated from these individual simulations.

Table 4.

Seq.	Number of video frames				Bitstream size (bytes)				Nubmer of NALs
Seq.	QP = 22	QP = 27	QP = 32	QP = 37	QP = 22	QP = 27	QP = 32	QP = 37	QP = 22	QP = 27	QP = 32	QP = 37
A.C.	10	25	50	50	1,437,444	683,439	545,609	283,424	8,262	5,596	4,515	2,656
D.I.C.	10	25	50	50	1,200,762	958,512	866,025	445,354	6,839	7,412	7,754	4,449
D.I.Cnt.	10	25	50	50	2,099,940	1,614,957	1,337,263	564,394	9,148	10,973	10,878	5,493
P.V.	10	25	50	50	2,668,564	1,821,949	1,395,539	644,376	11,708	11,747	10,613	5,648
B.D.1	100	100	100	100	5,924,782	2,152,390	1,086,692	589,493	30,131	17,435	10,523	6,227
B.D.2	100	100	100	100	6,930,240	2,412,996	1,186,813	637,157	32,471	18,513	11,087	6,665
B.D.3	100	100	100	100	5,842,612	2,119,768	1,059,965	571,621	28,378	16,422	9,971	6,003
B.D.4	100	100	100	100	5,763,228	2,062,464	1,035,605	555,595	28,441	16,258	9,771	5,786
B.D.5	100	100	100	100	7,025,233	2,392,721	1,181,242	632,550	30,837	17,889	10,840	6,585
R.H.1	100	100	100	100	3,169,178	1,319,091	633,682	304,993	9,840	7,769	4,833	2,966
R.H.2	100	100	100	100	3,122,401	1,223,352	560,953	259,744	9,199	6,801	4,164	2,559
R.H.3	100	100	100	100	1,858,032	848,466	438,942	235,197	8,827	6,361	3,958	2,480

Table 4. Detail Summary of Testing Video Sequences

5.2 Analysis on Channel Decoding Accuracy

With the help of the proposed JSCD, the R-SCFlip channel decoder tries to decode data-frames from the physical layer raw video bitstream by flipping error bits in the range predicted by the KDE estimator. Consequently, it gives more chances to recover error decoded data-frames, leading to reduce FER. Suppose \(N_{frm}\) is the total number of data-frames, and \(N_{E}\) is the number of error frames without involving JSCD. Among these \(N_{E}\) error frames, some of them are corrected after performing JSCD. Denote the number of corrected frames as \(N_{E}^{^{\prime }}\). Then, define \(\Delta _{FER}=N_{E}^{^{\prime }}/N_{frm}\) and \(\Delta _{EC}=N_{E}^{^{\prime }}/N_{E}\) as the FER improvement and error data-frame correction ratio, respectively, which indicate the performances of the JSCD method in improving channel decoding accuracy.

Figures 5 and 6 give the FER, \(\Delta _{FER}\) and \(\Delta _{EC}\) results for the AWGN and Rayleigh fading channels, respectively. The black solid lines represent the FERs for different QPs. The red dash lines represent FER improvements \(\Delta _{FER}\)s and the colorful lines with triangles represent the the error data-frame correction ratios \(\Delta _{EC}\)s. As can be seen, for all the testing videos with different QPs, \(\Delta _{FER}\)s are between 0.18% and 3.01% for AWGN channel and between 0.26% and 2.63% for the fading channel. \(\Delta _{EC}\)s are between 25.08% and 73.41% for AWGN channel and between 28.02% and 65.16% for the fading channel, which means that at least 25 percent of error data-frames have been corrected with the help of JSCD. Take \(AerialCity\) with QP 22 in AWGN channel as an example, as shown in Figure 5(a), the FER is 7.72% when \(\frac{{{E_b}}}{{{N_0}}}=2.0\) dB, and \(\Delta _{FER}\) is 2.80% which means that the FER can be reduced to \(7.72\%-2.80\%=4.92\)% after performing JSCD. \(\Delta _{EC}\) is 36.20% which means that 36.20 percent of error data-frames have been corrected. Similarly, when \(\frac{{{E_b}}}{{{N_0}}}=2.1,2.2,2.3\) dB, the \(\Delta _{FER}\)s are 1.05%, 0.53%, and 0.20%, respectively, and the \(\Delta _{EC}\)s are 42.22%, 43.57%, and 51.32%, respectively. For QP 27, the FER improvements for \(\frac{{{E_b}}}{{{N_0}}}=2.0,2.1,2.2,2.3\) dB are 2.57%, 0.86%, 0.50%, and 0.27%, respectively, and the correction ratios are 33.70%, 39.63%, 47.11%, and 62.44%, respectively. Results for QP 32 and 37 show similar performances.

Fig. 5.

Fig. 6.

Generally, as \(\frac{{{E_b}}}{{{N_0}}}\) grows, more and more percentage of data-frames can be corrected, while less FER improvements are obtained. In the perspective of QP, it seems that more error data-frames could be corrected for larger QPs. The best case for \(\Delta _{FER}\) lies in \(AerialCity\) with \(\frac{{{E_b}}}{{{N_0}}}=2.3\) dB and QP = 37, in that, \(3.01\%\) FER improvement has been achieved. The best case for \(\Delta _{EC}\) lies in \(BasketBallDrive\) with \(\frac{{{E_b}}}{{{N_0}}}=2.3\) dB and QP = 37, in that, \(73.41\%\) error data-frames have been corrected. Averagely, for the sequences with different QPs and \(\frac{{{E_b}}}{{{N_0}}}\) levels, \(1.07\%\) FER improvement has been achieved and \(44.09\%\) error data-frames have been corrected.

The proposed approach also works well in the fading channel. As shown in Figure 6, it has the similar performances as in the AWGN channel. The best case for \(\Delta _{FER}\) lies in \(DrivingInCountry\) with \(\frac{{{E_b}}}{{{N_0}}}=14.6\) dB and QP = 22, in that, \(2.63\%\) FER improvement has been achieved. The best case for \(\Delta _{EC}\) also lies in \(DrivingInCountry\) with \(\frac{{{E_b}}}{{{N_0}}}=15.2\) dB and QP = 37, in that, \(65.16\%\) error data-frames have been corrected. On average, \(1.11\%\) FER improvement has been achieved and \(46.98\%\) of the error data-frames have been corrected. Overall, the scheme performs a little better in the fading channel than in the AWGN channel.

5.3 Video Quality Performance Evaluation

To evaluate the qualities of the 360\(^\circ\) videos, Weighted-to-Spherically-uniform Peak Signal-to-Noise Ratio (WS-PSNR) and Voronoi-based Video Multimethod Assessment Fusion (VI-VMAF) [11] are utilized. On the other hand, Peak Signal-to-Noise Ratio (PSNR) and Structural SIMilarity (SSIM) are measured for the regular 2D videos. In this subsection, the proposed JSCD (denoted as “Pro.”) is mainly compared to the conventional source-channel separated decoding scheme (denoted as “NoJ.”) and the proposed JSCD scheme with channel decoded error bits’ range given as ground truth (denoted as “Pro_RGT” and as “RGT” in the tables due to space limitation). In addition, the qualities of videos reconstructed from the original HEVC encoded bitstreams are also given as the baseline for comparisons (denoted as “Rec.”). For a specific metric, if the values for “NoJ.”, “Pro_RGT” and “RGT” are \(A\), \(B\), and \(C\), respectively, then the performance improvements of “Pro_RGT” and “RGT” are calculated as \(\Delta _{Pro}=(B-A)/A\) and \(\Delta _{RGT}=(C-A)/A\), respectively. In the following tables, \(\Delta _{Pro}\) and \(\Delta _{RGT}\) are denoted as the performance improvements.

Tables 5 and 6 present the WS-PSNR and VI-VMAF results for the 360\(^\circ\) videos in AWGN channel. It can be seen that “Pro.” and “Pro_RGT” has improved all the video qualities significantly in all cases of \(\frac{{{E_b}}}{{{N_0}}}\) compared to “NoJ.”. Specifically, in terms of WS-PSNR, the best improvements lies in \(DrivingInCity\) with QP 27 for \(\frac{{{E_b}}}{{{N_0}}}=2.2\) dB, which are \((25.52-17.29)/17.29\ \times \ 100\% \approx 48\%\) for “Pro.” and \((36.68-17.29)/17.29 \times 100\% \approx 110\%\) for “Pro_RGT”, as shown in Table 5. The “Avg.” and “All Avg.” improvement values presented in the table are averaged from the four videos’ data. As we can see, the overall average WS-PSNR improvements of “Pro.” for QP 22, 27, 32, and 37 are \(22\%\), \(25\%\), \(24\%,\) and \(20\%\), respectively. In terms of VI-VMAF, as shown in Table 6, the biggest improvements reach to \((81.48-70.75)/70.75 \times 100\% \approx 15\%\) for “Pro.” for the \(AerialCity\) at \(\frac{{{E_b}}}{{{N_0}}}=2.3\) dB with QP 27, and \((82.59-61.60)/61.60 \times 100\% \approx 34\%\) for “Pro_RGT” for the \(DrivingInCity\) at \(\frac{{{E_b}}}{{{N_0}}}=2.2\) dB with QP 27. In average, the VI-VMAF improvements of “Pro.” for QP 22, 27, 32, and 37 are \(6\%\), \(7\%\), \(8\%,\) and \(7\%\), respectively.

Table 5.

Table 6.

Tables 7 and 8 give the detailed PSNR and SSIM results for the 2D videos in AWGN channels. They also show that “Pro.” and “Pro_RGT” work fine in improving 2D video qualities in all cases of \(\frac{{{E_b}}}{{{N_0}}}\) compared to “NoJ.”. Specifically, in terms of PSNR, the best improvement for “Pro.” lies in \(RaceHorses\) with QP 37 for \(\frac{{{E_b}}}{{{N_0}}}=2.1\) dB, which is \((21.37-16.31)/16.31 \times 100\% \approx 31\%\). The best improvement for “Pro_RGT” lies in \(RaceHorses\) with QP 32 for \(\frac{{{E_b}}}{{{N_0}}}=2.1\) dB, which is \((27.30-15.02)/15.02 \times 100\% \approx 82\%\), as shown in Table 7. The overall average PSNR improvements of “Pro.” for QP 22, 27, 32, and 37 are \(7\%\), \(13\%\), \(20\%,\) and \(17\%\), respectively. In terms of SSIM, as shown in Table 8, the biggest improvements have reached to \((0.68-0.37)/0.37 \times 100\% \approx 84\%\) for “Pro.” for \(RaceHorses\) at \(\frac{{{E_b}}}{{{N_0}}}=2.1\) dB with QP 37 and \((0.63-0.15)/0.15 \times 100\% \approx 320\%\) for “Pro_RGT” at \(\frac{{{E_b}}}{{{N_0}}}=2.0\) dB with QP 32. In average, the SSIM improvements of “Pro.” for QP 22, 27, 32, and 37 are 24%, 34%, 37%, and 39%, respectively.

Table 7.

Table 8.

For the Rayleigh fading channels, the proposed algorithms achieve similar results. Tables 9 and 10 show that all the metrics are improved in all cases of the \(\frac{{{E_b}}}{{{N_0}}}\) and QPs for both 360\(^\circ\) and regular 2D videos. Compared to “NoJ.”, the overall average improvements of WS-PSNR and VI-VMAF for the 360\(^\circ\) videos are \(19\%\) and \(6\%,\) respectively, and the overall average improvements of PNSR and SSIM for the 2D videos are \(15\%\) and \(37\%,\) respectively.

Table 9.

Table 10.

The experimental results show that the proposed JSCD scheme is not sensitive to the video encoding QPs but to the channel noise levels. In addition, from the above tables, it can be seen that the proposed JSCD is inferior to “Pro_RGT”, which means that higher video quality can be obtained if higher accuracy of the error bit range estimation is achieved. It indicates that the proposed JSCD could be further improved. The possible direction would be HEVC semantic and syntax error checking, so as to improve the accuracy of error bit range estimation.

Figures 7 and 8 give the visual results of the final recovered 360\(^\circ\) videos in EquiRectangular (ERP) and CubeMap (CMP) projections, respectively, and Figure 9 gives those of the two 2D videos. Obviously, the proposed scheme can successfully recover several parts of the slices inside the video frames, thus improving the whole video’s quality. It can be observed that, for some video frames, there are gaps between “Pro” and “Pro_RGT” in terms of percentages of recovered slices, which proves that the accuracy of error bit range prediction plays a key role in improving video quality.

Fig. 7.

Fig. 8.

Fig. 9.

5.4 Complexity Analysis

Figure 4 shows that there are iterative loops involved in the proposed JSCD scheme, which costs extra computations depending on the number of loops in performing iterative JSCD. In the iterative JSCD process, the R-SCFlip decoding for data-frames with error bits needs to be run recurrently and the HEVC decoding for the corresponding NAL units needs to be run recurrently as well. Therefore, the increased computational complexity can be measured by the increased number of JSCD iterations.

The experimental results are collected from the 360\(^\circ\) videos streaming in Rayleigh fading channels. Table 11 gives the average percentage of increased iterative decoding loops when applying JSCD process, where \({\eta _c}\) is the percentage of channel data-frames needed to do extra R-SCFlip decoding and \({\eta _s}\) is the percentage of NAL units need to do extra HEVC decoding. The overall increased computational complexity can be measured by \({\eta _c}+{\eta _s}\). For lower \(\frac{{{E_b}}}{{{N_0}}}\), more data-frames tend to be channel decoded incorrectly, and more iterative decoding loops are involved when performing JSCD. In this case, more extra computations are required. Each loop involves one data-frame channel decoding process and one NAL unit source decoding process. Each experiment with the same configuration is run for 100 times. The number of iterative decoding loops is recorded and the average value is computed. It is found that the worst-case happens when \(QP=27\) and \(\frac{{{E_b}}}{{{N_0}}}=14.6\) dB, where the proposed JSCD has to run averagely about 85.97 times extra decoding loops. In this case, the average total number of data-frames and NAL units are 1,984 and 8,932 (according to the parameter configuration given in Table 4), respectively. Thus, the ratio of extra computation overhead for channel and HEVC decoding is \(4.35\%\) and \(0.94\%\), respectively, as specified in Table 11, and the total computation overhead is \(4.35\%~+~0.94\%=5.29\%\). According to Table 11, the overall average computation overhead of \({\eta _c}+{\eta _s}\) is \((2.85\%+3.01\%+2.95\%+2.95\%)/4=2.94\%\), which brings in \(19\%\) and \(6\%\) video quality improvements for WS-PSNR and VI-VMAF, respectively, according to the data from Table 9. Overall, the computational complexity results indicate that it is more worthy to perform JSCD for larger \(\frac{{{E_b}}}{{{N_0}}}\).

Table 11.

5.5 Comparison with Other JSCD Methodologies

The JSCD scheme in [41] (referred as “R. Perera et al. JSCD”) is selected for comparison, which is compared with the “Pro_RGT” approach in this article for fairness. In the experiments of “R. Perera et al. JSCD”, a total of 60 data-frames (transport blocks) from Foreman video sequence are considered. Numbers of error blocks that can be recovered are \(60 \times (20\%-3\%)=10.2\) and \(60 \times (53\%-22\%)=18.6\) for SNR (measured in \(\frac{{{E_s}}}{{{N_0}}}\)) 11.5 dB and 11.4 dB, respectively. Thus, the improvements are \(10.2/(60 \times 20\%)=85.00\%\) and \(18.6/(60 \times 53\%)=58.49\%\), respectively. For comparison, in our experiments, the total number of transmission data-frames is 853, which is derived from 360\(^\circ\) video sequence AerialCity with QP 32 in ERP projection. The average numbers of error decoded frames are 43.51 and 52.56 under fading channels with SNR = 11.5 dB and 11.4 dB, respectively. “Pro_RGT” can correct 40.22 and 47.72 of them, respectively. Thus, the improvements are \(92.44\%\) and \(90.79\%\), respectively. Table 12 summarizes the comparison result. Obviously, our “Pro_RGT” scheme shows better performance in terms of improvements in error data-frame recovery. It proves that, compared to turbo decoders, the polar decoder can correct more data-frames when the extrinsic information of positions of error bits is given.

Table 12.

SNR (dB)	R. Perera et al. JSCD in [41]	Pro_RGT
11.5	85.00%	92.44%
11.4	58.49%	90.79%

Table 12. Comparisons of Improvements of Error Data-frame Recovery

In terms of the quality of visual experience, “R. Perera et al. JSCD” has given the SSIM results for channel SNR from 11.375 dB to 11.6 dB in [41]. The best case turned out to be the Beergarden video sequence with higher resolution. Therefore, only comparisons with this case are made here. The experiments for “Pro_RGT” are also performed in Rayleigh fading channel with the corresponding SNRs for the AerialCity in ERP. Figure 10 gives the comparison. It shows that “Pro_RGT” produces higher SSIM values in general, and significantly performs better in the small SNR range. The largest performance gap lies in SNR = 11.375 dB, which equals about \(25\%\) improvement. These improvements are owning to the higher ratio of error data-frame recovery provided by the polar decoder utilizing HEVC semantic and syntax error checking.

Fig. 10.

6 Conclusions

We propose a novel JSCD approach of polar codes for HEVC-based video streaming. According to the semantic and syntax errors reported by the HEVC decoder, the error bit positions in the input video bitstreams are estimated by a KDE fitting approach. By using the estimated error bit position ranges, an R-SCFlip polar decoding algorithm is presented. By combining the KDE error bit range estimator and the R-SCFlip polar decoder together, an iterative JSCD methodology is proposed. Experimental results show that the proposed scheme demonstrates significant performance improvements compared to the scheme without JSCD. Averagely, \(1.09\%\) FER improvements have been achieved. The average PSNR and WS-PSNR gains reach \(14\%\) and \(21\%\) for 2D and 360\(^\circ\) videos, respectively. Experiments also indicate that the computational complexities paid for these improvements are affordable. Compared with benchmark JSCD methods, the proposed JSCD outperforms in recovering error data-frames, especially for small channel SNR.

Footnote

To distinguish the concepts of a “frame” in a video and a “frame” in the communication physical layer, the word “data-frame” is used in the whole article referring to the physical layer frame. Otherwise, “frame” refers to a video frame.

References

[1]

3GPP. 2018. Multiplexing and Channel Coding. Technical Specification (TS) 38.212. 3rd Generation Partnership Project (3GPP). Version 15.2.0.

Abstract

1 Introduction

2 Related Works

2.1 General JSCD, JSCC, and Cross-layer Schemes

2.2 Video Streaming Dedicated JSCD Schemes

2.3 Polar Code Related JSCD Schemes

3 Preliminaries of Polar Codes

4 Proposed Joint Source-Channel Decoding Scheme

4.1 System Model

4.2 Error Bit Location Range Estimation

4.3 Range Specified Successive Cancellation Flip Decoding

4.4 Joint Range Estimation and R-SCFlip Decoding

5 Experimental Results and Analysis

5.1 Experimental Settings

5.2 Analysis on Channel Decoding Accuracy

5.3 Video Quality Performance Evaluation

5.4 Complexity Analysis

5.5 Comparison with Other JSCD Methodologies

6 Conclusions

Footnote

References

Cited By

Index Terms

Recommendations

Iterative joint source-channel decoding of H.264 compressed video

Joint Source-Channel Soft Decoding of Huffman Codes with Turbo-Codes

Joint source-channel decoding of variable-length codes with soft information: a survey

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations