WO2012158159A1

WO2012158159A1 - Packet loss concealment for audio codec

Info

Publication number: WO2012158159A1
Application number: PCT/US2011/036662
Authority: WO
Inventors: Turaj ZAKIZADEH SHABESTARY; Tina LE GRAND
Original assignee: Google Inc.
Priority date: 2011-05-16
Filing date: 2011-05-16
Publication date: 2012-11-22
Also published as: CN103688306A; CN103688306B

Abstract

A speech signal is encoded as a sequence of consecutive frames. When a frame is lost, the loss is concealed at a receiver by reconstructing audio that would be contained in the lost frame based on other previously received frames. The frames contain a residual signal and linear predictive coding parameters representing a segment of audio data. For a lost frame the content of a previous frame is not copied, but is modified to make the reconstructed audio sound natural. The modification includes creating a weighted sum of a quasi-periodic signal derived from the latest two pitch cycles and a pseudo random sequence. The weights are selected based on a determination of whether the previous frame contains voiced or unvoiced utterances.

Description

PACKET LOSS CONCEALMENT FOR AUDIO CODEC

TECHNICAL FIELD

[0001] The technical field relates to packet loss concealment in communication systems (such as Voice over IP, also referred to as VoIP), having an audio codec (coder/decoder). One such codec may be iSAC.

BACKGROUND

[0002] Telephone communication originally relied on dedicated connections between callers. Thus, every ongoing telephone conversation required a physical, real-time, connection to enable real-time communication. Real-time communication refers to communication where the delay between one user speaking and another user hearing the speech is so short that it is imperceptible or nearly imperceptible. In recent years, advances in communication technology have allowed packet-switched networks, such as the Internet, to support real-time communication.

[0003] VoIP is one audio communication approach enabling real-time communication over packet-switched networks. Instead of a dedicated connection between callers, an audio signal is broken up into short time segments by an audio coder, and the time segments are transmitted individually as audio frames in packets. The packets are received by the receiver, the audio frames are extracted, and the short time segments are reassembled by an audio decoder into the original audio signal, enabling the receiver to hear the transmitted audio signal, [0004] Real time audio communication over packet-switched networks has brought with it unique challenges. The available bandwidth of the network may be limited, and may change over time, Packets may also get lost or corrupted. A packet is considered lost, when it fails to arrive at the intended receiver within a limited time interval, even if the packet does eventually arrive at the receiver.

[0005] One approach for dealing with lost packets is Backward Error Correction (BEC), where the receiver notifies the transmitter that an expected packet was not received, causing the transmitter to re-transmit the expected packet. While viable for tasks such as file transmission, BEC is not desirable for a real-time communication system. In real-time audio communication re-transmission is not a viable option because it typically results in a large delay before the missing packet is received by the receiver. Waiting for re-transmission of a packet would result in the loss of the real-time nature of the communication.

[0006] Another approach for dealing with lost packets is to use information from received packets to recreate lost packet or packets. The received packets may contain information specifically for this purpose, such as redundant information about audio data from preceding time segments. Such an approach, however, will result in reduced effective bandwidth available for communication, because the available bandwidth is used for transmitting redundant data, which may not be needed at all if packets are not lost.

[0007] The present invention recognizes the problem posed by lost packets in real-time audio communication over packet switched networks, and provides a solution that avoids the disadvantages of the above examples.

[0008] According to an embodiment of the present invention, the loss of packets is concealed by simulating the audio information that would have likely been contained in the lost packets based on previously received packets. The invention utilizes packets that were previously received to reconstruct dropped packets in a particular way, without the use of a jitter buffer. Specifically, information from a previously received packet is used to reconstruct a lost packet, but the information is not merely copied. If it were simply copied, the resulting audio would sound unnatural and "robotic." Instead, the information from the previously received packet is modified in a special way to make the reconstructed packet result in natural sounding audio.

SUMMARY

[0009] In an embodiment a method of decoding an audio signal having been encoded as a sequence of consecutive frames may include receiving a first frame of the consecutive frames, the first frame containing decoding parameters and a residual signals for reconstructing audio data represented by the first frame, storing the residual signals contained in the first frame, decoding the first frame based on the stored residual signals to reconstruct the audio signal encoded by the first frame, determining that a second frame subsequent to the first frame in time has been lost, modifying the stored residual signals, and reconstructing an estimate of the audio signal encoded by the second frame based on the modified residual signals.

[0010] In an embodiment, the modifying the stored residual signals may include generating a periodic signal, generating a colored pseudo-random signal based on the stored residual signals, multiplying the periodic signal and the colored pseudo-random signal with weight factors selected based on energy of an input and an output signal of a pitch synthesis filter created from the stored residual signals and based on pitch gain of the stored residual signals, and summing the weighted periodic signal and the weighted colored pseudo-random signal. [0011] In an embodiment, the generating the periodic signal may include retrieving at least two most recently stored pitch cycles, altering periodicity of each pitch cycle, weighting each pitch cycle, and summing the two weighted pitch cycles.

[0012] In an embodiment, the altering the periodicity may include resampling pitch pulses of the pitch cycles.

[0013] In an embodiment, the generating the colored pseudo-random signal may include generating a pseudo-random sequence, and filtering the pseudo-random sequence with Nth- order all-zero filter with coefficients given by N latest samples of a previously decoded lower-band residual signals of a previously received frame.

[0014] In an embodiment, the stored residual signals may include input of a pitch synthesis filter, and input of an LPC synthesis filter. The decoding parameters may include pitch gains, pitch lags, and LPC parameters.

[0015] In an embodiment, the frames may contain encoded information for a first frequency band and distinct second frequency band higher than the first frequency band, and only a residual signal of the first frequency band is pitch post filtered, but not a residual signal of the second frequency band.

[0016] In another embodiment a decoding apparatus for decoding an audio signal having been encoded as a sequence of consecutive frames includes a receiver configured to receive a first frame of the consecutive frames, the first frame containing decoding parameters and a residual signals for reconstructing audio data represented by the first frame, a storage unit storing the residual signals contained in the first frame, a decoding unit configured to decode the first frame based on the stored residual signals to reconstruct the audio signal encoded by the first frame, a loss detector configured to determine that a second frame subsequent to the first frame in time has been lost, a modification unit configured to modify the stored residual signals, and a reconstruction unit configured to reconstruct an estimate of the audio signal encoded by the second frame based on the stored residual signals modified by the modification unit.

[0017] In an embodiment, the modification unit may include a first signal generator configured to generate a periodic signal, a second signal generator configured to generate a colored pseudo-random signal based on the stored residual signals, a multiplier multiplying the periodic signal generated in the first signal generator and the colored pseudo-random signal generated in the second signal generator with weight factors selected based on energy of an input and an output signal of a pitch synthesis filter created from the stored residual signals and based on pitch gain of the stored residual signals, and an adder summing the weighted periodic signal and the weighted colored pseudo-random signal output from the multiplier.

[0018] In an embodiment, the first signal generator may be configured to retrieve at least two most recently stored pitch cycles, alter periodicity of each pitch cycle, weight each pitch cycle, and sum the two weighted pitch cycles.

[0019] In an embodiment, the first signal generator may be configured to alter the periodicity by resampling pitch pulses of the pitch cycles.

[0020] In an embodiment, the second signal generator may be configured to generate a pseudo-random sequence, and filter the pseudo-random sequence with Nth-order all-zero filter with coefficients given by N latest samples of a previously decoded lower-band residual signals of a previously received frame.

[0021] In an embodiment, the stored residual signals may include input of a pitch synthesis filter, and input of an LPC synthesis filter. The decoding parameters may include pitch gains, pitch lags, and LPC parameters.

[0022] In yet another embodiment a computer readable tangible recording medium is encoded with instructions, wherein the instructions, when executed on a processor, cause the processor to perform a method including receiving a first frame of the consecutive frames, the first frame containing decoding parameters and a residual signals for reconstructing audio data represented by the first frame, storing the residual signals contained in the first frame, decoding the first frame based on the stored residual signals to reconstruct the audio signal encoded by the first frame, determining that a second frame subsequent to the first frame in time has been lost, modifying the stored residual signals, and reconstructing an estimate of the audio signal encoded by the second frame based on the modified residual signals,

BRIEF DESCRIPTION OF DRAWINGS

[0023] The present invention will become more fully understood from the detailed description given herein below and the accompanying drawings which are given by way of illustration only, and thus do not limit the present invention.

[0024] FIG. 1 is a block diagram illustrating an example of a communication system according to an embodiment of the present invention.

[0025] FIG. 2 illustrates an example of a stream of packets with a lost packet according to an embodiment of the present invention,

[0026] FIG. 3 illustrates an example of a process flow of receiving packets according to an embodiment of the present invention.

[0027] FIG. 4 illustrates an example of a process flow of decoding received packets according to an embodiment of the present invention.

[0028] FIGS. 5A and 5B illustrate an example of a process flow of an algorithm for concealing packet loss according to an embodiment of the present invention.

[0029] FIGS. 6A and 6B illustrate an example of a process flow of an algorithm for generating a quasi-periodic pulse train according to an embodiment of the present invention. [0030] FIG. 7 illustrates an example of a processing system for implementing the packet loss algorithm according to an embodiment of the present invention.

DETAILED DESCRIPTION

[0031] Fig. 1 illustrates a communication system. Audio input is passed into one end of the system, and is ultimately output at the other end of the system. The communication can be concurrently bi-directional, as in a telephone conversation between two callers. The audio input can be generated by a user speaking, by a recording, or any other audio source. The audio input is supplied to encoder 102.

[0032] Encoder 102 encodes the audio input into multiple packets, which are transmitted over packet network 104 to decoder 106. Packet network 104 can be any packet-switched network, whether using physical link connection and/or wireless link connections. Packet network 104 may also be a wireless communication network, and/or an optical link network. Packet network 104 conveys packets from encoder 102 to decoder 106. Some of the packets sent from the encoder 102 may get lost, as illustrated in Fig. 2.

[0033] The encoder 102 may be the iSAC coder, and produces as output packets (also referred to as frames). An embodiment of the invention relies on pitch information, and assumes that pitch parameters are available at the decoder. But even if pitch parameters are not embedded in the payload, they could be estimated at the decoder based on the previously decoded audio. Each frame corresponds to a short segment of time, for example 30 or 60 milliseconds for iSAC. Other segment lengths may also be used with other encoders. Oneway delay is at least as large as one frame size, so frame sizes longer than 60 ms may create unacceptably long delays. Furthermore, longer frames are harder to conceal in the event of a lost packet. Shorter frames on the other hand may introduce too much packet overhead, reducing the effective bandwidth. If delay was not a concern (for instance in streaming), high quality could be achieved by allowing long frame sizes for stationary segments.

[0034] When the encoder 102 is the iSAC coder, it may separate the incoming audio signal into two frequency bands, referred to as the lower band (LB) and the upper band (UB). For example, the LB may be 0-4kHz, and the UB may be 4-8kHz. Other selections of the bands are possible and may be used, (e.g., LB=0-8kHz, UB=8- 16kHz). A single frequency band (e.g., 0-8kHz) may also be used, without separating the incoming audio signal into separate bands.

[0035] As illustrated in Fig. 2, each frame contains at least pitch gain, pitch lag, LPC parameters, and DFT coefficients of a residual signal during the corresponding time segment. In the case where the incoming audio signal was separated into the LB and UB bands, each of the bands will have respective information in the frame, the information for each band can be individually selected from the frame, and there are no pitch parameters associated with the UB band. When the encoder used is iSAC, there are 4 sets of pitch parameters in a frame and 6 sets of LPC parameters in a frame, to capture the evolution of the signal within the frame. The pitch lag can be thought of as the "optimal" delay of a long-term predictor, and pitch gain can be though of as the prediction gain, while LPC coefficients are optimal short-term prediction coefficients.

[0036] Decoder 106 receives packets conveyed by network 104 and decodes the packets into audio data, which is output from decoder 106. Details of the processing performed by the decoder 106 are illustrated in Figs. 3-6. Decoder 106 may be implemented on a processor, such as illustrated in Fig. 7, or on other hardware platforms, such as mobile telecommunication devices. The processing performed by decoder 106 is advantageous for mobile devices that lack sufficient processing power to perform alternate types of packet loss concealment, as the approach according to the present invention is of a relatively low computational complexity.

[0037] Fig. 3 illustrates a high level processing flow of the PLC approach according to an embodiment of the present invention. In step S 306, a determination is made whether frame N has been received, i.e., not lost. If frame N has been received, the processing continues to step S 320, where frame N is decoded. Fig. 4 illustrates additional details of the processing in step S 320.

[0038] After frame N is decoded in step S 320, the processing increments index N in step S 340, and continues with step S 306 to determine if frame N+l has been received. So long as frames are not lost, the processing continues along the loop of step S 306, S 320, and S 340.

[0039] If it is determined in step S 306 that a frame has been lost, the processing continues to step S 350, where the loss of the frame is concealed. Figs. 5A-B illustrate additional details of the processing in step S 350.

[0040] Fig. 4 illustrates an example of the process of decoding frames that are received by decoder 106. When a frame is received, frame size and bandwidth information are decoded from the frame in step S 410. The frame size represents the size of the time segment represented by the frame, and can be represented in milliseconds, or count of samples at a particular sampling rate. The sampling rate may also be encoded in the frame. Sampling rate may be negotiated before a call takes place and is not supposed to change during a call. The bandwidth information reflects the bandwidth of the audio data encoded in the frame, and may be LB, UB, or both.

[0041] In step S 415 the pitch lags and the pitch gains are decoded from the frame. Pitch lags and gains may be updated every 7.5 ms, thus resulting is 4 pitch lags and gains per one 30 ms frame. The pitch lag represents the lag of a long-term predictor for the current signal. The pitch gain represents the long-term linear prediction coefficient.

[0042] The decoded pitch lags and pitch gains are stored in step S 420, as they may be needed for packet loss concealment, if subsequent frames are lost.

[0043] In step S 425 the LPC parameters (LPC shape and gain) are decoded. The LPC parameters represent short-term linear prediction coefficients, describing the spectral envelope of the signal.

[0044] The LPC shape and gain are stored in step S 430, as they may be needed for packet loss concealment, if subsequent frames are lost.

[0045] In step S 435 the DFT coefficients of the residual signal encoded in the frame are decoded. The residual signal is the result of filtering out the short term and the long term linear dependencies. The DFT coefficients are the result of transforming the residual signal into the frequency domain by an operation such as the FFT. The DFT coefficients may include separate information for the LB signal and separate information for the UB signal.

[0046] In step S 440 the DFT coefficients which were decoded in step S 435 are transformed from the frequency domain into the time domain, by an operation such as an inverse FFT, resulting in the residual signal. In case of using both LB and UB signal, a separate residual signal is created for LB (referred to as LBJ es) and a separate residual signal is created for UB (referred to as UB_Res).

[0047] In step S 445 the residual signals (LB Res and UB_R.es) are stored, as they may be needed for packet loss concealment.

[0048] In step S 450 the lower band residual signal (LB Res) is filtered by a pitch post- filter. The pitch post-filter is pole-zero filter where coefficients are given by the pitch gain and lag. It is the inverse of pitch pre-filter, therefore, it introduces long-term structure which was removed by the pitch pre-filter. Even when both LBJtes and UB_Res may be available, only the LB ies will be pitch post-filtered. The output of the pitch post-filter (the filtered residual signal) is stored, as it may be needed for packet loss concealment.

[0049] In step S 455 the LPC parameters decoded in step S 425 are used to synthesize the lower band and the upper band signals. LPC synthesis is an all-pole filter with coefficients derived from LPC parameters. This filter is the inverse of LPC analysis (at the encoder), therefore, it introduces short-term structure of the signal,

[0050] The output of LPC synthesis is the time domain representation of the original encoded signal. In the case where LB and UB are used at the same time, the output is a separate LB signal and UB signal.

[0051] When LB and UB are used together, in step S 460 the LB signal and the UB signal are combined, thus creating a representation of the original audio input, thereby, the output can be the audio input for a receiver, illustrated in Fig. 1. In an implementation where LB and UB are not treated separately, and only a single frequency band is used, step S 460 may be skipped.

[0052] As illustrated in Fig. 4, the re-creation of the audio depends of the availability of the residual signal, pitch gain and lag, and LPC parameters from the received frame. In case of packet loss, however, that information is not available. As each frame represents a time segment on the order of 30 milliseconds, it is possible to simply copy the information from a preceding frame to represent the lost frame. With that approach, however, the audio would sound artificial and robotic. Thus, the inventors have derived an approach to reconstruct the data from the lost frame based on previously received frames which creates natural sounding audio. The idea is to reconstruct residual signals— input to pitch synthesis (low-band residual) and the input to the upper-band LPC synthesis (upper-band residual)-similar to ones of the previous packet, but not exactly the same. The details are illustrated in Figs. 5A-B. [0053] When it is determined in step S 306 that a frame has been lost, decoder 106 performs packet loss concealment in step S 350. As shown in Fig. 5 A, stored pitch lag and pitch gain are retrieved in step S 510. The pitch lag and pitch gain were stored in step S 420 for the previous received frame.

[0054] In step S 515 the residual signal is retrieved for the previous received frame. The residual signal was stored in step S 445.

[0055] In step S 516, the decoder determines whether the current lost frame is one of consecutive lost frames. If the lost frame is not one of multiple consecutive lost frames, the processing proceeds to step S 520.

[0056] In step S 520 the latest two pitch pulses are computed. The pitch pulses used are closest in time to the lost frame. The computation is based on the pitch lag and the residual signal retrieved in steps S 510 and S 515. In an embodiment the two latest pitch pulses are only computed for the LB signal, even when both LB and UB signals are used. In other embodiments the two pitch pulses may be computed for both the LB and UB signals. The choice of using two pitch pulses is a design parameter determined by the inventors for optimal performance, but other number of pitch pulses could also be used.

[0057] In step S 525 the pitch pulses obtained in step S 520 are stored. For the LB signal the pitch pulses will be referred to as LB PI and LB P2.

[0058] In step S 530 the pitch post-filter output stored in step S 450 is retrieved, and in step S 535 the pitch post-filter output is used to compute a long-term similarity measure. More specifically the long-term similarity measure is a ratio computed based on the energy of pitch pulses before and after the post-filtering of the previous frame. It is a measure of how periodic the previous frame was.

[0059] In step S 540 a voice indicator is computed based on the long-term similarity measure and the frequency of the computed pitch pulses. For example, the voice indicator may be calculated as log2( sigma2_out / sigma2_in ) + 2 * pitch_gain + pitch_gain / 256, where log2(x) is logarithm of x in base 2, sigma2_out is the variance of the latest pitch pulse at the output of pitch post-filter and sigma2_in is the variance of the corresponding pulse at the input. The voice indicator is an indication of how periodic the last decoded frame was. [0060] In step S 545 weigh factors are computed for voiced and un-voiced segments. The weight factor for voiced segments is w_v, while the weight factor for un-voiced segments is w_u. The following pseudo code is an example of an algorithm for calculating the weight factors: limLow = 0.0214;

limUp = 0.1526;

M = ( limLow + limUp ) / 2 ;

if( voicelndicator < limLow )

{

w_u = 1;

w_v = 0;

} else {

if( voicelndicator > limUp ){

w_u = 0;

w_v = 1;

} else{

if( voicelndicator < M ){

s - ( voiceIndicator - limLow ) / (doubleX 1.41421356237310 * M ); b = s*s;

a = 1 - b;

} else{

s = ( limUp - voicelndicator ) / (double)( 1.41421356237310 * M ); a = s * s;

b = 1 - a;

}

w_u = a;

w_v = b;

}

[0061] The weights are stored in step S 550. The description of steps S 520 through S 550 is based on non-consecutive lost frames. The processing differs for multiple consecutive lost frames as compared to a single lost frame. In the case of multiple consecutive lost frames, there is no immediately preceding frame that has been received. However, the first lost frame of a sequence of multiple lost frames will have been processed through steps S 520 to S 550. Any subsequent lost frames follow the processing through S 517 and S 547.

[0062] A voiced segment reproduced by simply repeating a pitch pulse would sound very artificial and unpleasant to human ears (known as robotic sounds). Thus, the weighting changes over the number of reconstructed pitch cycles to avoid artifacts. In step S 517 a decay rate is increased. The decay rate is the rate that the synthesized residual signal is decayed to zero, and is applied in step S 590.

[0063] In step S 547 the weight factors w_v and w_u calculated during the previous PLC call (stored in step S 550) are retrieved.

[0064] The processing flow continues in Fig. 5B, where in step S 556 the weight factors w_v and w_u are analyzed to determine what kind of utterance is contained in the most recent received frame. Voiced utterances have strong periodic nature, while unvoiced utterances do not. If the most recently received frame contains voiced utterances, w_v will be greater than zero. If the frame also contains unvoiced utterances, w_u will also be grater than zero. The weights reflect the relative mix of voiced to unvoiced utterances in the frame. A frame with only voiced utterances will have w_ u equal to zero, while a frame with only unvoiced utterances will have w_v equal to zero. If both w_v and w_u are non-zero, the utterance is considered a mixed utterance.

[0065] If it is determined that the utterance is unvoiced (i.e., w_v is zero), the processing proceeds to step S 560, where a pseudo random vector is generated. A pseudo random vector may be generated for LB and a separate one for UB, when both LB and UB are used.

[0066] In step S 562 the pseudo random vector is filtered by an Nth-order all-zero filter with coefficients given by N latest samples of recently decoded residual signal. In an exemplary embodiment, N may be a fixed number equal to 30. This filtering will color the generated pseudo random vectors to have a spectrum envelope similar to that of the previous received packet.

[0067] If it is determined in step S 556 that the utterance is voiced (i.e., w_ is zero), the processing proceeds to step S 580. In step S 580 a quasi periodic pulse train is constructed. The quasi periodic pulse train is a weighted sum of the two latest pitch cycles. The output is the residual signal. In case both LB and UB are used, the output is the LB residual and the UB residual. Details of the process of generating the quasi periodic pulse train are illustrated in Figs. 6A-B.

[0068] If it is determined in step S 556 that the utterance is mixed, the processing proceeds to step S 570. Step S 570 is functionally the same as step S 580. The details of the processing in step S 570 are illustrated in Figs. 6A-B. The output of step S 570 is a lower band pulse train (referred to as LB_F) and an upper band pulse train (referred to as UB P).

[0069] In step S 572 two pseudo random vectors are generated, one for LB and one for UB. The process of generating the pseudo random vectors is the same as in step S 560. The LB pseudo random vector will be referred to as LB N and the UB pseudo random vector will be referred to as UB_N,

[0070] In step S 574 weight factors w_v and w u are applied to the quasi-periodic pulse train and to the pseudo random vectors as follows. The LB residual is LB_P*w_v + LB_N*w_u. The UB residual is UB_P*w_v + UB_N*w_u.

[0071] At this stage the residual signals have been calculated and weighted appropriately. In step S 590 the residual signal is decayed. The decay is linear and applied sample-by- sample. If K is the size of the reconstructed residual signal, the following pseudo code illustrates an exemplary algorithm for decaying the signal, where d is a number less than 1 , the role of the decay_rate is apparent:

for n = 1 to do

{ s(n) - s(n) * d;

d = d - decay_rate;

if ( d < O ) then d = 0

}

[0072] In step S 592 the LB residual is pitch post-filtered, similar to step S 450. The pitch post-filtering uses filter coefficients derived from pitch lag and pitch gain stored in step S 420. The UB residual can skip pitch post-filtering.

[0073] In step S 594 LPC parameters stored in step S 430 are retrieved, and LPC synthesis of the LB and UB signal is performed based on the retrieved parameters. [0074] In step S 596 the LB and UB signals are combined to create a synthesized representation of the audio of the lost frame.

[0075] Figs. 6A-B illustrate a detailed description of the process of constructing a quasi- periodic pulse train according to an embodiment of the present invention. A quasi-periodic pulse train is constructed in steps S 570 and S 580.

[0076] In step S 610 the pitch lag of a previous frame, LB PI, LB P2, and UB_Res are retrieved. These values were previously stored when the previous frame was received. [0077] In step S 615 loop counters j and p_cntr are initialized to zero. In step S 616 the decoder determines whether the current frame is one of consecutive lost frames. If the lost frame is not one of multiple consecutive lost frames, the processing proceeds to step S 617, where the value of variable L is set equal to the retrieved pitch lag from step S610. It can be appreciated that the first lost frame will cause L to be initialized to the value of pitch lag, but subsequent lost frames will bypass step S 617, and the processing will continue to step S 620. [0078] In step S 620 LB_P1 is resampled to L samples and assigned to Rl. Thus, the length of Rl is L samples.

[0079] In step S 625 the last L samples of UB_Res are selected, and referred to as Ql. [0080] In step S 630 loop counter i is initialized to zero. [0081] In step S 635 the quasi-periodic pulse trains LB P (for the lower band) and UB_P (for the upper band) are constructed. At each iteration of a loop over /^' and j, LB_P(j)=Rl(i) and UB_P(f)=Ql(i), and / and j are incremented by one.

[0082] In step S 636 the decoder determines whether j is less than the frame size (extracted in step S 410). As long as j remains less than the frame size, the loop continues. When j reaches the frame ize, LB P and UB P are returned as the quasi-periodic pulse trains.

[0083] In step S 638 the decoder determines whether i is less than L. If i is less than L, the process returns to step S 635 and continues the loop. Once i reaches L, the process continues to step S 640, shown in Fig. 6B.

[0084] In step S 640 p cntr is incremented by one.

[0085] In step S 642 the decoder determines whether L is greater than pitch Jag. If L is not greater, L is set to pitch Jag+1 in step S 644. If L is greater than pitch ag, L is set to pitch Jag in step S 646. This processing is an example of resampling of pitch pulses to avoid too much of periodicity in the reconstructed signal.

[0086] In step S 650 LB PI is resampled to L samples and assigned to RJ. Thus, the length of Rl is I samples.

[0087] In step S 655 LB P2 is resampled to L samples and assigned to R2. Thus, the length of R2 is L samples.

[0088] In step S 656 the decoder determines whether the value of p_cntr is equal to 1, 2, or 3.

[0089] If the value of p_cntr is 1 , Rl is set to (3 *R1 +R2)/4 in step S 661.

[0090] If the value ofp_cnlr is 2, Rl is set to (Rl+R2)/2 in step S 662.

[0091] If the value of p_cntr is 3, Rl is set to (Rl+3*R2)/4 in step S 663, and p_cntr is set to 0 in step S 673. [0092] At the conclusion any of steps S 661, S 662, and S 673 the processing returns to step S 630 in Fig. 6A.

[0093] FIG. 7 is a block diagram illustrating an example of a computing device 700 that is arranged for packet loss concealment in accordance with the present disclosure. In a very basic configuration 701, computing device 700 typically includes one or more processors 710 and system memory 720. A memory bus 730 can be used for communicating between the processor 710 and the system memory 720,

[0094] Depending on the desired configuration, processor 710 can be of any type including but not limited to a microprocessor (μΡ), a microcontroller (μθ), a digital signal processor (DSP), or any combination thereof. Processor 710 can include one more levels of caching, such as a level one cache 711 and a level two cache 712, a processor core 713, and registers 714. The processor core 713 can include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP Core), or any combination thereof. A memory controller 715 can also be used with the processor 710, or in some implementations the memory controller 715 can be an internal part of the processor 710.

[0095] Depending on the desired configuration, the system memory 720 can be of any type including but not limited to volatile memory (such as RAM), non- volatile memory (such as ROM, flash memory, etc.) or any combination thereof. System memory 720 typically includes an operating system 721, one or more applications 722, and program data 724. Application 722 includes a decoding processing algorithm with packet loss concealment 723 that is arranged to decode incoming packets, and to conceal lost packets according to the present disclosure. Program Data 724 includes service data 725 that is useful for performing decoding of received packets and concealing lost packets, as will be further described below. In some embodiments, application 722 can be arranged to operate with program data 724 on an operating system 721 such that Android, Chrome, Windows, etc. This described basic configuration is illustrated in FIG. 7 by those components within dashed line 701.

[0096] Computing device 700 can have additional features or functionality, and additional interfaces to facilitate communications between the basic configuration 701 and any required devices and interfaces. For example, a bus/interface controller 740 can be used to facilitate communications between the basic configuration 701 and one or more data storage devices 750 via a storage interface bus 741. The data storage devices 750 can be removable storage devices 751, non-removable storage devices 752, or a combination thereof. Examples of removable storage and non-removable storage devices include magnetic disk devices such as flexible disk drives and hard-disk drives (HDD), optical disk drives such as compact disk (CD) drives or digital versatile disk (DVD) drives, solid state drives (SSD), and tape drives to name a few. Example computer storage media can include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data.

[0097] System memory 720, removable storage 751 and non-removable storage 752 are all examples of computer readable storage media, and store information as described in various steps of the processing algorithms described in this disclosure. Computer readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 700. Any such computer storage media can be part of device 700, and can store instructions that are executed by processor 710, and cause the computing device 700 to perform a method of decoding packets and concealing lost packets as described in this disclosure.

[0098] Computing device 700 can also include an interface bus 742 for facilitating communication from various interface devices (e.g., output interfaces, peripheral interfaces, and communication interfaces) to the basic configuration 701 via the bus/interface controller 740. Example output devices 760 include a graphics processing unit 761 and an audio processing unit 762, which can be configured to communicate to various external devices such as a display or speakers via one or more A/V ports 763. Example peripheral interfaces 770 include a serial interface controller 771 or a parallel interface controller 772, which can be configured to communicate with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device, etc.) or other peripheral devices (e.g., printer, scanner, etc.) via one or more I/O ports 773. An example communication device 780 includes a network controller 781, which can be arranged to facilitate communications with one or more other computing devices 790 over a network communication via one or more communication ports 782. The communication connection is one example of a communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. A "modulated data signal" can be a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media can include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared (IR) and other wireless media. The term computer readable media as used herein can include both storage media and communication media. [0099] Computing device 700 can be implemented as a portion of a small-form factor portable (or mobile) electronic device such as a cell phone, a personal data assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 700 can also be implemented as a personal computer including both laptop computer and non-laptop computer configurations.

[00100] There is little distinction left between hardware and software implementations of aspects of systems; the use of hardware or software is generally (but not always, in that in certain contexts the choice between hardware and software can become significant) a design choice representing cost vs. efficiency tradeoffs. There are various vehicles by which processes and or systems and/or other technologies described herein can be effected (e.g., hardware, software, and/or firmware), and that the preferred vehicle will vary with the context in which the processes and/or systems and/or other technologies are deployed. For example, if an implementer determines that speed and accuracy are paramount, the implementer may opt for a mainly hardware and/or firmware vehicle; if flexibility is paramount, the implementer may opt for a mainly software implementation; or, yet again alternatively, the implementer may opt for some combination of hardware, software, and/or firmware.

[00101] The foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. Insofar as such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or virtually any combination thereof. In one embodiment, ^'several portions of the subject matter described herein may be implemented via Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), digital signal processors (DSPs), or other integrated formats. However, those skilled in the art will recognize that some aspects of the embodiments disclosed herein, in whole or in part, can be equivalently implemented in integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computer systems), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof, and that designing the circuitry and/or writing the code for the software and or firmware would be well within the skill of one of skill in the art in light of this disclosure. In addition, those skilled in the art will appreciate that the mechanisms of the subject matter described herein are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the subject matter described herein applies regardless of the particular type of signal bearing medium used to actually carry out the distribution. Examples of a signal bearing medium include, but are not limited to, the following: a recordable type medium such as a floppy disk, a hard disk drive, a Compact Disc (CD), a Digital Video Disk (DVD), a digital tape, a computer memory, etc.; and a transmission type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.) .

[00102] Those skilled in the art will recognize that it is common within the art to describe devices and/or processes in the fashion set forth herein, and thereafter use engineering practices to integrate such described devices and/or processes into data processing systems. That is, at least a portion of the devices and/or processes described herein can be integrated into a data processing system via a reasonable amount of experimentation. Those having skill in the art will recognize that a typical data processing system generally includes one or more of a system unit housing, a video display device, a memory such as volatile and nonvolatile memory, processors such as microprocessors and digital signal processors, computational entities such as operating systems, drivers, graphical user interfaces, and applications programs, one or more interaction devices, such as a touch pad or screen, and/or control systems including feedback loops and control motors (e.g., feedback for sensing position and/or velocity; control motors for moving and/or adjusting components and/or quantities). A typical data processing system may be implemented utilizing any suitable commercially available components, such as those typically found in data computing/communication and/or network computing communication systems.

[00103] With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.

[00104] While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.

Claims

1. A method of decoding an audio signal having been encoded as a sequence of consecutive frames, the method comprising:

receiving a first frame of the consecutive frames, the first frame containing decoding parameters and a residual signals for reconstructing audio data represented by the first frame; storing the residual signals contained in the first frame;

decoding the first frame based on the stored residual signals to reconstruct the audio signal encoded by the first frame;

determining that a second frame subsequent to the first frame in time has been lost; modifying the stored residual signals; and

reconstructing an estimate of the audio signal encoded by the second frame based on the modified residual signals.

2. The method according to claim 1, wherein the modifying the stored residual signals includes:

generating a periodic signal;

generating a colored pseudo-random signal based on the stored residual signals; multiplying the periodic signal and the colored pseudo-random signal with weight factors selected based on energy of an input and an output signal of a pitch synthesis filter created from the stored residual signals and based on pitch gain of the stored residual signals; and

summing the weighted periodic signal and the weighted colored pseudo-random signal.

3. The method according to claim 2, wherein the generating the periodic signal includes:

retrieving at least two most recently stored pitch cycles;

altering periodicity of each pitch cycle;

weighting each pitch cycle; and

summing the two weighted pitch cycles.

4. The method according to claim 3, wherein the altering the periodicity includes: resampling pitch pulses of the pitch cycles.

5. The method according to claim 2, wherein the generating the colored pseudorandom signal includes:

generating a pseudo-random sequence; and

filtering the pseudo-random sequence with Nth-order all-zero filter with coefficients given by N latest samples of a previously decoded lower-band residual signals of a previously received frame.

6. The method according to claim 1, wherein

the stored residual signals include

input of a pitch synthesis filter, and

input of an LPC synthesis filter; and

the decoding parameters include

pitch gains,

pitch lags, and

LPC parameters.

7. The method according to claim 1, wherein

the frames contain encoded information for a first frequency band and distinct second frequency band higher than the first frequency band, and

only a residual signal of the first frequency band is pitch post filtered, but not a residual signal of the second frequency band.

8. A decoding apparatus for decoding an audio signal having been encoded as a sequence of consecutive frames, comprising:

a receiver configured to receive a first frame of the consecutive frames, the first frame containing decoding parameters and a residual signals for reconstructing audio data represented by the first frame;

a storage unit storing the residual signals contained in the first frame;

a decoding unit configured to decode the first frame based on the stored residual signals to reconstruct the audio signal encoded by the first frame;

a loss detector configured to determine that a second frame subsequent to the first frame in time has been lost;

a modification unit configured to modify the stored residual signals; and

a reconstruction unit configured to reconstruct an estimate of the audio signal encoded by the second frame based on the stored residual signals modified by the modification unit.

9. The decoding apparatus according to claim 8, wherein the modification unit comprises:

a first signal generator configured to generate a periodic signal; a second signal generator configured to generate a colored pseudo-random signal based on the stored residual signals;

a multiplier multiplying the periodic signal generated in the first signal generator and the colored pseudo-random signal generated in the second signal generator with weight factors selected based on energy of an input and an output signal of a pitch synthesis filter created from the stored residual signals and based on pitch gain of the stored residual signals; and

an adder summing the weighted periodic signal and the weighted colored pseudorandom signal output from the multiplier.

10. The decoding apparatus according to claim 9. wherein the first signal generator is configured to:

retrieve at least two most recently stored pitch cycles;

alter periodicity of each pitch cycle;

weight each pitch cycle; and

sum the two weighted pitch cycles.

11. The decoding apparatus according to claim 10, wherein

the first signal generator is configured to alter the periodicity by resampling pitch pulses of the pitch cycles.

12. The decoding apparatus according to claim 9, wherein the second signal generator is configured to:

generate a pseudo-random sequence; and filter the pseudo-random sequence with Nth-order all-zero filter with coefficients given by N latest samples of a previously decoded lower-band residual signals of a previously received frame.

13. The decoding apparatus according to claim 8, wherein