A Systematic Framework For Dynamically Optimizing Multi-User Wireless Video Transmission
A Systematic Framework For Dynamically Optimizing Multi-User Wireless Video Transmission
A Systematic Framework For Dynamically Optimizing Multi-User Wireless Video Transmission
3, APRIL 2010
Abstract—In this paper, we systematically formulate the prob- mission solutions, focusing on packet scheduling, error pro-
lem of multi-user wireless video transmission as a multi-user tection or cross-layer adaptation in order to maximize the re-
Markov decision process (MUMDP) by explicitly considering the ceiving user’s video quality [1][4] and multi-user video trans-
users’ heterogeneous video traffic characteristics, time-varying
network conditions as well as, importantly, the dynamic coupling mission, emphasizing multi-user resource allocation among
among the users’ resource allocations across time, which are often multiple users simultaneously transmitting video and sharing
ignored in existing multi-user video transmission solutions. To the same wireless resources [5][7]. However, most existing
comply with the decentralized wireless networks’ architecture, solutions in both categories do not explicitly consider both the
we propose to decompose the MUMDP into multiple local MDPs heterogeneous characteristics of the video traffic and the time-
using Lagrangian relaxation. Unlike in conventional multi-user
video transmission solutions stemming from the network utility varying network conditions (e.g. time-varying channel condi-
maximization framework, the proposed decomposition enables tions, dynamic multi-user channel access, etc.), thereby often
each wireless user to individually solve its own local MDP (i.e. leading to suboptimal performance for wireless media systems.
dynamic single-user cross-layer optimization) and the network For example, in the single-user video transmission category,
coordinator to update the Lagrangian multipliers (i.e. resource most solutions employ Unequal-Error-Protection (UEP) tech-
prices) based on not only current, but also the future resource
needs of all users, such that the long-term video quality of niques [3] to differentially protect the video packets, or deploy
all users is maximized. This MUMDP solution provides us rate-distortion optimization to schedule the video packets [1]
the necessary foundations and structures for solving multi- based on their distortion impact and delay deadline and the
user video communication problems. However, to implement packets’ dependencies. However, these solutions assume only
this framework in practice requires statistical knowledge of the simplistic underlying network (channel) models and they do
experienced environment dynamics, which is often unavailable
before transmission time. To overcome this obstacle, we propose not consider the adaptation of transmission parameters at the
a novel online learning algorithm, which allows the wireless other layers of the network stack, besides the application layer.
users to simultaneously update their policies at multiple states To deal with the wireless network dynamics, cross-layer adap-
during each time slot. This is different from conventional learning tation methods [2] have been proposed to optimize on-the-fly
solutions, which often update the current visited state per time the transmission parameters at various layers, based on current
slot. The proposed learning algorithm can significantly improve
the learning performance, thereby dramatically improving the observations of channel conditions. However, these cross-layer
video quality experienced by the wireless users over time. Our solutions are myopic and result in suboptimal performance
simulation results demonstrate the efficiency of the proposed because they do not account for the future channel conditions
MUMDP framework as compared to conventional multi-user and video traffic.
video transmission solutions.
In the multi-user video transmission category, many current
Index Terms—Multi-User Video Transmission, Markov Deci-
techniques [7][8] are based on the network utility maximiza-
sion Process, Lagrangian Relaxation, Online Learning.
tion (NUM) framework [6]. In the NUM framework, the basic
assumption is that each user has a static utility function of the
I. I NTRODUCTION
(average) allocated transmission rate (or QoS). For example,
wireless network, we propose a systematic framework for dy- online reinforcement learning [16], and the network coor-
namically and foresightedly optimizing the cross-layer trans- dinator will update the resource price dynamically using
mission strategies (e.g. packet scheduling and resource acqui- stochastic subgradient methods [15]. Unlike conventional
sition, etc) of multiple users coexisting in the same wireless online learning algorithms [16], which often update the
channel in order to maximize their long-term video quality. In policy for only the visited state during each time slot, our
the proposed framework, unlike existing video transmission proposed learning algorithm can simultaneously update
solutions, we explicitly consider the heterogeneous video multiple states during each time slot, which can signifi-
traffic characteristics, experienced time-varying network con- cantly improve the learning performance. This approach
ditions, as well as the dynamic coupling among the users’ has two advantages: (i) it does not require each user to
transmission strategies across time. Our contributions are: know the statistical distribution of channel conditions and
• To characterize the heterogeneous video data, we define incoming video traffic beforehand; and, (ii) the wireless
a traffic state for each user which considers the number user and the network coordinator need to perform only
of data units (e.g. video frames or video packets) to very simple computations in each time slot.
be transmitted, their distortion impacts, and the depen- The paper is organized as follows. Section II defines the
dencies between them at each transmission time. The traffic states, the state transition and the utility function for
traffic state together with the network state (e.g. chan- each wireless user at each time slot. Section III formulates the
nel conditions) characterizes the environment dynamics single-user cross-layer optimization as an MDP and proves
experienced by each user. Using this state definition, we that the utility function is concave. Section IV formulates
are able to dynamically optimize the resource acquisition the multi-user video transmission problem as an MUMDP.
and packet scheduling for video transmission over time- Section V presents how the MUMDP can be decomposed into
varying networks. multiple local MDPs using the Lagrangian relaxation method
• We further formulate the optimization of the packet and develop the corresponding subgradient method to update
scheduling and resource allocation for the dynamic multi- the resource price. Subsequently, Section VI describes the
user video transmission system as a weakly-coupled proposed distributed online learning algorithm to deal with
MUMDP [10] problem. The MUMDP formulation allows the unknown video characteristics and channel conditions.
each user to make foresighted transmission decisions Section VII presents numerical results to validate the proposed
by taking into account the future impact of its current framework. The conclusions are drawn in Section VIII.
decisions on the long-term utilities of all the users.
Unlike the conventional centralized solutions [9] to the
MDP which have very high computation complexity and II. M ODELS FOR H ETEROGENEOUS V IDEO T RAFFIC
unacceptable communication overheads, we propose to Unlike traditional traffic models [19], which only charac-
decompose this weakly-coupled MUMDP problem using terize the rate changes of video traffic, in this section we
Lagrangian relaxation into multiple local MDPs, each aim to develop a general model to represent the encoded
of which can be separately solved by the individual video traffic with heterogeneous characteristics (e.g. various
users. This decomposition is different from the conven- delay deadlines, distortion impacts, dependencies, etc.). Using
tional dual solutions [7] to the multi-user NUM-based this video traffic model, we will be able to dynamically
video transmission problem in two ways: (i) instead of optimize the resource acquisition and packet scheduling for
maximizing the static utility at each transmission time, video transmission over time-varying networks.
our approach allows each wireless user to solve the
dynamic cross-layer optimization problem (formulated as
the local MDP), which is vital for the delay-sensitive A. Attributes of data units
video applications; (ii) instead of updating the Lagrangian In this section, we discuss how the heterogeneous attributes
multipliers only based on the current resource require- of the video traffic can be modelled. The video data is
ments of all users, our approach updates the multipliers often encoded periodically using a Group of Pictures (GOP)
based on not only current, but also future resource needs, structure, which lasts a period of T time slots. The video
such that the long-term video quality of all the users is frames within one GOP are encoded interdependently using
maximized. To the best of our knowledge, this is the first motion estimation, while the frames belonging to different
attempt to formalize the multi-user video communication GOPs are encoded independently. Note that the prediction-
problem using MUMDP and decompose the MUMDP based coding schemes can lead to sophisticated dependencies
such that it can be solved in a decentralized manner by between the video data.
autonomous, yet collaborative, users. After being encoded, each GOP contains N data units
• To overcome the obstacle of the unknown environmen- (DUs), each of which represents one type of DU (e.g. I, P,
tal dynamics (e.g. state transition probabilities) in real- B frames) and indexed by j ∈ {1, 2, · · · , N } . The types
time video applications operating in dynamic multi-user of DUs are determined based on its distortion impact, delay
networks, the MUMDP framework provides the neces- deadline, and dependencies which are illustrated below. We
sary foundations and principles for how the users can assume that the GOP structure is fixed (i.e. the types of DUs
autonomously learn on-line to cooperatively optimize the are fixed at each GOP). The set of DUs within GOP g ∈ N is
global long-term video quality. Specifically, to deal with denoted by {f1g , · · · , fN
g
}. The attributes of DU fjg are listed
the unknown dynamics, each wireless user will deploy below.
310 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 3, APRIL 2010
Size: The size of DU fjg is denoted as ljg (measured in structure is fixed, it is easy to show that Ft is periodic with
packets1 ), where ljg ∈ [1, ljmax], and ljmax is the maximum the period of T , which means that, for any fjg ∈ Ft , there
allowable size for the j-th DU at each GOP. The size of exists fjg+1 ∈ Ft+T and vice versa. Hence, Ft and Ft+T have
DU fjg is determined when DU fjg is encoded. To simplify the same types of DUs and the same DAG between these
the exposition, ljg is generated from an i.i.d. random variable DUs. For example, as shown in Figure 1, Ft = {f1g , f2g , f3g }
2
with the probability mass function P M Fj (l) . Note that and Ft+3 = {f1g+1 , f2g+1 , f3g+1 } where T = 3. It is clear
P M Fj (l) is the same for the j-th DU across different GOPs, that the DAG between the DUs in Ft is also the same as
but is different for different types of DUs. the one between the DUs in Ft+3 . Due to the periodicity,
Distortion impact: Each DU fjg has a distortion impact qjg there are only T different DU-type states. The transition of Ft
per packet, which
is assumed to be the same for all the GOPs, is deterministic and denoted by δ(Ft+1 − N ext(Ft )) where
i.e. qjg = qjg , ∀g, g . N ext(Ft ) represents the next DU-type state following Ft and
Delay deadline: We define the delay deadline of DU fjg δ(x) = 1 if x = 0 and otherwise, 0. For example in Figure 1,
as the time by which the DU should be decoded in order to N ext(Ft ) = {f2g , f3g , f4g , f5g }.
be displayed. We denote by dgj the delay deadline of DU fjg . Furthermore, for each DU fjg ∈ Ft , we denote by bgj the
Since the GOP structure is fixed, the difference between the amount of packets remaining for transmission at time slot t.
delay deadlines of the two DUs within one GOP is constant, Note that bgj ≤ ljg which means the remaining packets must
i.e. dgj − dgj = Δdjj > 0 where j > j , and the delay be less than the amount of the originally available packets.
deadlines of the same DUs from different GOPs satisfy dgj − If DU fjg is undecodable, bgj = −1. We denote the buffer
dg−1
j = T . In other words, the delay deadline dgj is periodic state of the DUs in Ft by Bt = {bgj |fjg ∈ Ft }. The traffic
with the period of T , which is the length of one GOP. state Tt = (Ft , Bt ) of the video application is then defined as
Dependency: When one DU fjg is encoded based on the representing the types of DUs, the dependencies between them
prediction from the other DU fjg , we say that DU fjg depends and the amount of packets remaining for transmission. Hence,
on DU fjg . Note that the dependencies between DUs only the traffic state Tt is able to capture heterogeneous video traffic
occur within one GOP and DUs from different GOPs can be and is a super-set of existing well-known single-buffer models
decoded independently (i.e. no dependency between them.). (i.e. which ignore packet dependencies and delay deadlines)
The dependencies between the DUs within one GOP are or multi-buffer models (i.e. which ignore packet dependencies
expressed as a directed acyclic graph (DAG) [1]. The DAG or delay deadlines).
remains the same for a fixed GOP structure. In this paper, we
assume that, if DU fjg depends on DU fjg (i.e. there exists a C. Packet scheduling, state transition and immediate reward
path directed from DU fjg to DU fjg and denoted by j ≺ j
), then dgj ≥ dgj and qjg ≤ qjg . In other words, DU fjg should Given a transmission rate4 rt at time slot t, the wireless
be decoded prior to DU fjg and DU fjg has higher distortion user has to determine the amount of packets to be transmitted
for each DU in Ft , which we refer to as packet scheduling.
impact. One illustrative example of DAGs for video data is
given in Figure 1. The scheduling policy π maps the current traffic state Tt and
transmission rate rt into the amount of packets transmitted
during each time slot, yt = [yjg |fjg ∈ Ft ], i.e. π(Tt , rt ) = yt
B. Traffic state representation in each time slot . Formally, the scheduling policy π satisfies the following
In this subsection, we discuss how to represent the video conditions5: (i) Underflow
constraint: 0 ≤ yjg ≤ bgj , ∀fjg ∈ Ft ;
g
traffic which can be potentially transmitted in each time slot. (ii) Rate constraint: fjg ∈Ft yj ≤ rt . In other words, we
At time slot t, as in [1], we assume that the wireless user will allow partial DUs to be scheduled for transmission. The set
only consider for transmission the DUs with delay deadlines of possible policies in each traffic state Tt given a certain
in the range of [t, t + W ], where W is referred to as the transmission rate rt is denoted by P(Tt , rt ). In this paper,
scheduling time window (STW) and assumed to be given a we assume that the packet scheduling policy π(Tt , rt ) is a
|F |
priori3 . In this paper, we further assume that STW is chosen vector of nonnegative real number, i.e. π(Tt , rt ) ∈ R+ t
to satisfy the following condition: if DU fjg directly depends where |Ft | represents the number of DUs in Ft . This type of
on DU fjg (i.e. there is a direct arc from fjg to fjg in the packet scheduling policy can be implemented by mixing the
DAG), then fjg − fjg < W . This assumption ensures that DU packet scheduling policies taking an integer number of packets
fjg and fjg can be within one STW. to transmit. In the rest of paper, we assume that π(Tt , rt )
We denote by the set of DUs whose delay deadlines are takes values from the nonnegative real number set. P(Tt , rt )
within the range of Ft = {fjg |dgj ∈ [t, t + W ]}. This set Ft is represents the set of feasible mixed packet scheduling policies.
referred to as the DU-type state at time slot t. Since the GOP In the following, we discuss the transition of the traffic
state Tt , given the transmission rate rt . First, as discussed in
1 For simplicity, we assume in this paper that each packet has the same
Section II.B, the DU-type state has the deterministic transition
length, but this does not affect our proposed solution. It just simplifies our
exposition given the space limitations. δ(Ft+1 −N ext(Ft )), which is independent of the transmission
2 The DU size can also be modeled as a random variable depending on the
previous DUs [17]. 4 The transmission rate can be determined by the allocated network resource
3 The STW can be determined based on the channel conditions experienced and transmission strategies at the layers below application layer.
by the user in each time slot. For example, the STW can be set small when 5 Similar constraints are also considered in [18]. However, the authors
the channel conditions are poor (low SNR regime), and to be large whenever therein did not consider the time-varying transmission rate and foresighted
the channel condition are good. packet scheduling decisions aimed at maximizing the long-term video quality.
FU and VAN DER SCHAAR: A SYSTEMATIC FRAMEWORK FOR DYNAMICALLY OPTIMIZING MULTI-USER WIRELESS VIDEO TRANSMISSION 311
GOP g GOP g + 1
I P P I P P
g
DU f g DU f2g DU f4 DU f g +1 DUf g +1 DUf4g +1
1 1 2
g +1
g g DU f3 g +1
DU f3 DU f5 DUf5
B B B B
Ft = { f1g , f2g , f3g } Ft +1 = { f2g , f3g , f4g , f5g } Ft +2 = { f4g , f5g , f1g +1 } Ft + 3 = { f1g +1, f2g +1, f3g +1 }
Fig. 1. DAG-based dependencies and traffic states at each time slot using IBPBP GOP structure
rate rt . However, the buffer state transition is determined by wireless video user experiencing a slow fading wireless chan-
the scheduling policy, the dependencies between DUs and the nel. In each time slot, the wireless user experiences a channel
PMF of the incoming DUs. condition ht . We assume that the channel condition remains
In order to compute the transition from Bt to Bt+1 , we constant within one time slot, but varies across time slots.
divide the DUs in Ft+1 into two disjoint subsets: Ft ∩Ft+1 and The changes of can be modelled as a finite state Markov
Ft+1 /Ft 6 . Specifically, Ft ∩Ft+1 represents the set of DUs that chain (FSMC) [13] with the state transition probability given
are not expired at time slot t + 1, and Ft+1 /Ft represents the by ph (ht+1 |ht ), which is independent of the traffic state
newly arriving DUs at time slot t+1. We can also divide Ft+1 transition. The transmission rate attained by the wireless user
into two subsets: the decodable set Ft+1 d
and undecodable set is determined by rt (ht , xt ), where xt ∈ X represents the
u d
Ft+1 . Ft+1 represents the set of DUs whose parent DUs in amount of network resource (e.g. the transmission time in the
the DAG (if any) have been successfully transmitted, and is TDMA-like network [14] as discussed in Section IV) acquired
computed as Ft+1 d
= {fjg |fjg ∈ Ft+1 &∀j ≺ j, bgj = 0}. by the wireless user from the network and X represents the
u
Ft+1 represents the set of DUs that cannot be decoded (i.e. set of possible resource allocations. As we will discuss in
bgj = −1 or bgj > 0 where j ≺ j), and is computed as Section IV for the multi-user video transmission, the resource
u
Ft+1 = {fjg |fjg ∈ Ft+1 &∃j ≺ j, bgj = 0}. Let Bt+1 = acquisition will be affected by other users. The transmission
g g
{b̂j |fj ∈ Ft+1 } be the buffer state at time slot t + 1. We are rate function rt (ht , xt ) is assumed to be a convex increasing
now ready to compute the transition probability of the buffer function of xt , given the channel condition ht .
state from Bt to Bt+1 , as shown below. We define the state for the wireless user at time slot t as
st = (Tt , ht ) ∈ S, which includes the video traffic state and
p(Bt+1 |Bt , yt ) = δ(b̂gj + 1) channel state. The state st satisfies the Markovian property
fjg ∈Ft+1
u
fjg ∈Ft+1
d ∩(F
t+1 ∩Ft ) since both the traffic state and channel state are Markovian.
δ(b̂gj − (bgj − yjg )) P M Fj (b̂gj ) (1) Then, the wireless user state transition is expressed by
fjg ∈Ft+1
d ∩(F
t+1 /Ft ) p(st+1 |st , yt , xt ) = pT (Tt+1 |Tt , yt , rt )ph (ht+1 |ht ). (4)
where the first product in the right hand side represents the At each state st , the wireless user takes the actions including
buffer state transition probability of the undecodable DUs, the resource acquisition xt and scheduling yt , thereby leading
the second one represents the transition probability of the to the immediate utility ut (st , yt , xt ) − λst xt , where λst is
remaining decodable and unexpired DUs, and the third one interpreted as the resource price as in [1]. Note that we express
represents the transition probability of the decodable and ut (Tt , yt , rt (ht , xt )) as ut (st , yt , xt ) to emphasize that the
newly arriving DUs. The transition of the traffic state Tt is immediate utility is a function of the state st , scheduling action
then computed as yt and allocated time xt .
In this section, we assume that λst is determined a priori.
pT (Tt+1 |Tt , yt , rt ) = δ(Ft+1 − N ext(Ft ))p(Bt+1 |Bt , yt ). In Section V, we will discuss how the resource price can
(2) be determined in a multi-user scenario. The wireless user’s
From the computation in Eq. (2), we know that the traffic objective is to maximize its expected discounted accumulated
state transition is Markovian. Given the scheduling policy utility7 (we call this ”single-user primary problem (SUP)”):
yt = π(Tt , rt ) and transmission rate rt , the distortion reduc- SUP:
tion experienced by the wireless user is: ∞
g g
ut (Tt , yt , rt ) = qj yj . (3) max v(s0 )E α (ut (st , yt , xt ) − λst xt )|s0
t
{ytx∈P(s t ,xt )
t ≥0,t≥0
} s0 ∈S t=0
fjg ∈Ft
where α is the discounted factor in the range8 of [0, 1), and
III. DYNAMIC O PTIMIZATION FOR A S INGLE U SER v(s0 ) is the distribution of the initial state. The reasons why
In this section, we first consider the optimization of both 7 In this formulation, we interchangeably express the set of admissible
the packet scheduling and resource acquisition for a single policies as P(Tt , rt ) and P(st , xt ).
8 Our solutions discussed below are also applicable to the problem of
6 Here Ft+1 /Ft = Ft+1 − Ft+1 ∩ Ft maximizing the average accumulated utility by allowing α → 1.
312 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 3, APRIL 2010
we consider the discounted accumulated utility are: (1) for our allocations
M i to all the users satisfy the following inequalities:
considered delay-sensitive applications, the data needs to be i=1 xt ≤ 1, ∀t, which are referred to as the stage resource
sent out as soon as possible to avoid missing delay deadlines; constraints. Here one stage means one time slot.
and (2) since a wireless user may encounter unexpected We consider a collaborative multi-user video transmission
environmental dynamics in the future, it may care more about problem aimed at maximizing the expected discounted accu-
its immediate reward rather than the future reward. Based mulated video quality of all the users under the stage resource
on the discussion in Section II, the transition of the state st constraints (we call this problem ”the multi-user primary
only depends on the current action at = (xt , yt ). Hence, the problem with stage resource constraints - MUP/SRC”):
problem above can be formulated as an MDP. MUP/SRC:
Note that unlike the previous video transmission solutions
in [2][3][4], here we explicitly take into consideration the het- U∗ = max
yti ,xit ,i=1,··· ,M
erogeneous characteristics of video traffic (represented by the ∞ M
M
traffic states) and time-varying channel conditions (represented v(si0 )E αt uit (sit , yti , xit )|s0
as channel states). Similar to the work in [1], we optimize
s10 ∈S 1 ,··· ,sM
0 ∈S
M i=1 t=0 i=1
the trade-off between the consumed resource and the received
reward in terms of distortion reduction, but unlike in [1] we
M
s.t.yti ∈ P i (sit , xit ), xit ≤ 1
focus on a dynamic setting. The optimization of SUP is called
i=1
a foresighted optimization for video transmission because it
considers the impact of current decisions on the future utility. where v(si0 ) is the distribution of the initial state of user i and
Based on [9], the optimization of SUP can be solved using the is assumed to be independent of other users’.
Bellman’s equations (5), where U (s, λ) is the optimal reward- The multi-user transmission problem in MUP/SRC can be
to-go starting at state s, given λ = [λs ]s∈S . The Bellman’s formulated as an MUMDP. Specifically, we define the state
equations can be solved using the value iteration or policy of the multi-user system as mathbf s = (s1 , · · · , sM ). The
iteration methods [9]. action performed by each user is ai = (yi , xi ) and the action
We define (6). Given the resource price λ, H(s, λ, x) profile for all the users is a = (a1 , · · · , aM ). It is easy to
verify that p(s |s, a) = i=1 pi (si |si , yi , xi ) since the traffic
M
represents the utility function the wireless video user obtains
in state s under the optimal mixed packet scheduling policy. state and channel state are independent across M the users. The
The optimal mixed packet scheduling policy is characterized reward at each time slot is given by ut = i=1 uit (sit , yti , xit ).
in detail in [23] which leads to a concave utility function We note that, when α = 0 (i.e. all users make myopic
H(s, λ, x). decisions), the MUMDP problem reduces to the traditional
Lemma 1: H(s, λ, x) is a concave function of the allocated multi-user NUM-based resource allocation problems for video
resource x. transmission [6]-[8], which is performed repeatedly. However,
The proof is given in Appendix A. we consider here the dynamic optimization for the multi-user
The concavity of the utility function H(s, λ, x) plays an video transmission by taking into account the resource allo-
important role in deriving the optimal resource allocation so- cation and corresponding scheduling across time (i.e. α = 0).
lution in the multi-user video transmission, which is presented From [9], we know that, for this multi-user MDP problem,
in the subsequent sections. there is at least one optimal stationary policy that only depends
on the current multi-user system state. Hence, in this paper, we
restrict our focus to the stationary policies, i.e. the policy only
IV. M ULTI -U SER W IRELESS V IDEO T RANSMISSION
depends on the current state. Then, solving the maximization
F ORMULATION
problem in MUP/SRC is equivalent to solving the following
In this section, we aim to formulate the problem of multi- Bellman’s equations [9]:
user multimedia transmission over a slowly-fading wireless
channel by considering both the heterogeneous traffic and U (s) = max PM
yi ∈P i (si ,xi ),i=1,··· ,M, xi ≤1
i=1
time-varying channel conditions experienced by the users. The
users are indexed by i ∈ {1, · · · , M }, where M is the number
M
M
i
ui (s , y , x ) + α
i i i
p(s |s , y , x )U (s ) , ∀s (7)
i i i
of users sharing the channel. Similarly, we define the state for i=1 s i=1
the wireless user i at time slot t as sit = (Tti , hit ) ∈ S i (the M
superscript i represents user i and the same in the below), and U ∗ = s1 ∈S 1 ,··· ,sM ∈S M i=1 v(si0 )U (s0 ).
0 0
which includes the video traffic state Tti and channel state Based on these Bellman’s equations, we can make the
hit . As discussed in Section II, the traffic state Tti models following observations: (i) to solve the Bellman’s equations,
the heterogeneous characteristics of the delay-sensitive video we can use the centralized value iteration or policy iteration
data. The channel state hit represents the channel conditions [9] to find the optimal state function U (s) for the multi-user
experienced by the wireless user i. The channel state transition MDP problem. However, this solution requires knowing all the
probability ph (hit+1 |hit ) is independent of the traffic state users’ information (state spaces, action spaces, transition prob-
transition and also other users’ state transition. abilities, and utility functions) and also has a prohibitively high
We assume that the multiple users access the shared channel computation complexity. Hence, this centralized solution is
using the TDMA-like protocol [14]. At each time slot, the por- not applicable for the multi-user wireless video transmission;
tion of time allocated to user i is denoted by xit ∈ [0, 1]. The (ii) we note that the coupling among the multiple users’ video
FU and VAN DER SCHAAR: A SYSTEMATIC FRAMEWORK FOR DYNAMICALLY OPTIMIZING MULTI-USER WIRELESS VIDEO TRANSMISSION 313
U (s, λ) = max u(s, y, x) − λs x + αp(s |s, y, x)U (s , λ) , (5)
{y∈P(s,x)
x≥0 } s
H(s, λ, x) = max u(s, y, x) − λs x + αp(s |s, y, x)U (s , λ) (6)
y∈P(s,x)
s
Multi-user video Multi-user video state st . Then the dual function is given by
Single-user video
transmission (MDP)
transmission with
accumulated resource
Upper
transmission with
stage resource
(Section III) constraint (MUMDP)
Bound
constraint (MUMDP) U (λ) = i max v(si0 )
yt ∈P i (si i ),xi ≥0
(Section V.B) (Section IV) { i=1,···t ,M,t≥0
,x t t } si0 ∈S 1 ,··· ,sM
0 ∈S
M
∞
Dual with zero
duality gap
Dual with zero
duality gap
M
λ s
Subproblem with
E αt uit (sit , yti , xit ) − λst xit + t |s0 , (8)
scalar resource price t=0 i=1
M
(local MDP) Dual solution with Dual solution with per-
Upper
uniform resource price
Bound
state resource price with λ = [λs ]. We refer to λs as ”pre-state resource price”.
(Section V.B) (Section V.A)
Then, λs xi is the cost user i has to pay in state s and
λs · 1 is the amount of revenue received by the multi-user
Fig. 2. Relationship between the various proposed solutions for the
considered multi-user MDP problem system by allowing the users to consume the resources (i.e.
access the wireless channel). However, we should note that,
transmission is only through the resource allocation performed in the considered collaborative communication scenario, the
at each time slot. The optimal scheduling policy performed by resource price is used in order to efficiently allocate the limited
each user i depends on the multi-user system state through the resource, instead of maximizing the revenue of the multi-user
resource allocation xi . Then, given the resource allocation xi , system.
the scheduling policy is independent of other users’ states. The multi-user dual problem with the per-state resource
This type of MUMDP is weakly-coupled MDP [10] and thus, price (MUD/PSRP) is then given by
the decomposition into multiple local MDPs is possible. MUD/PSRP
In the next section, we will discuss how the multi-user MDP U λ,∗ = min U (λ).
λ≥0
problem can be decomposed when the resource allocation is
dynamic and depends on the multi-user system’s state. The The following proposition proves that the dual problem
relationships among the proposed solutions are illustrated in MUD/PSRP has zero duality gap compared to the primary
Figure 2. problem in MUP/SRC and thus, the optimal time allocation
and scheduling policies corresponding to the optimal resource
price λs at each state are also optimal policies for the primary
V. D UAL D ECOMPOSITION OF MUMDP
problem.
In this section, we will relax the per-stage resource con- Proposition 2: U λ,∗ = U ∗
straints and show how we can decompose the MUMDP. The sketch of the proof is given in Appendix B.
First, in Subsection A, we introduce a per-state Lagrangian Similar to the Bellman’s equations in Eq. (7) for the
multiplier associated with the resource constraint at each state. primary problem MUP/SRC, we have the following Bellman’s
This dual solution leads to the zero duality compared to equations corresponding to the dual function in MUD/PSRP:
the primary problem MUP/SRC, but requires a centralized
solution since the resource price depends on multi-user state U (s, λ) = max
yi ∈P(si ,xi ),xi ≥0,i=1,··· ,M
which cannot be observed by each individual user. Then, in M
1
i=1 (ui (s , y , x ) − λs x + M λs )+
i i i i
Subsection B, we impose a uniform resource price, which is M , ∀s. (9)
independent of the multi-user state. With this uniform resource α s i=1 p(si |si , yi , xi )U (s, λ)
price, the MUMDP problem can be decomposed into multiple We note that, by setting α = 0, the Bellman’s equations
local MDPs, each of which represents a dynamic cross-layer above degrade to the dual solutions [6][7] to the conven-
optimization problem that can be separately solved by each tional multi-user video transmission. The degraded Bellman’s
individual user. This decomposition is promising since (i) equations can be decomposed into multiple sub-equations,
it enables each user to perform the cross-layer optimization each corresponding to one user, by letting the user know the
independently of other users; and (ii) the network coordinator resource price. However, in general, this Bellman’s equation
only needs to simply update the resource price, which involves cannot be decomposed into independent subproblems which
only very few computations. can be autonomously solved by each user, since the Bell-
man’s equations are coupled through the resource price λs ,
A. Dual solution with per-state resource prices which varies with the state of the multi-user system. Hence,
a centralized solution has to be deployed by the network
At each state st , we introduce a Lagrangian
multiplier
λst coordinator, which requires all users’ information, as in the
M
i=1 xt − 1 at each
i
associated with the resource constraint primary solution to MUP/SRC.
314 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 3, APRIL 2010
B. Dual solution with uniform resource price The following theorem shows that the Bellman’s equations
In this subsection, we consider a scenario where the same corresponding to the dual function in Eq. (10) can be decom-
price (referred to as ”uniform resource price”) is imposed in posed into M local Bellman’s equations, each corresponding
all the states of the multi-user system, i.e. λs = λ, ∀s. Then, to one user.
the dual function is given by Theorem 4: Given λs = λ, ∀s, the optimization in Eq. (10)
is given by
M
U (λ) = max v(si0 )
M
{
i ∈P i (si ,xi ),xi ≥0
yt
} s10 ∈S 1 ,··· ,sM
t t t U (λ) = v(si0 )U i (si0 , λ), (12)
0 ∈S
M i=1
i=1,··· ,M,t≥0
∞
i=1 si0
M
λ
E αt
ut (st , yt , xt ) − λxt +
i i i i i
|s0 (10) with U i (si , λ) satisfying the local Bellman’s equation:
t=0 i=1
M
By minimizing over the uniform resource price λ, we U i (si , λ) = max
xi ≥0,yi ∈P(si ,xi )
have the multi-user dual problem with uniform resource price
1
ui (si , yi , xi ) − λxi + M λ+
(MUD/URP): (13)
MUD/URP α si p(si |si , yi , xi )U i (si , λ)
U λ,∗ = min U (λ) The proof is given in Appendix D.
λ≥0
The key result of Theorem 4 is that U (λ) can be de-
We would like to note that the Lagrangian relaxation using
composed into M local Bellman’s equations, which can be
uniform resource price has been also proposed in [10][11]
solved in a distributed fashion. Each user can solve its own
to decompose a weakly-coupled MUMDP problem. In the
Bellman’s equations (and accordingly solve its own cross-
below, we mathematically derive the duality and propose a
layer optimization problem) provided the resource price λ.
systematic way to compute the subgradient for updating the
This local Bellman’s equations correspond to the local MDP,
Lagrangian multiplier which is not explicitly addressed in
which is the single-user cross-layer optimization solved by
[10][11]. Interestingly, by setting the uniform resource price,
each individual user (see Figure 2).
the dual problem MUD/URP is not dual to the primary
Next, we discuss how the resource price can be updated.
problem in MUP/SRC. Instead, it is dual to the following
Given the resource price λ, each user can solve its own Bell-
problem:
man’s equations using e.g. value iteration, which results in the
MUP/ARC
optimal resource allocation xi,∗ (si , λ) and scheduling policy
M
yi,∗ (si , λ). Note that the resource acquisition is independent
Û ∗ = max v(si0 )
yti ,xt ,i=1,··· ,M
i of other users’ state given the uniform resource price. In the
s10 ∈S 1 ,··· ,sM
0 ∈S
M i=1
following proposition, we formally compute the subgradient
∞
M
with respect to the resource price λ for the dual problem
E αt uit (sit , yti , xit )|s0 MUD/URP, which will be used to update the resource price
t=0 i=1
in each iteration.
∞ M
1 Proposition 5: The subgradient with respect to λ is given
s.t.yti ∈ P i (sit , xit ), (xit − )≤0
M by
t=0 i=1
M
1
We call this optimization - ”the multi-user primary problem Zi − , (14)
with accumulated resource constraint (MUP/ARC)”. The dual- i=1
1−α
ity between MUD/URP and MUP/ARC can be easily verified. i −1 i
where Z i = si0 ∈S i v(s0 )esi0 (I − αP )
i T
x (λ) is the ex-
Similar to Proposition 1, we can prove that the duality gap
pected discounted accumulated resource consumption (note
between MUD/URP and MUP/ARC is zero.
that the expectation is taken over all the possible sample
We further notice that the resource constraint in the primary
paths), and P i is the state transition probability matrix, and
problem MUP/SRC satisfies the following condition:
esi is the vector with the si component being 1 and others
M being zero.
xt , i = 1, · · · , M, t ≥ 0|
i
xt ≤ 1 ⊂
i
The proof is given in Appendix E.
i=1 Using the subgradient method, the resource price is then
∞
M
1 updated as follows:
xit , i = 1, · · · , M, t ≥ 0| xit − ≤ 0 , (11) +
M
t=0 i=1 M
1
which means that the feasible resource allocations in the λk+1
= λ +β (k k
Z −
i
) (15)
i=1
1−α
MUP/SRC is a subset of the feasible resource allocations in the
MUP/ARC. Then, comparing to the dual solution with state- where β k is a diminishing
step-size
∞ which satisfies the follow-
wise prices, we have the following proposition which shows ing conditions: ∞ k=1 β k
= ∞, k 2
k=1 (β ) < ∞ [15]. One
1
that U λ,∗ serves as an upper bound of the optimal value for example is β = k . We notice that the subgradient computed
k
the primary problem. in Eq. (14) accounts for not only the current resource con-
Proposition 3: U λ,∗ = Û ∗ ≥ U ∗ = U λ,∗ straint, but also the future constraints, since MUMDP aims
The proof is given in Appendix C. to maximize the long-term utility. The subgradient method
FU and VAN DER SCHAAR: A SYSTEMATIC FRAMEWORK FOR DYNAMICALLY OPTIMIZING MULTI-USER WIRELESS VIDEO TRANSMISSION 315
Network coordinator compute V i (si , xi ). From Section V.B, we know that each user
Update λ based on is able to compute its optimal resource acquisition xi,∗ (si , λ)
subgradient method
with respect to the current uniform price λ. Comparing Eq.
(17) to Eq. (13), we note that
Z1 λ Zi λ ZM λ
i i i
xi,λ,∗ (si , λ) = arg max V (s , x ) − λxi (18)
User 1 User i User M xi
state B̃ti changes: B̃ti = Bti − yti . With the post-decision state satisfy limk→∞ βγ i = 0 as shown in [20]. One example
k
1
in mind, the Bellman equation in Eq. (13) to solve the local is γli = k0.7 . Then, the optimal resource allocation and
MDP can be rewritten as packet scheduling corresponding to the normal state sit+1 is
implemented at time slot t + 1, which leads to the actual
U i (s̃it , λ) = P M Fji (bi,g
j ) packet transmission. However, the optimal resource allocation
bi,g g i,d
j ,Ft+1 ,ht+1 fj ∈Ft+1 ∩(Ft+1 /Ft )
i i i i and packet scheduling corresponding to other states is only
computed but not implemented. The batch update of the post-
i
δ(Ft+1 − next(F̃ti ))p(hit+1 |h̃it ) max decision state-value function in Eq. (21) can significantly
xit+1 ≥0,yt+1
i ∈P i (sit+1 ,xit+1 )
i i 1 improve the learning performance (i.e. it reduces the number
u (st+1 , yt+1
i
, xit+1 ) − λxit+1 + M λ+
, (20) of time slots required to achieve a specific distortion reduction)
αU i (((Ft+1
i , Bi
t+1 − yt+1 ), ht+1 ), λ)
i i
as compared to the conventional on-line learning algorithm.
Furthermore, we note that the batch update of the post-
where the next post-decision state is s̃it+1 = ((Ft+1 i i
, Bt+1 − decision state value function is not affected by the allocated
yt+1 ), ht+1 ) and U (s̃t , λ) is the post-decision state value
i i i i
resource by the network coordinator. Actually, it is only
function. Using the post-decision state, we are able to separate
affected by the announced resource price.
the expectation (corresponding to the first line in the right hand
side of Eq. (20)) from the optimization. At each time slot, we
first perform the cross-layer optimization without worrying C. Stochastic subgradient-based resource price update
about the expectation. After the optimization, we then take From Section V.B, we notice that the subgradient of the
the expectation over all the possible state sit . This separation dual problem with uniform price is computed as in Eq.
provides us a new way to learn the optimal resource allocation (14), which is the expected discounted accumulated resource
and packet scheduling online, which is discussed below. consumption. Since each wireless user does not know the
transition probability, we only use the realized sample path
to estimate the subgradient of the dual problem (i.e. using the
B. Batch update of post-decision state value function stochastic subgradient). Specifically, we update the Lagrangian
multiplier as follows:
From the discussion in Section VI.A, we note that the
M ∞ +
channel state transition and incoming data dynamics are 1
independent of the buffer state, the resource allocation and the λk+1 = λk + κk α xt −
t i
(22)
1−α
packet scheduling. In the conventional reinforcement learning i=1 t=0
(e.g. Q-learning, actor-critic learning [16]), the state-value ∞
where t=0 αt xit is the stochastic subgradient approximating
function can be updated only one state per time slot, which the subgradient i
leads to slow convergence rate. However, in our context, we ∞ i Z and κk is a diminishing step-size sat-
∞ i 2
isfying κ
k=1 k = ∞, k=1 (κk ) < ∞ . One example
are able to update the post-decision state value function at 1
is κk = k . However, in practice, we cannot wait for an
multiple states per time slot. infinite time to update the Lagrangian multiplier. Instead, we
Specifically, at time slot t, the DU type is Fti and channel update the multiplier every K time slots, i.e. we use Z̃ i =
(k+1)K−1 t−kK i ∞ t i
state is hit , and the setof the feasible post-decision
states xt instead of
t=kK α t=0 α xt . The proposed
at time slot t is S̃ti = ((Fti , B̃ti ), hit ), ∀B̃ti . Note that the online learning algorithm is illustrated in Figure 4. It can
realized post-decision state s̃it belongs to S̃ti . When the expired be shown that the batch update on the post-decision state
DUs are deleted, the new DUs arrive and the new channel value function and subgradient-based resource price update
state hit+1 is realized, any post-decision state s̃ ∈ S̃ti transits will converge to the optimal solution [21].
to the normal state si = ((Ft+1 i i
, Bt+1 ), hit+1 ) ∈ St+1
i
. In
conventional reinforcement learning, the state value function VII. S IMULATION R ESULTS
is often updated in the current normal state sit . In contrast, in In this section, we present simulation results highlighting
our cross-layer optimization problem, we can update the post- the efficiency of the proposed single-user and multi-user video
decision state value function in all the possible post-decision transmission solutions compared to existing solutions. To
FU and VAN DER SCHAAR: A SYSTEMATIC FRAMEWORK FOR DYNAMICALLY OPTIMIZING MULTI-USER WIRELESS VIDEO TRANSMISSION 317
Compute the
resource Submit x ti ΔV i Scale the 100
acquisition resource
acquisition
x ti 90
Message exchange phase
Repeated NUM
80
Resource prices
No
t=kN?
70
Yes
60
Submit Zi Update
Compute Zi λk
50 Proposed solution
Update λ k Using stochastic
Receive λ k subgradient
40
Optimize the Polling users
scheduling Polling user i based on 30
Transmission phase
36
Proposed dual solution Conventional dual solution
Update phase
Batch update 34
U i ( si , λ )
32
PSNR (dB) User 1, Foreman
30
28
Fig. 4. Flowchart of the online learning algorithm User 2, Coastguard
26
User 3, Mobile
compress the video data, we used a scalable video coding 24
channel conditions, support for a variety of wireless receivers Fig. 6. PSNR of received video sequences
with different resource capabilities and power constraints, and
easy prioritization of various coding layers and video packets.
stage resource constraints into the accumulated resource con-
A. Dual solutions with uniform price straint (shown in the problem of MUP/ARC) and the resource
price is reduced. We scale down the resource acquisition at
In this section, we will verify the convergence of the dual
each multi-user system state to enforce the feasible allocation.
solution with uniform price to the proposed MUMDP. We will
Figure 6 shows the improvements in terms of PSNR when the
further compare the performance of our approach to that of
price converges (i.e. User 1 receives 0.5dB higher PSNR, User
the conventional multi-user dual solution. We first consider
2 receives 1dB higher PSNR and User 3 receives 1.1dB higher
three wireless users: User 1 streams the video sequence
PSNR). The improvement is due to the foresighted decisions
”Foreman” (CIF resolution, 30 Hz), User 2 streams the video
in our solution, as compared to the myopic decisions in the
sequence ”Coastguard” (CIF resolution, 30 Hz) and User 3
conventional NUM-based solution.
streams the video sequence ”Mobile” (CIF resolution, 30 Hz).
We compare our proposed dual solution with uniform price
to the conventional dual solution [7] based on the NUM B. Duality gap of proposed dual decomposition
framework. Figure 5 shows the convergence of the resource In this section, we illustrate the duality gap of the proposed
prices with various initial price selections. We notice that, our the dual decomposition with uniform price. As we know, the
proposed dual solution with uniform price shows much faster resource acquisitions xi,λ,∗ (si ), ∀i with respect to the optimal
convergence (less than 25 iterations) than the conventional uniform price λ∗ provide the upper bound on the optimal
dual solution (having more than 100 iterations). We also note utility U ∗ , while the feasible resource allocations x̂i,λ (s), ∀i
that our solution converges to a lower resource price than provide the lower bound on U ∗ . Thus, the duality gap must
the conventional one. This is because that the conventional be less than the difference between the utilities obtained by
solution myopically maximizes the video transmission over xi,λ,∗ (si ), ∀i and x̂i,λ (s), ∀i.
each time slot. Hence, to achieve a feasible resource allocation, The simulation settings are the same as in Section VII.A.
it has to increase its resource price to ensure that the resource We consider two scenarios in which the users experience
allocations over all the states are feasible (corresponding to the average channel conditions of 22dB and 28dB, respectively.
worst case scenario.) However, in our solution, we relax the The upper and lower bounds of the received video qualities
318 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 28, NO. 3, APRIL 2010
30 100
70
28
Average reward
60
PSNR (dB)
27 50 User 1
40
26
30
20 User 2
25
10
User 3
24 0
1 2 3 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
User Index # time slots
Fig. 7. Upper and lower bounds on the received video qualities under average Fig. 9. Learning curves with different online learning algorithms
channel condition of 22dB
Conventional online learning Proposed online learning
35
32
PSNR (dB)
Upper bound
31 Lower bound 30
User 1, Foreman
30
25
0 50 100 150 200 250 300
PSNR (dB)
29 32
PSNR (dB)
28
30
28
27
User 2, Coastguard
26
0 50 100 150 200 250 300
26
26
PSNR (dB)
25
1 2 3 24
User Index
22
Fig. 8. Upper and lower bounds on the received video qualities under average User 3, Mobile
channel condition of 28dB 20
0 50 100 150 200 250 300
#Frames
A PPENDIX A A PPENDIX D
P ROOF OF L EMMA 1 P ROOF OF T HEOREM 4
Proof : To prove the concavity of H(s, λ, x), we only need
Proof : We prove this by induction.
to show that the optimal mixed packet scheduling policy
We define
always schedules the packet with the highest marginal utility
(i.e. the immediate utility minus the future utility if the packet M
1
is delayed for transmission, which is computed as in [23]). U0 (s, λ) = max u (s , y , x ) − λx +
i i i i
λ . i
First, we note that the optimal mixed packet scheduling policy yi ∈P i (si ,xi ),xi ≥0
i=1
M
will always schedules the packet which has no parent packets M
or whose parent packets have been transmitted. This is because It is easy to verify that U0 (s, λ) = i=1 U0i (si , λ) with
scheduling this packet will lead to successful decoding at the
receiver side. If j ≺ j, then qj > qj . When both DUs j and j 1
U0i (si , λ) = max ui (si , yi , xi ) − λxi + λ
are available for transmission, the optimal packet scheduling yi ∈P i (si ,xi ),xi ≥0 M
policy transmits DU j first due to this dependency, which
M
automatically leads to a higher net utility contribution. If DUs Similarly we have U1 (s, λ) = U1i (si , λ) with
i=1
j and j do not depend on each other, the optimal mixed packet
scheduling policy will choose the DU with the higher marginal U1i (si , λ) = max
utility for transmission. The utility function is the summation yi ∈P i (si ,xi ),xi ≥0
of the marginal utility of the transmitted packets. Hence, the 1
optimal mixed packet scheduling policy gives a concave non- ui (si , yi , xi ) − λxi + λ + α p(si |si , yi , xi )U0i (si , λ)
M i
decreasing utility function H(s, λ, x) at state s. The properties s