CN108737382A

CN108737382A - SVC coding HTTP streaming media self-adaption method based on Q-L earning

Info

Publication number: CN108737382A
Application number: CN201810366841.2A
Authority: CN
Inventors: 熊丽荣; 尤日晶; 沈树茂
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-04-23
Filing date: 2018-04-23
Publication date: 2018-11-02
Anticipated expiration: 2038-04-23
Also published as: CN108737382B

Abstract

the invention relates to a self-adaptive method of SVC coding HTTP streaming media based on Q-L earning, which comprises the steps of firstly constructing a Q-L earning model, constructing a state set, a behavior set and a return function for the streaming media interactive situation of SVC coding, selecting an exploration strategy, secondly learning the constructed Q-L earning self-adaptive algorithm in an actual network environment off-line until the knowledge obtained by algorithm learning is converged, and finally deploying the obtained model on line to make self-adaptive decision.

Description

SVC based on Q-Learning encodes HTTP streaming media self-adapting methods

Technical field

The invention belongs to information technology fields, especially dynamic self-adapting Streaming Media method

Background technology

In recent years, online streaming media video service is widely used.Online Video business is in entire internet Flow is just in occupation of increasing proportion.Scalable video (Scalable Video Coding, SVC) can overcome The redundancy issue of advanced video coding (Advanced Video Coding, AVC) is providing same video matter with AVC codings When the service of amount, SVC codings can encode the server-side memory space for saving 200%-300% than AVC.Therefore, research is based on To saving server-side storage resource, provide higher-quality streaming media video service has the adaptive stream media technology of SVC codings Extremely important realistic meaning.

In streaming media video service, the technology of video playing end most critical is exactly adaptive decision-making method.Currently, research base It is broadly divided into two classes in the adaptive approach of SVC codings, one kind is according to SVC segment progress adaptive decision-makings encoded and another Class is the layer self-adapting decision encoded according to SVC.Adaptive decision-making based on segment is mainly according to handling capacity or cache prediction The credit rating of next video clip, then according to the basal layer and enhancement layer of video quality grade serial download segment.It is based on Throughput prediction method can bring the problem of fragment masses frequent switching when bandwidth changes.Prediction technique based on caching is then It maintains high level cache and downloads the video clip of lower quality grade always, cause the whole QOE for watching video relatively low.It is existing SVC based on section encodes decision-making technique in the acute variation of bandwidth, tends not to timely respond to, causes video cardton.It is another Class method is to carry out decision based on layer, and existing such methods mainly have the thought following two, 1. use is successively downloaded, first Ensure that basal layer fills up buffer area.Secondly, ensure that enhancement layer first layer fills up buffer area, until all enhancement layers are all downloaded It is complete.When downloading slice layer every time, the slice layer of the more inferior grade in ensureing to cache will be needed to be filled.This method can be effective Guarantee video smooth playing, but quality can be relatively low when entire video playing.2. often having downloaded one layer of basal layer or enhancing It can determine whether to increase the quality of current clip according to bandwidth variation after layer or fill the basal layer of new segment to fill Buffer area, can not be flexibly to having filled out when caching filling is more although this mode can timely responsive bandwidth change The segment filled carries out increased quality, and it is not strong to promote video quality flexibility.

In general, in the streaming media self-adapting method of existing SVC codings, following two problems are primarily present：1. base Bandwidth variation can not be timely responded in the method that SVC coding segments carry out adaptive decision, video cardton can be caused.2. being based on The method that SVC coding slice layers carry out decision carries out increased quality to the segment that can not be had been filled with, and whole QoE is caused to decline.

Invention content

The present invention will overcome the disadvantages mentioned above of the prior art, provide a kind of SVC coding HTTP streams based on Q-Learning Media adaptive approach.

SVC based on Q-Learning encodes HTTP streaming media self-adapting methods, includes the following steps：

1) by SVC coding Streaming Media interaction context build Q-Learning models, need build state set (States), Strategy is explored in behavior collection (Actions), Reward Program (Reward function), and selection.Build the Q- of intensified learning Learning model key steps are as follows：

(1.1) state set (States) is built：Select bandwidth and caching occupied state constructing environment state.Client needs Bandwidth and caching occupied state are carried out discrete.

The maximum value that (1.1.1) defines bandwidth is BW_max, each fragment segmentation is required when in i-th layer at M layers Lowest-bandwidth is thr_i(i≤0≤M), we are by bandwidth discrete at { 0~thr₀,thr₀~thr₁,...,thr_M-1~thr_M, altogether M+1 state.

It is discrete as follows that (1.1.2) caches occupied state：Definition buffer memory segment ranges are 0~S_max, cache occupied state Bs (bufferState) is by S_maxA element forms [s₁,s₂,s₃...s_smax], wherein s_kIndicate k-th of buffer memory position The basal layer and enhancement layer sum of section storage, such as bs=[0,0,0,0,0,0,0], illustrate that all segments are not filled by. When bs=[1,1,1,1,1,1,1,1], illustrate that all stuffers have been only filled with basal layer.

State is configured to：S={ bs, bw }, the discrete case and mode of the two elements are as shown in table 1 below：

1 ambient condition of table defines

Element	Range	Discrete way
			bs	[s₁,s₂,s₃...s_smax]	s_k∈(1,2,..M),k∈(1,2...S_max)
bw	0~BW_max	{ 0~thr₀,thr₀~thr₁,...thr_M-1~thr_M}

(1.2) behavior collection is defined as a=(index, layer), is cache location subscript (index) and cache location respectively Required download grade (layer), such as a=(3,0), then it represents that next file of download is the base in the 3rd segment in caching Plinth layer.Different states generally has different optional behavior collection.The discrete case and mode of behavior collection element such as the following table 2 institute Show：

2 behavior collection of table defines

Element	Range	Discrete way
			index	0~S_max	{1,2...S_max}
layer	0~M	{0,1,2...M}

(1.2.1) decision behavior from the optional behavior of current state concentration selected, after determining behavior, by behavior into Row downloads next slice layer.The addition of behavior collection is as follows, if current cache occupied state bs=[s₁,s₂,...s_k] when, currently The optional behavior collection of state thoroughly does away with current cache occupied state and is added from left to right, if s_kState not be 0, then add Behavior a=(k, s_k), if s_kIt is 0, then adds a=(k, 0), and terminate and carry out searching new state.If bs fillings have been expired, Then enter sleep state and when needing to refill, then carries out decision after video clip is removed during waiting caches.Bs=[1,0, 0,0,0,0,0] the download behavior (1,1) (2,0) that lower a moment can select is gone next time when bs=[3,2,0,0,0,0,0] Can to be (1,3), (2,2), (3,0), three behaviors.

(1.3) Reward Program (Reward function)：Reward Program includes three factor r_freeze, r_actionWith r_switch, it is defined as follows：

It is r that (1.3.1), which defines behavior return value,_freezeIf the behavior of selection causes video pause broadcasting, to it It is punished, enables r_freeze=-10000, otherwise enable r_freeze=0

It is r that (1.3.2), which defines behavior return value,_action,r_action=100* (10-index)+layer, wherein index tables Show it is piece fragment position in caching, layer then illustrates currently lower carrier layer credit rating while also representing current video piece The quality of section, if the behavior of selection tends to that when caching the position of subscript earlier above higher return value can be obtained as possible

(1.3.4) we define the quality of segment and be switched to r_switch, defined formula r_switch=-10*abs (leftlayer-layer)+(- 10) * abs (rightlayer-layer) calculates position fragment masses layer and the left side of filling The segment level (leftlayer) on side it is of poor quality and filling position fragment masses layer and the right level (rightlayer) of poor quality, if gap is larger, punished.

The linear overall return value of (1.3.5) definition is r, r=r_freeze+r_action+r_switch。

(1.4) strategy is explored

Select Softmax tactful as exploring.Boltzmann probability point is carried out according to the optional behavior Q values of current state Cloth calculates, different action probability distribution such as formula:

Wherein π (a | s) is the probability of housing choice behavior a under s states, calculates the index multiple of the optional behavior collection e under s states It is cumulative and, and weight of a behaviors in all behaviors is determined by τ parameters.It ensure that different behaviors has different quilts The probability chosen.

2) off-line training Q-Learning algorithms；

(2.1.1) determines input relevant parameter：Learning rate α, discount factor γ, Reward Program r, current bandwidth bw, when Preceding buffer area occupied state (bs)；

(2.1.2) determines output parameter：Convergent Q tables；

(2.1.3) determines random initializtion Q tables；

(2.1.4) checks whether Q tables restrain, and terminates if Q tables are restrained, new spy is carried out if Q tables are not yet restrained Rope；

(2.1.5) plays video and carries out new round exploration；

(2.1.6) determines current state s according to current bandwidth and caching occupied state；

(2.1.7), which is used, explores tactful (softmax) from s state housing choice behaviors a；

(2.1.8) process performing a calculates return value r, and enters NextState

(2.1.9) by formula Q (s, a)=(1- α) Q (s, a)+α (r+ γ .max (s`, a`)) update Q tables；

State is set to s by (2.1.10):=s`；

Whether (2.1.11) video, which plays, is terminated, and (2.1.4) is entered step if playing and terminating, if not yet playing knot Beam then enters step (2.1.5).

3) it is applied in model line

According to current cache occupied state (bs) and current bandwidth (bw), current state is inquired in Q tables, and passes through inquiry All executable behaviors in this state determine that the Q values of that behavior are maximum, then execute the behavior.When decision trip is (a) When, then download the corresponding slice layer of behavior (a).

The present invention applies the Q-Learning algorithms of intensified learning in the adaptive cross layer method that SVC is encoded, and improves The limitation of existing layer decision-making technique, improves user's viewing experience, advantage is as follows：

Solve the problems, such as it is existing based on SVC coding section decision-making techniques can not timely respond to band acute variation, can According to the bandwidth of client end of playing back and caching occupied state, dynamic carries out interlayer decision, carries the user of client streaming media playing Experience.

Interlayer decision is carried out using Q-Learning algorithms, is improved existing slow to what is had been filled with based on layer decision-making technique It rushes area's segment and carries out more effective increased quality.

Description of the drawings

Fig. 1 is the flow chart of the method for the present invention

Head office when Fig. 2 is the behavior collection structure of intensified learning of the present invention is collection.

Fig. 3 be the present invention stuffer layer (1,0) after optional behavior collection.

Fig. 4 is the stuffer layer (1,0) of the present invention, after (1,1) and (2,0), optional behavior collection after caching.

Fig. 5 is the algorithm flow chart of the present invention.

Specific implementation mode

Referring to attached drawing, further illustrate the present invention：

1) for algorithm interactive environment referring to shown in attached drawing 1, DASH server-sides store the basis of multiple segments of a video Layer and multilayer enhancement layer and MPD file, client downloads MPD file and obtains video clip relevant information first, and according to band Wide and caching occupied state factor carries out adaptive decision-making.Algorithm is carried out according to the caching occupied state and bandwidth of interactive environment Build Q-Learning models.It needs to build state set (States), behavior collection (Actions), Reward Program (Reward Function), and strategy is explored in selection.The Q-Learning model key steps for building intensified learning are as follows：

1 ambient condition of table defines

(1.2) behavior collection is defined as a=(index, layer), is cache location subscript (index) and cache location respectively Required download grade (layer), behavior, which always collects, participates in attached drawing 2.Such as a=(3,0), then it represents that next file of download is caching In basal layer in the 3rd segment.Different states generally has different optional behavior collection.The discrete case of behavior collection element And mode is as shown in table 2 below：

2 behavior collection of table defines

Element	Range	Discrete way
			index	0~S_max	{1,2...S_max}
layer	0~M	{0,1,2...M}

(1.2.1) decision behavior from the optional behavior of current state concentration selected, after determining behavior, by behavior into Row downloads next slice layer.The addition of behavior collection is as follows, if current cache occupied state bs=[s₁,s₂,...s_k] when, currently The optional behavior collection of state thoroughly does away with current cache occupied state and is added from left to right, if s_kState not be 0, then add Behavior a=(k, s_k), if s_kIt is 0, then adds a=(k, 0), and terminate and carry out searching new state.If bs fillings have been expired, Then enter sleep state and when needing to refill, then carries out decision after video clip is removed during waiting caches.Bs=[1,0, 0,0,0,0,0] attached drawing 3 is participated in the download behavior (1,1) (2,0) that lower a moment can select；When bs=[3,2,0,0,0,0,0] When, behavior next time can be (1,3), (2,2), (3,0), three behaviors, shown in attached drawing 4.

(1.4) strategy is explored；

(1.4.1 selects Softmax tactful as exploring.Boltzmann is carried out according to the optional behavior Q values of current state Probability distribution calculates, different action probability distribution such as formula:

2) off-line training Q-Learning algorithms

(2.1.1) determines input relevant parameter：Learning rate α, discount factor γ, Reward Program r, current bandwidth bw, when Preceding buffer area occupied state (bs)

(2.1.2) determines output parameter：Convergent Q tables

(2.1.3) determines random initializtion Q tables；

(2.1.4) determines whether Q tables restrain, and terminates if Q tables are restrained, new spy is carried out if Q tables are not yet restrained Rope；

(2.1.5) plays video and carries out a new wheel exploration；

(2.1.8) process performing a calculates return value r, and enters NextState

State is set to s by (2.1.10):=s`；

3) it is applied in model line；

Content described in this specification embodiment is only enumerating to the way of realization of inventive concept, protection of the invention Range is not construed as being only limitted to the concrete form that embodiment is stated, protection scope of the present invention is also and in art technology Personnel according to present inventive concept it is conceivable that equivalent technologies mean.

Claims

1. the SVC based on Q-Learning encodes HTTP streaming media self-adapting methods, include the following steps：

1) the Streaming Media interaction context of SVC codings is built into Q-Learning models, needs to build state set (States), behavior Collect (Actions), Reward Program (Reward function), and strategy is explored in selection；Build intensified learning Q-Learning The step of model, is as follows：

(1.1) state set (States) is built：Bandwidth and caching occupied state constructing environment state, client is selected to need to band Wide and caching occupied state carries out discrete；

The departure process of (1.1.1) bandwidth is as follows:The maximum value for defining bandwidth is BW_max, each fragment segmentation at M layers, when in At i-th layer, required lowest-bandwidth is thr_i, i≤0≤M, by bandwidth discrete at { 0~thr₀,thr₀~thr₁,...,thr_M-1 ~thr_M, total M+1 state；

It is discrete as follows that (1.1.2) caches occupied state：Definition buffer memory segment ranges are 0~S_max, caching occupied state bs (bufferState) by S_maxA element forms [s₁,s₂,s₃...s_smax], wherein s_kIndicate k-th of segment of buffer memory position Basal layer and the enhancement layer sum of storage；

1 ambient condition of table defines

(1.2) behavior collection structure (Actions)：Behavior collection is defined as a=(index, layer), is cache location subscript respectively (index) and needed for cache location grade (layer) is downloaded；Different states generally has different optional behavior collection；Behavior Discrete case and the mode for collecting element are as shown in table 2 below：

2 behavior collection of table defines

Element Range Discrete way index 0~S_max {1,2...S_max} layer 0~M {0,1,2...M}

(1.2.1) decision behavior is selected from the optional behavior of current state concentration, after determining behavior, is carried out down by behavior Carry next slice layer；The addition of behavior collection is as follows, if current cache occupied state bs=[s₁,s₂,...s_k] when, current state Optional behavior collection be added from left to right according to current cache occupied state, if s_kState not be 0, then add behavior A=(k, s_k), if s_kIt is 0, then adds a=(k, 0), and terminate and carry out searching new behavior；If bs fillings have been expired, into Enter sleep state and when needing to refill, then carries out decision after video clip is removed during waiting caches；

(1.3) Reward Program (Reward function)：Reward Program includes three factor r_freeze, r_actionAnd r_switch, It is defined as follows：

It is r that (1.3.1), which defines behavior return value,_freezeIf the behavior of selection causes video pause broadcasting, it is carried out Punishment, enables r_freeze=-10000, otherwise enable r_freeze=0；

It is r that (1.3.2), which defines behavior return value,_action,r_action=100* (10-index)+layer, wherein index are illustrated It is the piece fragment position in caching, layer then illustrates currently lower carrier layer credit rating while also representing current video segment Quality, if the behavior of selection tends to that when caching the position of subscript earlier above higher return value can be obtained as possible；

The quality that (1.3.3) defines segment is switched to r_switch, defined formula r_switch=-10*abs (leftlayer- Layer)+(- 10) * abs (rightlayer-layer) calculates the slice layer of position the fragment masses layer and the left side of filling Grade (leftlayer) it is of poor quality and filling position fragment masses layer and the right level (rightlayer) it is of poor quality；

The linear overall return value of (1.3.4) definition is r, r=r_freeze+r_action+r_switch；

(1.4) strategy is explored；

It selects Softmax as strategy is explored, Boltzmann probability distribution is carried out according to the Q values of the optional behavior of current state It calculates, the probability distribution formula of difference action is as follows:

Wherein π (a | s) is the probability of housing choice behavior a under s states, and the index multiple for calculating the optional row collection e under s states is cumulative With and weight of a behaviors in all behaviors is determined by τ parameters, it is different selected to ensure that different behaviors has Probability；

2) off-line training Q-Learning algorithms are built；

(2.1.1) determines input parameter：Learning rate α, discount factor γ, Reward Program r, current bandwidth bw, current cache area Occupied state (bs)；

(2.1.2) determines output parameter：Convergent Q tables

(2.1.3) determines random initializtion Q tables；

(2.1.4) checks whether Q tables restrain, and terminates if Q tables are restrained, new exploration is carried out if Q tables are not yet restrained；

(2.1.5) plays video and carries out a new wheel exploration；

(2.1.8) process performing a calculates return value r, and enters NextState

State is set to s by (2.1.10):=s`；

Whether (2.1.11) video, which plays, is terminated, and (2.1.4) is entered step if playing and terminating, and is terminated if not yet played, It then enters step (2.1.5)；

3) it is applied in model line；

According to current cache occupied state (bs) and current bandwidth (bw), current state is inquired in Q tables, and by inquiring at this All executable behaviors under state determine that the Q values of that behavior are maximum, then execute the behavior；When decision trip is (a), then The corresponding slice layer of download behavior (a).