CN102427507A

CN102427507A - Football video highlight automatic synthesis method based on event model

Info

Publication number: CN102427507A
Application number: CN2011102943849A
Authority: CN
Inventors: 赵沁平; 陈小武; 蒋恺
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2011-09-30
Filing date: 2011-09-30
Publication date: 2012-04-25
Anticipated expiration: 2031-09-30
Also published as: CN102427507B

Abstract

The invention relates to a football video highlight automatic synthesis method based on an event model. The method comprises the following steps: to a football match video highlight, defining whether a football video highlight clip can be separated into a football video event composed of a plurality of motions; constructing a core-surrounding event model to express a football highlight clip; utilizing football match video and corresponding text narration to construct a training set, selecting goals and red and yellow cards as two types of football highlights, and training the event model; inputting a segment of football match video without narration, identifying an appearance position of a football highlight clip in the input video, and giving a matching mark; according to a user requirement, automatically synthesizing a football highlight clip with a highest mark to be a football video highlight. According to a method of generating the football video highlight in the invention, restriction of factors such as a lens distance of the input video, and the method can be widely applied and popularized to fields of personal digital entertainment, physical education movie and television production and the like.

Description

A kind of football video collection of choice specimens automatic synthesis method based on event model

Technical field

The present invention relates to computer vision, Video processing and augmented reality field, specifically a kind of football video collection of choice specimens automatic synthesis method based on event model.

Background technology

The sports video collection of choice specimens is a kind of as the physical culture movie and video programs, owing to can obtain sufficient information in the short period, its dapper characteristics are liked by spectators deeply.Especially aspect the football race; Watch the match video that reaches 90 minutes very consuming time just to seeing the sportsman that likes or excellent shooting camera lens, therefore often adopt the mode of the football match collection of choice specimens to write down race associated topic such as excellent camera lens playback, race summary, sportsman's personal story.The conventional video collection of choice specimens is by artificial montage match video, though the montage precision is higher and be rich in emotion, needs the labor manpower to check video seeking required excellent camera lens by frame, and editor's race Heuristics is had relatively high expectations.Along with video is understood, the research of computer vision field is constantly progressive,, the competitive sports video becomes a technology and research focus gradually for generating collection of choice specimens video automatically.

At present, different according to the video film source, for generating collection of choice specimens video automatically, the competitive sports video can be divided into two big types.One type is the automatic collection of choice specimens to the television relay video.Because having added, the television relay video relays the understanding of teacher to race, can be when handling with relaying the implicit clue of skill as Video Roundup.For example, when football match is relayed, close-up shot or put camera lens slowly and can appear at after the goal usually; Same incident was taking place usually between twice camera lens switched; Long shot means grand movement track of prologue or ball or the like usually.These class methods are accomplished football collection of choice specimens fragment and are detected the also final collection of choice specimens video that generates through in football video, detecting above-mentioned clue, perhaps directly in video, detect the apparent literal (for example than distributional) of screen and confirm football collection of choice specimens fragment time of origin.Though these class methods can obtain collection of choice specimens result preferably to a certain extent, it is too dependent on the television relay video, and very big limitation is arranged on the scope of application.

Another kind of is the automatic collection of choice specimens that is directed against non-television relay video.Wherein, The video theme there is method more targetedly; Usually utilize the special priori (for example prioris such as the netted goal in the football video, large stretch of green lawn, spectators' cheer) of this video theme, obtain excellent Shot Detection clue about this video theme.Its stronger specific aim has determined such method model to fix, and reusability is poor.And what researching value was arranged is the collection of choice specimens method that has general applicability within the specific limits.The research of this aspect at present mainly concentrates on both direction: (1) Video Events analysis; (2) video content summary.

Aspect the Video Events analysis, in the ECCV meeting in 2010, people such as the Li Fei-Fei of Stanford University have proposed a kind of behavior model based on human action sequence relation.This model is cut apart the behavior that action schedule is shown different time points.This method trains two kinds of models, is respectively discriminant model and display model: the video sequence that decision model is used for encoding and decomposes based on the time, display model is used for each behavior and cuts apart.In identifying, cut apart the coupling that video and model are carried out in decomposition through learning characteristic and behavior.This method can be discerned simple and complicated human action preferably, but because its time tactic pattern is fixed, can't be competent at the complicated event of being made up of action through introducing time structure.In CVPR meeting in 2009, people such as the Larry S.Davis of University of Maryland propose a kind of method of from the video that has weak flag data, learning out complete visual plot model.Wherein the plot model with or the form of figure express, can the plot in the video be changed and carry out simple code.With or figure in the limit be equivalent to causality based on space-time restriction.With this model and the training data that study obtains, can carry out behavior identification and plot and extract.Consider in the frame of video human body attitude and the incidence relation of object on every side, the people such as Fowlkes of California, USA university in 2010 propose a kind of based on human body attitude and object incidence relation modeling on every side, come the method for identification maneuver.This method mainly solves the action recognition problem of still image and is translated into potential structure tag problem.

Aspect the video content summary; The method that people such as Pritch propose on PAMI periodical in 2008 can be with a bit of summary of segment length's video simmer down to through analyzing video; And on every frame, show the movable information of multiframe simultaneously, but the limitation of this method is to handle whole scene in the video all at the situation of motion and video through editor.People such as the Hwang of University of Washington propose a kind of extraction method of key frame and design of cutting apart based on VS and have realized carrying out online treatment fast and effectively by corresponding system.In the CVPR meeting in 2005, people such as the Jojic of Microsoft Research propose a kind of new interaction models to monitor video and come index and analyze video.In addition, the people such as Wu of Vermont State university have proposed a kind of layered video summary strategy, provide multiple dimensioned, multi-level video to sum up through analyzing the video content structure to the user.

In sum, technical at Video Roundup at present, mainly have the problem of following two aspects: (1) depends critically upon the input video quality, and the scope of application is narrower.Can comparatively fast detect football collection of choice specimens fragment though use the clue of rich semantic hint information such as camera lens switching, whistle, transition to carry out Video Roundup, can't understand the football incident carries out process, therefore is difficult to the time interval that the incident that extracts takes place.(2) less is that the unit carries out Video Roundup with the incident.Because Video Events is rich and varied, directly adopts the model of characteristic statistics method to be difficult to contain fully the variation of incident, how rationally to utilize domain knowledge, modeling is a difficult point and research focus to the visual signature of binding events to incident.

Summary of the invention

According to above-mentioned actual demand and key issue, the objective of the invention is to: propose a kind of football video collection of choice specimens automatic synthesis method based on event model.This method can break through the restriction of factors such as the camera lens distance, video length, video sound of input video; Especially working as input video is non-relay video; In the time of can't therefrom obtaining the crucial clue of the collection of choice specimens such as close-up shot, cheer, the collection of choice specimens method based on event model that the present invention proposes is particularly suitable.

It is considered herein that the football video collection of choice specimens is the synthetic video that some football collection of choice specimens fragment combination form, and contains an important football incident in each collection of choice specimens fragment.Compare with other sports events videos, section of football match video has two characteristics: the first, difficult from video, find Video Events begin and finish clue; The second, the football match rule is complicated, and its duration, course of event often have nothing in common with each other important football incident of the same type (for example scoring or red and yellow card) when occurring at every turn.Learn that through a large amount of observations important football incident can resolve into some motion combination usually, wherein contain an important action that often occurs, be called the core action; Comparatively speaking, other actions are called as action on every side.Therefore, it is considered herein that section of football match video collection of choice specimens fragment can be with a core-event model is represented on every side.

For with section of football match video simmer down to football collection of choice specimens video, need in input video, detect and extract football collection of choice specimens fragment.Therefore, the present invention at first makes up the event model of a core-on every side, modeling incident and form semantic relation, sequential relationship and the visual signature between each action of incident.

Core-training process of event model comprises following steps on every side: (1) a series of section of football match video of input and corresponding text commentary thereof; From commentary, extract keyword; And according to the logout of commentary; Add up the probability of occurrence of each keyword, and the probability that occurs simultaneously of a plurality of keyword; (2) the maximum keyword of selected probability of occurrence is the core keyword; (3) commentary is corresponding with section of football match video, recorded key speech time of occurrence, and add up duration and the incident duration that keyword is represented; (4) in the gradient characteristic and the light stream characteristic of keyword time of occurrence section calculating space-time interest points, statistical gradient histogram and light stream histogram are as the local visual characteristic of action.

Generally, the content of core-event model modeling on every side comprises: the vision statistical nature of each action; The sequencing of action in the incident generating process; The ratio of duration and incident duration; The probability that each action takes place.

Model is used for the detection and the extraction of Video Events through after training.Generally speaking, import one section section of football match video, the step of synthetic football collection of choice specimens video can be divided into: collection of choice specimens fragment is extracted in (1).For every type of football collection of choice specimens fragment, at first according to the contained important football incident of such collection of choice specimens fragment, on input video, detect the core action and action on every side of forming this incident respectively, obtain the time of occurrence section of each action; Then, be benchmark with the core action, confirm the Time To Event section in conjunction with the action sequence relation, count the time period of candidate's collection of choice specimens fragment; At last, at candidate's collection of choice specimens fragment match event model, draw the Model Matching mark.(2) synthetic collection of choice specimens video.At first draw candidate's collection of choice specimens fragment list for every type football collection of choice specimens fragment, it is sorted according to the Model Matching mark from high to low through step (1); Choose some football collection of choice specimens fragments according to the collection of choice specimens fragment classification and the collection of choice specimens video length of user's needs then, and arrange by its time of origin; Select the some frames of beginning of the some frames in end and a back fragment of previous football collection of choice specimens fragment to do at last and seamlessly transit processing, make it more meet visual perception's effect.

Compare with other Video Roundup methods, advantage of the present invention is: it is extensive that (1) is suitable for the video film source.Clues such as the camera lens feature when other Video Roundup methods need rely on TV station and relay video and transition switching; The present invention is through analyzing the visual signature of Video Events; All kinds of incidents in the detection and Identification video, thus can be adaptable across Video Roundups such as individual digital amusement, Sports Scientific Research, television program designings.(2) collection of choice specimens fragment combination is flexible.Because to adopt Video Events is the Video Roundup slice unit in the present invention, the user specify its collection of choice specimens clip types that needs, collection of choice specimens video length, etc. condition, thereby can synthesize the individualized video collection of choice specimens product that meets user's request.

Description of drawings:

Fig. 1 is the event model structure chart of core of the present invention-on every side;

Fig. 2 is a model training process sketch map of the present invention;

Fig. 3 is a semantic layer event model modeling flow chart of the present invention;

Fig. 4 is a vision layer event model training process flow chart of the present invention;

Fig. 5 is a football collection of choice specimens snippet extraction process sketch map of the present invention;

Fig. 6 is the synthetic sketch map of football collection of choice specimens fragment of the present invention.

Embodiment:

Below in conjunction with accompanying drawing the present invention is elaborated.

The present invention define the football video collection of choice specimens be defined as in the football match take place, be the important football event sets of carrier with the video.The football video collection of choice specimens is formed by a series of football collection of choice specimens fragment combination, and each football collection of choice specimens fragment comprises an important football incident.The present invention make up core-event model is used for the important football incident of detection and Identification section of football match video on every side, and then extract football collection of choice specimens fragment.Football collection of choice specimens fragment is different according to the important football event type that wherein comprises, and has different classes of.For example, scoring belongs to different important football incidents with red and yellow card, therefore, comprises the football collection of choice specimens fragment of goal and comprises the football collection of choice specimens fragment that the football collection of choice specimens fragment of red and yellow card belongs to a different category.

Consult the event model structure chart of Fig. 1 core of the present invention-on every side, the event model of the core that the present invention makes up-on every side simultaneously semantic with visually to football collection of choice specimens fragment in the important football incident that comprises carry out modeling.This model mainly comprises 3 parts: (1) semantic relation, the action of the main modeling core of this part with each around the possibility that occurs simultaneously of action, and the possibility that in this important football incident, occurs of each action.(2) time sequencing, this part mainly are modeled in the important football incident generating process, time location that each action possibly occur and duration length.(3) visual appearance, this part mainly refer to move the visual signature statistics on the space-time interest points in the video of place time interval.For similar important football incident, select the action that a most probable takes place to be regarded as the core action, other actions are regarded as supporting the action on every side of this incident.Therefore, the temporal relation constraint between action and the core action on every side is by the model that is built into of implicit expression, and this is very helpful for locating events in video.

This core-event model can be divided into two-layer when training on every side: semantic layer and vision layer.For one type of incident E and the behavior aggregate { a that describes it _i, i=1 ..., n}, a among the semantic layer modeling incident E _iProbability of happening and a _iWhether be the core of E.The visual appearance of vision layer modeling incident, and the semantic layer model introduced as prior probability.The vision layer model has three parameters: discern certain action a _iBest grader A _iGrader A _iBest time of occurrence anchor point t _iA _iTime interval r in the incident generating process _i

Event model training set comprise video-frequency band { V ¹..., V ^N, and the class label y of corresponding actions _i(y _i∈ 1, and 1}, i=1 ..., N).Adopt this model of implicit expression SVM LSVM study; In the LSVM framework, energy function is maximized according to hidden variable, and the hidden variable here refers to that the position appears in the best of classification of motion device; This position is not accurately to provide, but obtains through the training of training sample implicit expression.

Consult Fig. 2 football collection of choice specimens of the present invention fragment model training process sketch map, model training process of the present invention mainly is divided into three steps: (1) semantic relation modeling.Its detailed process is as shown in Figure 3, and the commentary that at first will have the Time And Event sign through the sentence element analysis, is extracted its verb property, gerund property keyword as training text, and makes up the keyword set of presentation of events; Based on the WordNet classified vocabulary, keyword is mapped to different classes of, and with this class label as the action classification label; Add up each action in this classification collection of choice specimens fragment occurrence number and total degree occurs, calculate the sign degree of each action, and select the action of sign degree maximum to move as core to this classification collection of choice specimens fragment; The operation of recording frequency, and to calculate its probability of happening be prior probability.(2) action visual signature statistics.According to the time marking and the action classification label of commentary, obtain the video time interval that this action takes place; Video-frequency band in this video time interval is divided into some parts, calculates histogram of gradients and light stream histogram on the space-time interest points at each part.(3) sequential relationship modeling.According to time marking, event identifier and the action classification label of commentary, draw the action order of occurrence figure of the contained incident of similar football collection of choice specimens fragment, according to incident vision layer model, each moves best occurrence positions to utilize LSVM training.

Consult Fig. 4 vision layer of the present invention event model training process flow chart, the training process of event model of the present invention on the vision layer is following: (1) calculated characteristics point, and with each the video V in the training set ^p(p ∈ 1 ..., N}) on average be divided into M video-frequency band Detect

Space-time interest points

Wherein

Be video-frequency band

In the space-time interest points number.(2) statistics st _lHistogram of gradients

With the light stream histogram

Wherein the abscissa of histogram of gradients is that gradient vector is interval, and interval number representes that with ng ordinate representes to drop on the interval gradient vector number of each vector; The histogrammic abscissa of light stream is that light stream vectors is interval, and interval number representes that with nf ordinate representes to drop on the interval light stream vectors number of each vector.(3) histogram of gradients and the light stream histogram with each video-frequency band space-time interest points is normalized to a nd dimensional vector; Nd=ng+nf wherein; And utilize the k-means algorithm that

individual vector is gathered the class for K, construct the coding schedule of video-frequency band vision statistical nature.(4) initialization grader A _iBest time of occurrence anchor point t _iAnd A _iTime interval r in the incident generating process _i, then through step (5) (6) training classifier A _i(5) according to t _iAnd r _iIntercepting video V ^pSome video-frequency bands, add up its space-time interest points that comprises vector, and be mapped to coding schedule and constitute the vector distribution histogram that length is K

This histogram is normalized to the K dimensional vector adds positive example collection Ψ.(6) with r _iConfirm the intercepting window size, at video V ^pThe vector distribution histogram in time anchor point t place intercepting video-frequency band is calculated in last slip

Calculate K dimensional vector and the positive routine distance of concentrating vector that this histogram constitutes

If

(ε is that certain is indivisible) then will

Replace Add positive example collection, repeat this step; Otherwise finish this step.(7) statistics t is at video V ^pThe middle position that occurs fits to the secondary parabolic curve with it { α wherein _i, β _iIt is the conic section parameter.This secondary parabolic curve abscissa is represented the time of occurrence of the t after the normalization, and ordinate is illustrated in this temporal occurrence number, waits until identifying as time penalty function and uses.

Consult Fig. 5 football collection of choice specimens of the present invention snippet extraction process sketch map, this leaching process mainly may further comprise the steps: (1) for input section of football match video section, detect the action that might occur; (2) be example with certain type of football collection of choice specimens fragment, use the core action of the contained important football incident of such football collection of choice specimens fragment to locate the candidate time period of the rough time period of this football collection of choice specimens fragment as this football collection of choice specimens fragment; (3) calculate the matching degree of this candidate time period and corresponding event model, and, be called the matching score of this candidate time period for this football collection of choice specimens fragment with fraction representation.With all candidate's time periods of similar football collection of choice specimens fragment according to matching score sequence arrangement from high to low.The matching process step of candidate's football collection of choice specimens fragment and event model is following: (1) is with candidate's football collection of choice specimens fragment V ^FDivide yardstick according to the training set video and be divided into video-frequency band

(2) get grader A _i, according to interval r of its time _iDelimitation sliding window size is at V ^FQ section video-frequency band on slide, calculate vector distribution histogram in time anchor point t place intercepting video-frequency band

Calculate K dimensional vector and the concentrated vectorial similarity of positive example that this histogram constitutes

(3) computing time anchor point t place time punishment

(4) according to formula

Calculate grader A _iAt candidate's football collection of choice specimens fragment V ^FOn best score as grader A _iMatching fractional; (5) the Model Matching mark that adds up, and return step (2) and finish until all graders couplings.

Consult the synthetic sketch map of Fig. 6 football collection of choice specimens of the present invention fragment, according to the football collection of choice specimens clip types and the collection of choice specimens video length of user's needs, through editing the transition effect between per two football collection of choice specimens fragments, to accomplish Video Roundup.The beginning N frame of last N frame and football collection of choice specimens fragment B of choosing football collection of choice specimens Segment A is as transitional region; Adjust the transparency of every frame, and make the x frame transparency

of adjusted A and the x frame transparency

satisfied of B

The present invention can support according to user's request collection of choice specimens section of football match video.(1) given collection of choice specimens video length, the collection of choice specimens video of a football match of generation.(2) specify football collection of choice specimens clip types, generate about specifying the football collection of choice specimens video of collection of choice specimens clip types.(3) specify collection of choice specimens video length and collection of choice specimens clip types simultaneously, generate football collection of choice specimens video about the length-specific of such collection of choice specimens fragment.

The above is merely basic explanations more of the present invention, and any equivalent transformation according to technical scheme of the present invention is done all should belong to protection scope of the present invention.

Claims

1. football video collection of choice specimens automatic synthesis method based on event model is characterized in that comprising following steps:

(1) definition football video collection of choice specimens fragment is carried out, can be decomposed into the important football incident of many combination of actions by single or many people;

(2) make up the event model of a core-on every side; According to the action probability of happening; Specifying the action that most probable takes place is the core action, and all the other actions are action on every side, and this event model specifically comprises action semantic relation, action sequence relation and three parts of local visual signature;

(3) utilize section of football match video and corresponding text commentary thereof to make up training set; Select to score with red and yellow card as two types of football collection of choice specimens, concern and the event model of said core-is on every side trained in three aspects of local visual signature from action semantic relation, action sequence respectively;

(4) one section section of football match video that does not have commentary of input, the event model that utilizes training to obtain extracts football collection of choice specimens fragment in input video, and provides the matching fractional of candidate's collection of choice specimens fragment and model;

(5) classification of football collection of choice specimens fragment is sorted according to matching fractional, the football collection of choice specimens fragment that mark is higher synthesizes a football video collection of choice specimens automatically.

2. the football video collection of choice specimens automatic synthesis method based on event model according to claim 1; It is characterized in that: in the step (1) with Video Events as football collection of choice specimens slice unit, carry out the football video collection of choice specimens separately to the football collection of choice specimens fragment of certain type.

3. the football video collection of choice specimens automatic synthesis method based on event model according to claim 1; It is characterized in that: the core of step (2)-event model requires incident can be broken down into a plurality of actions, said core-three partial contents of the main modeling of event model on every side on every side:

(2.1) the action semantic relation comprises the probability that each action takes place, and the probability that each moves on every side and the core action occurs simultaneously;

(2.2) the action sequence relation comprises the sequencing of action in the incident generating process, and the ratio of duration and incident duration;

(2.3) the local visual characteristic comprises gradient and the light stream statistical nature of each action in the motion time-continuing process.

4. the football video collection of choice specimens automatic synthesis method based on event model according to claim 1; It is characterized in that: require the section of football match video text commentary of input to contain free record and logout in the step (3); Can be corresponding with video time, train the step of said core-on every side following to certain type football collection of choice specimens:

(3.1) keyword is extracted in a series of section of football match video of input and corresponding text commentary thereof from commentary, and according to the logout of commentary, adds up the probability of occurrence of each keyword, and the probability that occurs simultaneously of a plurality of keyword;

(3.2) the maximum keyword of selected probability of occurrence is the core keyword;

(3.3) commentary is corresponding with section of football match video, recorded key speech time of occurrence, and add up duration and the incident duration that keyword is represented;

(3.4) in the gradient characteristic and the light stream characteristic of keyword time of occurrence section calculating space-time interest points, statistical gradient histogram and light stream histogram are as the local visual characteristic of action.

5. the football video collection of choice specimens automatic synthesis method based on event model according to claim 1 is characterized in that: one section section of football match video of step (4) input, and its collection of choice specimens snippet extraction process is divided into following steps:

(4.1) on input video, detect core action and action on every side respectively, obtain the time of occurrence section of everything;

(4.2) be benchmark with the core action, count candidate's football collection of choice specimens fragment in conjunction with the definite Time To Event section of action sequence relation;

(4.3) at candidate's football collection of choice specimens fragment match event model, draw the Model Matching mark.

6. the football video collection of choice specimens automatic synthesis method based on event model according to claim 1; It is characterized in that: when in the step (5) some candidate's football collection of choice specimens fragment combination being the football video collection of choice specimens, each football collection of choice specimens fragment is begun to do transition processing with the ending according to the collection of choice specimens type and the video length of user's needs.