JP2008252667A

JP2008252667A - System for detecting event in moving image

Info

Publication number: JP2008252667A
Application number: JP2007093237A
Authority: JP
Inventors: Haruyo Ookubo; 晴代大久保; Toshihito Egami; 登志人江上
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2007-03-30
Filing date: 2007-03-30
Publication date: 2008-10-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a robust system for detecting events in moving images independent of a recording system. <P>SOLUTION: The system detects a highlighted scene or its start in moving images and is composed of a feature quantity extraction part, a feature quantity observation part, a model storing part, a category classification part and a determination part. The feature quantity observation part observes a feature quantity evaluated by the feature quantity extraction part, and if the result does not match a specified feature quantity level, a system for properly detecting events can be constituted by changing the feature quantity in the extraction part. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、大量の動画を効率的に取り扱う目的で、動画を解析して重要なシーンを検出する動画イベント検出装置に関するものである。 The present invention relates to a moving image event detection apparatus that analyzes moving images and detects important scenes for the purpose of efficiently handling a large amount of moving images.

近年、ＤＶＤレコーダなどのデジタル映像機器やＴＶチューナーを搭載したＴＶ録画が可能なパソコンの普及により、日常の出来事やテレビ番組をデジタルの動画として、ハードディスク（ＨＤＤ）や光ディスクなどに大量に保存して保有することが一般的になりつつある。そして、今後は、ＨＤＤや記録メディアの大容量化、録画機器の多チューナー化、動画の圧縮効率の向上等により、個人が大量の動画コンテンツを保有することが、更に加速されると考えられている。 In recent years, with the spread of digital video equipment such as DVD recorders and personal computers capable of TV recording with a TV tuner, daily events and TV programs can be stored in large quantities on hard disks (HDD) or optical disks as digital moving images. Holding is becoming common. In the future, it is believed that individuals will be able to further accelerate the possession of a large amount of video content by increasing the capacity of HDDs and recording media, increasing the number of recording devices, and improving the compression efficiency of video. Yes.

このような状況に対して、録画したコンテンツのハイライトなどの重要なシーンやユーザーが見たいシーンを検出し、この結果を再生時に利用して、大量の動画を短時間で効率よく視聴することが提案されている。従来の検出技術としては、ハイライトを抽出するために、オーディオパワーに対して閾値を使ってオーディオの種別や判定をする方法が提案されている（例えば、特許文献２、特許文献３参照。）。これらの方法は、非線形な状態、例えば、音声と歓声が重なっているような状態を適切に切り分けて分類することに適切ではない。 For such situations, detect important scenes such as highlights of recorded content and scenes that users want to see, and use these results during playback to efficiently watch a large amount of videos in a short time. Has been proposed. As a conventional detection technique, in order to extract highlights, a method for determining the type or determination of audio using a threshold value for audio power has been proposed (see, for example, Patent Document 2 and Patent Document 3). . These methods are not suitable for appropriately classifying and classifying a non-linear state, for example, a state where voice and cheer overlap.

一方、非線形な状態を適切に切り分ける方法として、事前にトレーニングデータとして与えられたシーンの特徴ベクトルを学習して確率モデルを構成し、この確率モデルを使って、入力する動画の特徴量がハイライトシーンに含まれるか否かを判定する方法が知られている。この方法は、事前にモデルを学習するために、事前に与えるトレーニングデータにより検出性能が左右され、実際のシステムに組み込んだ場合に、トレーニング時には反映できなかった未知の特徴量を持つ入力動画に対して、期待していなかった判定結果を出してしまう場合がある。このような問題に対して、例えば、トレーニングデータに人工的に雑音を重畳して学習させて、未知の入力動画に対して出力結果が正解に対して外れないようにすること、つまり、ロバスト性を向上させる提案がされている（例えば、特許文献１参照。）。
特開平１０−６３７８９号公報（第１図）特開２００４−２６０７３４号公報（第１図）特開２００３−１０１９３９号公報（第１図） On the other hand, as a method of appropriately separating nonlinear states, a feature model of a scene given in advance as training data is learned to construct a probability model, and using this probability model, the feature amount of the input video is highlighted. A method for determining whether or not a scene is included is known. In order to learn the model in advance, this method depends on the training data given in advance, and the detection performance depends on the input video with unknown features that could not be reflected during training when incorporated in an actual system. In some cases, the determination result may be unexpected. To solve such a problem, for example, training is performed by artificially superimposing noise on training data so that the output result does not deviate from the correct answer for an unknown input video, that is, robustness Has been proposed (see, for example, Patent Document 1).
Japanese Patent Laid-Open No. 10-63789 (FIG. 1) JP 2004-260734 A (FIG. 1) JP 2003-101939 A (FIG. 1)

上記の従来方法では、瞬間的に入力動画が乱れた場合に対するロバスト性を強化できるものの、定常的、あるいは、一時的にトレーニングデータから大きく特性が外れた入力動画に対しては、期待した検出結果が得られなかった。そのため、録画するシステムごとにモデルのトレーニングやチューニングをする必要があった。また、近年の録画機器やＴＶパソコンの普及による録画環境の多様化、ブロードバンドの普及とデジタル放送の普及によりさまざまなところで制作された動画を視聴する機会が増えている。そのため、必ずしも決まった記録装置で録画された動画だけを効率よく視聴するのではなく、前述のように様々な環境で保存された入力動画に対して、偏りなく同一の検出精度を保つ必要がある。 Although the above-mentioned conventional method can enhance the robustness against momentary disturbance of the input video, the expected detection result for the input video that is significantly different from the training data on a regular or temporary basis Was not obtained. Therefore, it was necessary to train and tune the model for each recording system. In addition, there are increasing opportunities to view videos produced in various places due to the diversification of recording environment due to the recent spread of recording devices and TV personal computers, the spread of broadband and the spread of digital broadcasting. For this reason, it is not always necessary to efficiently view only videos recorded by a fixed recording device, and it is necessary to maintain the same detection accuracy without bias for input videos stored in various environments as described above. .

本発明は、前記従来の課題を解決するもので、機械学習に基づいた動画の検出で、動画を録画するシステムに依存しないロバストな検出装置を提供するものである。 The present invention solves the above-described conventional problems, and provides a robust detection device that does not depend on a system for recording a moving image by detecting a moving image based on machine learning.

上記の従来の課題を解決するために、本発明の動画イベント検出装置は、動画データを入力し、指定された変更する特徴量とその変更レベルに応じて入力されたデータを変更し、特徴量を演算し出力する特徴量抽出部と、前記特徴量抽出部から出力された特徴量を入力とし、観察する特徴量を観察し、入力として指定された特徴量レベルでは無い場合は前記特徴量抽出部に変更する特徴量とその変更レベルを伝える特徴量観察部と、学習モデルデータを保存しているモデル保存部と、与えられたジャンル情報に対応した学習モデルデータを前記モデル保存部より読込み、前記特徴量抽出部から出力した特徴量を入力とし、入力された動画データが予め決められている分類種のうち、どの分類に近いかを計算し、近い分類結果を出力するカテゴリ分類部と、前記カテゴリ分類部により分類された結果とジャンル情報を入力し、重要シーンの始まりもしくは、重要区間を決定する判定部を有している。 In order to solve the above-described conventional problem, the moving image event detection apparatus of the present invention inputs moving image data, changes the specified feature amount to be changed and the input data according to the change level, and the feature amount. The feature amount extraction unit that calculates and outputs the feature amount, and the feature amount output from the feature amount extraction unit is input, the feature amount to be observed is observed, and if the feature amount level is not the input, the feature amount extraction is performed A feature amount observing unit that conveys a feature amount to be changed and a change level thereof, a model storage unit that stores learning model data, and learning model data corresponding to given genre information is read from the model storage unit, A category that receives the feature amount output from the feature amount extraction unit, calculates which classification of the classification types the input video data is determined in advance, and outputs a close classification result A classification unit, enter the result and genre information classified by the category classification unit, the start of important scenes or has a determination unit for determining a critical section.

また、本発明の動画イベント検出装置は、学習モデルに対応させた観察する特徴量と特徴レベルを保存するモデル保存部、モデル保存部から、ジャンル情報に対応した学習モデルに対応した観察する特徴量と特徴量レベルを入力し、観察する特徴量とを入力とする特徴量観察部を有する。 In addition, the moving image event detection apparatus of the present invention includes a model storage unit that stores feature quantities and feature levels to be observed corresponding to a learning model, and a model storage unit that observes feature quantities corresponding to a learning model corresponding to genre information. And a feature amount level, and a feature amount observing unit for inputting the observed feature amount.

以上のように、本発明の動画イベント装置によれば、特徴量観察部が入力動画の特徴量を学習時のトレーニングデータから特性が外れていないか否かを観察し、外れている場合は、入力動画を補正して適切な検出結果が得られるという効果がある。 As described above, according to the video event device of the present invention, the feature amount observing unit observes whether or not the characteristic is deviated from the training data at the time of learning the feature amount of the input moving image. There is an effect that an appropriate detection result can be obtained by correcting the input moving image.

以下、本発明の実施の形態を添付図面に基づき詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

（実施の形態１）
本発明の実施の形態１による動画イベント検出装置について図１〜図２を用いて詳細を説明する。 (Embodiment 1)
Details of the moving image event detection apparatus according to the first embodiment of the present invention will be described with reference to FIGS.

図１は、本発明の実施の形態１による動画イベント検出装置の構成を示すブロック図である。図１において、１１は、入力動画の特徴量を算出する特徴量抽出部、１２は外部から指定された観察する特徴量とその参考レベルを保持し、入力動画の特徴量を観察し、参考レベルになっていることを確認し、参考レベルになっていない場合は、補正レベルを１１に伝える特徴量観察部、１４はセグメント単位の入力動画をカテゴリ分類するカテゴリ分類部、１３は１４のモデルのパラメータを記憶しているモデル保存部、１５は１４の分類結果と解析コンテンツのジャンル情報により、イベント部分を検出する判定部である。
本実施の形態では、ＴＶ番組のイベント区間の抽出をする場合で説明する。録画済のコンテンツや受信中のコンテンツを対象に所望のイベントを抽出するものとする。例えば、スポーツ番組の野球やサッカーでヒット、ホームラン、ゴールシーンなど、盛り上がっている部分を抽出するとする。本実施の形態では、スポーツ番組では、オーディオの特徴量を使うと顕著に観客の歓声や解説者の強調音声などの特性から盛り上がりを検出でき、検出処置の処理量を抑えることができるので、オーディオの特徴量を利用することとする。 FIG. 1 is a block diagram showing a configuration of a moving image event detection apparatus according to Embodiment 1 of the present invention. In FIG. 1, 11 is a feature amount extraction unit for calculating the feature amount of an input moving image, 12 is a feature amount to be observed designated from the outside and its reference level, and observes the feature amount of the input moving image to obtain a reference level. If the reference level is not reached, a feature amount observing unit that informs the correction level to 11, 14 is a category classification unit that categorizes the input video in segment units, and 13 is a model of 14 models. A model storage unit 15 that stores parameters, and a determination unit 15 that detects an event part based on 14 classification results and genre information of analysis content.
In this embodiment, a case where an event section of a TV program is extracted will be described. It is assumed that a desired event is extracted for recorded content or content being received. For example, it is assumed that a lively part such as a hit, a home run, a goal scene in a sports program baseball or soccer is extracted. In this embodiment, in sports programs, if audio feature values are used, prominence can be detected prominently from characteristics such as audience cheers and commentator's emphasized speech, and the processing amount of detection processing can be suppressed. The feature amount of is used.

特徴量観察部１２は、外部より指定された観察する特徴量とその特徴量の範囲制限を受けて、特徴量抽出部１１からの特徴量が、その指定した範囲内に収まっているか否かを確認する。指定した範囲内に収まっていない場合には、どの特徴量をどう補正するかを特徴量抽出部１１へ伝える。ここで、外部より指定される観察する特徴量と特徴レベルは、事前に学習モデルパラメータを生成したときに、教師データとして利用したデータを解析して求めた特徴量から決定されたレベルを指定する。本実施の形態では、外部より、オーディオパワーの平均値の上限Ｐｗ＿ｍａｘ、オーディオパワー平均の参考値がＰｗ＿ｒｅｆと指定されているとする。まず、特徴量観察部１２は、始めの２０秒間のオーディオのパワーの平均値を１サンプル毎に計算し保持する。次に２０秒間を越えると、２０秒間のオーディオパワーの平均値を計算後、求めた値がＰｗ＿ｍａｘを越えないかを判断する。オーディオ信号に対してゲインをかけていず、Ｐｗ＿ｍａｘを越える場合は、例えば、下記の数式１にて変更するゲインαを求める。ここでは、現在、求められているオーディオパワーの平均値が、Ｐｗ＿ｒｅａｌｍｅａｎとする。 The feature quantity observing unit 12 receives an externally designated feature quantity to be observed and a range restriction of the feature quantity, and determines whether or not the feature quantity from the feature quantity extracting unit 11 is within the designated range. Check. If it is not within the specified range, the feature amount extraction unit 11 is notified of which feature amount is to be corrected. Here, the feature amount and feature level to be observed specified from the outside specify the level determined from the feature amount obtained by analyzing the data used as the teacher data when the learning model parameters are generated in advance. . In the present embodiment, it is assumed that the upper limit Pw_max of the audio power average value and the reference value of the audio power average are designated as Pw_ref from the outside. First, the feature amount observation unit 12 calculates and holds the average value of the audio power for the first 20 seconds for each sample. Next, if it exceeds 20 seconds, after calculating the average value of the audio power for 20 seconds, it is determined whether or not the obtained value exceeds Pw_max. When no gain is applied to the audio signal and Pw_max is exceeded, for example, the gain α to be changed is obtained by the following formula 1. Here, it is assumed that the average value of the currently obtained audio power is Pw_realmean.

次に、特徴量抽出部１１に入力されたオーディオ信号にかけるゲインαとオーディオパワー特徴量という種類名を特徴量抽出部１１に伝える。また、ゲインαが１以外の値が設定されており、かつ、オーディオパワーの平均値が、Ｐｗ＿ｍａｘを越えなくなった場合は、ゲインαを１に戻す。入力信号が無くなるまで、特徴量観察部１２は上述の処理を続ける。 Next, the feature quantity extraction unit 11 is informed of the type name of gain α and audio power feature quantity applied to the audio signal input to the feature quantity extraction unit 11. When the gain α is set to a value other than 1, and the average value of the audio power does not exceed Pw_max, the gain α is returned to 1. The feature amount observation unit 12 continues the above-described processing until there is no input signal.

図２は、特徴量抽出部１１の構成例である。図２において、２１は、１サンプルごとにパワーを演算するパワー算出部、２２は、短時間パワー平均、ケプストラム、ＭＦＣＣ、基本周波数などの特徴量を演算する特徴量算出部、２３は入力信号にゲインをかけるゲイン設定部である。 FIG. 2 is a configuration example of the feature amount extraction unit 11. In FIG. 2, 21 is a power calculation unit that calculates power for each sample, 22 is a feature value calculation unit that calculates feature values such as short-time power average, cepstrum, MFCC, and fundamental frequency, and 23 is an input signal. It is a gain setting unit for applying gain.

特徴量抽出部１１は、非圧縮のオーディオを入力として、この入力の特徴量を演算する。特徴量は、短時間パワー平均、ケプストラム、ＭＦＣＣ、基本周波数など、音の音響的性質を示す多くの特徴量を算出し、後の処理に用いることができる。ここでは、後続の処理に利用する特徴量を計算するとともに、特徴量観察部１２から指定された特徴量を、特徴量観察部１２に出力する。 The feature quantity extraction unit 11 takes uncompressed audio as input and calculates the feature quantity of this input. As the feature amount, many feature amounts indicating the acoustic properties of sound such as short-time power average, cepstrum, MFCC, fundamental frequency, and the like can be calculated and used for later processing. Here, the feature quantity used for the subsequent processing is calculated, and the feature quantity designated by the feature quantity observation unit 12 is output to the feature quantity observation unit 12.

モデル保存部１３は、学習モデルのパラメータを保存している。ここでのモデルのパラメータは、別ステップで事前に学習をして生成されたものである。ここでは、番組のジャンルに対応して学習モデルのパラメータデータを保存している。 The model storage unit 13 stores parameters of the learning model. The parameters of the model here are generated by learning in advance in another step. Here, parameter data of the learning model is stored corresponding to the genre of the program.

カテゴリ分類部１４は、特徴量抽出部１１より出力される特徴量を入力として、入力されたオーディオデータが、スピーチ、歓声、音楽、であるか、スピーチと歓声と音楽のうちのいずれか２種か３種が組み合わされたデータか、それ以外であるか、を分類する。このように分類種を定義して学習されたモデルは、歓声と音楽が重なっている場合、スピーチと歓声が重なっている場合なども分類できる。本実施の形態では、ＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）用いることとする。このモデルは、各出力確率を計算し，その累積尤度から，入力がどの音声の種類に最も近いかを判別することができる、一般的に知られたモデルである。ここでのモデルは外部より入力される番組情報に応じた学習モデルのパラメータをモデル保存部１３より取り出し、モデルを決定する。入力動画データを一定間隔のセグメント単位に分割し、各々のセグメント毎にどの分類種に近いかを出力し、分類結果としている。たとえば、１秒毎にどの分類種に近いかを出力する。 The category classification unit 14 receives the feature amount output from the feature amount extraction unit 11 as input, and whether the input audio data is speech, cheer, music, or any two of speech, cheer, and music Or three types of data, or other data. The model learned by defining the classification type in this way can be classified when the cheer and music overlap, or when the speech and cheer overlap. In this embodiment, GMM (Gaussian Mixture Model) is used. This model is a generally known model in which each output probability is calculated, and from which cumulative likelihood, it is possible to determine which kind of speech is closest to the input. The model here determines the model by taking out the parameters of the learning model corresponding to the program information input from the outside from the model storage unit 13. The input moving image data is divided into segments at regular intervals, and the classification type close to each segment is output and used as the classification result. For example, which classification type is close every second is output.

ＴＶ番組情報は、ＥＰＧ（ＥｌｅｃｔｒｉｃＰｒｏｇｒａｍＧｕｉｄｅ）のデータを利用して取得できる。このデータは、テレビ放送波に重畳されるものと、インターネットから取得できるものとがある。解析する入力動画が、放送中のものではなく、すでにＨＤＤや記録メディアに録画されたデータを使う場合は、番組情報を録画データと対応させて合わせて記録しておく必要がある。 TV program information can be obtained using EPG (Electric Program Guide) data. This data may be superimposed on a television broadcast wave or acquired from the Internet. If the input moving image to be analyzed is not being broadcast, but data already recorded on the HDD or recording medium is used, it is necessary to record the program information in association with the recorded data.

判定部１５は、カテゴリ分類部１４が出力した分類結果を外部から入力されたＴＶ番組のジャンルに応じて定義したルールに基づき、ハイライト区間やシーンの切り替わり目を決定し、出力する。具体的には、分類結果のノイズを取る、つまり、予め決めている短時間内で結果が変化している変化を除き、歓声が含まれている区間（開始時刻と終了時刻）を検出結果として出力する。 The determination unit 15 determines and outputs a highlight section and a scene switching point based on a rule that defines the classification result output from the category classification unit 14 according to the genre of the TV program input from the outside. Specifically, the noise of the classification result is taken, that is, the section (start time and end time) including the cheer is included as the detection result except for the change in which the result changes within a predetermined short time. Output.

かかる構成によれば、特徴量観察部１２が指定された特徴量を観察し、特徴量が外部より指定された特徴量抽出部１１に入力オーディオに掛け合わせるゲインを指定することにより、入力オーディオの特徴量を指定された範囲に抑えることができるため、カテゴリ分類部１４が、学習モデルが想定外の状態になり、不適切な分類判別をしてしまう問題を回避することができる。 According to such a configuration, the feature amount observing unit 12 observes the designated feature amount, and designates the gain to be applied to the input audio to the feature amount extraction unit 11 whose feature amount is designated from the outside, whereby the input amount of the input audio is specified. Since the feature amount can be limited to the specified range, it is possible to avoid the problem that the category classification unit 14 causes the learning model to be in an unexpected state and makes an inappropriate classification determination.

なお、本実施の形態では、オーディオパワーの平均値が指定された最大値を超える場合を、ゲインをかける条件としたが、オーディオパワー平均の参考値と比較して、特徴量を観察しても構わない。 In the present embodiment, the case where the average value of the audio power exceeds the specified maximum value is set as a condition for applying the gain. I do not care.

なお、本実施の形態では、モデル保存部１３は、ＴＶ番組のジャンルに応じて学習モデルのパラメータを保存しているものとするが、複数のジャンルに一つの学習モデルを対応させてパラメータを保存、あるいはＴＶ番組ごとに一つの学習モデルを対応させてパラメータを保存、あるいはサブジャンルごとに一つの学習モデルを対応させてパラメータを保存させても構わない。 In the present embodiment, the model storage unit 13 stores learning model parameters in accordance with TV program genres, but stores parameters by associating one learning model with a plurality of genres. Alternatively, the parameters may be stored by associating one learning model for each TV program, or may be stored by associating one learning model for each sub-genre.

なお、本実施の形態では、カテゴリ分類部１４は、スピーチ、歓声、音楽、スピーチと歓声と音楽のいずれか２種か３種が組み合わせを分類するとしたが、分類の定義は、ＧＭＭモデルを学習する際に決めた分類であれば、何でも構わない。ジャンルやサブジャンルや番組ごとに検出したい部分の特徴が異なる場合は、それぞれの場合に適するように分類する種類を定義し、モデルを学習させておくとよい。 In the present embodiment, the category classification unit 14 classifies combinations of two or three of speech, cheer, music, speech and cheer, and music. Any classification can be used as long as it is determined. If the genre, sub-genre, or part of the program to be detected has different characteristics, it is better to define the type of classification to suit each case and learn the model.

なお、本実施の形態では、カテゴリ分類部１４は、分類結果のみを出力しているが、分類結果とその尤度を出力させても構わない。その場合は、後段の判別部１５にて、判定時に尤度を加味することが可能となる。 In the present embodiment, the category classification unit 14 outputs only the classification result, but the classification result and its likelihood may be output. In that case, it is possible to consider the likelihood at the time of determination in the determination unit 15 at the subsequent stage.

（実施の形態２）
本発明の実施の形態２による動画イベント検出装置について図３を用いて詳細を説明する。図３は、本発明の実施の形態１による動画イベント検出装置の構成を示すブロック図である。本実施の形態では、実施の形態１と同様にオーディオの特徴量を利用して、ＴＶ番組のイベント区間の抽出をするとする。 (Embodiment 2)
The moving image event detection apparatus according to the second embodiment of the present invention will be described in detail with reference to FIG. FIG. 3 is a block diagram showing a configuration of the moving image event detection apparatus according to the first embodiment of the present invention. In the present embodiment, it is assumed that the event section of the TV program is extracted using the audio feature amount as in the first embodiment.

図３において、特徴量抽出部１１とカテゴリ分類部１４と判定部１５は、実施の形態１と同様の動作をする。第２の特徴量観察部３２は、入力されたジャンル情報に対応した学習モデルパラメータに対応している観察する特徴量と特徴量レベルとを取得する動作以外は、実施の形態１に述べた特徴量観察部１２と同様の動作をする。第２のモデル保存部３３は、学習モデルパラメータを保存している。更に、この学習パラメータを生成するときに利用したトレーニングデータの特徴量から求めた観察する特徴量と特徴量レベルを学習モデルパラメータに対応して保存している。 In FIG. 3, the feature quantity extraction unit 11, the category classification unit 14, and the determination unit 15 operate in the same manner as in the first embodiment. The second feature amount observing unit 32 is the feature described in the first embodiment except for the operation of acquiring the observed feature amount and the feature amount level corresponding to the learning model parameter corresponding to the input genre information. The same operation as that of the quantity observation unit 12 is performed. The second model storage unit 33 stores learning model parameters. Furthermore, the observed feature quantity and the feature quantity level obtained from the feature quantity of the training data used when generating the learning parameter are stored in correspondence with the learning model parameter.

図４は、第２のモデル保存部３３が保存しているデータのイメージ図である。学習モデルパラメータは、ＴＶジャンルごとに定義されており、また、１セットの学習モデルパラメータに対応して、観察する特徴量の種類とその特徴量の平均値と最大値が保存されている。ここで保存する観察する特徴量の平均値と最大値は、学習モデルパラメータを生成する際に利用したトレーニングデータの特徴量の平均値と最大値であり、本実施の形態では、先に述べたトレーニングデータのオーディオパワーの平均値と最大値を保存している。かかる構成によれば、第２の特徴量観察部３２が指定された特徴量を観察し、特徴量レベルが第２のモデル保存部３３から読み込んだ値に収まっていない場合に、特徴量抽出部１１に対して入力オーディオに掛け合わせるゲインを求めてこれを指定することにより、入力オーディオの特徴量を指定された範囲に抑えることができる。そのため、カテゴリ分類部１４が、学習モデルが想定外の状態になり、不適切な分類判別をしてしまう問題を回避することができる。 FIG. 4 is an image diagram of data stored in the second model storage unit 33. The learning model parameter is defined for each TV genre, and the type of feature quantity to be observed and the average value and maximum value of the feature quantity are stored corresponding to one set of learning model parameters. The average value and the maximum value of the observed feature amount stored here are the average value and the maximum value of the feature amount of the training data used when generating the learning model parameter. In the present embodiment, as described above The average value and maximum value of audio power of training data are saved. According to this configuration, when the second feature quantity observation unit 32 observes the designated feature quantity and the feature quantity level does not fall within the value read from the second model storage unit 33, the feature quantity extraction unit By calculating the gain multiplied by the input audio for 11 and specifying this, the feature amount of the input audio can be suppressed to the specified range. Therefore, it is possible to avoid the problem that the category classification unit 14 causes the learning model to be in an unexpected state and makes an inappropriate classification determination.

なお、本実施の形態では、モデル保存部１３は、ＴＶ番組のジャンルに対応させて学習モデルのパラメータを保存しているものとするが、複数のジャンルに一つの学習モデルを対応させてパラメータを保存、あるいはＴＶ番組ごとに一つの学習モデルを対応させてパラメータを保存、あるいはサブジャンルごとに一つの学習モデルを対応させてパラメータを保存させても構わない。 In the present embodiment, the model storage unit 13 stores learning model parameters in association with TV program genres. However, the model storage unit 13 stores parameters in association with one learning model for a plurality of genres. The parameters may be stored by storing one TV model for each learning program, or by storing one parameter for each sub-genre.

なお、本実施の形態では、学習モデルパラメータと観察する特徴量の種類とその特徴量の平均値と最大値は、同一ファイルに保存されているイメージであるが、観察する特徴量の種類と平均値と最大値のデータセットと学習モデルパラメータセットが対応づけられておれば、どのような構成でも構わない。 In the present embodiment, the learning model parameter, the type of feature quantity to be observed, and the average value and maximum value of the feature quantity are images stored in the same file, but the type and average of the feature quantity to be observed As long as the data set of the value and the maximum value is associated with the learning model parameter set, any configuration may be used.

（実施の形態３）
本発明の実施の形態３による動画イベント検出装置について図５を用いて詳細を説明する。図５は、本発明の実施の形態１による動画イベント検出装置の構成を示すブロック図である。本実施の形態では、実施の形態１と同様にオーディオの特徴量を利用して、ＴＶ番組のイベント区間の抽出をするとする。 (Embodiment 3)
The moving image event detection apparatus according to the third embodiment of the present invention will be described in detail with reference to FIG. FIG. 5 is a block diagram showing a configuration of the moving image event detection apparatus according to the first embodiment of the present invention. In the present embodiment, it is assumed that the event section of the TV program is extracted using the audio feature amount as in the first embodiment.

図５において、５１は第２の特徴量抽出部、５２は第３の特徴量観察部、５３は第３のモデル保存部、５４は第２のカテゴリ分類部、５５は第２の判定部である。 In FIG. 5, 51 is a second feature quantity extraction unit, 52 is a third feature quantity observation unit, 53 is a third model storage unit, 54 is a second category classification unit, and 55 is a second determination unit. is there.

第２の特徴量抽出部５１は、実施の形態１と同様の特徴量を抽出する。そして抽出した特徴量を第３の特徴量観察部５２と第２のカテゴリ分類部５４へ出力する。 The second feature quantity extraction unit 51 extracts the same feature quantity as in the first embodiment. Then, the extracted feature amount is output to the third feature amount observation unit 52 and the second category classification unit 54.

第３の特徴量観察部５２は、第２の特徴量抽出部より得られた各特徴量の最大値と平均値を計算し、第２のカテゴリ分類部５４に送る。 The third feature quantity observation unit 52 calculates the maximum value and the average value of each feature quantity obtained from the second feature quantity extraction unit, and sends it to the second category classification unit 54.

第３のモデル保存部５３は、実施の形態１と同様に学習モデルパラメータセットを保存しており、これに加えて、モデルの学習時のトレーニングデータの最大値と平均値を分類ごとに保存しておく。例えば、本実施の形態では、スピーチ、歓声、音楽、であるか、スピーチと歓声と音楽のいずれか２種か３種が組み合わせを分類するとしているため、各分類に対応させたトレーニングデータの各特徴量の最大値と平均値を保存している。 The third model storage unit 53 stores the learning model parameter set in the same manner as in the first embodiment. In addition, the third model storage unit 53 stores the maximum value and the average value of the training data when learning the model for each classification. Keep it. For example, in the present embodiment, the combination of speech, cheer, and music is classified into two or three of speech, cheer, and music. The maximum and average feature values are stored.

第２のカテゴリ分類部５４は、実施の形態１と同様の動作に加え、第３のモデル保存部５３から読込んだ各分類に対応させたトレーニングデータの各特徴量の最大値と平均値のうち、分類結果に対応する分類の各特徴量の最大値と平均値と、第３の特徴量観察部５２から受け取った各特徴量の最大値と平均値とを比較し、大きく外れていないことを確認する。大きく異なる場合は、分類結果を出力するとともに小さな値の重み（１以下）を出力する。大きく異ならない場合は、１を出力する。 In addition to the same operations as those in the first embodiment, the second category classification unit 54 calculates the maximum value and the average value of each feature value of the training data corresponding to each classification read from the third model storage unit 53. Among them, the maximum value and the average value of each feature amount of the classification corresponding to the classification result are compared with the maximum value and the average value of each feature amount received from the third feature amount observation unit 52, and not greatly deviated. Confirm. If they differ greatly, a classification result is output and a small value weight (1 or less) is output. If there is no significant difference, 1 is output.

第２の判定部５５は、第２のカテゴリ分類部５４が出力した分類結果を外部から入力されたＴＶ番組のジャンルに応じたルールに基づき、ハイライト区間やシーンの切り替わり目を決定し、出力する。具体的には、分類結果のノイズを取り、歓声が含まれている区間（開始時刻と終了時刻）を決定する際に、第２のカテゴリ分類部５４から出力された重みを考慮して出力する。 The second determination unit 55 determines the highlight section and the scene switching point based on the rule according to the genre of the TV program inputted from the outside, based on the classification result output by the second category classification unit 54, and outputs it. To do. Specifically, the noise of the classification result is taken and output in consideration of the weight output from the second category classification unit 54 when determining the section (start time and end time) in which cheers are included. .

ここで、各分類に対応させたトレーニングデータのオーディオパワーの平均値と最大値が図６に示す値であったとする。そして、時刻ｔにおける第２のカテゴリ分類部５４に入力されたオーディオパワーの最大値がＩｎＰＷ＿ｍａｘ（ｔ）、平均値がＩｎＰＷ＿ｍｅａｎ（ｔ）であり、第２のカテゴリ分類部５４で演算された尤度が図７に示す値になったとする。分類のトレーニングデータのオーディオパワーの平均値の１．５倍以上の場合には、該当する分類の尤度を、尤度×［トレーニングデータのオーディオパワーの平均／（ＩｎＰＷ＿ｍｅａｎ（ｔ））］として、全分類の尤度を求め、最も尤度が高い分類をその時刻の分類結果とする。 Here, it is assumed that the average value and the maximum value of the audio power of the training data corresponding to each classification are the values shown in FIG. The maximum value of the audio power input to the second category classification unit 54 at time t is InPW_max (t), the average value is InPW_mean (t), and the likelihood calculated by the second category classification unit 54 Is the value shown in FIG. When the average value of the audio power of the training data of the classification is 1.5 times or more, the likelihood of the corresponding classification is expressed as likelihood × [average of audio power of training data / (InPW_mean (t))]. The likelihood of all classifications is obtained, and the classification with the highest likelihood is taken as the classification result at that time.

かかる構成によれば、第３の特徴量観察部５２が第２の特徴量抽出部で求められた特徴量を観察して各特徴量の最大値と平均値を第２のカテゴリ分類部５４に出力し、第２のカテゴリ分類部５４は、入力の特徴ベクトルから求めた分類結果とこの分類結果に対応するモデルのトレーニングデータの最大値と平均値を比較することにより、出力された分類結果の信頼度を重みとして出力する。これにより、学習時に学習があまりできていない入力パターンに対して不適切な分類判別をしてしまう問題を回避することができる。 According to this configuration, the third feature amount observing unit 52 observes the feature amount obtained by the second feature amount extracting unit, and the maximum value and the average value of each feature amount are sent to the second category classification unit 54. The second category classification unit 54 outputs the classification result obtained by comparing the classification result obtained from the input feature vector with the maximum value and the average value of the training data of the model corresponding to the classification result. Output reliability as weight. As a result, it is possible to avoid the problem of improper classification determination for input patterns that are not well learned during learning.

なお、本実施の形態では、各特徴量の最大値と平均値を比較したが、学習モデルを支配的に寄与する特徴量のみに注目した最大値と平均値の比較でもかまわない。また、この比較する際は、最大値と平均値ではなく、分散などデータを特徴づける値の比較であれば、何でも構わない。 In the present embodiment, the maximum value and the average value of each feature amount are compared. However, it is also possible to compare the maximum value and the average value focusing only on the feature amount that contributes dominantly to the learning model. In this comparison, any value may be used as long as it is not a maximum value and an average value but a value that characterizes data such as variance.

本発明にかかる動画イベント検出装置は、特徴量観察部が入力動画の特徴量を学習時のトレーニングデータから特性が外れていないか否かを観察し、外れている場合は、入力動画を補正ができ、学習データを録画システム毎に準備しなくても適切な検出結果が得られることによって、動画の中から所望の区間、例えば盛り上り区間を自動で検出する用途にも適用できる。 In the moving image event detection apparatus according to the present invention, the feature amount observation unit observes whether or not the characteristic is deviated from the training data at the time of learning the feature amount of the input moving image. In addition, since an appropriate detection result can be obtained without preparing learning data for each recording system, the present invention can also be applied to a purpose of automatically detecting a desired section, for example, a rising section from a moving image.

本発明の実施の形態１における動画イベント検出装置の構成を示すブロック図1 is a block diagram showing a configuration of a moving image event detection apparatus according to Embodiment 1 of the present invention. 本発明の実施の形態１による動画イベント検出装置の特徴量抽出部１１の構成図Configuration diagram of the feature quantity extraction unit 11 of the video event detection device according to the first embodiment of the present invention. 本発明の実施の形態２における動画イベント検出装置の構成を示すブロック図The block diagram which shows the structure of the moving image event detection apparatus in Embodiment 2 of this invention. 本発明の実施の形態２における第２のモデル保存部が保存しているデータのイメージ図The image figure of the data which the 2nd model preservation | save part in Embodiment 2 of this invention preserve | saves 本発明の実施の形態３における動画イベント検出装置の構成を示すブロック図The block diagram which shows the structure of the moving image event detection apparatus in Embodiment 3 of this invention. 本発明の実施の形態３における第３のモデル保存部が保存している各分類に対応させたトレーニングデータの各特徴量の最大値と平均値を表した図The figure which represented the maximum value and average value of each feature-value of the training data matched with each classification | category which the 3rd model preservation | save part in Embodiment 3 of this invention preserve | saved 本発明の実施の形態３における第２のカテゴリ分類部で計算される各分類に対する尤度を表した図The figure showing the likelihood with respect to each classification calculated in the 2nd category classification part in Embodiment 3 of the present invention

Explanation of symbols

１１特徴量抽出部
１２特徴量観察部
１３モデル保存部
１４カテゴリ分類部
１５判定部
２１パワー算出部
２２特徴量算出部
２３ゲイン設定部
３２第２の特徴量観察部
３３第２のモデル保存部
５１第２の特徴量抽出部
５２第３の特徴量観察部
５３第３のモデル保存部
５４第２のカテゴリ分類部
５５第２の判定部 DESCRIPTION OF SYMBOLS 11 Feature-value extraction part 12 Feature-value observation part 13 Model preservation | save part 14 Category classification | category part 15 Judgment part 21 Power calculation part 22 Feature-value calculation part 23 Gain setting part 32 2nd feature-value observation part 33 2nd model preservation | save part 51 Second feature quantity extraction unit 52 Third feature quantity observation unit 53 Third model storage unit 54 Second category classification unit 55 Second determination unit

Claims

In a device that analyzes video and detects events and important parts, input video data, change the input data according to the specified feature quantity to be changed and its change level, calculate the feature quantity and output it And the feature quantity output from the feature quantity extraction section is input, and the observed feature quantity is observed, and if it is not the feature quantity level designated as input, the feature quantity extraction section is changed. A feature amount observing unit for conveying a feature amount and its change level, a model storage unit for storing learning model data, and learning model data corresponding to given genre information are read from the model storage unit, and the feature amount extraction is performed. A category classification unit that takes the feature value output from the input as input, calculates which classification is close among the classification types for which the input video data is determined in advance, and outputs a close classification result Inputs the results and genre information the classified by category classification section, the beginning of the important scenes or video event detection device, characterized in that it comprises a determining unit for determining a critical section.

The model storage unit stores an observed feature amount and feature level corresponding to a learning model, and the feature amount observation unit receives an observation feature associated with a learning model corresponding to genre information from the model storage unit. The moving image event detection apparatus according to claim 1, wherein an amount and a feature amount level are input, and a feature amount to be observed from the feature amount extraction unit is input.

In an apparatus that detects an event or an important part by analyzing a moving image, a moving image data is input, a feature amount extraction unit that calculates and outputs a feature amount, and a feature amount output from the feature amount extraction unit is input. A feature amount observing unit that calculates and outputs the range and distribution of the feature amount to be observed, and a model storage unit that stores the learning model data and the feature amount range and distribution corresponding to the training data of the learning model data. The learning model data corresponding to the genre information is read from the model storage unit, the feature amount output from the feature amount extraction unit is input, and the input data is close to any of the predetermined classification types , And output a close classification result, and the range and distribution of the observed feature value output from the feature value observation unit and the training read from the model storage unit The category classification that outputs the weights for the classification results with a lighter weight when the ranges or distributions of the feature quantities corresponding to the classification results are compared. A moving image event detection apparatus comprising: a determination unit configured to input a classification result classified by the category classification unit, a weight, and genre information, and to determine an important scene start or an important section.

The feature amount observation unit receives the audio power feature amount output from the feature amount extraction unit and the average value and maximum value of the audio power feature amount to be referenced, and calculates the average value and maximum value of the audio power feature amount. And confirming that the average value and the maximum value of the audio power feature values are the feature value level, and if it is not the specified feature value level, the change level applied to the audio power feature value by the feature value extraction unit The moving image event detection apparatus according to claim 1, wherein:

The feature amount observation unit receives the audio power feature amount output from the feature amount extraction unit and the average value and maximum value of the audio power feature amount to be referenced, and calculates the average value and maximum value of the audio power feature amount. And calculating the change level applied to the audio input when the average value of the audio power feature value exceeds the average value of the audio power feature value to be referred to for a certain period, or to refer to the audio power for a certain period. The moving image event detection apparatus according to claim 1, wherein, when a maximum value of a feature amount is exceeded, the change level is transmitted to a feature amount extraction unit.