JP2004086124A

JP2004086124A - Device and method for creating metadata

Info

Publication number: JP2004086124A
Application number: JP2002334831A
Authority: JP
Inventors: Masaaki Kobayashi; 小林　正明; Hiroyuki Sakai; 酒井　啓行; Kenji Matsui; 松井　謙二; Hiroyasu Kuwano; 桑野　裕康
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2002-06-24
Filing date: 2002-11-19
Publication date: 2004-03-18

Abstract

<P>PROBLEM TO BE SOLVED: To reduce labor required for imparting metadata relating to the contents in creating video/audio contents. <P>SOLUTION: A metadata creation device comprises a voice input means, a voice recognition means, and a metadata creation means. Information associated with the video/audio contents is inputted by the voice input means, and the inputted voice signal is recognized by the voice recognition means and converted into metadata by the metadata creation means. Thus, the metadata, a tag, and the contents are automatically associated with time or a scene. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、コンテンツ制作におけるメタデータ制作システム及び方法に関するものである。
【０００２】
【従来技術】
近年、映像・音声コンテンツの制作において、これらコンテンツに関連したメタデータの付与することがおこなわれている。
【０００３】
しかしながら、上記メタデータの付与は、制作された映像・音声コンテンツのシナリオあるいはナレーション原稿をもとに、制作された映像・音声コンテンツを再生しながらメタデータとすべき情報を確認し、手作業でコンピュータ入力することにより制作する方法が一般的であり、相当な労力の必要な方法であった。
【０００４】
【特許文献１】
特開平０９−１３０７３６号公報
【０００５】
【発明が解決しようとする課題】
本願発明は、上記従来の問題点に係る課題を解決することを目的とするものであって、制作された映像・音声コンテンツを再生することによりメタデータとすべき情報を確認し、音声入力でコンピュータに入力することにより制作するシステム及び方法を提供することを目的とする。
【０００６】
【課題を解決するための手段】
上記課題を解決するために本願発明は、制作されたコンテンツの再生手段、上記再生手段で再生された映像信号を表示する映像モニタ手段、上記再生手段で再生された音声信号をモニタする音声モニタ手段、上記映像モニタ手段および上記音声モニタ手段でオペレータが確認した制作すべきメタデータ内容をオペレータの発声によりマイクで収録する音声入力手段、上記音声入力手段により入力された音声信号を認識する音声認識手段、音声認識手段で認識された音声情報をメタデータに変換することによりメタデータを生成するメタデータ生成手段、および上記コンテンツと上記メタデータとを関連づけるため、上記コンテンツに付与されているタイムコード情報と上記メタデータとを入力しタイムコード付メタデータとするタイムコード付与手段とを備えたものである。
【０００７】
これにより、従来キーボードで入力し、制作していたメタデータを、音声認識を用いて音声入力し、自動的にタイムコード付きのメタデータを制作することが可能となる。
【０００８】
【発明の実施の形態】
本発明の請求項１に係る発明は、コンテンツに関連するメタデータの制作装置であって、音声入力手段と、音声認識手段と、メタデータ制作手段とを具備し、上記コンテンツに関連した情報を上記音声入力手段により入力し、上記入力された音声信号を上記音声認識手段にて認識し、認識したデータを上記メタデータ制作手段によりメタデータに変換することを特徴とするメタデータ制作装置である。
【０００９】
本発明の請求項２に係る発明は、コンテンツに関連するメタデータの制作装置であって、音声入力手段と音声認識手段とメタデータ制作手段と上記コンテンツに関連した辞書とを具備し、上記コンテンツに関連した情報を上記音声入力手段により入力し、上記入力された音声信号を上記音声認識手段にて上記コンテンツに関連した辞書に関連付けて認識し、認識したデータを上記メタデータ制作手段によりメタデータに変換することを特徴とするメタデータ制作装置である。
【００１０】
本発明の請求項３に係る発明は、上記請求項１ないし上記請求項２に係る発明のメタデータ制作装置であって、上記コンテンツに付与されているタイムコード情報と上記メタデータとを入力し、タイムコード付メタデータを生成するタイムコード付与手段を備え、上記コンテンツと生成された上記メタデータとを関連付けることを特徴とする請求項１または２のいずれかに記載のメタデータ制作装置である。
【００１１】
以下、本発明の実施の形態について図面を用いて説明する。
【００１２】
（実施の形態１）
図１は、本発明の実施の形態１によるメタデータ制作装置の構成を示すブロック図である。図１において、１はコンテンツ再生手段、２は映像モニタ手段、３は音声モニタ手段、４はマイク、５は音声認識手段、６はメタデータ生成手段、７はタイムコード付与手段、８は辞書である。
コンテンツ再生手段１は、たとえばＶＴＲ（あるいはハードディスクで構成された映像・音声信号再生装置、あるいは半導体メモリなどのメモリ手段を記録媒体とする映像・音声信号再生装置、あるいは光学記録式または磁気記録式などの回転型ディスクで構成された映像・音声信号再生装置、さらには伝送されてきたあるいは放送されてきた映像・音声信号を再生する映像・音声再生装置などのコンテンツ再生手段）である。上記コンテンツ再生手段１は、映像信号出力端子１０１、音声信号出力端子１０２およびタイムコード出力端子１０３を具備し、再生された映像信号は端子１０１および２０１を介して、映像モニタ手段２に供給され、再生された音声信号は端子１０２および３０２を介して、音声モニタ３に供給され、再生されたタイムコードは端子１０３および７０３を介してタイムコード付与手段７に供給される。メタデータを制作する制作者（図示せず）は、映像モニタ手段２と音声モニタ手段３のいずれかあるいは両方を確認しながら、場合によればシナリオまたはナレーション原稿なども参照しながら、入力すべきメタデータを発声する。マイク４は、上記制作者の発声を受付、音声信号に変換して、音声認識手段５に供給する。また、必要に応じて、音声認識用の辞書８も、音声認識手段５に供給される。音声認識手段５で認識された、音声データはメタデータ生成手段６に供給され、メタデータあるいはタグに変換される。此のようにして、生成されたメタデータあるいはタグは、コンテンツ自身の内容と時間関係あるいはシーンとの関係を略略一致させるため、タイムコード付与手段７にて、コンテンツ再生手段１から供給されたタイムコード情報が付与される。
【００１３】
より具体的に説明するため料理説明をする場面を一例として想定する。この場合オペレータが、映像モニタ手段２の表示画面を確認しながらマイク４を通じて”塩　１さじ”と発生すると、音声認識手段５にて、辞書８を参照しながら、　”塩”、”１さじ”と認識されメタデータ生成手段６にて各々”塩”、”１さじ”というタグに変換される。なお、音声認識としては、上記音声認識手段５に限定されず、一般的に用いられている種々の手段を用いて音声認識し”しお”、”ひとさじ”とのデータを認識してもよい。
なお、一般には、メタデータとは、このようなタグの集合体を意味するものである。タイムコード付与手段７０３では、コンテンツ再生手段１０３からの信号をもとに、タイムコードが付与されたタイムコード付与メタデータが生成される。具体的には、図２に示すようなパケットデータが生成される。生成されたメタデータは、そのまま出力されても良く、またハードディスク等の記録媒体に蓄積しても良い。また、上記の実施例においては、パケット形式のメタデータを生成するとして説明したが、特に限定されるものではない。
【００１４】
さらに上述した実施例では、コンテンツとしてタイムコードの付与されている動画コンテンツの場合について説明したが、静止画コンテンツあるいは、デジタルデータコンテンツなどの場合には、上記静止画コンテンツあるいはデジタルデータコンテンツを識別するために、動画の場合のタイムコードに相当するコンテンツの番地あるいは番号を用いて上記コンテンツと生成された上記メタデータを関連づけてもよい。
【００１５】
なお、一般的には、音声認識には何らかの誤認識が生じる可能性がある。誤認識が生じた場合、制作されたメタデータ、タグをコンピュータ手段などの情報処理手段を用いて修正することは可能である。
【００１６】
【発明の効果】
以上説明したように発明は、コンテンツに関連したメタデータの作成あるいはタグ付けを行うに当たり、音声入力による音声認識を用い、且つ、上記メタデータあるいはタグとコンテンツとの時間あるいはシーンとの関連付けを行うため、従来のようなキーボード入力より、効率的に、メタデータの作成やタグ付けを実施することができる。
【図面の簡単な説明】
【図１】本発明の実施の形態１に係るメタデータ制作装置の構成を示すブロック図
【図２】本発明の実施形態１に係るタイムコード付きメタデータの一例を示す図
【符号の説明】
１　コンテンツ再生手段
２　映像モニタ手段
３　音声モニタ手段
４　マイク
５　音声認識手段
６　メタデータ生成手段
７　タイムコード付与手段
８　辞書
１０１　映像出力端子
１０２　音声出力端子
１０３　タイムコード出力端子
２０１　映像入力端子
３０２　音声入力端子
７０３　タイムコード入力端子[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a metadata production system and method in content production.
[0002]
[Prior art]
2. Description of the Related Art In recent years, in the production of video / audio contents, metadata related to these contents has been added.
[0003]
However, the provision of the above metadata is based on the scenario of the produced video / audio content or the narration manuscript, and while reproducing the produced video / audio content, confirming the information to be metadata, and manually The method of producing by computer input was common and required a considerable amount of labor.
[0004]
[Patent Document 1]
JP-A-09-130736
[Problems to be solved by the invention]
An object of the present invention is to solve the problems related to the conventional problems described above, and confirms information to be metadata by reproducing produced video and audio contents, and performs audio input. It is an object of the present invention to provide a system and a method for producing by inputting to a computer.
[0006]
[Means for Solving the Problems]
In order to solve the above-described problems, the present invention provides a reproducing unit for produced content, a video monitoring unit for displaying a video signal reproduced by the reproducing unit, and an audio monitoring unit for monitoring an audio signal reproduced by the reproducing unit. Voice input means for recording the contents of metadata to be produced, confirmed by the operator with the video monitor means and the voice monitor means, with a microphone by an operator's voice, and voice recognition means for recognizing a voice signal input by the voice input means Metadata generating means for generating metadata by converting voice information recognized by voice recognizing means into metadata, and time code information given to the content in order to associate the content with the metadata And time code as above and input as metadata with time code It is obtained by a given unit.
[0007]
As a result, it is possible to automatically input metadata produced by the keyboard using the speech recognition and produce metadata with a time code.
[0008]
BEST MODE FOR CARRYING OUT THE INVENTION
An invention according to claim 1 of the present invention is an apparatus for producing metadata related to content, comprising: a voice input unit, a voice recognition unit, and a metadata production unit, and transmits information related to the content. A metadata producing apparatus characterized in that the inputted speech signal is inputted by the speech input means, the inputted speech signal is recognized by the speech recognition means, and the recognized data is converted into metadata by the metadata producing means. .
[0009]
According to a second aspect of the present invention, there is provided an apparatus for producing metadata related to a content, comprising: a voice input unit, a voice recognition unit, a metadata production unit, and a dictionary relating to the content. Is input by the voice input means, the input voice signal is recognized by the voice recognition means in association with a dictionary related to the content, and the recognized data is converted into metadata by the metadata production means. The metadata production device is characterized in that the metadata production device converts the metadata into a metadata.
[0010]
According to a third aspect of the present invention, there is provided the metadata producing apparatus according to the first or second aspect of the present invention, wherein the time code information given to the content and the metadata are inputted. 3. The metadata producing apparatus according to claim 1, further comprising a time code adding unit for generating metadata with a time code, wherein the content is associated with the generated metadata. .
[0011]
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0012]
(Embodiment 1)
FIG. 1 is a block diagram showing a configuration of a metadata production device according to Embodiment 1 of the present invention. In FIG. 1, 1 is a content reproducing means, 2 is a video monitoring means, 3 is an audio monitoring means, 4 is a microphone, 5 is an audio recognition means, 6 is metadata generating means, 7 is a time code adding means, and 8 is a dictionary. is there.
The content reproducing means 1 is, for example, a video / audio signal reproducing apparatus constituted by a VTR (or a hard disk), a video / audio signal reproducing apparatus using a memory means such as a semiconductor memory as a recording medium, or an optical recording type or a magnetic recording type And a video / audio signal reproducing device constituted by a rotating disk, and a content reproducing means such as a video / audio reproducing device for reproducing transmitted / broadcast video / audio signals. The content reproducing means 1 includes a video signal output terminal 101, an audio signal output terminal 102, and a time code output terminal 103. The reproduced video signal is supplied to the video monitor means 2 via the terminals 101 and 201, The reproduced audio signal is supplied to the audio monitor 3 via the terminals 102 and 302, and the reproduced time code is supplied to the time code adding means 7 via the terminals 103 and 703. A creator (not shown) who creates the metadata should input while checking one or both of the video monitor means 2 and the audio monitor means 3 and possibly referring to a scenario or a narration manuscript. Say the metadata. The microphone 4 accepts the utterance of the creator, converts the utterance into an audio signal, and supplies the audio signal to the audio recognition unit 5. Further, a dictionary 8 for speech recognition is also supplied to the speech recognition means 5 as needed. The voice data recognized by the voice recognition unit 5 is supplied to the metadata generation unit 6 and converted into metadata or a tag. In this way, the generated metadata or tag is used by the time code providing means 7 to provide the time supplied from the content reproducing means 1 by the time code adding means 7 so that the content of the content itself substantially matches the time relationship or the relationship with the scene. Code information is provided.
[0013]
For a more specific explanation, a scene in which cooking is explained is assumed as an example. In this case, when the operator confirms the display screen of the video monitor means 2 and outputs “salt 1 spoon” through the microphone 4, the voice recognition means 5 refers to the dictionary 8 and reads “salt” and “1 spoon”. Are recognized and converted into tags “salt” and “one spoon” by the metadata generating means 6, respectively. Note that the voice recognition is not limited to the voice recognition means 5 described above, and it is also possible to perform voice recognition using various commonly used means and recognize data of "Shio" and "Hitospo". Good.
In general, metadata means a collection of such tags. The time code adding unit 703 generates time code added metadata to which a time code has been added, based on a signal from the content reproducing unit 103. Specifically, packet data as shown in FIG. 2 is generated. The generated metadata may be output as it is, or may be stored in a recording medium such as a hard disk. Further, in the above-described embodiment, the description has been made assuming that the metadata in the packet format is generated. However, the present invention is not particularly limited.
[0014]
Further, in the above-described embodiment, the case of the moving image content to which the time code is added as the content has been described. However, in the case of the still image content or the digital data content, the still image content or the digital data content is identified. For this purpose, the content and the generated metadata may be associated using the address or number of the content corresponding to the time code in the case of a moving image.
[0015]
In general, some erroneous recognition may occur in voice recognition. If misrecognition occurs, it is possible to correct the produced metadata and tags using information processing means such as computer means.
[0016]
【The invention's effect】
As described above, in the invention, when creating or tagging metadata related to content, voice recognition by voice input is used, and the time or scene between the metadata or tag and the content is associated. Therefore, metadata can be created and tagged more efficiently than a conventional keyboard input.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a metadata production device according to a first embodiment of the present invention. FIG. 2 is a diagram illustrating an example of metadata with a time code according to the first embodiment of the present invention.
DESCRIPTION OF SYMBOLS 1 Content reproduction means 2 Video monitoring means 3 Audio monitoring means 4 Microphone 5 Audio recognition means 6 Metadata generation means 7 Time code addition means 8 Dictionary 101 Video output terminal 102 Audio output terminal 103 Time code output terminal 201 Video input terminal 302 Audio input Terminal 703 Time code input terminal

Claims

An apparatus for producing metadata related to content,
A voice input unit, a voice recognition unit, and a metadata production unit;
The information related to the content is input by the voice input unit, the input voice signal is recognized by the voice recognition unit, and the recognized data is converted into metadata by the metadata production unit. Metadata production equipment.

An apparatus for producing metadata related to content,
A voice input means, a voice recognition means, a computer means including a keyboard, and a metadata producing means, wherein information related to the content is input by the voice input means, and the input voice signal is transmitted to the voice Recognized by the recognition means, the recognized data is converted to metadata by the metadata production means, and if the recognized data is determined to be erroneous recognition, the data is corrected by the computer means including the keyboard. A metadata production device, characterized in that:

An apparatus for producing metadata related to content,
A speech inputting means, a speech recognizing means, a metadata producing means and a dictionary relating to the content, wherein information relating to the content is inputted by the speech inputting means, and the inputted speech signal is inputted to the speech recognizing means. A metadata producing apparatus for recognizing the metadata in association with a dictionary relating to the content, and converting the recognized data into metadata by the metadata producing means.

An apparatus for producing metadata related to content,
A speech inputting means, a speech recognizing means, a metadata producing means and a dictionary relating to the content, wherein information relating to the content is inputted by the speech inputting means, and the inputted speech signal is inputted to the speech recognizing means. A metadata producing apparatus for recognizing each word in association with a dictionary relating to the content and converting the recognized data into metadata by the metadata producing means.

An apparatus for producing metadata related to content,
Computer means including voice input means, voice recognition means, a keyboard, metadata producing means, and a dictionary relating to the content, wherein information relating to the content is input by the voice input means, and the input voice The signal is recognized by the voice recognition means in association with the dictionary relating to the content in word units, the recognized data is converted into metadata by the metadata production means, and the recognized data is determined to be erroneously recognized. A metadata production device, wherein the metadata is modified by computer means including the keyboard when the metadata is created.

The time code information provided to the content and the metadata are input, and time code providing means for generating metadata with a time code is provided, and the content is associated with the generated metadata. The metadata production apparatus according to any one of claims 1, 2, 3, 4, and 5.

An address or number or frame number assigning means for inputting the address or number or frame number of the content assigned to the content and the metadata and generating the address or number of the content or the metadata with frame number is provided. 7. The metadata producing apparatus according to claim 1, wherein the content is associated with the generated metadata.

4. The information input device according to claim 1, further comprising: inputting the information related to the content by confirming at least one of the video monitor and the audio monitor for monitoring the content. , 4, 5, 6, or 7.

In inputting the information related to the content by the audio input means, one or both of the video monitor means and the audio monitor means for monitoring the content are confirmed, and the input is performed while referring to the content scenario or the narration manuscript. 9. The metadata producing apparatus according to claim 1, wherein the metadata producing apparatus is a metadata producing apparatus.

A method of creating metadata related to content,
Using voice input means, voice recognition means, and metadata production means, information related to the content is input by the voice input means, and the input voice signal is recognized by the voice recognition means, A metadata production method, wherein the metadata is converted into metadata by the metadata production means.

A method of creating metadata related to content,
Using a voice input unit, a voice recognition unit, a metadata production unit, and a dictionary related to the content, information related to the content is input by the voice input unit, and the input voice signal is input to the voice recognition unit. And recognizing the metadata in association with a dictionary relating to the content, and converting the content into metadata by the metadata generating means.

Time code information assigned to the content and the metadata are input, and the content is associated with the generated metadata by using a time code adding unit that generates metadata with a time code. The metadata production method according to any one of claims 1 and 2.