JP2008072572A

JP2008072572A - Content photographing apparatus

Info

Publication number: JP2008072572A
Application number: JP2006250827A
Authority: JP
Inventors: Yoshihiro Morioka; 芳宏森岡; Yozo Yamamoto; 洋三山本; Mitsuru Yasukata; 満安方; Masaaki Kobayashi; 正明小林
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2006-09-15
Filing date: 2006-09-15
Publication date: 2008-03-27

Abstract

<P>PROBLEM TO BE SOLVED: To provide a content photographing apparatus capable of converting voice-inputted event information into metadata through voice recognition without increasing costs. <P>SOLUTION: During a recording mode or a recording standby mode, an explanatory voice inputted from a voice input means is filtered by a filtering means, an output level is then detected and in a case where the output level is equal with or higher than a preset output level for a period of time as long as or longer than a preset output term, a voice tag containing information of time when the input of the explanatory voice is started, is generated, the voice tag containing the time information is associated with time information of the explanatory voice, and the voice tag is recorded on an information storage medium as metadata. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、撮影時に音声入力されたイベント情報を音声認識を介して音声メタデータを作成するコンテンツ作成装置に関する。 The present invention relates to a content creation device that creates audio metadata for event information inputted during shooting by voice recognition.

通常、映画やドラマでは絵コンテ等を元にシナリオを作成するが、プロフェッショナルによるニュース、ドキュメンタリー、バラエティ番組、生本番の舞台、およびコンサート等の撮影においてはシナリオが存在するのは稀である。このような状態は、アマチュアによる運動会や入学式、卒業式、音楽の発表会、および結婚式等の撮影においても同様である。このような状態では、撮影者は自身の予測の範囲内または予測の範囲を超えてリアルタイムに発生するイベントを撮影することになる。 Normally, scenarios are created based on storyboards in movies and dramas, but scenarios rarely exist for shooting professional news, documentaries, variety shows, live performances, concerts, and the like. This is the same for amateur sports events, entrance ceremonies, graduation ceremonies, music recitals, weddings, and other photography. In such a state, the photographer captures an event that occurs in real time within or beyond the prediction range of the photographer.

つまり、撮影後の編集時に、逐次発生したイベントに応じた臨機応変な対応が必要である。しかしながら、上述のような生本番の舞台やコンサート、運動会や入学式、卒業式、音楽の発表会、および結婚式等はプログラムの進行に沿ってイベントが発生するが、発生するイベントに対するシナリオは存在しない。つまり、イベントは撮影者の予想を超えた内容でリアルタイムに発生する。そのために、撮影画像に編集時に必要となる情報をメタデータとして作成して、撮影画像データと共に保存しておく必要がある。 In other words, it is necessary to respond flexibly according to events that occur sequentially when editing after shooting. However, events such as the above-mentioned live performance stages, concerts, sports events, entrance ceremonies, graduation ceremonies, music recitals, weddings, etc. occur according to the progress of the program, but there are scenarios for the events that occur. do not do. In other words, the event occurs in real time with contents exceeding the photographer's expectations. Therefore, it is necessary to create information necessary for editing a captured image as metadata and save it together with the captured image data.

そのためには、撮影時に発生するイベントに応じて、例えば、撮影者或いは別の人が、メモ用紙に、撮影日（日付、朝昼夕夜など）、撮影方法（レンズ、カメラ、ショット、光源など）、イベントの参加者（目線、動き、表情、テンション、メイク、衣装の状態など）、セリフ（アドリブなどのキーワード）、および音（サウンド）などの注目ポイントに関する情報をメモしておく。そして、撮影後に同メモに基づいて、メタデータを作成する必要がある。 For that purpose, depending on the event that occurs at the time of shooting, for example, the photographer or another person puts on the memo paper the shooting date (date, morning, day, night, etc.), shooting method (lens, camera, shot, light source, etc.) Take note of information about points of interest such as event attendees (line of sight, movement, facial expressions, tension, makeup, costume status, etc.), words (keywords such as ad lib), and sounds (sounds). Then, it is necessary to create metadata based on the memo after shooting.

しかしながら、撮影時に、発生したイベントの状況（内容）や人の表情に関するメモなどをリアルタイムに手書きで記録して残すことは並大抵なことではない。そのために、コンテンツ撮影に関する記録に基づいてメタデータを作成して、撮影画像に付与するためには専任の補助者が必要である。 However, at the time of shooting, it is not uncommon to record in real time handwritten notes about the status (contents) of the event that occurred and the facial expression of the person. For this reason, a dedicated assistant is required to create metadata based on the recording related to content shooting and assign it to the shot image.

上述のようなリアルタイムなメモ書きや専任の補助者を必要としないメタデータの入力方法と編集システムとして、特許文献１に記載されたものが知られている。具体的には、コンテンツに関連したメタデータの作成あるいはタグ付けを行う場合に、制作されたコンテンツのシナリオ等から事前に抽出したキーワードが音声で入力される。そして、シナリオに基づいて辞書分野の設定およびキーワードの優先順位付けが行われて、音声認識手段によってメタデータが作成される。同方法によれば、キー入力では困難な数秒間隔でメタデータを付与する場合でも、音声認識を用いることによって効率のよいメタデータの付与が可能である。また、メタデータを検索するキーワードとして、シーン検索もできる。 As a metadata input method and editing system that does not require real-time memo writing as described above and a dedicated assistant, the one described in Patent Document 1 is known. Specifically, when metadata related to content is created or tagged, a keyword extracted in advance from a scenario or the like of the produced content is input by voice. Based on the scenario, dictionary fields are set and keywords are prioritized, and metadata is created by the voice recognition means. According to this method, even when metadata is given at intervals of several seconds, which is difficult by key input, it is possible to assign metadata efficiently by using voice recognition. A scene search can also be performed as a keyword for searching metadata.

特許第３７８１７１５号公報Japanese Patent No. 3781715

しかしながら上述の従来の構成では、リアルタイムで音声入力されるキーワードをリアルタイムに認識できる音声認識エンジンをカメラに実装する必要がある。音声認識エンジンを実装するためには、カメラに音声認識用マイコン処理系の追加など基本設計の変更が必要となりコストが増大する。また、コスト増の主な原因であるリアルタイム音声認識エンジンは、撮影時のキーワード入力時以外は無用であり、コスト効率が悪い。 However, in the above-described conventional configuration, it is necessary to mount a voice recognition engine in the camera that can recognize keywords input in real time in real time. In order to implement a speech recognition engine, it is necessary to change the basic design, such as adding a speech recognition microcomputer processing system to the camera, which increases costs. In addition, the real-time speech recognition engine, which is the main cause of the cost increase, is useless except when a keyword is input at the time of shooting, and is not cost effective.

従って本発明は、音声入力されたイベント情報を、コストを増大させることなく、音声認識によりメタデータに変換できるコンテンツ撮影装置を提供することを目的とする。 Therefore, an object of the present invention is to provide a content photographing apparatus capable of converting event information inputted by voice into metadata by voice recognition without increasing cost.

上記目的を達成するため本発明のコンテンツ撮影装置は、
映像、音声またはデータのいずれかを含み時刻情報で任意の位置にアクセスできるコンテンツをストリームに変換し、前記コンテンツに関するメタデータと組み合わせて、情報記憶媒体に記録するコンテンツ撮影装置であって、
記録モードまたは記録スタンバイモードである時に、音声入力手段より入力される前記コンテンツの付加情報を含む解説音声のフィルタリング手段と、
前記フィルタリング手段の出力のレベル検出手段と、
前記フィルタリング手段の出力レベルが、事前に設定した出力期間以上の期間に渡って、事前に設定した出力レベル以上である場合、該解説音声の入力が始まった時刻情報を含む音声タグを生成する手段と、
時刻情報を含む前記音声タグを前記解説音声の時刻情報と関連付け、前記音声タグをメタデータとして前記情報記憶媒体に記録する手段とを具備することを特徴とする。 In order to achieve the above object, the content photographing apparatus of the present invention provides:
A content photographing apparatus that converts any content including video, audio, or data that can be accessed at an arbitrary position with time information into a stream, and records the information in an information storage medium in combination with metadata related to the content,
Commentary voice filtering means including additional information of the content input from the voice input means when in the recording mode or the recording standby mode;
Output level detection means of the filtering means;
Means for generating a voice tag including time information when the input of the commentary voice is started when the output level of the filtering means is equal to or higher than a preset output level over a period longer than a preset output period. When,
Means for associating the audio tag including time information with the time information of the commentary audio and recording the audio tag as metadata in the information storage medium.

本発明のコンテンツ撮影装置において、
前記音声タグを用いて前記解説音声にアクセスし、前記解説音声を入力しテキストデータに変換する音声認識手段と、
前記音声認識手段で変換されたテキストデータを撮影されたコンテンツの時刻情報と関連付ける手段とをさらに具備することが好ましい。 In the content photographing apparatus of the present invention,
Voice recognition means for accessing the commentary voice using the voice tag, inputting the commentary voice and converting it into text data;
It is preferable to further comprise means for associating the text data converted by the voice recognition means with time information of the photographed content.

また本発明のコンテンツ撮影装置において、前記解説音声のテキストデータへの変換は、記録モードでも再生モードでもない場合にノンリアルタイムに実行されることが好ましい。 In the content photographing apparatus of the present invention, it is preferable that the conversion of the commentary sound into text data is performed in non-real time when neither the recording mode nor the reproduction mode is used.

本発明のコンテンツ撮影装置において、検索手段より入力された検索用キーワードが前記メタデータの少なくとも一部分と合致する場合、前記記録メディアから該検索用キーワードを包含するコンテンツを再生する手段をさらに具備することが好ましい。 The content photographing apparatus of the present invention further comprises means for reproducing content including the search keyword from the recording medium when the search keyword input from the search means matches at least a part of the metadata. Is preferred.

本発明のコンテンツ撮影装置において、
前記音声タグを用いて前記解説音声にアクセスして該解説音声を装置外に出力する手段と、
装置外で変換された前記解説音声のテキストデータを入力する手段と、
前記テキストデータを撮影されたコンテンツの時刻情報と関連付ける手段とをさらに具備することが好ましい。 In the content photographing apparatus of the present invention,
Means for accessing the commentary voice using the voice tag and outputting the commentary voice outside the device;
Means for inputting the text data of the commentary speech converted outside the device;
It is preferable to further comprise means for associating the text data with time information of the captured content.

本発明のコンテンツ撮影装置において、撮影用カメラ部におけるレンズ部のズーム状態、絞り値、焦点距離、シャッター速度、レンズ部の水平または垂直の傾き角、レンズ部の水平方向または垂直方向に回転する角速度、または、レンズ部の前後左右または垂直方向の移動加速度の少なくともいずれかのレンズ部の動作データと、ユーザによる入力データと、前記動作データにあらかじめ決めた演算処理を行って得たデータのいずれかをカメラ制御手段より受け取り、当該受け取ったデータをメタデータとして該当する映像フレームと関連付けて一時記憶させる制御手段とをさらに具備することが好ましい。 In the content photographing apparatus of the present invention, the zoom state of the lens unit, the aperture value, the focal length, the shutter speed, the horizontal or vertical tilt angle of the lens unit, and the angular velocity of the lens unit that rotates in the horizontal or vertical direction Or any one of the lens unit operation data of the moving acceleration in the front / rear / right / left direction or the vertical direction of the lens unit, input data by the user, and data obtained by performing a predetermined calculation process on the operation data It is preferable to further comprise control means for receiving the received data from the camera control means and temporarily storing the received data in association with the corresponding video frame as metadata.

本発明のコンテンツ撮影装置において、記録されるメタデータのそれぞれに優先度を設定し、当該優先度も各メタデータの付加情報として前記情報記録媒体に記録する手段をさらに具備することが好ましい。 The content photographing apparatus of the present invention preferably further includes means for setting a priority for each of the recorded metadata and recording the priority on the information recording medium as additional information of each metadata.

本発明のコンテンツ撮影装置において、
読み出し優先度の設定手段と、
前記読み出し優先度の設定手段で設定された優先度よりも高い優先度を有するメタデータ情報を出力する手段とをさらに具備することが好ましい。 In the content photographing apparatus of the present invention,
Read priority setting means,
It is preferable to further comprise means for outputting metadata information having a higher priority than the priority set by the read priority setting means.

本発明の構成では、音声認識エンジンが動作するためのＣＰＵ負荷を下げることができるので、数十ＭＩＰＳ程度以上の安価なＣＰＵを搭載した民生用ムービーなどでも、音声メタデータをテキストメタデータに変換することが可能となり、導入コストを低くできる。 In the configuration of the present invention, the CPU load for operating the speech recognition engine can be reduced, so that speech metadata is converted into text metadata even in a consumer movie equipped with an inexpensive CPU of about several tens of MIPS or more. This makes it possible to reduce the introduction cost.

本発明の実施の形態について、図面を参照して具体的に説明する前に、本発明における基本的な技術的特徴について説明する。上述の特許文献１に提案されている方法では撮影時に入力される音声に基づいてリアルタイムにメタデータを作成するために負荷が膨大なために、コンテンツ撮像装置のリソースに対する要求も厳しくならざるを得ない。 Before specifically describing the embodiments of the present invention with reference to the drawings, the basic technical features of the present invention will be described. In the method proposed in Patent Document 1 described above, since the load for creating metadata in real time based on the sound input at the time of shooting is enormous, the demands on the resources of the content imaging apparatus must be severe. Absent.

本発明においては、音声入力に対する音声認識処理を撮影時にリアルタイム処理として行うのではなく、撮影後あるいは撮影に対するリソース要求が緩和された時にバッチ的に処理する。つまり、本発明においては、撮影時にリアタイムに入力される音声において音声認識対象部分の位置を示す位置情報である一次メタデータが作成される。位置情報としては、入力音声データのタイムコードやファイル内でのバイト／ビット位置情報が利用される。 In the present invention, the voice recognition processing for voice input is not performed as real-time processing at the time of shooting, but is processed batchwise after shooting or when resource requirements for shooting are relaxed. In other words, in the present invention, primary metadata that is position information indicating the position of the speech recognition target portion is created in the sound input in real time during shooting. As the position information, the time code of the input audio data or the byte / bit position information in the file is used.

結果、音声認識エンジンは、一次メタデータに基づいて、リソース要求が逼迫していない時に、入力音声データの認識対象部分にアクセスできる。音声認識結果に基づいてキーワードをテキストとして検出して、撮影後のコンテンツの編集の際に必要となる二次メタデータを作成する。 As a result, the speech recognition engine can access the recognition target portion of the input speech data when the resource request is not tight based on the primary metadata. Based on the voice recognition result, the keyword is detected as text, and secondary metadata necessary for editing the content after shooting is created.

よって、ＣＰＵは、音声認識を負荷の大きなリアルタイムで処理するのではなく、負荷の小さなノンリアルタイムで処理できる。すなわち、音声認識エンジンが動作するためのＣＰＵ負荷を下げることができるので、数十ＭＩＰＳ程度以上の安価なＣＰＵを搭載した民生用ムービーなどでも、音声メタデータをテキストメタデータに変換することが可能となり、導入コストを低くできるという大きなメリットを持つ。なお、ＣＰＵのパワーが十分であれば、リアルタイムに音声認識処理をしてもよいことは言うまでもない。 Therefore, the CPU can process voice recognition in non-real time with a small load, instead of processing in real time with a large load. In other words, since the CPU load for operating the voice recognition engine can be reduced, it is possible to convert voice metadata into text metadata even in a consumer movie equipped with an inexpensive CPU of about several tens of MIPS. Therefore, it has a great merit that the introduction cost can be lowered. Needless to say, if the CPU power is sufficient, voice recognition processing may be performed in real time.

図１を参照して、本発明の実施の形態に係るコンテンツ撮影装置について説明する。図１においては、コンテンツ撮影装置Ａｃｃは、カメラ１０１において記録媒体（またはバッファメモリ）上に映像データと音声データとメタデータを作成するシステムモデルの一例として表されている。 A content photographing apparatus according to an embodiment of the present invention will be described with reference to FIG. In FIG. 1, the content photographing device Acc is represented as an example of a system model that creates video data, audio data, and metadata on a recording medium (or buffer memory) in the camera 101.

図１において参照符号１０１はカメラを示し、参照符号１０２はカメラ１０１のレンズ部を示し、参照符号１０３はカメラ１０１のマイクを示し、そして参照符号１０４はカメラ１０１の撮影対象を示している。なお、撮影対象１０４とは、例えば、風景や人やペットなどの動物、車、建造物などの物である。 In FIG. 1, reference numeral 101 indicates a camera, reference numeral 102 indicates a lens unit of the camera 101, reference numeral 103 indicates a microphone of the camera 101, and reference numeral 104 indicates a photographing target of the camera 101. Note that the imaging target 104 is, for example, an object such as a landscape, an animal such as a person or a pet, a car, or a building.

参照符号１１４は、メタデータ入力用ボタンを示し、参照符号１０５はカメラ１０１で撮影したデータを示し、参照符号１０６はメタデータを含むＡＶストリームデータファイル１０６を示している。参照符号１０７は、撮影シーン情報ＩＳ（シーン番号、カット番号、テーク番号、その収録テークの採用、不採用、保留）等のメタデータを示している。参照符号１０９は、カメラ１０１に対するリモコンを示している。ユーザはメタデータ入力用ボタン１１４およびリモコン１０９を操作して、カメラ１０１にメタデータ１０７を入力する。なお、カメラ１０１に用いられる撮像素子は、好ましくはＣＣＤやＣ−ＭＯＳなどで構成される。 Reference numeral 114 indicates a metadata input button, reference numeral 105 indicates data captured by the camera 101, and reference numeral 106 indicates an AV stream data file 106 including metadata. Reference numeral 107 indicates metadata such as shooting scene information IS (scene number, cut number, take number, adoption of the recorded take, non-adoption, hold). Reference numeral 109 indicates a remote controller for the camera 101. The user operates the metadata input button 114 and the remote controller 109 to input the metadata 107 to the camera 101. The image sensor used in the camera 101 is preferably composed of a CCD, C-MOS, or the like.

参照符号１０８はカメラ１０１で撮影されたデータシーケンスを示している。データシーケンス１０８においては、時間軸上に映像データ、音声データ、およびメタデータ１０７が配置されている。メタデータ１０７はテキスト形式の文字データとして扱うが、バイナリィ形式のデータとしても良い。データシーケンス１０８は、特定のシーンにおけるクリップ＃１からクリップ＃５までを含んでいる。 Reference numeral 108 indicates a data sequence photographed by the camera 101. In the data sequence 108, video data, audio data, and metadata 107 are arranged on the time axis. The metadata 107 is handled as text data in text format, but may be data in binary format. The data sequence 108 includes clips # 1 to # 5 in a specific scene.

参照符号１１０は編集により、クリップ＃１からクリップ＃５までがつなぎ合わされたデータシーケンスを示している。参照符号１１１は、カメラ１０１に接続可能なテレビを示している。参照符号１１２は、カメラ１０１からテレビ１１１に信号を送る接続ケーブルを示している。参照符号１１３は、テレビ１１１からカメラ１０１へ信号を送る接続ケーブルを示している。ユーザは、カメラ１０１から離れた場所でリモコン１０９を操作して、信号ケーブル１１２を経由して、各シーンを編集されたデータシーケンスの順番でテレビ１１１に一覧表示させることができる。 Reference numeral 110 indicates a data sequence in which clip # 1 to clip # 5 are connected by editing. Reference numeral 111 denotes a television that can be connected to the camera 101. Reference numeral 112 indicates a connection cable for transmitting a signal from the camera 101 to the television 111. Reference numeral 113 indicates a connection cable for transmitting a signal from the television 111 to the camera 101. The user can operate the remote controller 109 at a location away from the camera 101 and display a list of each scene on the television 111 in the order of the edited data sequence via the signal cable 112.

符号１１５はマイク１０３と同様に、音声を検出して音声信号としてカメラ１０１に入力するマイクを示している。参照符号１１７はカメラ１０１に内蔵されているマイクを示している。但し、マイク１１５は、マイク１０３およびマイク１１７がカメラ１０１に直接取り付けられてカメラ１０１の近傍の音声を収録するのに比べて、ケーブルなどでカメラ１０１に接続されてカメラ１０１の遠方の音声の収録に用いられる。マイク１１５は後述するように、マイクの代わりに光センサを用いることもできる。 Reference numeral 115 denotes a microphone that detects sound and inputs it to the camera 101 as an audio signal, similarly to the microphone 103. Reference numeral 117 indicates a microphone built in the camera 101. However, the microphone 115 is connected to the camera 101 with a cable or the like to record the sound far away from the camera 101 compared to the case where the microphone 103 and the microphone 117 are directly attached to the camera 101 and record the sound near the camera 101. Used for. As will be described later, the microphone 115 can use an optical sensor instead of the microphone.

テレビ１１１による一覧表示について簡単に説明する。テレビ１１１の画面において、横軸は時間の経過を表しており、それぞれのクリップを有効部と無効部に分けて示している。テレビ１１１の一覧表示において３つある有効部それぞれの代表クリップを代表サムネイルで画面上に表示している様子が例示されている。この代表クリップは、それぞれの有効部の先頭フレームであってもよいし、有効部分の途中にある代表フレームであってもよい。 A list display by the television 111 will be briefly described. In the screen of the television 111, the horizontal axis represents the passage of time, and each clip is divided into an effective part and an invalid part. In the list display of the television 111, a state in which representative clips of each of the three effective portions are displayed on the screen as representative thumbnails is illustrated. This representative clip may be the first frame of each effective portion or a representative frame in the middle of the effective portion.

上述のメタデータ入力用ボタン１１４は、好ましくは３つのボタンにより構成されている。カメラで撮影中に重要な場面で、ユーザがメタデータ入力用ボタン１１４を操作することにより、その重要な撮影場面（クリップ）にマークをつけることができる（「マーキング機能」と言う）。この重要クリップを指すマークもメタデータ１０７であり、このメタデータ１０７を利用することにより、撮影後にマーク検索によりマークを付けたクリップ（クリップの先頭または代表となるフレームの映像、またはそれらのサムネイル映像）を素早く呼び出すことができる。メタデータ入力用ボタン１１４の３つのボタンは、例えば、１つ目のボタンは重要クリップの登録に、２つ目のボタンはボタン操作を有効にしたり文字入力モードに切替えたりするモード切替えに、３つ目のボタンは登録のキャンセル等の用途に使用する。 The metadata input button 114 described above is preferably composed of three buttons. When the user operates the metadata input button 114 in an important scene while shooting with the camera, the important shooting scene (clip) can be marked (referred to as “marking function”). The mark indicating the important clip is also metadata 107, and by using this metadata 107, the clip (the video of the head or the representative frame of the clip, or the thumbnail video thereof) marked by the mark search after shooting is used. ) Can be called quickly. The three buttons of the metadata input button 114 are, for example, the first button for registering important clips, the second button for mode switching for enabling button operation or switching to the character input mode, and the like. The first button is used for canceling registration.

また、１つ目のボタンを押している期間を重要クリップとして登録するモードに切替えることもできる。さらに、１つ目のボタンを押した時点の前後５秒、あるいは前５秒、後１０秒の合計１５秒を重要クリップとして登録するモードに切替えることもできる。ボタンが３つあれば、押すボタンの種類、タイミング、押す長さの組み合わせにより、多くの機能を実現できる。 It is also possible to switch to a mode in which the period during which the first button is pressed is registered as an important clip. Further, the mode can be switched to a mode in which 5 seconds before and after the first button is pressed, or a total of 15 seconds, 5 seconds before and 10 seconds after, are registered as important clips. If there are three buttons, many functions can be realized by combining the type of button to be pressed, timing, and pressing length.

メタデータ１０７として入力された撮影シーン情報ＩＳはクリップのタイムコードに関連付けられる。そして、タイムコードに関連づけられたメタデータ１０７は、カメラ１０１の本体内で電子的にカチンコ音や収録コンテンツと関連付けて、新たなメタデータ１０７として生成される。これにより、カチンコを鳴らした時刻への即時アクセスはもちろん、カチンコを鳴らした時刻以前の不要な収録データの削除や、収録結果が採用のシーンやカット等の並べ替えが簡単にできる。例えば、運動会の撮影において、かけっこ（短距離競争）、リレーなどの長距離競争、綱引き、および玉入れ等の開始タイミングにおけるフレーム画像もすぐに呼び出すことができる。 The shooting scene information IS input as the metadata 107 is associated with the time code of the clip. The metadata 107 associated with the time code is generated as new metadata 107 by electronically associating it with a clapperboard sound or recorded content within the main body of the camera 101. As a result, not only the immediate access to the time when the clapper is sounded but also the deletion of unnecessary recorded data before the time when the clapper is sounded and the rearrangement of scenes and cuts where the recording results are adopted can be easily performed. For example, when shooting an athletic meet, a frame image at the start timing of a long distance competition such as a game (short distance competition), a relay, a tug of war, and a ball insertion can be called immediately.

ユーザは、カメラで撮影した撮影素材であるデータシーケンスから、各クリップの開始位置（時刻）と終了位置（時刻）、または長さを選択して、各クリップを並べ替えることができる。また各クリップをＴＶモニタなどに表示する場合、そのクリップの先頭または先頭以降最後尾のフレーム（またはフィールド）映像や、パンやズームの前後などにおけるフィックス画像などのクリップで最も特徴的なフレームを、そのクリップを代表する映像として表わすことができる。 The user can rearrange the clips by selecting the start position (time) and the end position (time) or length of each clip from a data sequence that is a shooting material shot by the camera. Also, when displaying each clip on a TV monitor, etc., the most characteristic frame of the clip such as a fixed frame image before or after panning or zooming, or the frame (or field) video at the beginning or the end of the clip, The clip can be represented as a representative image.

なお、記録・ポーズ・停止などのムービーのボタン操作、撮影者の声をマイク１１５で検出して、クリップの特定のタイムコードと関連付けた（マーキングした）メタデータとして登録することができる。撮影者の声の例として、撮影対象に関するメタ情報がある。具体例としては、前述した撮影日（日付、朝昼夕夜など）、撮影方法（レンズ、カメラ、ショット、光源など）、イベントの参加者（目線、動き、表情、テンション、メイク、衣装の状態など）、セリフ（アドリブなどのキーワード）、音（サウンド）、その他注目ポイントなどコンテンツの撮影に関する情報などである。 Note that movie button operations such as recording, pause, and stop, and the photographer's voice can be detected by the microphone 115 and registered as metadata associated with (marked) a specific time code of the clip. As an example of a photographer's voice, there is meta-information related to a photographing target. Specific examples include the shooting date (date, morning, day, night, etc.), shooting method (lens, camera, shot, light source, etc.), event participants (line of sight, movement, facial expression, tension, makeup, costume status) Etc.), words (keywords such as ad lib), sound (sound), and other information related to the shooting of content such as attention points.

マイク１０３、１１５または１１７から入力された音声信号があらかじめ設定してあるレベル以上の場合、カメラ１０１は、そこで音声入力があったと判断して、入力音声を該撮影クリップ（ファイル）のタイムコードと関連付ける。カメラ１０１は、上述した一次メタデータとし、タイムコードと関連付けられた情報を含むリストを作成して音声メタデータ（音声タグともいう）を生成する。更にカメラ１０１は、その音声メタデータを入力音声とともにファイル化してカメラ本体内に記録する。 If the audio signal input from the microphones 103, 115, or 117 is equal to or higher than a preset level, the camera 101 determines that there is an audio input there, and uses the input audio as the time code of the shooting clip (file). Associate. The camera 101 generates audio metadata (also referred to as an audio tag) by creating a list including the primary metadata described above and information associated with the time code. Further, the camera 101 forms a file of the audio metadata together with the input audio and records it in the camera body.

以上のように、事前に設定した特定レベル以上の音声として入力された音声は音声メタデータにおいてクリップのタイムコードと関連付けられる。また入力音声は、音声メタデータとともに、タイムコードと１対１対応の音声メタデータ専用チャネルに記録される。 As described above, audio input as audio of a predetermined level or higher set in advance is associated with the clip time code in audio metadata. The input audio is recorded together with the audio metadata in a channel dedicated to audio metadata that has a one-to-one correspondence with the time code.

よって、音声メタデータのリストを見れば、音声メタデータ専用チャネルに記録されている音声部分にアクセスすることができる。カメラに内蔵された音声認識エンジンは、該ファイルの音声位置情報より音声部分に高速にアクセスして、音声をテキスト変換する。変換されたテキストデータは、その後前記メタデータのリストに追加されて、上述した二次メタデータが生成される。 Therefore, if the audio metadata list is viewed, it is possible to access the audio portion recorded in the audio metadata dedicated channel. The voice recognition engine built in the camera accesses the voice part at a higher speed than the voice position information of the file and converts the voice into text. The converted text data is then added to the metadata list to generate the secondary metadata described above.

以上により、撮影イベントに関するメタデータをテキストデータとして記録できるため、編集や視聴などのアプリケーションにおいて、撮影内容の検索性を向上できる。 As described above, since the metadata related to the shooting event can be recorded as text data, it is possible to improve the searchability of shooting contents in applications such as editing and viewing.

通常、音声認識エンジンは、音声認識用のソフトウェアをＣＰＵで実行することによって実現される。本実施の形態では、入力音声が音声メタデータとともにファイル化され、カメラ本体内に記録されているため、音声認識エンジンは、撮影コンテンツの管理情報（たとえば、タイムコードやファイル内でのバイト／ビット位置情報等）を用いて、上述した音声メタデータに、必要な時にいつでもアクセスできる。よって、ＣＰＵは、音声認識を、撮影時のような負荷の大きい時にリアルタイムで実行するのでなく、撮影や再生を行っていない負荷の小さな時にノンリアルタイムで実行できる。 Usually, the voice recognition engine is realized by executing software for voice recognition on a CPU. In the present embodiment, since the input voice is filed together with the voice metadata and recorded in the camera body, the voice recognition engine can manage the management information (for example, time code or byte / bit in the file) of the captured content. Using the location information etc., the above-mentioned audio metadata can be accessed whenever necessary. Therefore, the CPU can execute voice recognition in non-real time when the load is low when shooting or reproduction is not performed, instead of executing the voice recognition in real time when the load is high as in shooting.

すなわち、本実施の形態によれば、音声認識エンジンを動作させる際のＣＰＵ負荷を下げることができる。結果、数十ＭＩＰＳ程度以上の安価なＣＰＵを搭載した民生用ムービーなどでも、音声認識エンジンをノンリアルタイムで動作させ、入力音声を二次のメタデータであるテキストメタデータに変換することが可能となる。なお、音声認識エンジンは、必ずしもカメラに内蔵する必要はない。音声認識エンジンをＰＣに内蔵させ、入力音声と音声メタデータを含むファイルをＰＣに渡してＰＣで処理してもよいし、カメラをケーブルでＰＣと接続して処理してもよい。 That is, according to the present embodiment, it is possible to reduce the CPU load when operating the speech recognition engine. As a result, it is possible to operate the speech recognition engine in non-real time and convert the input speech to text metadata, which is the secondary metadata, even for consumer movies equipped with an inexpensive CPU of about several tens of MIPS. Become. Note that the voice recognition engine is not necessarily built in the camera. A voice recognition engine may be built in the PC, and a file including input voice and voice metadata may be transferred to the PC and processed by the PC, or may be processed by connecting the camera to the PC with a cable.

図２を参照してカメラ１０１の内部構成について説明する。カメラ１０１の内部には、ズーム制御部２０１、フォーカス制御部２０２、露出制御部２０３、撮像素子２０４、シャッタ速度制御部２０５、カメラマイコン２０６、絶対傾きセンサ２０７、角速度センサ２０８、前後／左右／垂直の加速度センサ２０９、ユーザ入力系２１０、カメラ信号処理部２１１、音声処理系２１２、Ｈ．２６４方式エンコーダ２１３、記録メディア２１４、および出力インタフェース２１５が備えられている。 The internal configuration of the camera 101 will be described with reference to FIG. Inside the camera 101 are a zoom control unit 201, a focus control unit 202, an exposure control unit 203, an image sensor 204, a shutter speed control unit 205, a camera microcomputer 206, an absolute tilt sensor 207, an angular velocity sensor 208, front / rear / left / right / vertical. Acceleration sensor 209, user input system 210, camera signal processing unit 211, audio processing system 212, H.P. A H.264 encoder 213, a recording medium 214, and an output interface 215 are provided.

ズーム制御部２０１はレンズ部１０２のズーム動作を制御する。フォーカス制御部２０２は、レンズ部１０２のフォーカス動作を制御する。露出制御部２０３は、レンズ部１０２の露出調整動作を制御する。シャッタ速度制御部２０５は、撮像素子２０４のシャッタ速度調整動作を制御する。絶対傾きセンサ２０７は、カメラ１０１の水平／垂直方向の絶対傾きを検出する。角速度センサ２０８は、カメラ１０１の水平／垂直方向の角速度を検出する。加速度センサ２０９は、カメラ１０１の前後／左右／垂直の加速度を検出する。 The zoom control unit 201 controls the zoom operation of the lens unit 102. The focus control unit 202 controls the focus operation of the lens unit 102. The exposure control unit 203 controls the exposure adjustment operation of the lens unit 102. A shutter speed control unit 205 controls the shutter speed adjustment operation of the image sensor 204. The absolute tilt sensor 207 detects the absolute tilt of the camera 101 in the horizontal / vertical direction. An angular velocity sensor 208 detects the angular velocity of the camera 101 in the horizontal / vertical direction. The acceleration sensor 209 detects the front / rear / left / right / vertical acceleration of the camera 101.

ユーザ入力系２１０は、ボタンなどでユーザの操作を受け付けて指示信号を生成する。音声処理系２１２は、内蔵マイク１１７、外部マイク１０３、あるいはマイク１１５からの入力を受け付ける。Ｈ．２６４方式エンコーダ２１３は、音声処理系２１２に入力された音声から、カチンコによる音を検出してカチンコ音検出メタデータを生成する。 The user input system 210 receives a user operation with a button or the like and generates an instruction signal. The audio processing system 212 receives input from the built-in microphone 117, the external microphone 103, or the microphone 115. H. The H.264 encoder 213 detects clapperboard sound from the sound input to the sound processing system 212 and generates clapperboard sound detection metadata.

撮像素子２０４の動作パラメータとしては、３原色点の色度空間情報、白色の座標、および３原色のうち少なくとも２つのゲイン情報、色温度情報、Δｕｖ（デルタｕｖ）、および３原色または輝度信号のガンマ情報の少なくとも１つの撮像像素子動作データがある。本実施の形態においては、一例として、３原色点の色度空間情報、３原色のうちＲ（赤）とＢ（青）のゲイン情報、およびＧ（緑）のガンマカーブ情報をメタデータとして取り扱うものとする。なお、３原色点の色度空間情報が分かれば色空間における色再現が可能な範囲が分かる。また、３原色のうちＲ（赤）とＢ（青）のゲイン情報が分かれば色温度が分かる。さらに、Ｇ（緑）のガンマカーブ情報が分かれば、階調表現特性が分かる。 The operation parameters of the image sensor 204 include chromaticity space information of three primary colors, white coordinates, and gain information of at least two of the three primary colors, color temperature information, Δuv (delta uv), and three primary colors or luminance signals. There is at least one imaged image element operation data of gamma information. In the present embodiment, as an example, chromaticity space information of three primary color points, gain information of R (red) and B (blue) of three primary colors, and gamma curve information of G (green) are handled as metadata. Shall. If the chromaticity space information of the three primary color points is known, the range in which color reproduction in the color space is possible is known. If the gain information of R (red) and B (blue) among the three primary colors is known, the color temperature can be determined. Further, if the G (green) gamma curve information is known, the gradation expression characteristic can be known.

レンズのズーム情報、レンズのフォーカス情報、レンズの露出情報、撮像素子のシャッタ速度情報、水平／垂直方向の絶対傾き情報、水平／垂直方向の角速度情報、前後／左右／垂直の加速度情報、ユーザの入力したボタン情報やシーン番号、カット番号、テーク番号、その収録テークの採用、不採用、保留などに関する情報、３原色点の色度空間情報、３原色のうちＲ（赤）とＢ（青）のゲイン情報、およびＧ（緑）のガンマカーブ情報は、カメラマイコン２０６においてメタデータ１０７（カメラメタと呼ぶ）として取り扱われる。 Lens zoom information, lens focus information, lens exposure information, image sensor shutter speed information, horizontal / vertical absolute tilt information, horizontal / vertical angular velocity information, front / rear / left / right / vertical acceleration information, user's information Input button information, scene number, cut number, take number, information on adoption, non-adoption, hold, etc. of the recorded take, chromaticity space information of the three primary colors, R (red) and B (blue) of the three primary colors Gain information and G (green) gamma curve information are handled as metadata 107 (referred to as camera meta) in the camera microcomputer 206.

撮像素子２０４で撮影された情報（画像のデータ）は、カメラ信号処理部２１１による画素単位で画素欠陥補正やガンマ補正などの処理を経て、Ｈ．２６４方式エンコーダ２１３で圧縮された後に、前述のカメラメタと共に記録メディア２１４に蓄積される。また、Ｈ．２６４方式エンコーダ２１３のＡＶ出力と、カメラ部マイコン２０６のカメラメタ出力は、出力インタフェース２１５より、それぞれ出力される。 Information (image data) captured by the image sensor 204 is subjected to processing such as pixel defect correction and gamma correction in units of pixels by the camera signal processing unit 211, and the H.264 data. After being compressed by the H.264 encoder 213, it is stored in the recording medium 214 together with the camera meta described above. H. The AV output from the H.264 encoder 213 and the camera meta output from the camera microcomputer 206 are output from the output interface 215, respectively.

次に、図３を参照して、本実施の形態において採用されている映像圧縮方式であるＡＶＣ方式および音声圧縮方式であるＡＡＣ方式について説明する。なお、図３には、カメラ１０１が内部に有するＡＶ信号圧縮記録制御部における映像と音声の圧縮エンジンとその周辺処理手段の詳細な構成が示されている。同図において、参照符号３０１は映像符号化部を示し、参照符号３０２はＶＣＬ（ＶｉｄｅｏＣｏｄｉｎｇＬａｙｅｒ）−ＮＡＬ（ＮｅｔｗｏｒｋＡｂｓｔｒａｃｔｉｏｎＬａｙｅｒ）ユニットバッファを示し、参照符号３０３はＡＡＣ方式による音声符号化部を示している。 Next, with reference to FIG. 3, the AVC method that is a video compression method and the AAC method that is an audio compression method employed in the present embodiment will be described. FIG. 3 shows a detailed configuration of the video and audio compression engine and its peripheral processing means in the AV signal compression / recording control unit included in the camera 101. In the figure, reference numeral 301 indicates a video encoding unit, reference numeral 302 indicates a VCL (Video Coding Layer) -NAL (Network Abstraction Layer) unit buffer, and reference numeral 303 indicates an AAC-based audio encoding unit. ing.

参照符号３０４はＰＳ（ＰａｒａｍｅｔｅｒＳｅｔ）バッファを示し、参照符号３０５はＶＵＩ（ＶｉｄｅｏＵｓａｂｉｌｉｔｙＩｎｆｏｒｍａｔｉｏｎ）バッファを示し、参照符号３０６はＳＥＩ（ＳｕｐｐｌｅｍｅｎｔａｌＥｎｈａｎｃｅｍｅｎｔＩｎｆｏｒｍａｔｉｏｎ）バッファを示し、参照符号３０７はｎｏｎ−ＶＣＬ−ＮＡＬユニットバッファを示し、参照符号３０８はＭＰＥＧ−ＰＥＳパケット生成部を示している。また、参照符号３０９はＭＰＥＧ−ＴＳ（ＭＰＥＧＴｒａｎｓｐｏｒｔＰａｃｋｅｔ）生成部を示し、参照符号３１０はＡＴＳ（ＡｒｒｉｖａｌＴｉｍｅＳｔａｍｐ）パケット生成部を示し、参照符号３１１はＥＰ−ｍａｐ生成部を示している。 Reference numeral 304 indicates a PS (Parameter Set) buffer, reference numeral 305 indicates a VUI (Video Usability Information) buffer, reference numeral 306 indicates an SEI (Supplemental Enhancement Information) buffer, and reference numeral 307 indicates a non-VCL-NAL. A unit buffer is shown. Reference numeral 308 denotes an MPEG-PES packet generator. Reference numeral 309 indicates an MPEG-TS (MPEG Transport Packet) generation unit, reference numeral 310 indicates an ATS (Arrival Time Stamp) packet generation unit, and reference numeral 311 indicates an EP-map generation unit.

図３に示すように、映像信号は映像符号化部３０１によってＶＣＬ−ＮＡＬユニット形式のデータに変換された後に、ＶＣＬ−ＮＡＬユニットバッファ３０２によって一時保持される。音声信号、外部入力ＰＳデータおよび外部入力ＶＵＩデータは、音声符号化部３０３、ＰＳバッファ３０４、およびＶＵＩバッファ３０５によってそれぞれＮｏｎＶＣＬ−ＮＡＬユニット形式のデータに変換された後に、ＮｏｎＶＣＬ−ＮＡＬユニットバッファ３０７で一時保持される。同様に、リアルタイム系メタデータはＳＥＩバッファ３０６によって、ＮｏｎＶＣＬ−ＮＡＬユニット形式のデータに変換された後に、ＮｏｎＶＣＬ−ＮＡＬユニットバッファ３０７で一時保持される。 As shown in FIG. 3, the video signal is temporarily stored in the VCL-NAL unit buffer 302 after being converted into data in the VCL-NAL unit format by the video encoding unit 301. The voice signal, the external input PS data, and the external input VUI data are converted into data of the Non VCL-NAL unit format by the voice encoding unit 303, the PS buffer 304, and the VUI buffer 305, respectively, and then the Non VCL-NAL unit buffer. It is temporarily held at 307. Similarly, the real-time metadata is converted into data in the non-VCL-NAL unit format by the SEI buffer 306 and then temporarily stored in the non-VCL-NAL unit buffer 307.

ＭＰＥＧ−ＰＥＳパケット生成部３０８は、ＶＣＬ−ＮＡＬユニットバッファ３０２から出力されたＶＣＬ−ＮＡＬユニット形式のデータと、ＮｏｎＶＣＬ−ＮＡＬユニットバッファ３０７から出力されたＮｏｎＶＣＬ−ＮＡＬユニット形式のデータに基づいて、ＭＰＥＧ−ＰＥＳパケット（図３において「ＰＥＳ」と表示）を生成する。さらに、ＭＰＥＧ−ＴＳ生成部３０９はＭＰＥＧ−ＰＥＳパケット生成部３０８から出力されたＭＰＥＧ−ＰＥＳパケットに基づいて１８８バイト長のＭＰＥＧ−ＴＳ（図３において「ＴＳ」と表示）を生成する。 The MPEG-PES packet generation unit 308 is based on the VCL-NAL unit format data output from the VCL-NAL unit buffer 302 and the Non VCL-NAL unit format data output from the Non VCL-NAL unit buffer 307. MPEG-PES packet (indicated as “PES” in FIG. 3) is generated. Further, the MPEG-TS generator 309 generates an 188-byte MPEG-TS (indicated as “TS” in FIG. 3) based on the MPEG-PES packet output from the MPEG-PES packet generator 308.

ＡＴＳパケット生成部３１０は、ＭＰＥＧ−ＴＳ生成部３０９から出力されるＭＰＥＧ−ＴＳ（１８８バイト長）のそれぞれにタイムスタンプを含む４バイトのヘッダーを付加して、１９２バイトのＡＴＳパケット（図３において「ＡＴＳ」と表示）を生成する。このタイムスタンプは、各ＭＰＥＧ−ＴＳパケットがＡＴＳパケット生成部３１０に到着した時刻を示す。タイムスタンプのクロックは２７ＭＨｚである。なお、４バイト全てがタイムスタンプでもよい。また、４バイトの内の３０ビットをタイムスタンプとし、残りの２ビットはコンテンツ保護のためのフラグなどに使用することもできる。 The ATS packet generation unit 310 adds a 4-byte header including a time stamp to each of the MPEG-TS (188 byte length) output from the MPEG-TS generation unit 309 to generate a 192-byte ATS packet (in FIG. 3). Display "ATS"). This time stamp indicates the time at which each MPEG-TS packet arrives at the ATS packet generator 310. The clock for the time stamp is 27 MHz. All 4 bytes may be time stamps. Further, 30 bits of 4 bytes can be used as a time stamp, and the remaining 2 bits can be used as a flag for content protection.

また、ＥＰ−ｍａｐ生成部３１１は、ストリームが包含する各ＧＯＰ（ＧｒｏｕｐｏｆＰｉｃｔｕｒｅ）の先頭ピクチャのＰＴＳ（ＰｒｅｓｅｎｔａｔｉｏｎＴｉｍｅＳｔａｍｐ）、および各ＧＯＰの先頭ピクチャにおける先頭ＡＴＳの連番をペアで、ＥＰ−ＭＡＰとして出力する。なお、ＰＴＳやＤＴＳ（ＤｅｃｏｄｅＴｉｍｅＳｔａｍｐ）はＰＥＳパケットのヘッダーに含まれるので抽出は容易である。また、各ＧＯＰの先頭ピクチャにおける先頭ＡＴＳの連番とは、ストリーム先頭のＡＴＳの連番を１とし、ストリーム先頭からのＡＴＳの個数を順次数えた番号である。各ＧＯＰの先頭ピクチャのＰＴＳとＡＴＳ連番のペアであるＥＰ−ＭＡＰとストリーム編集およびプレイリストの関係は後ほど述べる。 In addition, the EP-map generation unit 311 makes a pair of the PTS (Presentation Time Stamp) of the first picture of each GOP (Group of Picture) included in the stream and the serial number of the first ATS in the first picture of each GOP. Output as MAP. Since PTS and DTS (Decode Time Stamp) are included in the header of the PES packet, extraction is easy. The serial number of the first ATS in the first picture of each GOP is a number obtained by sequentially counting the number of ATSs from the stream head, with the serial number of the ATS at the head of the stream being 1. The relationship between EP-MAP, which is a pair of PTS and ATS serial number of the first picture of each GOP, and stream editing and playlist will be described later.

Ｈ．２６４／ＡＶＣ方式については、例えば、「Ｈ．２６４／ＡＶＣ教科書」、大久保榮監修、株式会社インプレス発行、などに詳述されている。また、ＭＰＥＧ−ＴＳ（ＭｏｖｉｎｇＰｉｃｔｕｒｅＥｘｐｅｒｔｓＧｒｏｕｐ、ＴｒａｎｓｐｏｒｔＳｔｒｅａｍ）信号はＩＥＣ６１８８３−４で規定されている。ＭＰＥＧ−ＴＳはＭＰＥＧトランスポートパケット（「ＴＳパケット」と略す）が複数個集まったものである。ＴＳパケットは１８８ｂｙｔｅの固定長パケットで、その長さはＡＴＭのセル長（５３バイト中、ＡＴＭペイロードは４７バイト）との整合性、およびリードソロモン符号などの誤り訂正符号化を行なう場合の適用性を考慮して決定されている。 H. The H.264 / AVC format is described in detail, for example, in “H.264 / AVC Textbook”, supervised by Okubo Satoshi, published by Impress Corporation. MPEG-TS (Moving Picture Experts Group, Transport Stream) signals are defined in IEC 61883-4. MPEG-TS is a collection of a plurality of MPEG transport packets (abbreviated as “TS packets”). The TS packet is a 188-byte fixed-length packet whose length is consistent with the ATM cell length (of 53 bytes, the ATM payload is 47 bytes), and applicable when performing error correction coding such as Reed-Solomon codes. Has been determined in consideration of.

ＴＳパケットは４ｂｙｔｅ固定長のパケットヘッダと可変長のアダプテーションフィールド（ａｄａｐｔａｔｉｏｎｆｉｅｌｄ）およびペイロード（ｐａｙｌｏａｄ）で構成される。パケットヘッダにはＰＩＤ（パケット識別子）や各種フラグが定義されている。このＰＩＤによりＴＳパケットの種類を識別する。ａｄａｐｔａｔｉｏｎ＿ｆｉｅｌｄとｐａｙｌｏａｄは、片方のみが存在する場合と両方が存在する場合とがあり、その有無はパケットヘッダ内のフラグ（ａｄａｐｔａｔｉｏｎ＿ｆｉｅｌｄ＿ｃｏｎｔｒｏｌ）により識別できる。ａｄａｐｔａｔｉｏｎ＿ｆｉｅｌｄは、ＰＣＲ（Ｐｒｏｇｒａｍ＿Ｃｌｏｃｋ＿Ｒｅｆｅｒｅｎｃｅ）等の情報伝送、および、ＴＳパケットを１８８ｂｙｔｅ固定長にするためのＴＳパケット内でのスタッフィング機能を持つ。 The TS packet includes a 4-byte fixed-length packet header, a variable-length adaptation field (adaptation field), and a payload (payload). PID (packet identifier) and various flags are defined in the packet header. The type of TS packet is identified by this PID. Adaptation_field and payload can be either only one or both, and the presence / absence can be identified by a flag (adaptation_field_control) in the packet header. The adaptation_field has information transmission such as PCR (Program_Clock_Reference) and a stuffing function in the TS packet for making the TS packet have a fixed length of 188 bytes.

また、ＭＰＥＧ−２の場合、ＰＣＲは２７ＭＨｚのタイムスタンプで、符号化時の基準時間を復号器のＳＴＣ（ＳｙｓｔｅｍＴｉｍｅＣｌｏｃｋ）で再現するためにＰＣＲ値が参照される。各ＴＳパケットに付加するタイムスタンプのクロックは、例えば、ＭＰＥＧのシステムクロック周波数に等しい。さらに、パケット送信装置はＴＳパケットを受信し、受信したＴＳパケットに付加されたタイムスタンプより、ＭＰＥＧ−ＴＳのネットワーク伝送によりＰＣＲに付加された伝送ジッターを除去して、ＭＰＥＧシステムクロックの再生を行うクロック再生手段を備える。 In the case of MPEG-2, the PCR is a time stamp of 27 MHz, and the PCR value is referred to reproduce the reference time at the time of encoding by the STC (System Time Clock) of the decoder. The clock of the time stamp added to each TS packet is equal to the MPEG system clock frequency, for example. Further, the packet transmitting apparatus receives the TS packet, removes transmission jitter added to the PCR by MPEG-TS network transmission from the time stamp added to the received TS packet, and reproduces the MPEG system clock. Clock recovery means is provided.

ＭＰＥＧ−２のＴＳでは復号器のＳＴＣはＰＣＲによるＰＬＬ動機機能を有する。このＰＬＬ同期の動作を安定させるために、ＰＣＲの送信間隔はＭＰＥＧ規格で１００ｍｓｅｃ以内と定められている。映像や音声などの個別ストリームが収められたＭＰＥＧ−ＰＥＳパケットは、同じＰＩＤ番号を持つ複数のＴＳパケットのペイロードに分割して伝送する。ＰＥＳパケットの先頭は、ＴＳパケットの先頭から開始するように構成される。 In MPEG-2 TS, the STC of the decoder has a PLL motive function by PCR. In order to stabilize the PLL synchronization operation, the PCR transmission interval is defined within 100 msec in the MPEG standard. An MPEG-PES packet containing individual streams such as video and audio is divided into a plurality of TS packet payloads having the same PID number and transmitted. The head of the PES packet is configured to start from the head of the TS packet.

トランスポートストリームは複数のプログラムを混合して伝送できる。このような混合伝送を可能にするために、ストリームに含まれているプログラムとそのプログラムを構成している映像や音声ストリームなどのプログラムの要素との関係を表すテーブル情報が用いられている。このテーブル情報はＰＳＩ（ＰｒｏｇｒａｍＳｐｅｃｉｆｉｃＩｎｆｏｒｍａｔｉｏｎ）と呼ばれ、ＰＡＴ（ＰｒｏｇｒａｍＡｓｓｏｃｉａｔｉｏｎＴａｂｌｅ）、ＰＭＴ（ＰｒｏｇｒａｍＭａｐＴａｂｌｅ）などのテーブルを含む。ＰＡＴ、およびＰＭＴなどのＰＳＩはセクションと呼ばれる単位でＴＳパケット中のペイロードに配置されて伝送される。 A transport stream can be transmitted by mixing a plurality of programs. In order to enable such mixed transmission, table information representing the relationship between a program included in a stream and program elements such as video and audio streams constituting the program is used. This table information is called PSI (Program Specific Information) and includes tables such as PAT (Program Association Table) and PMT (Program Map Table). PSI such as PAT and PMT is arranged and transmitted in a payload in a TS packet in units called sections.

ＰＡＴにはプログラム番号に対応したＰＭＴのＰＩＤなどが指定されている。ＰＭＴには対応するプログラムに含まれる映像、音声、付加データおよびＰＣＲのＰＩＤが記述されるため、ＰＡＴとＰＭＴを参照することにより、ストリームの中から目的のプログラムを構成するＴＳパケットを取り出すことができる。ＴＳに関する参考文献としては、例えば、ＣＱ出版社、ＴＥＣＨＩＶｏ．４、「画像＆音声圧縮技術のすべて（インターネット／ディジタルテレビ、モバイル通信時代の必須技術）」、監修、藤原洋、第６章、「画像や音声を多重化するＭＰＥＧシステム」があり、同書にて解説されている。 In PAT, the PID of the PMT corresponding to the program number is designated. Since PMT describes video, audio, additional data, and PID of PCR included in the corresponding program, TS packets constituting the target program can be extracted from the stream by referring to PAT and PMT. it can. References regarding TS include, for example, CQ Publisher, TECH I Vo. 4. “All of image & audio compression technology (essential technology in the Internet / digital television and mobile communication era)”, supervised by Hiroshi Fujiwara, Chapter 6, “MPEG system for multiplexing images and audio” It is explained.

ＰＳＩやＳＩに関する論理的な階層構造、処理手順の例、選局処理の例に関して、「デジタル放送受信機における選局技術」、三宅他、三洋電機技報、ＶＯＬ．３６、ＪＵＮＥ２００４、第７４号、３１ページから４４ページに解説されている。 Regarding the logical hierarchical structure related to PSI and SI, examples of processing procedures, and examples of channel selection processing, “Channel selection technology in digital broadcast receivers”, Miyake et al., Sanyo Electric Technical Report, VOL. 36, JUNE 2004, No. 74, pages 31-44.

上述のＳＥＩバッファ３０６に入力されるリアルタイム系メタデータの種類としては、映像や音声のフォーマット情報や、映像フレームを示すタイムコードや前述したメタデータ以外のものも含まれる。具体的には、一般的なデータをメタデータ化したメタデータ、デジタル放送を受信してそのＳＩ（ＳｅｒｖｉｃｅＩｎｆｏｒｍａｔｉｏｎ；番組配列情報）より得るメタデータ、ＥＰＧ提供事業者より得たＥＰＧ情報などのメタデータ、Ｉｎｔｅｒｎｅｔから得たＥＰＧなどのメタデータ、また、個人でムービー撮影したＡＶコンテンツ（静止画、音声、クリップなどの動画）に関連付けたメタデータがある。さらに、レンズのズーム情報、レンズのフォーカス情報、レンズの露出情報、撮像素子のシャッタ速度情報、水平／垂直方向の絶対傾き情報、水平／垂直方向の角速度情報、前後／左右／垂直の加速度情報、ユーザの入力したボタン情報やシーン番号、カット番号、テーク番号、その収録テークの採用、不採用、保留などに関する情報、前述した３原色点の色度空間情報、白色の座標、３原色のうち少なくとも２つのゲイン情報、色温度情報、Δｕｖ（デルタｕｖ）、３原色または輝度信号のガンマ情報などのカメラメタがある。 The types of real-time metadata input to the SEI buffer 306 include video and audio format information, time codes indicating video frames, and other than the above-described metadata. Specifically, metadata obtained by converting general data into metadata, metadata obtained by receiving digital broadcasts from SI (Service Information; program arrangement information), and metadata such as EPG information obtained from an EPG provider. Data, metadata such as EPG obtained from the Internet, and metadata associated with AV content (moving images such as still images, audio, and clips) taken by individuals. Furthermore, lens zoom information, lens focus information, lens exposure information, image sensor shutter speed information, horizontal / vertical absolute tilt information, horizontal / vertical angular velocity information, front / rear / left / right / vertical acceleration information, Button information entered by the user, scene number, cut number, take number, information on adoption, non-adoption, hold, etc. of the recorded take, chromaticity space information of the three primary colors described above, white coordinates, at least of the three primary colors There are two types of camera information such as gain information, color temperature information, Δuv (delta uv), three primary colors, or gamma information of a luminance signal.

メタデータの形式としては、例えば、ＵＰｎＰやＵＰｎＰ−ＡＶの標準仕様として、プロパティ（ｐｒｏｐｅｒｔｙ）やアトリビュート（ａｔｔｒｉｂｕｔｅ）があり、ｈｔｔｐ：／／ｕｐｎｐ．ｏｒｇで公開されており、ＸＭＬ（ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）やＢＭＬ（ＢｒｏａｄｃａｓｔＭａｒｋｕｐＬａｎｇｕａｇｅ）などの記述言語で表現できる。 The metadata format includes, for example, properties (attributes) and attributes (attributes) as standard specifications of UPnP and UPnP-AV, such as http: // upnp. org and can be expressed in a description language such as XML (Extensible Markup Language) or BML (Broadcast Markup Language).

ｈｔｔｐ：／／ｕｐｎｐ．ｏｒｇにおいて、例えば、「ＤｅｖｉｃｅＡｒｃｈｉｔｅｃｔｕｒｅＶ１．０」、「ＣｏｎｔｅｎｔＤｉｒｅｃｔｏｒｙ：１ＳｅｒｖｉｃｅＴｅｍｐｌａｔｅＶｅｒｓｉｏｎ１．０１」、「ＭｅｄｉａＳｅｒｖｅｒＶ１．０ａｎｄＭｅｄｉａＲｅｎｄｅｒｅｒＶ１．０」に関して、「ＭｅｄｉａＳｅｒｖｅｒＶ１．０」、「ＭｅｄｉａＲｅｎｄｅｒｅｒＶ１．０」、「ＣｏｎｎｅｃｔｉｏｎＭａｎａｇｅｒＶ１．０」、「ＣｏｎｔｅｎｔＤｉｒｅｃｔｏｒｙＶ１．０」、「ＲｅｎｄｅｒｉｎｇＣｏｎｔｒｏｌＶ１．０」、「ＡＶＴｒａｎｓｐｏｒｔＶ１．０」、「ＵＰｎＰ―ＡＶＡｒｃｈｉｔｅｃｔｕｒｅＶ．８３」などの仕様書が公開されている。また、メタデータ規格に関しては、ＥＢＵのＰ／Ｍｅｔａ、ＳＭＰＴＥのＫＬＶ方式、ＴＶＡｎｙｔｉｍｅ、ＭＰＥＧ７などで決められたメタデータ形式があり、「映像情報メディア学会誌、５５巻、３号、情報検索のためのメタデータの標準化動向」などで解説されている。 http: // upnp. In org, for example, “Media Architecture V 1.0”, “Content Directory: 1 Service Template Version 1.01”, “MediaServer V 1.0 and MediaRenderer V 1.0”, “Media“ V ” “MediaRenderer V 1.0”, “ConnectionManager V 1.0”, “ContentDirectory V 1.0”, “RenderingControl V 1.0”, “AVTransport V 1.0”, “UPnP-AV Architecture V.83”, etc. The specification is published. As for metadata standards, there are metadata formats determined by EBU P / Meta, SMPTE KLV, TV Anytime, MPEG7, etc. The standardization trend of metadata for

なお、ムービーなどの撮影者、コンテンツ制作者、またはコンテンツの著作権者が各メタデータに価値を付け、コンテンツを利用するユーザの利用内容や頻度により利用料金を徴収するために、各メタデータに価値を与えるメタデータを関連づけることができる。各メタデータに価値を与えるメタデータは該メタデータのアトリビュートで与えてもよいし、独立したプロパティとして与えてもよい。 In addition, in order that a photographer such as a movie, a content creator, or a copyright holder of content adds value to each metadata and collects usage fees according to the usage details and frequency of users who use the content, You can associate metadata that gives value. Metadata giving value to each metadata may be given as an attribute of the metadata or may be given as an independent property.

録画機器と録画条件に関する情報を一例に具体的に述べる。ムービーの機器ＩＤ、ムービーなどの撮影者、コンテンツ制作者、またはコンテンツの著作権者が作成、登録するメタデータの価値が高くて使用許諾が必要と考える場合、該メタデータの利用には認証による使用許諾のプロセスを実行する構成を本発明に組み込んだ構成をとることもできる。 Specific information about the recording device and the recording condition will be described as an example. If it is considered that the metadata to be created and registered by the photographer, the content creator, or the copyright holder of the content of the movie device ID, the movie creator, etc. is high and needs to be licensed, the use of the metadata is based on authentication. It is also possible to adopt a configuration in which a configuration for executing the use permission process is incorporated in the present invention.

この場合、撮影者は撮影した動画コンテンツを暗号化したファイルを作成し、Ｉｎｔｅｒｎｅｔ上のサーバーにその暗号化ファイルをアップロードする。そして、暗号化ファイルの説明や一部の画像などを公開して、気にいった人に購入してもらう構成をとることもできる。また、貴重なニュースソースが録画できた場合、複数の放送局のニュース部門間で競売（オークション）にかける構成をとることもできる。 In this case, the photographer creates a file in which the captured video content is encrypted, and uploads the encrypted file to a server on the Internet. Then, the description of the encrypted file, a part of the image, etc. can be made public so that the person who likes it can purchase it. In addition, when a valuable news source can be recorded, it is possible to adopt a configuration for auctioning among news departments of a plurality of broadcasting stations.

これらメタデータを活用することにより、多くのＡＶコンテンツの効率的な利用が可能となる。具体的には、多くのコンテンツから所望のコンテンツを検索する、ライブラリ分類、記録時間を長時間化、自動表示、およびコンテンツ販売などが挙げられる。記録時間の長時間化は、価値の低い動画コンテンツは解像度を低くするとか、音声と静止画（例えば、ＭＰＥＧのＩピクチャやＨ．２６４のＩＤＲピクチャを抜き出してもよい）だけにするとか、静止画だけにするなどの構成をとることができる。 By utilizing these metadata, it is possible to efficiently use many AV contents. Specific examples include searching for a desired content from many contents, library classification, extending recording time, automatic display, and content sales. Increasing the recording time can be achieved by reducing the resolution of low-value video content, or by using only audio and still images (for example, MPEG I pictures or H.264 IDR pictures can be extracted) It is possible to take a configuration such as only a picture.

次に、図４を参照して、Ｈ．２６４のストリーム構造について説明する。図４（Ａ）に、Ｉ（ＩＤＲを含む）、Ｂ、およびＰピクチャよりなる映像のＧＯＰ構造を示す。 Next, referring to FIG. The H.264 stream structure will be described. FIG. 4A shows a GOP structure of a video composed of I (including IDR), B, and P pictures.

図４（Ｂ）に、各ピクチャがＶＣＬおよびＮｏｎ−ＶＣＬのＮＡＬユニットによって構成されていることを示す。ＮＡＬ（ｖｉｄｅｏ）は映像のＮＡＬユニットであり、ＮＡＬ（Ａｕｄｉｏ）は音声のＮＡＬユニットであり、ＮＡＬ（ＳＥＩ）はＳＥＩのＮＡＬユニットである。ＮＡＬ（ＳＥＩ）にはリアルタイムで生成するメタデータを挿入できる。 FIG. 4B shows that each picture is composed of VCL and Non-VCL NAL units. NAL (video) is a video NAL unit, NAL (Audio) is an audio NAL unit, and NAL (SEI) is a SEI NAL unit. Metadata generated in real time can be inserted into NAL (SEI).

リアルタイムで生成するメタデータとしては、映像フレームに同期しているタイムコードや、重要なシーンでボタンを押して付加するマーキング情報などがある。タイムコードとしては、ＳＭＰＴＥタイムコード（ＳＭＰＴＥ１２Ｍ）、ＭＴＣ（ＭＩＤＩＴｉｍｅＣｏｄｅ）、ＬＴＣ（ＬｏｎｇｉｔｕｄｉｎａｌＴｉｍｅＣｏｄｅ）、ＶＩＴＣ（ＶｅｒｔｉｃａｌＩｎｔｅｒｖａｌｉｍｅＣｏｄｅ）や、ＤＶ（ＩＥＣ６１８３４、ＩＥＣ６１８８３）／ＤＶＣＰＲＯ（ＳＭＰＴＥ３１４Ｍ）のタイムコードで規定されているタイムコードがあり、これらのタイムコードより派生したタイムコードをメタデータとすることもできる。 The metadata generated in real time includes a time code synchronized with a video frame and marking information added by pressing a button in an important scene. As time codes, SMPTE time code (SMPTE 12M), MTC (MIDI Time Code), LTC (Longitudinal Time Code), VITC (Vertical Interval time Code), DV (IEC 61834, IEC 61883) S There are time codes defined by these time codes, and time codes derived from these time codes can be used as metadata.

図４（Ｃ）に、ＰＥＳパケットの構造を示す。ＰＥＳパケットは、図４（Ｂ）に示したピクチャデータに対して、ＰＥＳパケットヘッダが付加されている。なお、ＰＥＳパケットヘッダには、ヘッダーオプションとしてＭＰＥＧのＰＴＳ／ＤＴＳを含めることができる。Ｈ．２６４の観点よりは、ＰＥＳパケットを１ＡＵ（ＡｃｃｅｓｓＵｎｉｔ）として扱う。 FIG. 4C shows the structure of the PES packet. In the PES packet, a PES packet header is added to the picture data shown in FIG. The PES packet header can include MPEG PTS / DTS as a header option. H. From the viewpoint of H.264, a PES packet is handled as 1 AU (Access Unit).

図４（Ｄ）に示すように、本例において、ＰＥＳパケットは、１８８バイト毎に分割されＭＰＥＧ−ＴＳパケットが生成される。 As shown in FIG. 4D, in this example, the PES packet is divided every 188 bytes to generate an MPEG-TS packet.

図４（Ｅ）に示すように、本例において、ＭＰＥＧ−ＴＳパケットにタイムコードを含む４バイトのヘッダーが付加されて、ＡＴＳパケットが構成される。 As shown in FIG. 4E, in this example, an ATS packet is formed by adding a 4-byte header including a time code to an MPEG-TS packet.

次に図５を参照して、プレイリストとストリームの関係について説明する。図３を参照して上述したように、ＡＴＳパケットは、ＡＴＳパケット生成部３１０より、各ＧＯＰの先頭ピクチャのＰＴＳと先頭ＡＴＳ連番のペアであるＥＰ−ＭＡＰ（図５（Ｂ）に一例を示す）と共に出力され、ストリームの編集やプレイリストの作成に用いられる。図５（Ａ）はプレイリストの一例を示しており、プレイリストオブジェクトは「２００５年運動会」という名前を持つ「ｎａｍｅ＿２００５年運動会」である。 Next, the relationship between a playlist and a stream will be described with reference to FIG. As described above with reference to FIG. 3, the ATS packet is received from the ATS packet generation unit 310 by an example in EP-MAP (FIG. 5B), which is a pair of the PTS of the first picture of each GOP and the first ATS serial number. Output) and used for stream editing and playlist creation. FIG. 5A shows an example of a playlist, and the playlist object is “name_2005 athletic meet” having the name “2005 athletic meet”.

「ｎａｍｅ＿２００５年運動会」は、２つのプレイアイテム（ＰｌａｙＩｔｅｍ）である「演技」と「かけっこ」という名前を持つプレイアイテムオブジェクト「ｉｎａｍｅ＿演技」および「ｉｎａｍｅ＿かけっこ」から構成されている。「ｉｎａｍｅ＿演技」および「ｉｎａｍｅ＿かけっこ」のＩＮ点およびＯＵＴ点は、それぞれのピクチャが属するＰＴＳと、ストリーム先頭からのＡＴＳ連番のペアで示されている（図５（Ｂ））。プレイアイテムはストリームを特定し、ＡＴＳ連番より特定されたストリームの先頭からの位置を１９２バイト単位で特定する。図５（Ｂ）および図５（Ｃ）において、「ｉｎａｍｅ＿演技」は、それぞれストリーム上の（１）から（２）、「ｉｎａｍｅ＿かけっこ」は（３）から（４）で与えられる。 The “name_2005 athletic meet” is made up of two play items (PlayItems) “act” and play item objects “iname_act” and “iname_post” having the name “kakekko”. The IN point and OUT point of “iname_act” and “iname_kakekko” are indicated by a pair of a PTS to which each picture belongs and an ATS serial number from the beginning of the stream (FIG. 5B). The play item specifies a stream, and specifies the position from the head of the stream specified by the ATS serial number in units of 192 bytes. In FIG. 5B and FIG. 5C, “iname_act” is given by (1) to (2) on the stream, and “name_game” is given by (3) to (4).

図１を参照して上述した撮影の各クリップは、各プレイアイテムに関連付けて取り扱うことができる。「ｉｎａｍｅ＿演技」に関しては、撮影シーン情報ＩＳ（シーン番号、カット番号、テーク番号、その収録テークの採用、不採用、保留など）とリンクさせ、かつ、プレイアイテムの重要部分の開始時刻を例えばカチンコが鳴らされた時刻に設定できる。 Each shooting clip described above with reference to FIG. 1 can be handled in association with each play item. For “iname_act”, the shooting scene information IS (scene number, cut number, take number, adoption, non-adoption, hold, etc. of the recorded take) is linked, and the start time of the important part of the play item is, for example, a clapperboard Can be set to the time when is sounded.

「ｉｎａｍｅ＿かけっこ」に関しては、ピストルの発射音の検出時刻を使用してかけっこで走り出す部分にアクセスできる。また、「ｉｎａｍｅ＿演技」、「ｉｎａｍｅ＿かけっこ」など行事のプログラム構成を事前に機器に登録しておき、撮影時に登録された情報をカメラのビューファインダー上に表示したメニューより選択して一種の撮影シナリオを表すメタデータ１０７として登録することもできる。メタデータ１０７は電子データであるため、行事が終わった後でも、プログラムを登録することもできるし、登録内容を修正できる。 With respect to “iname_kakekko”, it is possible to access a portion that starts running with kakeko using the detection time of the pistol firing sound. In addition, the program configuration of the event such as “iname_act” and “iname_kakekko” is registered in the device in advance, and the information registered at the time of shooting is selected from the menu displayed on the camera viewfinder, which is a kind of shooting scenario It can also be registered as metadata 107 representing. Since the metadata 107 is electronic data, the program can be registered and the registered contents can be corrected even after the event is over.

次に、図６を参照して、情報記録媒体における動画ファイル、静止画ファイル、およびメタデータの記録ディレクトリ構造の一例について説明する。図６において、ｒｏｏｔ下に、「Ｍｏｖｉｅ」、「ＳｔｉｌｌＰｉｃｔｕｒｅ」、および、「Ｍｅｔａｄａｔａ」ディレクトリが存在する。 Next, an example of a recording directory structure of a moving image file, a still image file, and metadata in the information recording medium will be described with reference to FIG. In FIG. 6, “Movie”, “Still Picture”, and “Metadata” directories exist under the root.

「Ｍｏｖｉｅ」ディレクトリ下には、管理ファイル群、「ＰＬＡＹＬＩＳＴ」ディレクトリ、「ＣＬＩＰＩＮＦ」ディレクトリ、および「ＳＴＲＥＡＭ」ディレクトリが存在する。また、「ＰＬＡＹＬＩＳＴ」ディレクトリ下には、リアルタイムプレイリスト（ファイル）である「＊．ｒｐｌｓ」ファイル群とバーチャルタイムイムプレイリスト（ファイル）である「＊．ｖｐｌａ」ファイル群が存在する。また、「ＣＬＩＰＩＮＦ」（クリップインフォメーション）ディレクトリには、クリップインフォメーションファイルである「＊．ｃｌｐｉ」ファイル群が存在する。「ＳＴＲＥＡＭ」ディレクトリ下にはＡＴＳ（１９２バイト）により構成されるストリームファイルである「＊．ｍ２ｔｓ」ファイル群が存在する。「ＳｔｉｌｌＰｉｃｔｕｒｅ」ディレクトリ下には静止画であるである「＊．ｊｐｅｇ」ファイル群が存在する。 Under the “Movie” directory, there are a management file group, a “PLAYLIST” directory, a “CLIPINF” directory, and a “STREAM” directory. In addition, under the “PLAYLIST” directory, there are a “* .rpls” file group that is a real-time playlist (file) and a “* .vpla” file group that is a virtual time-play playlist (file). In the “CLIPINF” (clip information) directory, a group of “* .clpi” files that are clip information files exist. Under the “STREAM” directory, there is a “* .m2ts” file group that is a stream file composed of ATS (192 bytes). Under the “Still Picture” directory, there are “* .jpg” file groups that are still images.

「Ｍｅｔａｄａｔａ」ディレクトリ下には、「ＭＥＴＡ＿ＰＬＡＹＬＩＳＴ」ディレクトリ、「ＵＳＥＲ＿ＭＥＴＡＤＡＴＡ」ディレクトリが存在する。また、「ＭＥＴＡ＿ＰＬＡＹＬＩＳＴ」ディレクトリ下には、プレイリスト（ファイル）内に存在するメタデータの内、選択されたメタデータを持つ「＊．ｍｔｄｔ」ファイル群が存在する。「ＵＳＥＲ＿ＭＥＴＡＤＡＴＡ」ディレクトリには、ムービーのメニュー設定に関する「ＭＥＮＵ＿ＩＮＦ」ディレクトリが存在する。ここには、ムービーのメニューで簡易編集を行ったＥＤＬ（ＥｄｉｔＤｅｃｉｓｉｏｎＬｉｓｔ）を保存できる。また、ユーザが独自に設定するプライベートなメタデータを格納する「ＵＳＥＲ＿ＰＲＩＶＡＴＥ」ディレクトリ下が存在する。ここには、ＣＬＩＰ識別のための代表サムネイルやタイムコードなどを記録できる。 Under the “Metadata” directory, there are a “META_PLAYLIST” directory and a “USER_METADATA” directory. Also, under the “META_PLAYLIST” directory, there is a “* .mtdt” file group having selected metadata among the metadata existing in the playlist (file). In the “USER_METADATA” directory, there is a “MENU_INF” directory related to movie menu settings. Here, an EDL (Edit Decision List) obtained by simple editing in the movie menu can be stored. In addition, there exists a “USER_PRIVATE” directory that stores private metadata uniquely set by the user. Here, a representative thumbnail, a time code, and the like for CLIP identification can be recorded.

図６において、各プレイリストファイルはクリップインフォメーションファイルとメタデータファイルとを関連付ける。また、各クリップインフォメーションファイルは、ＡＴＳ（１９２バイト）により構成されるストリームファイルを関連付ける。従来にない大きな特徴としては、各プレイリストファイルがクリップインフォメーションファイルだけでなく、メタデータファイルを関連付けていることである。これにより、メタデータ１０７を用いた検索で、そのメタデータ１０７と関連付けられたプレイリスト、プレイアイテム、およびストリームを見つけ出すことができる。 In FIG. 6, each playlist file associates a clip information file with a metadata file. Each clip information file is associated with a stream file composed of ATS (192 bytes). A major feature not found in the past is that each playlist file associates not only a clip information file but also a metadata file. Thus, a playlist, play item, and stream associated with the metadata 107 can be found by a search using the metadata 107.

図７を参照してメタデータ１０７について説明する。ＳＥＩバッファ３０６に入力されるリアルタイムメタデータの例としては、重要シーンでユーザが押したボタン情報、撮影データ（撮像素子の動作色温度、レンズ系のズーム値、フォーカス、仰角データ、角速度、加速度など）、タイムコード、位置データ、などがある。ノン（非）リアルタイムメタデータの例としては、メニュー情報、シーン番号、カット番号、テーク番号、その収録テークの採用、不採用、保留などに関する情報、タイトルリスト、画像認識データ、音声認識データ、撮像素子の３原色点の色度空間情報などの撮影データ、外部入力ファイル（シナリオなどのテキストをＸＭＬ、バイナリデータの形式のファイルを外部インタフェースより入力）、インデックス情報、フォーマット情報、静止画、サムネイルなどがあり、これらのうち、任意のものが選択される。例えば、代表ピクチャのサムネイル、シーンの説明文、およびタイムコードが選択される。 The metadata 107 will be described with reference to FIG. Examples of real-time metadata input to the SEI buffer 306 include button information pressed by a user in an important scene, shooting data (operating color temperature of an image sensor, lens system zoom value, focus, elevation angle data, angular velocity, acceleration, etc. ), Time code, position data, etc. Examples of non (non) real-time metadata include menu information, scene number, cut number, take number, information on adoption, non-adoption, hold, etc. of the recorded take, title list, image recognition data, voice recognition data, imaging Shooting data such as chromaticity space information of the three primary color points of the element, external input file (text of scenario etc. is input from XML, binary data format file input from external interface), index information, format information, still image, thumbnail, etc. Of these, an arbitrary one is selected. For example, a thumbnail of a representative picture, a description of a scene, and a time code are selected.

その他のメタデータとしては、予算データ、撮影スケジュール、撮影場所のデータ、撮影コスト、機材や道具のリスト、業者のリスト、およびキャストやエキストラやスタッフなどのスケジュールや雇用費用（時間当たりの単価）などがある。各メタデータには、例えば、その範疇毎にユーザがカメラ部マイコンを通じて自由に設定できる優先度をアトリビュートとして持たせることにより、アプリケーションでメタデータの競合が起こった場合、優先度の高いものから処理できる。 Other metadata includes budget data, shooting schedules, shooting location data, shooting costs, equipment and tool lists, vendor lists, and schedules for casts, extras, staff, etc. and employment costs (unit price per hour) There is. For each metadata, for example, by giving priority as an attribute that can be freely set by the user through the camera microcomputer for each category, if metadata conflict occurs in the application, processing from the highest priority it can.

図８を参照して特定の撮影シーン（ピクチャ）の検索動作について説明する。検索の目的としては、頭だし、粗編集、プレイリストの作成、また検索のためのメタデータマップの作成および再作成などである。図８に示すアルゴリズムに基づき、キーワード検索の場合およびイベント検索の場合の双方において、目的とするピクチャデータなどを検索結果として探し出すことができる。検索結果には、撮像素子の動作情報も入っているので表示デバイスに撮影者の意図を反映した撮影情報を伝達して、映像表示ができる。 With reference to FIG. 8, a search operation for a specific shooting scene (picture) will be described. The purpose of the search is cueing, rough editing, creation of a playlist, creation and recreation of a metadata map for search, and the like. Based on the algorithm shown in FIG. 8, the target picture data and the like can be searched as a search result in both the keyword search and the event search. The search result also includes operation information of the image sensor, so that image information reflecting the photographer's intention can be transmitted to the display device to display an image.

図９を参照して、登録ピクチャの変更方法について説明する。なお、この方法は、図８を参照して説明した検索方法で探し出したピクチャが求めるピクチャよりずれている場合に、正しいピクチャを登録し直す目的に有効である。具体的には、検索結果のピクチャを中心として１秒程度の粗い間隔で代表画像群を時間軸上に表示し、最も近いピクチャを指定すると、その指定されたピクチャを中心に５フレーム刻み程度の間隔で代表画像群を時間軸上に表示する。５フレーム刻み程度の間隔で代表画像をさらに指定すると、その指定されたピクチャを中心に１フレーム刻みの間隔で代表画像群を時間軸上に表示する。ここで目的とするフレーム映像を得ることができる、クリップやプレイリストの代表画像やサムネイルとして再登録できる。 A registered picture changing method will be described with reference to FIG. Note that this method is effective for the purpose of re-registering a correct picture when the picture found by the search method described with reference to FIG. Specifically, a representative image group is displayed on the time axis with a coarse interval of about 1 second centering on the search result picture, and when the nearest picture is designated, the designated picture is centered on the designated picture at intervals of about 5 frames. The representative image group is displayed on the time axis at intervals. When a representative image is further designated at intervals of about 5 frames, a representative image group is displayed on the time axis at intervals of 1 frame around the designated picture. Here, the target frame image can be obtained, and can be re-registered as a representative image or thumbnail of a clip or playlist.

図１０を参照して、本発明のコンテンツ撮影装置を用いたアプリケーションについてシステム的に説明する。図１０において、ムービーカメラ１００１の記録媒体であるＳＤカードメモリ１００２に映像データ、音声データおよびメタデータが記録される。ＳＤカードメモリ１００２がパソコン１００３に挿入され、記録データの移動が行われる。この際、前述の重要シーンやチャプタなどのメタデータ１０７がすでにＳＤカード１００２に記録されていれば、パソコン１００３にデータを移動してプレイリストを確認する。ＯＫならば、その時点で自動的に粗編集やノンリニア編集を実行して完パケファイルが生成できる。また、この編集されたファイルをほとんど自動的にＤＶＤ−ＲやＤＶＤ−ＲＡＭなどのメディア１００４に記録して保存できる。そしてメディア１００４をＤＶＤプレーヤ１００５で再生することにより、編集されたファイルをＴＶ１００６で視聴できる。 With reference to FIG. 10, an application using the content photographing apparatus of the present invention will be systematically described. In FIG. 10, video data, audio data, and metadata are recorded in an SD card memory 1002 that is a recording medium of the movie camera 1001. The SD card memory 1002 is inserted into the personal computer 1003, and the recording data is moved. At this time, if the metadata 107 such as the aforementioned important scenes or chapters is already recorded on the SD card 1002, the data is moved to the personal computer 1003 and the playlist is confirmed. If it is OK, a complete package file can be generated by automatically executing rough editing or non-linear editing at that time. In addition, the edited file can be recorded and stored in a medium 1004 such as a DVD-R or DVD-RAM almost automatically. Then, by playing the media 1004 on the DVD player 1005, the edited file can be viewed on the TV 1006.

メタデータの伝達方法に関しては、テキストデータとして前記コンテンツに付随させることもできるし、メタデータをバイナリデータとして前記コンテンツに付随させることもできる。また、メタデータをウォーターマークとして前記コンテンツに付随させることもできる。更に、メタデータをウォーターマークとして画像データの中に埋め込んだ形でコンコードし、得られた画像データを記録再生したり、伝送受信した後、デコードして使うこともできる。 With regard to the metadata transmission method, the content can be attached to the content as text data, or the metadata can be attached to the content as binary data. Also, metadata can be attached to the content as a watermark. Further, it is possible to concode the metadata as a watermark embedded in the image data, and to record / reproduce the obtained image data, or to decode and use after transmission / reception.

また、上記実施の形態では、メタデータ１０７と映像データを同一のメディアへ記録、蓄積した例について説明したが、関連付けが行われた２つ以上のメディアにメタデータ１０７と映像データを別々に保存しても良い。また、関連付けが行われたメディアであれば、メタデータのみの保存、更には映像データのみの保存、またはメタデータと映像データの保存、のいずれかを行っても良い。 In the above-described embodiment, the example in which the metadata 107 and the video data are recorded and stored in the same medium has been described. However, the metadata 107 and the video data are separately stored in two or more associated media. You may do it. In addition, as long as the medium is associated, it is possible to store only metadata, further store only video data, or store metadata and video data.

なお、撮影装置から表示装置に撮影した映像信号を出力する際に、確認手段によって表示装置が撮影メタデータを用いて色再現を含む表示処理を最適化する高画質化機能を持っているかどうかを確認する。この確認手段が取り扱う信号は、前述の実施の形態において説明したMPEG−TS、NTSC／PALのベースバンド信号（ＳＭＰＴＥ２５９Ｍと同等のデジタル、アナログ信号）およびＨＤＴＶのベースバンド信号（ＳＭＰＴＥ２９２Ｍと同等のデジタル、アナログ信号）のいずれでもよい。すなわち、４８０／６０ｉ、４８０・３０Ｐ、４８０・２４Ｐ、７２０／３０Ｐ、７２０・２４Ｐ，１０８０／６０ｉ、１０８０／６０Ｐの映像信号でもよい。 Whether the display device has a high image quality function for optimizing display processing including color reproduction by using the shooting metadata when the video signal is output from the imaging device to the display device. Check. The signals handled by this checking means are MPEG-TS, NTSC / PAL baseband signals (digital and analog signals equivalent to SMPTE 259M) and HDTV baseband signals (SMPTE 292M equivalent) described in the above embodiments. Either digital or analog signal) may be used. That is, 480 / 60i, 480 · 30P, 480 · 24P, 720 / 30P, 720 · 24P, 1080 / 60i, 1080 / 60P video signals may be used.

以上のように本発明のコンテンツ撮影装置は、カメラを用いたプロフェッショナルによるニュース、ドキュメンタリー、バラエティ番組やアマチュアによる運動会や入学式、卒業式、音楽の発表会、結婚式等の撮影で、カメラの撮影者や撮影補助者がイベント情報をマイクにより音声でカメラに入力する手段と、カメラ内で該入力音声のレベルを検出して撮影コンテンツのタイムコードなどの管理情報と関連付けた音声メタデータを生成する手段と、メタデータをリスト化した後にファイル化する手段とを備えている。本発明のコンテンツ撮影装置によれば、数十ＭＩＰＳ程度以上の安価なＣＰＵを搭載した民生用ムービーなどでも、音声メタデータをテキストメタデータに変換することが可能となる。 As described above, the content photographing apparatus of the present invention can be used for shooting news, documentaries, variety programs and amateur sports days, entrance ceremonies, graduation ceremonies, music recitals, weddings, etc. Means that a person or a shooting assistant inputs event information to the camera by means of a microphone, and detects the level of the input sound in the camera and generates audio metadata associated with management information such as time code of the shooting content Means and means for creating a file after listing the metadata. According to the content photographing apparatus of the present invention, it is possible to convert audio metadata into text metadata even in a consumer movie equipped with an inexpensive CPU of about several tens of MIPS.

また本発明のコンテンツ撮影装置は、カメラ内蔵またはカメラ外の音声認識エンジンにより該ファイルの音声位置情報より音声部分に高速にアクセスして、音声をテキスト変換した後に前記メタデータのリストに付加情報として追加する手段を備えている。 Also, the content photographing apparatus of the present invention accesses the audio portion at a higher speed than the audio position information of the file by a voice recognition engine built in or outside the camera, converts the sound into text, and adds it as additional information to the metadata list. Means to add are provided.

本発明は、音声メタデータを用いて、コストを増大させることなくテキストメタデータの生成を行うことができるため、コンテンツ撮影装置として利用価値の高いものである。 Since the present invention can generate text metadata using audio metadata without increasing the cost, it is highly useful as a content photographing apparatus.

本発明の実施の形態に係るコンテンツ撮影装置のモデル図Model diagram of content photographing apparatus according to an embodiment of the present invention 図１に示すカメラの内部構成の説明図Explanatory drawing of the internal configuration of the camera shown in FIG. Ｈ．２６４圧縮におけるメタデータの取り扱いの説明図H. Explanatory drawing of handling of metadata in H.264 compression Ｈ．２６４圧縮のピクチャ構造とＭＰＥＧ−ＴＳへの変換方法の説明図H. Explanatory drawing of picture structure of H.264 compression and conversion method to MPEG-TS プレイリストとストリームオブジェクトの関係の説明図Illustration of the relationship between playlist and stream object ストリームとメタデータを記録するディレクトリ構造の説明図Illustration of directory structure for recording streams and metadata メタデータの分類例を示す図Diagram showing an example of metadata classification メタデータを用いた検索アルゴリズムのモデル図Model diagram of search algorithm using metadata ピクチャ設定方法のモデル図Model diagram of picture setting method アプリケーション例の説明図Illustration of application example

Explanation of symbols

１０１カメラ
１０２カメラのレンズ部
１０３カメラのマイク
１０４カメラの撮影対象
１０５カメラで撮影したデータ
１０６ＡＶストリームファイルデータ
１０７メタデータ
１０８カメラで撮影されたデータシーケンス
１０９リモコン
１１０編集によりシーン＃１から＃５までを繋いだデータシーケンス
１１１テレビ（ＴＶ）
１１２信号接続ケーブル
１１３信号接続ケーブル
１１４メタデータ入力用ボタン（重要シーン登録ボタン、静止画撮影ボタン）
１１５カチンコ
１１６マイク
２０１ズーム制御部
２０２フォーカス制御部
２０３露出制御部
２０４撮像素子
２０５シャッタ速度制御部
２０６カメラ部マイコン
２０７絶対傾きセンサ
２０８角速度センサ
２０９加速度センサ
２１０ユーザ入力系
２１１カメラ信号処理部
２１２音声処理系
２１３Ｈ．２６４方式エンコーダ
２１４記録メディア
２１５出力インタフェース
３０１映像符号化部
３０２ＶＣＬ−ＮＡＬユニットバッファ
３０３音声符号化部
３０４ＰＳバッファ
３０５ＶＵＩバッファ
３０６ＳＥＩバッファ
３０７ＮｏｎＶＣＬ−ＮＡＬユニットバッファ
３０８ＭＰＥＧ−ＰＥＳパケット生成部
３０９ＭＰＥＧ−ＴＳ生成部
３１０ＡＴＳパケット生成部
３１１ＥＰ−ｍａｐ生成部
１００１ムービーカメラ
１００２ＳＤカードメモリ
１００３パソコン
１００４メディア
１００５ＤＶＤプレーヤ
１００６ＴＶ DESCRIPTION OF SYMBOLS 101 Camera 102 Camera lens part 103 Camera microphone 104 Camera photographing target 105 Data 106 photographed by camera 106 AV stream file data 107 Metadata 108 Data sequence photographed by camera 109 Remote control 110 Scenes # 1 to # 5 by editing Data sequence 111 connecting TVs (TV)
112 signal connection cable 113 signal connection cable 114 metadata input button (important scene registration button, still image shooting button)
115 Clapper 116 Microphone 201 Zoom control unit 202 Focus control unit 203 Exposure control unit 204 Image sensor 205 Shutter speed control unit 206 Camera unit microcomputer 207 Absolute tilt sensor 208 Angular velocity sensor 209 Acceleration sensor 210 User input system 211 Camera signal processing unit 212 Audio processing System 213 H.I. H.264 encoder 214 Recording medium 215 Output interface 301 Video encoding unit 302 VCL-NAL unit buffer 303 Audio encoding unit 304 PS buffer 305 VUI buffer 306 SEI buffer 307 Non VCL-NAL unit buffer 308 MPEG-PES packet generation unit 309 MPEG TS generation unit 310 ATS packet generation unit 311 EP-map generation unit 1001 Movie camera 1002 SD card memory 1003 Personal computer 1004 Media 1005 DVD player 1006 TV

Claims

A content photographing apparatus that converts any content including video, audio, or data that can be accessed at an arbitrary position with time information into a stream, and records the information in an information storage medium in combination with metadata related to the content,
Commentary voice filtering means including additional information of the content input from the voice input means when in the recording mode or the recording standby mode;
Output level detection means of the filtering means;
Means for generating a voice tag including time information when the input of the commentary voice is started when the output level of the filtering means is equal to or higher than a preset output level over a period longer than a preset output period. When,
A content photographing apparatus comprising: means for associating the audio tag including time information with time information of the commentary audio, and recording the audio tag as metadata in the information storage medium.

Voice recognition means for accessing the commentary voice using the voice tag, inputting the commentary voice and converting it into text data;
The content photographing apparatus according to claim 1, further comprising means for associating the text data converted by the voice recognition means with time information of the photographed content.

The content photographing apparatus according to claim 2, wherein the conversion of the commentary sound into text data is executed in non-real time when neither the recording mode nor the reproduction mode is used.

3. The apparatus according to claim 2, further comprising means for reproducing content including the search keyword from the recording medium when the search keyword input from the search means matches at least a part of the metadata. The content photographing apparatus described.

Means for accessing the commentary voice using the voice tag and outputting the commentary voice outside the device;
Means for inputting the text data of the commentary speech converted outside the device;
The content photographing apparatus according to claim 1, further comprising means for associating the text data with time information of the photographed content.

The zoom state, aperture value, focal length, shutter speed, horizontal or vertical tilt angle of the lens unit, angular velocity of the lens unit rotating in the horizontal or vertical direction, or the front, rear, left and right of the lens unit Alternatively, at least one of the lens unit motion data of vertical movement acceleration, input data by the user, and data obtained by performing a predetermined calculation process on the motion data is received from the camera control unit, 2. The content photographing apparatus according to claim 1, further comprising control means for temporarily storing the received data in association with the corresponding video frame as metadata.

7. The method according to claim 1, further comprising means for setting a priority for each of the recorded metadata and recording the priority on the information recording medium as additional information of each metadata. The content photographing device described in 1.

Read priority setting means,
8. The content photographing apparatus according to claim 7, further comprising means for outputting metadata information having a higher priority than the priority set by the read priority setting means.