JP2018129008A

JP2018129008A - Image compositing device, image compositing method, and computer program

Info

Publication number: JP2018129008A
Application number: JP2017023667A
Authority: JP
Inventors: 和樹岡見; Kazuki Okami; 広太竹内; Kota Takeuchi; 木全　英明; Hideaki Kimata; 英明木全
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-02-10
Filing date: 2017-02-10
Publication date: 2018-08-16
Anticipated expiration: 2037-02-10
Also published as: JP6730204B2

Abstract

PROBLEM TO BE SOLVED: To create an image with an arbitrary field of view at an arbitrary time with higher quality by using a video obtained without imposing a strict constraint condition, such as limitation of the background to a uniform color.SOLUTION: An image composition device comprises: an input part that receives an input of a plurality of dynamic images, camera parameters related to a plurality of cameras that photograph the dynamic images, and parameters related to the three-dimensional shape of a subject photographed in the dynamic images; a three-dimensional joint information estimation part that estimates three-dimensional joint information on the subject for a frame at every time; a shape changing part that acquires information indicating the three-dimensional shape of the subject at every time on the basis of a result of estimation of the three-dimensional joint information in the frame at every time and the parameters related to the three-dimensional shape of the subject; and an image composition part that creates an image including an image of the subject with a designated field of view at a designated time on the basis of the information indicating the three-dimensional shape of the subject.SELECTED DRAWING: Figure 1

Description

本発明は、自由視点映像を生成するための技術に関する。 The present invention relates to a technique for generating a free viewpoint video.

自由視点映像では、複数の位置に配置されたカメラで撮った映像を用いて任意の視点の映像が合成される。このような合成処理によって、あらゆる視点からの映像を見ることが可能である。このような自由視点映像の技術は、次世代の映像メディアとして古くから研究が進められてきた。自由視点映像では、シーン中の被写体の三次元形状の復元を行うことで、実際にはカメラが配置されていない位置を視点とした映像を生成することを可能とする。 In a free viewpoint video, a video of an arbitrary viewpoint is synthesized using videos taken by cameras arranged at a plurality of positions. With such a composition process, it is possible to view images from any viewpoint. This kind of free-viewpoint video technology has long been studied as the next-generation video media. In the free viewpoint video, by restoring the three-dimensional shape of the subject in the scene, it is possible to generate a video with the viewpoint where the camera is not actually arranged.

高品質な自由視点映像を実現することができる代表的な研究の一つとして、Colletらの研究が挙げられる（非特許文献１参照）。この研究は、自由視点映像の撮影、合成及び配信の一連のパイプラインを提案した研究である。この研究の技術により、高品質な自由視点映像を合成することが可能である。しかし、大量のカメラ及び赤外カメラが必要とされる。また、被写体領域を抽出するために背景を均一色に限定する必要がある。さらに、これらの特殊な環境に特化したキャリブレーションを行う必要がある。このように、撮影環境に対して非常に厳しい制約条件がある。そのため、実際のシーンでの利用は難しい。 Collet et al. (See Non-Patent Document 1) is a representative study that can realize a high-quality free viewpoint video. This is a study that proposed a series of pipelines for shooting, compositing, and distributing free-viewpoint images. The technology of this research makes it possible to synthesize high-quality free viewpoint images. However, a large number of cameras and infrared cameras are required. Further, it is necessary to limit the background to a uniform color in order to extract the subject area. Furthermore, it is necessary to perform calibration specialized for these special environments. Thus, there are very strict constraints on the shooting environment. Therefore, it is difficult to use in actual scenes.

他の研究として、距離センサを用いることで比較的現実的な制約下での自由視点映像合成方法が提案されている（非特許文献２参照）。しかしながら、この提案による技術では、合成品質が十分には高くない。合成品質を低下させる大きな要因の一つとして、オクル―ジョン及び時間方向のちらつきが挙げられる。オクル―ジョンに関しては、取得できていない情報を再現する必要があるため、事前情報等を用いずに解決することは原理的に不可能である。時間方向のちらつきに関しては、距離センサが赤外光の干渉などを受けることにより、フレームによって取得する情報にばらつきが生じることが原因である。こちらについては、距離情報のフィルタリングなどによって解決が試みられている。しかしながら、改善はされているものの、未だに十分には解消されていない。オクルージョン及び時間方向のちらつきは、わずかに生じるだけでも視聴者が大きな違和感を覚えてしまうため、解決すべき問題である。 As another research, a free viewpoint video composition method under a relatively realistic constraint by using a distance sensor has been proposed (see Non-Patent Document 2). However, with the technique according to this proposal, the synthesis quality is not sufficiently high. One of the major factors that degrade the synthesis quality is occlusion and time flicker. Regarding occlusion, since it is necessary to reproduce information that cannot be obtained, it is impossible in principle to solve it without using prior information. The flickering in the time direction is caused by variations in information acquired by the frames due to the interference of infrared light by the distance sensor. In this case, attempts have been made to solve the problem by filtering distance information. However, although it has been improved, it has not been fully resolved. Occlusion and flickering in the time direction are problems to be solved because even if they occur only slightly, the viewer feels uncomfortable.

A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, S. Sullivan, “High-quality streamable free-viewpoint video,” ACM Transactions on Graphics, 34(4), 2015.A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, S. Sullivan, “High-quality streamable free-viewpoint video,” ACM Transactions on Graphics, 34 (4), 2015. D. Alexiadis, D. Zarpalas, P. Daras, “Fast and smooth 3D reconstruction using multiple RGB-Depth sensors,” in Visual Communications and Image Processing Conference 2014, pp.173-176.D. Alexiadis, D. Zarpalas, P. Daras, “Fast and smooth 3D reconstruction using multiple RGB-Depth sensors,” in Visual Communications and Image Processing Conference 2014, pp.173-176.

このように、従来の自由視点映像の技術には、解決すべき問題が残されており、実際のシーンで使用可能といえる制約条件で十分な品質の画像を生成することは実現されていない。
上記事情に鑑み、本発明は、背景を均一色に限定する等の厳しい制約条件を課すことなく得られた映像を用いることによって、任意の視野及び時刻における画像をより高い品質で生成する技術の提供を目的としている。 As described above, there are still problems to be solved in the conventional free viewpoint video technology, and it has not been realized to generate an image with sufficient quality under the constraint that it can be used in an actual scene.
In view of the above circumstances, the present invention is a technique for generating an image at an arbitrary field of view and time with higher quality by using an image obtained without imposing severe restrictions such as limiting the background to a uniform color. The purpose is to provide.

本発明の一態様は、複数の動画像と、前記動画像を撮影した複数のカメラの視野に関するカメラパラメータと、前記動画像に撮影されている被写体の三次元形状に関するパラメータと、の入力を受け付ける入力部と、各時刻におけるフレームについて、前記被写体の三次元関節情報を推定する三次元関節情報推定部と、各時刻のフレームにおける前記三次元関節情報の推定結果と、前記被写体の三次元形状に関するパラメータと、に基づいて各時刻における前記被写体の三次元形状を示す情報を取得する形状変形部と、前記被写体の三次元形状を示す情報に基づいて、指定された時刻における指定された視野の前記被写体の画像を含む画像を生成する画像合成部と、を備える画像合成装置である。 One embodiment of the present invention receives input of a plurality of moving images, camera parameters related to the fields of view of the plurality of cameras that captured the moving images, and parameters related to the three-dimensional shape of the subject captured in the moving images. An input unit, a three-dimensional joint information estimation unit for estimating the three-dimensional joint information of the subject for the frame at each time, an estimation result of the three-dimensional joint information in the frame at each time, and a three-dimensional shape of the subject A shape deforming unit that acquires information indicating the three-dimensional shape of the subject at each time based on the parameter, and the information on the specified field of view at the specified time based on the information indicating the three-dimensional shape of the subject. And an image composition unit that generates an image including an image of a subject.

本発明の一態様は、上記の画像合成装置であって、前記被写体の三次元形状に関するパラメータは、前記被写体の三次元形状を示す基準形状情報と、前記基準形状情報によって示される前記三次元形状における各関節の情報を示す基準関節情報と、を含む。 One aspect of the present invention is the above-described image composition device, wherein the parameters relating to the three-dimensional shape of the subject include reference shape information indicating the three-dimensional shape of the subject, and the three-dimensional shape indicated by the reference shape information. And reference joint information indicating information on each joint.

本発明の一態様は、上記の画像合成装置であって、前記動画像を構成する各時刻におけるフレームについて、前記フレームの画像に基づいて前記被写体の関節情報を推定する対象となるフレームである推定フレームを判定するフレーム分類部をさらに備え、前記三次元関節情報推定部は、前記推定フレームについては、前記推定フレームの画像に基づいて前記被写体の三次元関節情報を推定し、前記推定フレーム以外のフレームについては、前記推定フレームにおける前記三次元関節情報の推定結果を用いた補間処理によって前記三次元関節情報を推定する。 One aspect of the present invention is the above-described image composition device, in which the frame at each time constituting the moving image is a frame that is a target for estimating joint information of the subject based on the image of the frame. A frame classification unit for determining a frame; and the three-dimensional joint information estimation unit estimates the three-dimensional joint information of the subject based on an image of the estimation frame for the estimation frame, For the frame, the 3D joint information is estimated by an interpolation process using the estimation result of the 3D joint information in the estimated frame.

本発明の一態様は、上記の画像合成装置であって、各時刻におけるフレームについて、前記被写体の二次元関節情報を推定し、推定結果の信頼度を取得する二次元関節情報推定部をさらに備え、前記三次元関節情報推定部は、前記推定フレームについては、前記二次元関節情報推定部において高い信頼度が取得された一部のフレームに関する情報のみに基づいて前記三次元関節情報を推定する。 One aspect of the present invention is the above-described image composition device, further including a two-dimensional joint information estimation unit that estimates the two-dimensional joint information of the subject for the frame at each time and acquires the reliability of the estimation result. The three-dimensional joint information estimation unit estimates the three-dimensional joint information for the estimated frame based only on information about a part of frames for which high reliability is acquired by the two-dimensional joint information estimation unit.

本発明の一態様は、複数の動画像と、前記動画像を撮影した複数のカメラの視野に関するカメラパラメータと、前記動画像に撮影されている被写体の三次元形状に関するパラメータと、の入力を受け付ける入力ステップと、各時刻におけるフレームについて、前記被写体の三次元関節情報を推定する三次元関節情報推定ステップと、各時刻のフレームにおける前記三次元関節情報の推定結果と、前記被写体の三次元形状に関するパラメータと、に基づいて各時刻における前記被写体の三次元形状を示す情報を取得する形状変形ステップと、前記被写体の三次元形状を示す情報に基づいて、指定された時刻における指定された視野の前記被写体の画像を含む画像を生成する画像合成ステップと、を有する画像合成方法である。 One embodiment of the present invention receives input of a plurality of moving images, camera parameters related to the fields of view of the plurality of cameras that captured the moving images, and parameters related to the three-dimensional shape of the subject captured in the moving images. An input step, a three-dimensional joint information estimation step for estimating the three-dimensional joint information of the subject for the frame at each time, an estimation result of the three-dimensional joint information in the frame at each time, and a three-dimensional shape of the subject A shape deformation step for acquiring information indicating the three-dimensional shape of the subject at each time based on the parameter, and the information on the specified field of view at the specified time based on the information indicating the three-dimensional shape of the subject. And an image composition step for generating an image including an image of the subject.

本発明の一態様は、コンピュータを、上記の画像合成装置として機能させるためのコンピュータプログラムである。 One embodiment of the present invention is a computer program for causing a computer to function as the above-described image composition device.

本発明により、背景を均一色に限定する等の厳しい制約条件を課すことなく得られた映像を用いることによって、任意の視野及び時刻における画像をより高い品質で生成することが可能となる。 According to the present invention, it is possible to generate an image at an arbitrary field of view and time with higher quality by using an image obtained without imposing severe restrictions such as limiting the background to a uniform color.

実施形態における画像合成装置１０の構成例を示す概略ブロック図である。It is a schematic block diagram which shows the structural example of the image synthesis apparatus 10 in embodiment. フレーム分類部１２の構成例を示す図である。3 is a diagram illustrating a configuration example of a frame classification unit 12. FIG. 二次元関節情報推定部１３の構成例を示す図である。It is a figure which shows the structural example of the two-dimensional joint information estimation part. 三次元関節情報推定部１４の構成例を示す図である。It is a figure which shows the structural example of the three-dimensional joint information estimation part. 形状変形部１５の構成例を示す図である。3 is a diagram illustrating a configuration example of a shape deforming unit 15. FIG. 画像合成装置１０の処理の具体例を示すフローチャートである。4 is a flowchart illustrating a specific example of processing of the image composition device 10. ステップＳ１０６の処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of a process of step S106.

図１は実施形態における画像合成装置１０の構成例を示す概略ブロック図である。画像合成装置１０は、バスで接続されたＣＰＵ（Central Processing Unit）やメモリや補助記憶装置などを備え、画像合成プログラムを実行する。画像合成プログラムの実行によって、画像合成装置１０は、入力部１１、フレーム分類部１２、二次元関節情報推定部１３、三次元関節情報推定部１４、形状変形部１５及び画像合成部１６を備える装置として機能する。なお、画像合成装置１０の各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されてもよい。画像合成プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。画像合成プログラムは、電気通信回線を介して送信されてもよい。 FIG. 1 is a schematic block diagram illustrating a configuration example of an image composition device 10 according to the embodiment. The image composition device 10 includes a CPU (Central Processing Unit), a memory, an auxiliary storage device, and the like connected by a bus, and executes an image composition program. By executing the image synthesis program, the image synthesis apparatus 10 includes an input unit 11, a frame classification unit 12, a two-dimensional joint information estimation unit 13, a three-dimensional joint information estimation unit 14, a shape deformation unit 15, and an image synthesis unit 16. Function as. Note that all or part of each function of the image composition device 10 may be realized by using hardware such as an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA). . The image composition program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, a magneto-optical disk, a ROM, a CD-ROM, or a storage device such as a hard disk built in the computer system. The image composition program may be transmitted via a telecommunication line.

まず、入力部１１について説明する。入力部１１は、複数の動画像、各動画像を撮影したカメラのパラメータ（以下「カメラパラメータ」という。）、所定の被写体の三次元形状を示す基準形状情報、所定の被写体の三次元関節情報を示す基準関節情報、所定の被写体の変形パラメータ、の入力を受け付ける。複数の動画像は、複数の位置に配置された各カメラによって同時刻に同一のシーンを撮影することによって得られた動画像である。例えば、サッカー場などのフィールドを取り囲むように配置された複数のカメラによって上記フィールドを同時刻（例えば同日の１３時から１４時までの１時間）に撮影することによって得られる動画像が入力される。動画像は、カラーの動画像であってもよいし、グレースケールの動画像であってもよいし、二値の動画像であってもよい。動画像のデータは、各カメラからリアルタイムに入力されてもよいし、ハードディスクドライブ（ＨＤＤ）等の記録媒体に記録された動画像が入力されてもよい。各動画像は完全に同一の時刻に撮影されてものである必要は無く前後に多少の時間のずれが生じていてもよい。動画像の各フレームには、各フレームが撮影された時刻（以下「フレーム時刻」という。）が付与されていることが好ましい。フレーム時刻が付与されていない場合には、入力部１１は撮影開始時刻と動画像の再生時間とに基づいて各フレームに対してフレーム時刻を付与してもよい。以降の説明では、簡単のため動画像内に存在する人物は一人であり、その人物が基準形状情報等を用いてレンダリングが行われる対象の被写体（以下「注目被写体」という。）であるものとする。複数の注目被写体が存在する場合は、入力部１１は、動画像内で人物領域を切り出す処理を行うことによって、動画像を注目被写体ごとに分割してもよい。それぞれの注目被写体に関する動画像に対して、注目被写体が一人である場合と同様の処理を行うことによって、複数の注目被写体が存在する場合であっても同様の処理が可能となる。 First, the input unit 11 will be described. The input unit 11 includes a plurality of moving images, parameters of a camera that captures each moving image (hereinafter referred to as “camera parameters”), reference shape information indicating a three-dimensional shape of a predetermined subject, and three-dimensional joint information of the predetermined subject. Are received, and reference deformation information of a predetermined subject is received. The plurality of moving images are moving images obtained by shooting the same scene at the same time by the cameras arranged at a plurality of positions. For example, a moving image obtained by photographing the field at the same time (for example, 1 hour from 13:00 to 14:00 on the same day) is input by a plurality of cameras arranged so as to surround the field such as a soccer field. . The moving image may be a color moving image, a gray scale moving image, or a binary moving image. The moving image data may be input from each camera in real time, or a moving image recorded on a recording medium such as a hard disk drive (HDD) may be input. Each moving image need not be taken at the exact same time, and there may be some time lag before and after. Each frame of the moving image is preferably given a time when each frame was shot (hereinafter referred to as “frame time”). When the frame time is not given, the input unit 11 may give the frame time to each frame based on the shooting start time and the moving image reproduction time. In the following description, for the sake of simplicity, there is only one person in the moving image, and that person is a subject to be rendered using reference shape information or the like (hereinafter referred to as “target subject”). To do. When there are a plurality of subjects of interest, the input unit 11 may divide the moving image for each subject of interest by performing a process of extracting a person area in the moving image. By performing the same processing as that when there is only one subject on the moving image relating to each subject of interest, the same processing can be performed even when there are a plurality of subjects of interest.

カメラパラメータは、動画像を撮影した各カメラの視点位置、視線方向、視野角などのパラメータである。カメラパラメータは、カメラ内部のパラメータと、カメラ外部のパラメータとの両方を含んでもよい。カメラパラメータは、動画像を取得する前に、各カメラにおいてカメラキャリブレーションを行うことによって取得されてもよい。 The camera parameters are parameters such as a viewpoint position, a line-of-sight direction, and a viewing angle of each camera that has captured a moving image. The camera parameter may include both a parameter inside the camera and a parameter outside the camera. The camera parameter may be acquired by performing camera calibration in each camera before acquiring the moving image.

基準形状情報は、入力される動画像に撮影された被写体のうち、所定の基準に基づいて予め定められた被写体（注目被写体）の三次元形状を示す。例えば、特に注目される可能性の高い被写体について、基準形状情報が入力される。例えばサッカーの試合の動画像が入力される場合には、サッカーの試合に出場する選手（スタメンの選手及びベンチ入りした選手）全員の基準形状情報が入力されてもよい。基準形状情報は、注目被写体に対して予め三次元形状復元の処理を行うことによって得られてもよい。例えば、注目被写体に対して距離センサ等の測定機器を用いた測定を行うことによって得られたデータに基づいて基準形状情報が生成されてもよい。例えば、複数の位置のカメラによって撮影された静止画像を用いることによって基準形状情報が生成されてもよい。基準形状情報は、例えば人物の各関節の位置と、人物の表面形状と、人物の表面の画像（テクスチャ画像）と、を有するデータ（三次元人物モデルデータ）であってもよい。三次元人物モデルデータを用いることによって、所望の視野で所望の姿勢の人物の画像を生成することが可能となる。なお、基準形状情報における注目被写体の姿勢は、ＴポーズやＡスタンスのような姿勢であってもよいし、他の姿勢であってもよい。また、入力部１１は、入力された基準形状情報において欠損やノイズが生じていた場合には、Poisson Surface Reconstruction（参考文献１）や一般的な空間フィルタリングなどの手法を用いて表面形状の高品質化を行ってもよい。このような処理が行われることによって、その後に復元された形状は連続した表面を保持する。その結果、視点位置の変化による欠損が生じにくくなる。
参考文献１：M. Kazhdan, M. Bolitho, H. Hoppe, “Poisson Surface Reconstruction,” Symposium on Geometry Processing 2006, 61-70. The reference shape information indicates a three-dimensional shape of a subject (a subject of interest) that is predetermined based on a predetermined reference among subjects captured in the input moving image. For example, the reference shape information is input for a subject that is particularly likely to receive attention. For example, when a moving image of a soccer game is input, the reference shape information of all players (stamen players and players on a bench) participating in the soccer game may be input. The reference shape information may be obtained by performing a three-dimensional shape restoration process on the subject of interest in advance. For example, the reference shape information may be generated based on data obtained by performing measurement using a measuring device such as a distance sensor on the subject of interest. For example, the reference shape information may be generated by using still images taken by cameras at a plurality of positions. The reference shape information may be data (three-dimensional human model data) including, for example, the position of each joint of the person, the surface shape of the person, and an image (texture image) of the person's surface. By using the three-dimensional person model data, it is possible to generate an image of a person in a desired posture with a desired visual field. Note that the posture of the subject of interest in the reference shape information may be a posture such as a T pose or an A stance, or may be another posture. In addition, the input unit 11 uses a technique such as Poisson Surface Reconstruction (reference document 1) or general spatial filtering if there is a defect or noise in the input reference shape information. May also be performed. By performing such a treatment, the shape restored thereafter retains a continuous surface. As a result, defects due to changes in the viewpoint position are less likely to occur.
Reference 1: M. Kazhdan, M. Bolitho, H. Hoppe, “Poisson Surface Reconstruction,” Symposium on Geometry Processing 2006, 61-70.

基準関節情報は、入力された基準形状情報の姿勢における各関節の三次元関節情報である。三次元関節情報は、関節の位置と、関節の角度とを表す。関節の位置は、例えばｘｙｚ座標で表される。関節の角度は、例えばｘｙｚ軸を中心としたオイラー角によって表される。基準関節情報は、基準形状情報を生成する際にモーションキャプチャ等の測定技術を用いて測定されてもよいし、注目被写体が撮影された画像に対して推定処理を行うことによって取得されてもよい。 The reference joint information is three-dimensional joint information of each joint in the posture of the input reference shape information. The three-dimensional joint information represents a joint position and a joint angle. The position of the joint is expressed by, for example, xyz coordinates. The angle of the joint is represented by, for example, Euler angles with the xyz axis as the center. The reference joint information may be measured using a measurement technique such as motion capture when generating the reference shape information, or may be acquired by performing an estimation process on an image in which the subject of interest is captured. .

変形パラメータは、基準形状情報に対応する注目被写体の関節が変化した際に、動画像内の注目被写体の形状がどのように変形するかを定めるパラメータである。変形パラメータは、例えば関節の回転に応じた形状の変化を定義するパラメータである。変形パラメータは、予め測定などによって取得されてもよい。例えば、一般的なスキニング手法を用いることによって関節と形状の頂点との距離に反比例するように変形パラメータが定められてもよい。変形パラメータは、ソフトウェアを用いて手動で定められてもよい。 The deformation parameter is a parameter that determines how the shape of the subject of interest in the moving image is deformed when the joint of the subject of interest corresponding to the reference shape information changes. The deformation parameter is a parameter that defines a change in shape according to the rotation of the joint, for example. The deformation parameter may be acquired in advance by measurement or the like. For example, the deformation parameter may be determined so as to be inversely proportional to the distance between the joint and the shape vertex by using a general skinning technique. The deformation parameter may be determined manually using software.

次に、フレーム分類部１２について説明する。図２は、フレーム分類部１２の構成例を示す図である。フレーム分類部１２には、入力部１１において入力された複数の動画像が入力される。図２の例では、Ｌ台の異なる視点位置のカメラによって撮影されたＬ個の動画像がフレーム分類部１２に入力される。フレーム分類部１２は、フレーム分離部１２１及び関節情報推定フラグ付与部１２２を有する。フレーム分離部１２１は、各動画像をフレーム毎の画像に分離する。このとき、フレーム分離部１２１は、異なる動画像から得られたフレーム同士で、フレーム時刻に基づいて同時刻に撮影されたフレームの画像であることを示す所定の条件を満たすフレーム同士を関連づけする。所定の条件は、ある基準となる動画像のフレームに対して、他の動画像から得られるフレームのうち最もフレーム時刻が近いフレームであることを示す条件であってもよい。同時刻に撮影されたと推定された各フレームが関連づけられた一組のフレームセットを、同時刻フレームセットとよぶ。以下の処理では、同時刻フレームセットに含まれる各フレームは、実際のフレーム時刻にかかわらず、同一の時刻に撮影されたものとして扱われてもよい。 Next, the frame classification unit 12 will be described. FIG. 2 is a diagram illustrating a configuration example of the frame classification unit 12. A plurality of moving images input from the input unit 11 are input to the frame classification unit 12. In the example of FIG. 2, L moving images captured by L cameras at different viewpoint positions are input to the frame classification unit 12. The frame classification unit 12 includes a frame separation unit 121 and a joint information estimation flag provision unit 122. The frame separation unit 121 separates each moving image into images for each frame. At this time, the frame separation unit 121 associates frames obtained from different moving images with frames satisfying a predetermined condition indicating that it is an image of a frame shot at the same time based on the frame time. The predetermined condition may be a condition indicating that the frame time is closest to a frame of a certain moving image from among other moving images. A set of frames associated with each frame estimated to have been shot at the same time is called a simultaneous frame set. In the following processing, each frame included in the same time frame set may be treated as being taken at the same time regardless of the actual frame time.

関節情報推定フラグ付与部１２２は、各同時刻フレームセットに対して、その動画像フレームセットの画像に基づいて三次元関節情報の推定を行うか否かを示すフラグ（以下「関節情報推定フラグ」という。）の値を付与する。関節情報推定フラグには、推定フラグ及び非推定フラグの二種類の値がある。推定フラグが与えられた場合には、その同時刻フレームセットにおいてフレームの画像やカメラパラメータを用いて三次元関節情報の推定が行われる。一方、非推定フラグが与えられた場合には、その同時刻フレームセットにおいて補間処理によって三次元関節情報の推定が行われる。関節情報推定フラグ付与部１２２は、例えば所定の周期で同時刻フレームセットに推定フラグを付与し、他の同時刻フレームセットに非推定フラグを付与してもよい。関節情報推定フラグ付与部１２２は、画像内で所定の条件が満たされた同時刻フレームセットに対し推定フラグを付与してもよい。所定の条件とは、例えば画像内で注目被写体の移動速度が極値を示したことであってもよい。移動速度は、注目被写体全体の移動速度であってもよいし、一部の間接や身体部分（例えば腕や顔）の移動速度であってもよい。この場合、二次元関節情報推定部１３は、画像内で注目被写体の移動速度を判定し、その移動速度が極値を示した場合に推定フラグを付与してもよい。二次元関節情報推定部１３は、同時刻フレームセットにおいていずれか一つのフレームが所定の条件を満たした場合には、その同時刻フレームセットに対して推定フラグを付与してもよいし、同時刻フレームセットにおいて所定数以上のフレームにおいて所定の条件が満たされた場合にその同時刻フレームセットに対して推定フラグを付与してもよい。なお、推定フラグが付与されなかった全ての同時刻フレームセットに対して非推定フラグが付与される。以下の説明では、関節情報推定フラグを付与された各同時刻フレームセットの各フレームの画像のことを分類済みフレーム画像と称する。 The joint information estimation flag assigning unit 122 indicates, for each simultaneous frame set, a flag indicating whether or not to estimate the three-dimensional joint information based on the image of the moving image frame set (hereinafter referred to as “joint information estimation flag”). Value). The joint information estimation flag has two types of values: an estimation flag and a non-estimation flag. When the estimation flag is given, the 3D joint information is estimated using the frame image and camera parameters in the same time frame set. On the other hand, when the non-estimation flag is given, the 3D joint information is estimated by interpolation processing in the same time frame set. The joint information estimation flag assigning unit 122 may assign an estimation flag to the same time frame set at a predetermined cycle, for example, and may assign a non-estimation flag to another same time frame set. The joint information estimation flag giving unit 122 may give an estimation flag to the same time frame set that satisfies a predetermined condition in the image. The predetermined condition may be, for example, that the moving speed of the subject of interest in the image shows an extreme value. The moving speed may be the moving speed of the entire subject of interest, or may be the moving speed of some indirect or body parts (for example, arms or face). In this case, the two-dimensional joint information estimation unit 13 may determine the moving speed of the subject of interest in the image, and may add an estimation flag when the moving speed shows an extreme value. The two-dimensional joint information estimation unit 13 may assign an estimation flag to the same-time frame set when any one frame satisfies a predetermined condition in the same-time frame set, When a predetermined condition is satisfied in a predetermined number of frames or more in a frame set, an estimation flag may be assigned to the same-time frame set. A non-estimated flag is assigned to all the same-time frame sets to which no estimated flag is assigned. In the following description, an image of each frame of each simultaneous frame set to which the joint information estimation flag is assigned is referred to as a classified frame image.

次に二次元関節情報推定部１３について説明する。図３は、二次元関節情報推定部１３の構成例を示す図である。二次元関節情報推定部１３には、推定フラグが付与された同時刻フレームセット（各視点の画像）が入力される。二次元関節情報推定部１３は、入力された同時刻フレームセットに含まれる各視点のフレーム画像において、二次元関節情報（画像上の関節の位置）を推定する。また、二次元関節情報推定部１３は、各視点のフレーム画像において、二次元関節情報の推定結果の信頼度を取得する。信頼度は、その二次元関節情報の推定結果が実際の値にどの程度近いと推定されるかを示す値である。例えば、信頼度が高いほど、推定結果が実際の値に近いと推定されることを示す。これらの処理は例えば以下の参考文献２に記載のDeepCutが用いられてもよいし、他の手法が用いられてもよい。
参考文献２：L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, B. Schiele, “DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). Next, the two-dimensional joint information estimation unit 13 will be described. FIG. 3 is a diagram illustrating a configuration example of the two-dimensional joint information estimation unit 13. The two-dimensional joint information estimation unit 13 receives the same time frame set (images of the respective viewpoints) to which the estimation flag is assigned. The two-dimensional joint information estimation unit 13 estimates two-dimensional joint information (joint position on the image) in each viewpoint frame image included in the input simultaneous frame set. Further, the two-dimensional joint information estimation unit 13 acquires the reliability of the estimation result of the two-dimensional joint information in the frame image of each viewpoint. The reliability is a value indicating how close the estimated result of the two-dimensional joint information is to the actual value. For example, the higher the reliability, the closer the estimation result is to the actual value. For these processes, for example, DeepCut described in Reference Document 2 below may be used, or other methods may be used.
Reference 2: L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, B. Schiele, “DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016).

次に、三次元関節情報推定部１４について説明する。図４は、三次元関節情報推定部１４の構成例を示す図である。三次元関節情報推定部１４は、二次元関節情報推定部１３における推定結果に基づいて、各同時刻フレームセットにおける注目被写体のその時刻における三次元関節情報を推定する。三次元関節情報推定部１４は、三次元関節情報計算部１４１及び三次元関節情報補間部１４２を有する。 Next, the three-dimensional joint information estimation unit 14 will be described. FIG. 4 is a diagram illustrating a configuration example of the three-dimensional joint information estimation unit 14. The three-dimensional joint information estimation unit 14 estimates the three-dimensional joint information at the time of the subject of interest in each simultaneous frame set based on the estimation result in the two-dimensional joint information estimation unit 13. The three-dimensional joint information estimation unit 14 includes a three-dimensional joint information calculation unit 141 and a three-dimensional joint information interpolation unit 142.

三次元関節情報計算部１４１には、推定フラグが付与された同時刻フレームセットにおいて二次元関節情報推定部１３で取得された二次元関節情報及び信頼度と、各カメラのカメラパラメータとが入力される。三次元関節情報計算部１４１は、注目被写体の全ての関節について以下の処理を繰り返し行う。まず、三次元関節情報計算部１４１は、処理の対象となっている関節について、信頼度が高い方から所定数（例えば２つ）のフレームを選択する。三次元関節情報計算部１４１は、処理の対象となっている関節の三次元情報を、選択されたフレームにおける二次元関節情報と、選択されたフレームにおけるカメラパラメータと、を用いて推定する。例えば、１対のフレーム画像上の二次元関節情報を、カメラパラメータを用いて三次元空間に直線として投影し、直線同士の距離が最短となる２点を結ぶ線分の中点を、三次元関節情報の関節位置として推定してもよい。なお、信頼度に基づいて選択される所定数のフレームは、処理の対象となっている関節毎に異なってもよい。 The three-dimensional joint information calculation unit 141 receives the two-dimensional joint information and reliability acquired by the two-dimensional joint information estimation unit 13 in the same time frame set to which the estimation flag is assigned, and the camera parameters of each camera. The The three-dimensional joint information calculation unit 141 repeatedly performs the following processing for all joints of the subject of interest. First, the three-dimensional joint information calculation unit 141 selects a predetermined number (for example, two) frames of the joints to be processed from the one with higher reliability. The three-dimensional joint information calculation unit 141 estimates the three-dimensional information of the joint to be processed using the two-dimensional joint information in the selected frame and the camera parameters in the selected frame. For example, two-dimensional joint information on a pair of frame images is projected as a straight line in a three-dimensional space using camera parameters, and the midpoint of a line segment connecting two points where the distance between the straight lines is the shortest is three-dimensional. You may estimate as a joint position of joint information. Note that the predetermined number of frames selected based on the reliability may be different for each joint to be processed.

三次元関節情報計算部１４１は、全ての関節について三次元の位置を推定すると、推定結果に基づいて各関節の角度を推定する。関節の角度は、回転行列を導出して求められてもよいし、クォータニオンを用いて求められてもよい。三次元関節情報計算部１４１は、上述した処理を、推定フラグが付与された全ての同時刻フレームセットに対して実行する。上述のように、三次元関節情報計算部１４１は、信頼度が高い所定数の二次元関節情報を用いて三次元関節情報を推定する。そのため、二次元関節情報に含まれる誤差を抑制することが可能となる。 When the three-dimensional joint information calculation unit 141 estimates the three-dimensional positions for all the joints, the three-dimensional joint information calculation unit 141 estimates the angles of the joints based on the estimation results. The angle of the joint may be obtained by deriving a rotation matrix or may be obtained using a quaternion. The three-dimensional joint information calculation unit 141 executes the above-described processing for all the same-time frame sets to which the estimation flag is assigned. As described above, the three-dimensional joint information calculation unit 141 estimates the three-dimensional joint information using a predetermined number of two-dimensional joint information with high reliability. For this reason, it is possible to suppress errors included in the two-dimensional joint information.

三次元関節情報補間部１４２は、非推定フラグが付与された全ての同時刻フレームセットに対し、以下の処理を繰り返し行う。まず、三次元関節情報補間部１４２は、処理の対象となった同時刻フレームセットの各フレームについて、時刻的に最も近いフレームであって且つ推定フラグが付与されたフレームを、より早い時刻のフレームとより遅い時刻のフレームと一つずつ選択する。三次元関節情報補間部１４２は、選択された各フレームについて、三次元関節情報計算部１４１による推定結果（三次元関節情報）を取得する。三次元関節情報補間部１４２は、取得された推定結果に基づいて、関節位置及び角度に対してフレーム時刻の差分に応じた補間を行うことで、非推定フラグが付与された各フレームの三次元関節情報を推定する。この時、補間に用いられる係数に対して何らかの重み付けがなされてもよい。このように、全ての時刻のフレームに対して三次元関節情報計算部１４１による推定処理を行うのではなく、間引きされた一部のフレーム（推定フラグが付与されたフレーム）に対してのみ三次元関節情報計算部１４１による推定処理を実行し、残りのフレームに対しては推定結果に基づいた補間処理が行われる。このような処理により、フレームごとの形状のちらつきを抑制することが可能である。 The three-dimensional joint information interpolation unit 142 repeatedly performs the following processing for all the same-time frame sets to which the non-estimation flag is assigned. First, the three-dimensional joint information interpolating unit 142 determines, for each frame of the same-time frame set to be processed, a frame that is closest in time and has an estimation flag attached to a frame at an earlier time. And later frames one by one. The three-dimensional joint information interpolation unit 142 acquires an estimation result (three-dimensional joint information) by the three-dimensional joint information calculation unit 141 for each selected frame. The three-dimensional joint information interpolation unit 142 performs the interpolation according to the difference of the frame time with respect to the joint position and the angle based on the obtained estimation result, so that the three-dimensional of each frame to which the non-estimation flag is given Estimate joint information. At this time, some weight may be given to the coefficient used for interpolation. In this way, the estimation process by the three-dimensional joint information calculation unit 141 is not performed on the frames at all times, but only on a part of the thinned frames (frames to which the estimation flag is given). An estimation process by the joint information calculation unit 141 is executed, and an interpolation process based on the estimation result is performed on the remaining frames. By such processing, flickering of the shape for each frame can be suppressed.

次に、形状変形部１５について説明する。形状変形部１５には、基準形状情報、基準関節情報、三次元関節情報推定部１４による推定結果（三次元関節情報）、変形パラメータが入力される。形状変形部１５は、三次元関節情報推定部１４で推定された三次元関節情報と基準関節情報とを比較する。形状変形部１５は、比較結果と変形パラメータとに基づいて、基準形状を変形させる。形状変形部１５は、全ての同時刻フレームセットにおいて得られた三次元関節情報に対して上記の処理を実行する。 Next, the shape deformation unit 15 will be described. The shape deforming unit 15 receives reference shape information, reference joint information, an estimation result (three-dimensional joint information) by the three-dimensional joint information estimating unit 14, and a deformation parameter. The shape deforming unit 15 compares the three-dimensional joint information estimated by the three-dimensional joint information estimating unit 14 with the reference joint information. The shape deforming unit 15 deforms the reference shape based on the comparison result and the deformation parameter. The shape deforming unit 15 performs the above processing on the three-dimensional joint information obtained in all the same time frame sets.

図５は、形状変形部１５の構成例を示す図である。以下、図５を例に形状変形部１５について詳細に説明する。形状変形部１５は、関節変位計算部１５１及び変形部１５２を有する。 FIG. 5 is a diagram illustrating a configuration example of the shape deforming unit 15. Hereinafter, the shape deforming portion 15 will be described in detail with reference to FIG. The shape deforming unit 15 includes a joint displacement calculating unit 151 and a deforming unit 152.

関節変位計算部１５１は、三次元関節情報推定部１４で推定された三次元関節情報と基準関節情報との差分を算出する。例えば、関節変位計算部１５１は、推定された三次元関節情報における三次元座標と、基準関節情報における三次元座標との差分を算出し、ｘ軸、ｙ軸及びｚ軸における位置のずれを取得する。また、関節変位計算部１５１は、推定された三次元関節情報における三次元の角度と、基準関節情報における三次元の角度との差分を算出し、ｘ軸、ｙ軸及びｚ軸を中心とした回転角のずれを取得する。 The joint displacement calculation unit 151 calculates a difference between the three-dimensional joint information estimated by the three-dimensional joint information estimation unit 14 and the reference joint information. For example, the joint displacement calculation unit 151 calculates the difference between the three-dimensional coordinates in the estimated three-dimensional joint information and the three-dimensional coordinates in the reference joint information, and acquires the positional deviation in the x-axis, y-axis, and z-axis. To do. Further, the joint displacement calculation unit 151 calculates a difference between the three-dimensional angle in the estimated three-dimensional joint information and the three-dimensional angle in the reference joint information, with the x axis, the y axis, and the z axis as the centers. Get the deviation of the rotation angle.

変形部１５２は、推定された三次元関節情報と基準関節情報との差分と、変形パラメータと、に基づいて、基準形状を変形する。このような処理によって、変形部１５２は、処理の対象となっている同時刻フレームセットのフレーム時刻において注目被写体がとっていた姿勢と同じ姿勢となるように、基準形状を変形させる。変形部１５２は、このように変形された後の形状の情報を、変形形状情報として出力する。このような処理が全ての同時刻フレームセットにおいて実行されることによって、各同時刻フレームセットに対応する変形形状情報が取得される。変形部１５２は、各フレーム時刻に対応付けて変形形状情報を記憶装置に記録してもよい。 The deforming unit 152 deforms the reference shape based on the difference between the estimated three-dimensional joint information and the reference joint information and the deformation parameter. By such processing, the deforming unit 152 deforms the reference shape so as to have the same posture as that of the subject of interest at the frame time of the same-time frame set that is the processing target. The deformation unit 152 outputs information on the shape after being deformed in this way as deformed shape information. By executing such processing in all the simultaneous frame sets, the deformed shape information corresponding to each simultaneous frame set is acquired. The deformation unit 152 may record the deformation shape information in the storage device in association with each frame time.

画像合成部１６は、指定された視野及び時刻における自由視点映像を生成する。視野及び時刻は、例えば自由視点映像を再生する装置によって指定されてもよいし、自由視点映像を視聴する者によって指定されてもよい。画像合成部１６は、指定された時刻に相当するフレーム時刻の変形形状情報を取得する。取得される変形形状情報は、その時刻における注目被写体の位置や姿勢を示している。画像合成部１６は、取得された変形形状情報を用いて、指定された視野における画像をレンダリングする。このとき、注目被写体の画像は変形形状情報を用いたレンダリングによって得られる。画像合成部１６は、注目被写体の背景の画像については、予め得られている背景を示すモデルデータに基づいてレンダリングしてもよいし、対応する同時刻フレームセットにおいて近い視野の１又は複数のフレーム画像を用いてアフィン変換等の画像処理を行うことによってレンダリングしてもよい。画像合成部１６は、背景の画像と注目被写体の画像とを合成することによって、指定された視野及び時刻における自由視点映像を生成する。画像合成部１６は、動画像が要求されている場合には、以上の処理を時間軸にそって繰り返し実行することによって自由視点における映像を生成してもよい。 The image composition unit 16 generates a free viewpoint video at the designated field of view and time. The field of view and time may be specified, for example, by a device that reproduces a free viewpoint video, or may be specified by a person who views the free viewpoint video. The image composition unit 16 acquires the deformed shape information at the frame time corresponding to the designated time. The acquired deformed shape information indicates the position and orientation of the subject of interest at that time. The image composition unit 16 renders an image in the designated visual field using the acquired deformed shape information. At this time, the image of the subject of interest is obtained by rendering using the deformed shape information. The image composition unit 16 may render the background image of the subject of interest based on model data indicating the background obtained in advance, or one or a plurality of frames with a near field of view in the corresponding simultaneous frame set You may render by performing image processing, such as an affine transformation, using an image. The image synthesizing unit 16 synthesizes the background image and the image of the subject of interest to generate a free viewpoint video at the designated field of view and time. When a moving image is requested, the image composition unit 16 may generate a video at a free viewpoint by repeatedly executing the above processing along the time axis.

図６は、画像合成装置１０の処理の具体例を示すフローチャートである。以下、画像合成装置１０の処理の流れの具体例について説明する。まず、入力部１１が、複数の動画像、各動画像を撮影したカメラのカメラパラメータ、注目被写体の三次元形状を示す基準形状情報、注目被写体の三次元関節情報を示す基準関節情報、注目被写体の変形パラメータ、の入力を受け付ける（ステップＳ１０１，Ｓ１０２）。次に、フレーム分類部１２は、各同時刻フレームセットについて、推定フラグ又は非推定フラグを付与する（ステップＳ１０３）。 FIG. 6 is a flowchart illustrating a specific example of processing of the image composition device 10. Hereinafter, a specific example of the processing flow of the image composition device 10 will be described. First, the input unit 11 includes a plurality of moving images, camera parameters of the camera that captured each moving image, reference shape information indicating the three-dimensional shape of the subject of interest, reference joint information indicating the three-dimensional joint information of the subject of interest, and the subject of interest. Are received (steps S101 and S102). Next, the frame classification unit 12 assigns an estimation flag or a non-estimation flag to each simultaneous frame set (step S103).

不図示の制御部は、処理対象となる同時刻フレームセットを、未処理の同時刻フレームセットの中から選択する（ステップＳ１０４）。処理対象の同時刻フレームセットが推定フラグが付与されたものである場合、二次元関節情報推定部１３は、処理対象の同時刻フレームセットにおいて、注目被写体の二次元関節情報を推定する（ステップＳ１０５）。次に、三次元関節情報推定部１４は、処理対象の同時刻フレームセットにおいて、注目被写体の二次元関節情報を推定する（ステップＳ１０６）。 A control unit (not shown) selects a simultaneous frame set to be processed from unprocessed simultaneous frame sets (step S104). When the processing target simultaneous frame set is provided with the estimation flag, the two-dimensional joint information estimation unit 13 estimates the two-dimensional joint information of the subject of interest in the processing target simultaneous frame set (step S105). ). Next, the three-dimensional joint information estimation unit 14 estimates the two-dimensional joint information of the subject of interest in the same-time frame set to be processed (step S106).

図７は、ステップＳ１０６の処理の詳細を示すフローチャートである。処理対象の同時刻フレームセットが推定フラグが付与されたものである場合（ステップＳ２０１−ＹＥＳ）、三次元関節情報推定部１４の三次元関節情報計算部１４１が処理を行う。具体的には、三次元関節情報計算部１４１は、二次元関節情報及び信頼度と、各カメラのカメラパラメータとの入力を受け付ける（ステップＳ２０２）。三次元関節情報計算部１４１は、処理の対象となっている関節について、信頼度が高い方から所定数（例えば２つ）のフレームを選択する（ステップＳ２０３）。そして、三次元関節情報計算部１４１は、処理の対象となっている関節の三次元情報を、選択されたフレームの画像と、選択されたフレームに対応するカメラパラメータと、を用いて推定する（ステップＳ２０４）。 FIG. 7 is a flowchart showing details of the process in step S106. When the same time frame set to be processed has an estimation flag (step S201—YES), the 3D joint information calculation unit 141 of the 3D joint information estimation unit 14 performs processing. Specifically, the three-dimensional joint information calculation unit 141 receives input of two-dimensional joint information and reliability, and camera parameters of each camera (step S202). The three-dimensional joint information calculation unit 141 selects a predetermined number (for example, two) of frames with higher reliability for the joint to be processed (step S203). Then, the three-dimensional joint information calculation unit 141 estimates the three-dimensional information of the joint to be processed using the image of the selected frame and the camera parameter corresponding to the selected frame ( Step S204).

ステップＳ２０１の処理において、処理対象の同時刻フレームセットが非推定フラグが付与されたものである場合（ステップＳ２０１−ＮＯ）、三次元関節情報推定部１４の三次元関節情報補間部１４２が処理を行う。具体的には、三次元関節情報補間部１４２は、推定フラグが付与された同時刻フレームセットにおける三次元関節情報を用いて補間処理を行うことによって、三次元関節情報を取得する（ステップＳ２０５）。以上で図７の説明は終了する。 In the process of step S201, when the non-estimation flag is assigned to the processing target simultaneous frame set (step S201-NO), the 3D joint information interpolation unit 142 of the 3D joint information estimation unit 14 performs the process. Do. Specifically, the three-dimensional joint information interpolation unit 142 acquires the three-dimensional joint information by performing an interpolation process using the three-dimensional joint information in the same-time frame set to which the estimation flag is assigned (step S205). . This is the end of the description of FIG.

図６の説明に戻る。Ｓ１０４〜Ｓ１０６の処理は、全ての同時刻フレームセットに対して実行される（ステップＳ１０７）。その後、形状変形部１５は、各同時刻フレームセットにおける三次元関節情報の推定結果、基準関節情報及び変形パラメータに基づいて基準形状を変形することによって変形形状情報を生成する（ステップＳ１０８）。 Returning to the description of FIG. The processing of S104 to S106 is executed for all the same time frame sets (step S107). Thereafter, the shape deforming unit 15 generates deformed shape information by deforming the reference shape based on the estimation result of the three-dimensional joint information in each simultaneous frame set, the reference joint information, and the deformation parameter (step S108).

その後、自由視点映像を生成するタイミングにおいて、画像合成部１６は、形状変形部１５によって予め取得されている変形形状を用いて、指定された時刻及び視野における注目被写体の画像をレンダリングする（ステップＳ１０９）。そして、画像合成部１６は、得られた画像に背景の画像を合成することによって、合成画像を生成し出力する（ステップＳ１１０）。 Thereafter, at the timing of generating the free viewpoint video, the image composition unit 16 renders the image of the subject of interest at the specified time and field of view using the deformed shape acquired in advance by the shape deforming unit 15 (step S109). ). Then, the image synthesis unit 16 generates and outputs a synthesized image by synthesizing the background image with the obtained image (step S110).

このように構成された画像合成装置１０では、自由視点映像を生成するために以下のような処理が行われる。実際に撮影された複数の動画像に基づいて、各フレーム時刻における注目被写体の各関節の三次元関節情報が推定される。三次元関節情報の推定結果に基づいて、予め得られていた注目被写体の基準形状が変形され、各フレーム時刻における注目被写体の変形形状情報が得られる。そして、実際に自由視点映像を生成する際には、指定された時刻における注目被写体の変形形状情報を用いて指定された視野におけるレンダリングを行うことによって、注目被写体の映像が生成される。そのため、例えば実際に撮影された複数の動画像では陰となって得られていなかった部分の映像（例えば、注目被写体の脇の部分や顎下の部分など）についても、オクルージョンの問題が生じることを抑止することが可能となる。 In the image composition device 10 configured as described above, the following processing is performed in order to generate a free viewpoint video. Based on a plurality of actually captured moving images, 3D joint information of each joint of the subject of interest at each frame time is estimated. Based on the estimation result of the three-dimensional joint information, the reference shape of the subject of interest obtained in advance is deformed, and the deformed shape information of the subject of interest at each frame time is obtained. When the free viewpoint video is actually generated, the video of the subject of interest is generated by performing rendering in the designated field of view using the deformed shape information of the subject of interest at the designated time. For this reason, for example, the occlusion problem also occurs in a portion of the video that was not obtained in the shadow of a plurality of actually captured moving images (for example, the side of the subject of interest or the portion under the chin). Can be suppressed.

また、各フレーム時刻における変形形状情報を取得する際に、全てのフレーム時刻において注目被写体の三次元関節情報を独立に動画像から推定するのではなく、一部のフレーム時刻（推定フラグが付与されたフレーム時刻）の同時刻フレームセットのみにおいて動画像から三次元関節情報が推定される。そして、残りのフレーム時刻（非推定フラグが付与されたフレーム時刻）の動画像フレームセットにおいては、動画像からではなく、推定フラグが付与された同時刻フレームセットにおける推定結果に基づいた補間処理によって三次元関節情報が得られる。そのため、少なくとも推定フラグが付与された同時刻フレームセットから次の推定フラグが付与された同時刻フレームセットまでの間で時間方向のちらつきが生じにくい。このような処理によって、時間方向のちらつきを抑止することが可能となる。 In addition, when acquiring the deformed shape information at each frame time, the 3D joint information of the subject of interest is not estimated independently from the moving image at every frame time, but a part of the frame time (estimation flag is assigned). 3D joint information is estimated from the moving image only in the same time frame set at (frame time). In the moving image frame set at the remaining frame time (frame time to which the non-estimation flag is assigned), the interpolation processing is based on the estimation result in the same time frame set to which the estimation flag is assigned, not from the moving image. Three-dimensional joint information can be obtained. Therefore, flickering in the time direction is unlikely to occur at least between the same-time frame set to which the estimation flag is assigned and the same-time frame set to which the next estimation flag is given. Such processing makes it possible to suppress flickering in the time direction.

（変形例）
上述した画像合成装置１０による処理の対象は人物であったが、必ずしも処理の対象は人物に限定される必要は無い。処理の対象は、関節を有する生物又は物体であればどのようなものであってもよい。例えば、動物が処理の対象となってもよい。この場合、予め得られる基準形状情報、基準関節情報、変形パラメータはいずれも動物に関する情報である。例えば、ロボットが処理の対象となっても良い。この場合、予め得られる基準形状情報、基準関節情報、変形パラメータはいずれもロボットに関する情報である。 (Modification)
Although the object of processing by the image composition device 10 described above is a person, the object of processing is not necessarily limited to a person. The object of processing may be any organism or object having a joint. For example, an animal may be a target for processing. In this case, reference shape information, reference joint information, and deformation parameters obtained in advance are all information related to animals. For example, a robot may be a processing target. In this case, reference shape information, reference joint information, and deformation parameters obtained in advance are all information related to the robot.

上述した画像合成装置１０は、複数の情報処理装置を組み合わせたシステムとして構成されてもよい。例えば、入力部１１、フレーム分類部１２、二次元関節情報推定部１３、三次元関節情報推定部１４及び形状変形部１５を備えた装置と、入力部１１及び画像合成部１６を備えた装置と、を備えるシステムが構築されてもよい。 The image composition device 10 described above may be configured as a system in which a plurality of information processing devices are combined. For example, an apparatus including an input unit 11, a frame classification unit 12, a two-dimensional joint information estimation unit 13, a three-dimensional joint information estimation unit 14, and a shape deformation unit 15, and an apparatus including the input unit 11 and the image composition unit 16 , A system may be constructed.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１０…画像合成装置，１１…入力部，１２…フレーム分類部，１２１…フレーム分離部，１２２…関節情報推定フラグ付与部，１３…二次元関節情報推定部，１４…三次元関節情報推定部，１４１…三次元関節情報計算部，１４２…三次元関節情報補間部，１５…形状変形部，１５１…関節変位計算部，１５２…変形部，１６…画像合成部，２０２…画像生成部，２１１…ネットワーク構築部，２１２…パラメータ学習部，２１…画像取得部，２２…画像加工部 DESCRIPTION OF SYMBOLS 10 ... Image composition apparatus, 11 ... Input part, 12 ... Frame classification | category part, 121 ... Frame separation part, 122 ... Joint information estimation flag provision part, 13 ... Two-dimensional joint information estimation part, 14 ... Three-dimensional joint information estimation part, DESCRIPTION OF SYMBOLS 141 ... Three-dimensional joint information calculation part 142 ... Three-dimensional joint information interpolation part 15 ... Shape deformation part 151 ... Joint displacement calculation part 152 ... Deformation part 16 ... Image composition part 202 ... Image generation part 211 ... Network construction unit, 212 ... parameter learning unit, 21 ... image acquisition unit, 22 ... image processing unit

Claims

An input unit that receives input of a plurality of moving images, camera parameters related to the fields of view of the plurality of cameras that captured the moving images, and parameters related to the three-dimensional shape of the subject captured in the moving images;
For a frame at each time, a three-dimensional joint information estimation unit for estimating the three-dimensional joint information of the subject,
A shape deforming unit that acquires information indicating the three-dimensional shape of the subject at each time based on the estimation result of the three-dimensional joint information in the frame at each time and a parameter related to the three-dimensional shape of the subject;
An image composition unit that generates an image including an image of the subject in a designated field of view at a designated time based on information indicating the three-dimensional shape of the subject;
An image synthesizing apparatus.

The parameter related to the three-dimensional shape of the subject includes reference shape information indicating the three-dimensional shape of the subject, and reference joint information indicating information of each joint in the three-dimensional shape indicated by the reference shape information. Item 2. The image composition device according to Item 1.

A frame classifying unit for determining an estimated frame that is a target for estimating joint information of the subject based on an image of the frame for each time frame constituting the moving image;
The three-dimensional joint information estimation unit estimates the three-dimensional joint information of the subject based on the image of the estimated frame for the estimated frame, and the three-dimensional joint in the estimated frame for frames other than the estimated frame The image synthesis apparatus according to claim 1, wherein the three-dimensional joint information is estimated by an interpolation process using a joint information estimation result.

For a frame at each time, further comprising a two-dimensional joint information estimation unit that estimates the two-dimensional joint information of the subject and acquires the reliability of the estimation result,
The three-dimensional joint information estimation unit estimates the three-dimensional joint information based on only information about a part of frames for which the high reliability is acquired in the two-dimensional joint information estimation unit for the estimation frame. Item 4. The image composition device according to Item 3.

An input step for receiving input of a plurality of moving images, camera parameters related to the fields of view of the plurality of cameras that captured the moving images, and parameters related to the three-dimensional shape of the subject captured in the moving images;
For a frame at each time, a three-dimensional joint information estimation step for estimating the three-dimensional joint information of the subject;
A shape deformation step of obtaining information indicating the three-dimensional shape of the subject at each time based on the estimation result of the three-dimensional joint information in the frame at each time and a parameter relating to the three-dimensional shape of the subject;
An image synthesis step for generating an image including an image of the subject in a designated field of view at a designated time based on information indicating the three-dimensional shape of the subject;
An image composition method comprising:

A computer program for causing a computer to function as the image composition device according to any one of claims 1 to 4.