JP2013012045A

JP2013012045A - Image processing method, image processing system, and computer program

Info

Publication number: JP2013012045A
Application number: JP2011144417A
Authority: JP
Inventors: Jin Niigaki; 仁新垣; Ai Isogai; 愛磯貝; Hajime Noto; 肇能登; Harumi Kawamura; 春美川村; Nobuhiko Matsuura; 宣彦松浦
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-06-29
Filing date: 2011-06-29
Publication date: 2013-01-17

Abstract

PROBLEM TO BE SOLVED: To synthesize a high-quality virtual viewpoint image while suppressing a depth estimation error, even if mapping between images is difficult.SOLUTION: An image processing method is for synthesizing an image of a subject viewed from a given virtual viewpoint position, referring to a multi-viewpoint image obtained by imaging the subject from a plurality of different viewpoints. The image processing method comprises: calculating a likelihood to a depth of each pixel with regard to the multi-viewpoint image; estimating the depth of each pixel according to the likelihood; calculating an estimation function for estimating a likelihood to the depth from an image feature using a depth estimation result of a high accuracy estimation pixel; correcting the likelihood to a correction targeting pixel using the estimation function; re-estimating the depth of the whole image using the corrected likelihood; and synthesizing the image of the subject in accordance with the virtual viewpoint position, referring to the re-estimated depth and the multi-viewpoint image.

Description

本発明は、被写体にテクスチャが少ない領域やオクルージョンがあり、ステレオマッチング法で対応付けが難しい場合に有効な技術に関する。 The present invention relates to a technique that is effective when a subject has a region with little texture and occlusion and is difficult to associate with a stereo matching method.

複数のカメラから撮影された多視点画像を用いて、仮想の視点位置から見た画像を合成することを仮想視点画像合成という。図１０は、多視点画像を用いて任意の視点位置の画像を合成する従来技術の処理の流れを示す図である。以下、多視点画像を用いて任意の視点位置の画像を合成するため従来技術の処理の流れについて説明する。まず、多視点画像及びカメラパラメータが入力される（ステップＳａ１）。次に、二次元の画像群から三次元情報（奥行き）を推定する（ステップＳａ２）。そして、多視点画像と、カメラパラメータと、奥行きとに基づいて、仮想視点画像を合成する（ステップＳａ３）。このとき、奥行きの推定精度が低いと、仮想視点の合成画像の品質が劣化してしまう。 Combining images viewed from a virtual viewpoint position using multi-viewpoint images taken from a plurality of cameras is called virtual viewpoint image composition. FIG. 10 is a diagram showing a flow of processing in the prior art for synthesizing an image at an arbitrary viewpoint position using a multi-viewpoint image. Hereinafter, the flow of processing in the prior art for synthesizing an image at an arbitrary viewpoint position using a multi-viewpoint image will be described. First, multi-viewpoint images and camera parameters are input (step Sa1). Next, three-dimensional information (depth) is estimated from the two-dimensional image group (step Sa2). Then, the virtual viewpoint image is synthesized based on the multi-viewpoint image, the camera parameter, and the depth (step Sa3). At this time, if the depth estimation accuracy is low, the quality of the synthesized image of the virtual viewpoint deteriorates.

奥行きを推定する手法にステレオマッチング法がある。ステレオマッチング法では、多視点画像間の画素の対応づけとカメラの内部パラメータと外部パラメータとが用いられる。そして、三角測量の原理により、注目画素の実空間での位置が計算により求められる。図１１は、ステレオマッチング法による処理の概略を示す図である。例えば、図１１に示すように、地点Ｐ１及び地点Ｐ２から注目地点Ａを見ると仮定する。この場合、地点Ｐ１及びＰ２を結んだ直線の距離と、注目地点Ａと地点Ｐ１と地点Ｐ２とを頂点とした三角形の各頂点の角度とが得られれば、地点Ｐ１（もしくは地点Ｐ２）から注目地点Ａへの距離を求めることができる。 There is a stereo matching method for estimating the depth. In the stereo matching method, pixel correspondence between multi-viewpoint images, camera internal parameters, and external parameters are used. Then, the position of the pixel of interest in real space is obtained by calculation based on the principle of triangulation. FIG. 11 is a diagram showing an outline of processing by the stereo matching method. For example, as shown in FIG. 11, it is assumed that the point of interest A is viewed from the points P1 and P2. In this case, if the distance between the straight lines connecting the points P1 and P2 and the angles of the vertices of the triangle with the point of interest A, the point P1, and the point P2 as vertices are obtained, attention is drawn from the point P1 (or the point P2). The distance to the point A can be obtained.

しかしながら、模様（テクスチャ）が少ない領域や、周期的なテクスチャが存在する領域や、オクルージョンの影響を受けた領域がある場合には、その領域内の画素の対応付けは困難である。図１２は、従来技術における問題点を示す図である。例えば、図１２に示すように、鳥のような被写体Ｃが横切るなど、オクルージョンが生じた場合を考えると、地点Ｐ１からは注目地点Ａが見えないので対応付けができない。 However, when there is a region with a small pattern (texture), a region where a periodic texture exists, or a region affected by occlusion, it is difficult to associate pixels in the region. FIG. 12 is a diagram showing a problem in the prior art. For example, as shown in FIG. 12, considering a case where occlusion occurs, such as when a subject C such as a bird crosses, the point of interest A cannot be seen from the point P1, so that the association cannot be made.

このとき、ステレオマッチング法では、地点Ｐ２から見える注目地点Ａについて、地点Ｐ１から似たような形状として見える地点Ｂと誤って対応付けされやすい。そのため、奥行き推定が間違った地点Ａのような画素の影響により、合成された画像に不自然な像（アーティファクト）が生じる。これが仮想視点画像合成の品質に繋がる重要な課題である。 At this time, in the stereo matching method, the point of interest A that can be seen from the point P2 is likely to be erroneously associated with the point B that looks like a similar shape from the point P1. Therefore, an unnatural image (artifact) is generated in the synthesized image due to the influence of the pixel such as the point A where the depth estimation is wrong. This is an important issue that leads to the quality of virtual viewpoint image composition.

従来の仮想視点画像合成手法では、このような対応付けが困難な画素について、画像をセグメンテーションすることにより対応するアプローチがあった。 In the conventional virtual viewpoint image synthesis method, there has been an approach for dealing with such a pixel that is difficult to be associated by segmenting the image.

例えば、画素の色（Ｒ、Ｇ、Ｂ）情報を基にして画像を細かくセグメンテーションし、同じセグメント内の画素は、同一の被写体、つまり同一平面（曲面）上に存在すると仮定していた。この仮定により、注目画素の奥行きが所属するセグメントの平面の奥行きとなるように、奥行きを補正する手法（例えば非特許文献１参照）がある。 For example, the image is finely segmented based on pixel color (R, G, B) information, and it is assumed that pixels in the same segment exist on the same subject, that is, on the same plane (curved surface). Based on this assumption, there is a method of correcting the depth so that the depth of the target pixel becomes the depth of the plane of the segment to which the pixel belongs (see, for example, Non-Patent Document 1).

他にも奥行きを補正する手法として、画像の色情報から前景と背景を分離する手法がある。これは、被写体が前景と背景との２種類という前提のもと、ステレオマッチング法により対応付けが困難な画素について、その画素と類似した色を持つ被写体（前景か背景）を検出し、その被写体の奥行き情報を用いて、その画素の奥行きの尤度を補正する手法（例えば非特許文献２、３参照）がある。 As another technique for correcting the depth, there is a technique for separating the foreground and the background from the color information of the image. This is based on the premise that there are two types of subjects, foreground and background, and detects a subject (foreground or background) having a color similar to that pixel for a pixel that is difficult to be matched by the stereo matching method. There is a method of correcting the likelihood of the depth of the pixel using the depth information (see, for example, Non-Patent Documents 2 and 3).

A. Klaus, M. Sormann, K. Karner: Segment-Based Stereo Matching Using Belief Propagation and a Self-Adapting Dissimilarity Measure, in Proc. of ICPR, pp. 15-18 (2006)A. Klaus, M. Sormann, K. Karner: Segment-Based Stereo Matching Using Belief Propagation and a Self-Adapting Dissimilarity Measure, in Proc. Of ICPR, pp. 15-18 (2006) V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, C. Rother : Bi-layer segmentation of binocular stereo video, In Proc. of CVPR, vol. 2, pp. 407-414 (2005)V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, C. Rother: Bi-layer segmentation of binocular stereo video, In Proc. Of CVPR, vol. 2, pp. 407-414 (2005) 石井，高橋，苗村：自由視点画像のための合成とセグメンテーションの連結手法，3次元画像コンファレンス，5-1, pp.49-52 (2009)Ishii, Takahashi, Naemura: Combining synthesis and segmentation for free viewpoint images, 3D image conference, 5-1, pp.49-52 (2009)

上述の仮想視点画像合成の研究では、ある画素の奥行き情報を補正するときに、その画素と同一のセグメント内の画素の奥行き情報を用いて補正を行っていた。この手法では、カメラを密に置ける環境で、対応付けの誤りが狭い領域で起こるような場合には有効である。しかしながら、注目画素周辺の広範囲において対応付けが誤っている場合、つまり同一セグメント内の大部分の画素の対応付けが誤っている場合には、セグメント内の大部分の画素の奥行き推定精度が低くなってしまう。そのため、そのセグメントの奥行き推定結果を用いても、注目画素の奥行きを正しく補正することは難しいという問題があった。 In the above-described research on virtual viewpoint image synthesis, when correcting the depth information of a certain pixel, the correction is performed using the depth information of the pixel in the same segment as the pixel. This method is effective in an environment where cameras can be placed densely and an error in association occurs in a narrow area. However, if the correspondence is incorrect in a wide area around the target pixel, that is, if the correspondence of most pixels in the same segment is incorrect, the depth estimation accuracy of most pixels in the segment is low. End up. Therefore, there is a problem that it is difficult to correct the depth of the target pixel correctly even if the depth estimation result of the segment is used.

また、前景の被写体と背景の被写体とを分離する手法では、各被写体の色特徴を利用することは有効ではある。しかし、仮想視点画像合成の場合には、奥行きは多値であり、前景の奥行きと背景の奥行きとの２値で近似することが難しいという問題がある。 Further, in the method of separating the foreground subject and the background subject, it is effective to use the color feature of each subject. However, in the case of the virtual viewpoint image composition, the depth is multivalued, and there is a problem that it is difficult to approximate with the binary of the foreground depth and the background depth.

本発明は、このような事情を考慮してなされたものであり、テクスチャが少ない領域や、オクルージョンの影響により画像間の対応付けが困難な場合であっても、奥行き推定誤差を抑制し、高品質な仮想視点画像を合成することができる技術を提供することにある。 The present invention has been made in consideration of such circumstances, and suppresses depth estimation errors even when it is difficult to associate images with each other due to the influence of occlusion due to an area with little texture. The object is to provide a technique capable of synthesizing a quality virtual viewpoint image.

本発明の一態様は、複数の異なる視点から被写体を撮影した多視点画像に基づいて、任意の仮想視点位置から見た前記被写体の画像を合成する画像処理方法であって、ステレオマッチング法により、前記多視点画像に対して、各画素の奥行きに対する尤度を算出する第１のステップと、前記第１のステップで求めた尤度に基づいて、個々の画素の奥行きを推定する第２のステップと、奥行きの推定精度が高いと推定されるための条件を満たす高精度推定画素の奥行き推定結果を用いて、画像特徴から奥行きに対する尤度を推定するための推定関数を算出する第３のステップと、奥行きの推定精度が低いと推定されるための条件を満たす補正対象画素に対して、前記第３のステップで算出された推定関数を用いて、尤度の補正を行う第４のステップと、前記第４のステップで行われた補正後の尤度を用いて、画像全体の奥行きを再推定する第５のステップと、前記第５のステップで再推定した奥行きと、前記多視点画像とに基づいて、前記仮想視点位置に応じた前記被写体の画像を合成する第６のステップとを有する。 One aspect of the present invention is an image processing method for synthesizing an image of a subject viewed from an arbitrary virtual viewpoint position based on a multi-viewpoint image obtained by photographing the subject from a plurality of different viewpoints, and by a stereo matching method, A first step of calculating the likelihood of each pixel depth for the multi-viewpoint image, and a second step of estimating the depth of each pixel based on the likelihood obtained in the first step And a third step of calculating an estimation function for estimating the likelihood for the depth from the image feature using the depth estimation result of the high-precision estimation pixel that satisfies the condition for estimating that the depth estimation accuracy is high And a fourth step in which likelihood correction is performed using the estimation function calculated in the third step with respect to the correction target pixel that satisfies the condition for estimating that the depth estimation accuracy is low. And a fifth step of re-estimating the depth of the entire image using the likelihood after correction performed in the fourth step, the depth re-estimated in the fifth step, and the multi-viewpoint And a sixth step of synthesizing the subject image according to the virtual viewpoint position based on the image.

本発明の一態様は、複数の異なる視点から被写体を撮影した多視点画像に基づいて、任意の仮想視点位置から見た前記被写体の画像を合成する画像処理装置であって、ステレオマッチング法により、前記多視点画像に対して、各画素の奥行きに対する尤度を算出する尤度算出部と、前記尤度算出部で求めた尤度に基づいて、個々の画素の奥行きを推定する奥行き推定部と、奥行きの推定精度が高いと推定されるための条件を満たす高精度推定画素の奥行き推定結果を用いて、画像特徴から奥行きに対する尤度を推定するための推定関数を算出する尤度推定関数算出部と、奥行きの推定精度が低いと推定されるための条件を満たす補正対象画素に対して、前記尤度推定関数算出部で算出された推定関数を用いて、尤度の補正を行う尤度補正部と、前記尤度補正部で行われた補正後の尤度を用いて、画像全体の奥行きを再推定する奥行き再推定部と、前記奥行き再推定部により再推定された奥行きと、前記多視点画像とに基づいて、前記仮想視点位置に応じた前記被写体の画像を合成する画像合成部とを備える。 One aspect of the present invention is an image processing apparatus that synthesizes an image of a subject viewed from an arbitrary virtual viewpoint position based on a multi-viewpoint image obtained by photographing the subject from a plurality of different viewpoints, and by a stereo matching method, A likelihood calculating unit that calculates the likelihood of the depth of each pixel with respect to the multi-viewpoint image; and a depth estimating unit that estimates the depth of each pixel based on the likelihood obtained by the likelihood calculating unit; The likelihood estimation function calculation that calculates the estimation function for estimating the likelihood for the depth from the image feature using the depth estimation result of the high-precision estimation pixel that satisfies the condition for estimating the depth estimation accuracy is high And likelihood that the likelihood correction is performed using the estimation function calculated by the likelihood estimation function calculation unit for the correction target pixel that satisfies the condition for estimating that the depth estimation accuracy is low. Correction part and A depth re-estimation unit that re-estimates the depth of the entire image using the likelihood after correction performed by the likelihood correction unit, the depth re-estimated by the depth re-estimation unit, and the multi-viewpoint image And an image synthesis unit that synthesizes the image of the subject according to the virtual viewpoint position.

本発明の一態様は、複数の異なる視点から被写体を撮影した多視点画像に基づいて、任意の仮想視点位置から見た前記被写体の画像を合成する画像処理装置のコンピュータに、ステレオマッチング法により、前記多視点画像に対して、各画素の奥行きに対する尤度を算出する尤度算出ステップと、前記尤度算出ステップで求めた尤度に基づいて、個々の画素の奥行きを推定する奥行き推定ステップと、奥行きの推定精度が高いと推定されるための条件を満たす高精度推定画素の奥行き推定結果を用いて、画像特徴から奥行きに対する尤度を推定するための推定関数を算出する尤度推定関数算出ステップと、奥行きの推定精度が低いと推定されるための条件を満たす補正対象画素に対して、前記尤度推定関数算出ステップで算出された推定関数を用いて、尤度の補正を行う尤度補正ステップと、前記尤度補正ステップで行われた補正後の尤度を用いて、画像全体の奥行きを再推定する奥行き再推定ステップと、前記奥行き再推定ステップで再推定された奥行きと、多視点画像とに基づいて、前記仮想視点位置に応じた前記被写体の画像を合成する画像合成ステップと、を実行させるためのコンピュータプログラムである。 According to one aspect of the present invention, a stereo matching method is used in a computer of an image processing apparatus that synthesizes an image of the subject viewed from an arbitrary virtual viewpoint position based on a multi-viewpoint image obtained by photographing the subject from a plurality of different viewpoints. A likelihood calculating step for calculating a likelihood for the depth of each pixel with respect to the multi-viewpoint image; and a depth estimating step for estimating the depth of each pixel based on the likelihood obtained in the likelihood calculating step; The likelihood estimation function calculation that calculates the estimation function for estimating the likelihood for the depth from the image feature using the depth estimation result of the high-precision estimation pixel that satisfies the condition for estimating the depth estimation accuracy is high The estimation function calculated in the likelihood estimation function calculating step for the correction target pixel that satisfies the step and the condition for estimating that the depth estimation accuracy is low. A likelihood correction step for correcting the likelihood, a depth re-estimation step for re-estimating the depth of the whole image using the likelihood after the correction performed in the likelihood correction step, and the depth re-estimation step. A computer program for executing an image synthesis step of synthesizing an image of the subject according to the virtual viewpoint position based on the depth re-estimated in the estimation step and a multi-viewpoint image.

本発明により、テクスチャが少ない領域や、オクルージョンの影響により画像間の対応付けが困難な場合であっても、奥行き推定誤差を抑制し、高品質な仮想視点画像を合成することが可能となる。 According to the present invention, it is possible to suppress a depth estimation error and synthesize a high-quality virtual viewpoint image even in a case where it is difficult to associate images with each other due to an area with little texture or the influence of occlusion.

仮想視点画像合成システムの構成を示すブロック図である。It is a block diagram which shows the structure of a virtual viewpoint image composition system. 本実施形態による仮想視点画像合成方法を説明するためのフローチャートである。It is a flowchart for demonstrating the virtual viewpoint image synthesis method by this embodiment. 本実施形態による仮想視点画像合成方法で用いるカメラの配置例を示す概念図である。It is a conceptual diagram which shows the example of arrangement | positioning of the camera used with the virtual viewpoint image synthesis method by this embodiment. 本実施形態による奥行きに対する尤度の計算方法を説明するための概念図である。It is a conceptual diagram for demonstrating the calculation method of the likelihood with respect to the depth by this embodiment. 複数の画像間のエピポーラ線（ＥＬ１、ＥＬ２）を説明するための概念図である。It is a conceptual diagram for demonstrating the epipolar line (EL1, EL2) between several images. 画像特徴から奥行きに対する尤度推定関数Ｆの算出方法を説明するための概念図である。It is a conceptual diagram for demonstrating the calculation method of the likelihood estimation function F with respect to the depth from an image feature. 画像特徴から奥行きに対する尤度推定関数Ｆの算出方法を説明するための概念図である。It is a conceptual diagram for demonstrating the calculation method of the likelihood estimation function F with respect to the depth from an image feature. 仮想視点位置の画像合成を説明するための概念図である。It is a conceptual diagram for demonstrating the image synthesis | combination of a virtual viewpoint position. ３Ｄワーピング法について説明するための概念図である。It is a conceptual diagram for demonstrating 3D warping method. 多視点画像を用いて任意の視点位置の画像を合成する従来技術の処理の流れを示す図である。It is a figure which shows the flow of the process of the prior art which synthesize | combines the image of arbitrary viewpoint positions using a multiview image. ステレオマッチング法による処理の概略を示す図である。It is a figure which shows the outline of the process by a stereo matching method. 従来技術における問題点を示す図である。It is a figure which shows the problem in a prior art.

＜概略＞
まず、本発明の実施形態である仮想視点画像合成システム（以下、単に「仮想視点画像合成システム」という。）の概略について説明する。
仮想視点画像合成システムは、卓球や、テニスなどのスポーツの鑑賞や、大学等の授業を撮影した遠隔教育の教材向けに、視聴者があたかも撮影した環境にいると思えるような臨場感のある画像を高品質に合成する。そのため、本仮想視点画像合成システムによれば、光線空間法や、視体積交差法のようにカメラを密に配置する撮影環境や、被写体を全方位から撮影できるような撮影環境ではなくても、高品質な仮想視点画像の合成を実現する。すなわち、仮想視点画像合成システムは、実際の競技場や、イベント会場などにおける撮影を対象として、卓球やテニス等のスポーツシーン、またライブコンサートのようなイベントシーンにおいても、高品質な仮想視点画像の合成を実現する。 <Outline>
First, an outline of a virtual viewpoint image composition system (hereinafter simply referred to as “virtual viewpoint image composition system”) according to an embodiment of the present invention will be described.
The virtual viewpoint image composition system is a realistic image that allows viewers to feel as if they were in a shooting environment for table tennis, tennis and other sports appreciation, as well as distance learning materials for classes such as universities. Is synthesized with high quality. Therefore, according to this virtual viewpoint image composition system, even if it is not a shooting environment where the cameras are densely arranged like the ray space method or the visual volume intersection method, or a shooting environment where the subject can be shot from all directions, Realize high-quality virtual viewpoint image composition. In other words, the virtual viewpoint image composition system is intended for shooting at actual stadiums and event venues, and for high-quality virtual viewpoint images in sports scenes such as table tennis and tennis, and even in event scenes such as live concerts. Realize synthesis.

上述したような合成を実現するため、仮想視点画像合成システムは、対応付けが困難な領域の画素、もしくはセグメントされた領域について、その画素や、セグメントされた領域の奥行きの尤度を画像の特徴から補正する関数を算出する。そして、その関数の結果により補正を行う。画像特徴とは、色情報、テクスチャ情報、又は動き情報を意味する。 In order to realize the above-described composition, the virtual viewpoint image composition system determines the likelihood of the depth of the pixel or the segmented region for the pixel of the region or the segmented region that is difficult to be matched. The function to be corrected is calculated from Then, correction is performed based on the result of the function. The image feature means color information, texture information, or motion information.

具体的には以下の通りである。まず、仮想視点画像合成システムは、事前に検出した対応付け精度が高い画素（以下では高精度推定画素と呼ぶ）を用いて、奥行き値毎に画像特徴を抽出する。次に、仮想視点画像合成システムは、奥行き値毎に求めた画像特徴と、対応付けが困難な領域の画素（以下、補正対象画素と呼ぶ）との特徴とを比較する。そして、仮想視点画像合成システムは、最も類似した画像特徴を持つ奥行き値を用いて、対応付けが困難な画素の奥行きを補正する。対応付けが困難な画素と対応付けが困難な小領域（セグメントされた小領域）とは、スケール（空間的な大きさ）が違うだけで本質的な違いはない。以下の説明では、対応付けが困難な画素の補正方法についてのみ説明をする。 Specifically, it is as follows. First, the virtual viewpoint image composition system extracts an image feature for each depth value using a pixel with high matching accuracy detected in advance (hereinafter referred to as a high accuracy estimation pixel). Next, the virtual viewpoint image composition system compares the image feature obtained for each depth value with the feature of a pixel in a region that is difficult to associate (hereinafter referred to as a correction target pixel). Then, the virtual viewpoint image composition system corrects the depth of the pixel that is difficult to be associated using the depth value having the most similar image feature. A pixel that is difficult to associate and a small region that is difficult to associate (segmented small region) are not different from each other only in the scale (spatial size). In the following description, only a correction method for pixels that are difficult to associate will be described.

＜詳細＞
次に、仮想視点画像合成システムの詳細について説明する。
図１は、仮想視点画像合成システムの構成を示すブロック図である。被写体撮影部１０１は、複数台のカメラで構成される多視点画像取得システムである。被写体撮影部１０１は、撮影した映像信号Ｓ１をカメラ画像取得部１０２に供給する。カメラパラメータ入力部１０３は、キャリブレーションしたカメラパラメータＰ１を入力する装置である。仮想視点位置入力部１０５は、ユーザーが希望する視点位置を入力する装置である。カメラパラメータ入力部１０３と仮想視点位置入力部１０５は、例えば、キーボードや、マウス、タッチ入力装置などのユーザーインタフェースや、ＤＶＤ（Digital Versatile Disc）や、ＵＳＢ（Universal Serial Bus）メモリ等の外部記憶装置である。 <Details>
Next, details of the virtual viewpoint image composition system will be described.
FIG. 1 is a block diagram illustrating a configuration of a virtual viewpoint image synthesis system. The subject photographing unit 101 is a multi-viewpoint image acquisition system including a plurality of cameras. The subject photographing unit 101 supplies the photographed video signal S1 to the camera image acquisition unit 102. The camera parameter input unit 103 is a device that inputs calibrated camera parameters P1. The virtual viewpoint position input unit 105 is an apparatus that inputs a viewpoint position desired by the user. The camera parameter input unit 103 and the virtual viewpoint position input unit 105 are, for example, a user interface such as a keyboard, a mouse, or a touch input device, or an external storage device such as a DVD (Digital Versatile Disc) or a USB (Universal Serial Bus) memory. It is.

仮想視点画像合成装置１００は、カメラ画像取得部１０２、奥行き推定部１０４、仮想視点位置決定部１０６、画像データ記憶部１０７、画像合成部１０８、及び合成画像出力部１０９を備える。カメラ画像取得部１０２は、被写体撮影部１０１からの映像信号Ｓ１を取得し、画像データＤ１として画像データ記憶部１０７に供給する。仮想視点位置決定部１０６は、仮想視点位置入力部１０５により与えられた、仮想視点位置のカメラパラメータを決定し、画像合成部１０８に供給する。 The virtual viewpoint image composition device 100 includes a camera image acquisition unit 102, a depth estimation unit 104, a virtual viewpoint position determination unit 106, an image data storage unit 107, an image composition unit 108, and a composite image output unit 109. The camera image acquisition unit 102 acquires the video signal S1 from the subject photographing unit 101 and supplies it to the image data storage unit 107 as image data D1. The virtual viewpoint position determination unit 106 determines the camera parameter of the virtual viewpoint position given by the virtual viewpoint position input unit 105 and supplies the camera parameter to the image composition unit 108.

画像データ記憶部１０７は、磁気ハードディスク装置や半導体記憶装置などの記憶装置を用いて構成される。画像データ記憶部１０７は、カメラ画像・カメラパラメータ記憶部１０７ａ、奥行き記憶部１０７ｂ、及び合成画像記憶部１０７ｃを備える。各記憶部は、同一の記憶装置上に構成されても良いし、それぞれ異なる記憶装置上に構成されても良い。カメラ画像・カメラパラメータ記憶部１０７ａは、カメラ画像取得部１０２からの画像データＤ１を記憶する。奥行き記憶部１０７ｂは、後述する奥行き推定部１０４から出力される推定された奥行きデータＤ２を記憶する。合成画像記憶部１０７ｃは、後述する画像合成部１０８から出力される画像データＤ３を記憶する。被写体撮影部１０１のカメラによる被写体撮影で予め撮影したシーンの画像とキャリブレーションで求めたカメラパラメータＰ１と奥行き推定部１０４の出力結果Ｄ２とを、それぞれカメラ画像・カメラパラメータ記憶部１０７ａと奥行き記憶部１０７ｂとに記憶しておき、ユーザーが希望する仮想視点位置の入力に応じて画像合成を独立して実行することが可能となる。 The image data storage unit 107 is configured using a storage device such as a magnetic hard disk device or a semiconductor storage device. The image data storage unit 107 includes a camera image / camera parameter storage unit 107a, a depth storage unit 107b, and a composite image storage unit 107c. Each storage unit may be configured on the same storage device, or may be configured on different storage devices. The camera image / camera parameter storage unit 107a stores the image data D1 from the camera image acquisition unit 102. The depth storage unit 107b stores estimated depth data D2 output from the depth estimation unit 104 described later. The composite image storage unit 107c stores image data D3 output from the image composition unit 108 described later. An image of a scene captured in advance by subject photographing by the camera of the subject photographing unit 101, a camera parameter P1 obtained by calibration, and an output result D2 of the depth estimation unit 104 are respectively converted into a camera image / camera parameter storage unit 107a and a depth storage unit. 107b, and image synthesis can be executed independently according to the input of the virtual viewpoint position desired by the user.

奥行き推定部１０４は、カメラ画像・カメラパラメータ記憶部１０７ａからカメラパラメータＰ１と画像データＤ１とを取り出し、奥行き推定結果Ｄ２を出力して奥行き記憶部１０７ｂに供給する。
画像合成部１０８は、カメラ画像・カメラパラメータ記憶部１０７ａからカメラパラメータＰ１と画像データＤ１を取り出し、奥行き記憶部１０７ｂから奥行き推定結果Ｄ２を取り出して、合成画像（仮想視点から見た画像）データＤ３を出力する。 The depth estimation unit 104 extracts the camera parameter P1 and the image data D1 from the camera image / camera parameter storage unit 107a, outputs a depth estimation result D2, and supplies the result to the depth storage unit 107b.
The image composition unit 108 retrieves the camera parameter P1 and the image data D1 from the camera image / camera parameter storage unit 107a, retrieves the depth estimation result D2 from the depth storage unit 107b, and composes image (image viewed from a virtual viewpoint) data D3. Is output.

合成画像出力部１０９は、合成画像記憶部１０７ｃに記憶された合成画像データＤ３を、出力用画像データとして読み出し、ディスプレイ表示用の映像信号Ｓ２として合成画像表示部１１０に出力する。合成画像表示部１１０は、例えば、ディスプレイ端子等の合成画像出力部１０９に接続されたＣＲＴ（Cathode Ray Tube）、ＬＣＤ（Liquid Crystal Display）、ＰＤＰ（Plasma Display Panel）等の表示装置である。合成画像表示部１１０は、合成画像出力部１０９からの映像信号Ｓ２に従って合成画像を表示する。なお、合成画像表示部１１０は、例えば、二次元平面状の装置でもよいし、装置利用者を取り囲むような曲面状の表示装置であってもよい。 The composite image output unit 109 reads the composite image data D3 stored in the composite image storage unit 107c as output image data, and outputs it as a video signal S2 for display display to the composite image display unit 110. The composite image display unit 110 is a display device such as a CRT (Cathode Ray Tube), LCD (Liquid Crystal Display), or PDP (Plasma Display Panel) connected to the composite image output unit 109 such as a display terminal. The composite image display unit 110 displays a composite image in accordance with the video signal S2 from the composite image output unit 109. The composite image display unit 110 may be, for example, a two-dimensional planar device or a curved display device that surrounds the device user.

（画像合成方法の説明）
次に、本実施形態の仮想視点画像合成装置１００による仮想視点画像合成方法について説明する。図２は、本実施形態による仮想視点画像合成方法を説明するためのフローチャートである。仮想視点画像合成において、カメラの配置は、本来自由でよい。しかし、本実施形態では、複数台のカメラで共通視野を確保しやすくするために、格子状、もしくは一直線上にカメラを配置する。図３は、本実施形態による仮想視点画像合成方法で用いるカメラの配置例を示す概念図である。図３に示すように、カメラＣ_ｎ−２、Ｃ_ｎ−１、Ｃ_ｎ、Ｃ_ｎ＋１、…の向きは、並行、もしくは特定の被写体を注視点Ｍとするように放射線状に配置されており、全てのカメラＣ_ｎ−２、Ｃ_ｎ−１、Ｃ_ｎ、Ｃ_ｎ＋１、…は同期している。 (Description of image composition method)
Next, a virtual viewpoint image composition method by the virtual viewpoint image composition apparatus 100 of the present embodiment will be described. FIG. 2 is a flowchart for explaining the virtual viewpoint image synthesis method according to this embodiment. In the virtual viewpoint image composition, the arrangement of the cameras may be originally free. However, in this embodiment, the cameras are arranged in a lattice shape or on a straight line in order to easily secure a common field of view with a plurality of cameras. FIG. 3 is a conceptual diagram illustrating an arrangement example of cameras used in the virtual viewpoint image synthesis method according to the present embodiment. As shown in FIG. 3, the directions of the cameras C _n−2 , C _n−1 , C _n , C _{n + 1} ,... Are arranged in parallel or in a radial pattern so that a specific subject is the gazing point M. , All cameras C _n-2 , C _n−1 , C _n , C _{n + 1} ,... Are synchronized.

［多視点画像とカメラパラメータの入力］
まず、カメラパラメータ入力部１０３で、前処理として各カメラのカメラパラメータをキャリブレーションにより求める（ステップＳ１）。カメラ番号をｎ（＝１，２，３，…，Ｎ）、カメラの内部パラメータをＡ_ｎ、外部パラメータをＲ_ｎ、Ｔ_ｎ、カメラＣ_ｎの画像の画素の位置をｍ_ｎとすると、カメラＣ_ｎの画像上の位置ｍ_ｎ＝［ｘ_ｎ，ｙ_ｎ］とカメラＣ_ｎの座標系の位置Ｍｃ＝［Ｘ_ｃ，Ｙ_ｃ，Ｚ_ｃ］、世界座標系の位置Ｍ＝［Ｘ，Ｙ，Ｚ］の関係は次式（１）、（２）で求まる。 [Multi-viewpoint image and camera parameter input]
First, the camera parameter input unit 103 obtains camera parameters of each camera by calibration as preprocessing (step S1). The camera number n (= 1,2,3, ..., N ), the internal parameters of the camera _{A n,} the external parameter _R _{n, T} n, when the position of the pixel of the camera _{C n} of an image and _{m n,} camera position on C _n images _{_{_{m n = [x n, y}}} n] position of the coordinate system of the camera _{_{_{C n Mc = [X c,}}} Y c, Z c], the position M = [X world coordinate system, Y , Z] is obtained by the following equations (1) and (2).

数式（１）と数式（２）より、 From Equation (1) and Equation (2),

となる。但し、ｓ_ｎは奥行き方向のスケールを決める正の定数、右上添え字のＴは転置行列を意味し、チルダ（〜）ｍ_ｎとチルダ（〜）Ｍとは拡張ベクトルであり、チルダ（〜）ｍ_ｎ＝［ｘ_ｎ，ｙ_ｎ，１］^Ｔ，チルダ（〜）Ｍ＝［Ｘ，Ｙ，Ｚ，１］^Ｔである。
画像の奥行きが分かれば、数式（１）により定数ｓ_ｎが決定されカメラＣ_ｎの座標系での位置Ｍ_ｃが分かる。そして、数式（２）により世界座標系での位置Ｍを求めることができる。
また、カメラＣ_ｎの画素ｍ_ｎの奥行きがＺ＝ｄのとき、カメラＣ_ｎ−１の画像上の画素ｍ_ｎ−１は、ホモグラフィ行列Ｈ_{ｎ，ｎ−１}により求めることができる。 It becomes. Here, s _n is a positive constant that determines the scale in the depth direction, T in the upper right subscript means a transposed matrix, tilde (~) _mn and tilde (~) M are extension vectors, and tilde (~) m _n = [x _n , y _n , 1] ^T , tilde (˜) M = [X, Y, Z, 1] ^T
If the depth of the image is known, it is found position M _c in the coordinate system of the constant s _n is determined camera C _n by Equation (1). Then, the position M in the world coordinate system can be obtained from Equation (2).
Further, when the depth of the pixel _{m n} of the camera _{C n} is Z = d, the pixel _{m n-1} on the camera _{C n-1} of the image can be obtained homography matrix _{H n,} the _n-1.

［奥行きに対する尤度の計算］
次に、奥行き推定部１０４が、カメラＣ_ｎの画像Ｉ_ｎについて、奥行きに対する尤度をステレオマッチング法により求める（ステップＳ２）。カメラＣ_ｎ以外の全てのカメラの画像についても、同様にして奥行きを推定することが可能である。多視点画像を前提としているので、２眼ステレオで利用されるＳＳＤ（Sum of Squared Difference）を拡張した複数基線長を利用したステレオマッチング法（参考文献１：奥富，金出：複数の基線長を利用したステレオマッチング法，信学論, vol. J75-D-II, no. 8, pp. 1317-1327 (1992)）のＳＳＳＤ（Sum of SSDs）を尤度の計算に用いる。 [Calculation of likelihood to depth]
Next, the depth estimation unit 104, the image _{I n} of camera _{C n,} obtains a likelihood for depth by a stereo matching method (step S2). The depth can be estimated in the same manner for images of all cameras other than the camera C _n . Since multi-viewpoint images are assumed, a stereo matching method using multiple baseline lengths expanded from SSD (Sum of Squared Difference) used in binocular stereo (Reference 1: Okutomi, Kinde: Multiple baseline lengths) The stereo matching method used, SSSD (Sum of SSDs) of the theory of theory, vol. J75-D-II, no. 8, pp. 1317-1327 (1992)) is used for the likelihood calculation.

以下では、ＮＣＣ（Normalized Cross Correlation）を用いた場合の尤度の計算を示す。カメラＣ_ｎの画像Ｉ_ｎの注目画素ｐについて、奥行きｄに対する尤度Ｌ_ｐ（ｄ）は、次式（６）で表現される。 Below, the calculation of likelihood when NCC (Normalized Cross Correlation) is used is shown. For the target pixel p of the image _{I n} of camera _{C n,} the likelihood _L p for the depth d (d) is expressed by the following equation (6).

但し、ＯはカメラＣ_ｎの周辺カメラの集合とし、ｒは数式（４）のホモグラフィ行列により求まるカメラＣｏの画像Ｉ_ｏの画素の位置、ν_γは画像Ｉ_ｏにおいて画素ｒ周辺の局所領域の画像のＲ，Ｇ，Ｂの輝度値を並べたベクトルである。ν_ｐ・ν_γは、ベクトルの内積を表し、ｎｏｒｍは、ベクトルの大きさを表し、１−ノルム、２−ノルム等を意味する。Γ_ｐは、奥行きｄを変化させたとき、尤度Ｌ_ｐ（ｄ）の総和が１になるようにする正規化係数である。 Where O is a set of peripheral cameras of the camera C _n , r is the position of the pixel of the image I _o of the camera Co obtained by the homography matrix of Equation (4), and ν _γ is a local region around the pixel r in the image I _o This is a vector in which the luminance values of R, G, and B of the images are arranged. ν _p · ν _γ represents an inner product of vectors, norm represents the magnitude of the vector, and means 1-norm, 2-norm, and the like. Γ _p is a normalization coefficient that makes the sum of the likelihoods L _p (d) become 1 when the depth d is changed.

図４は、本実施形態による奥行きに対する尤度の計算方法を説明するための概念図である。また、図５は、複数の画像間のエピポーラ線（ＥＬ１、ＥＬ２）を説明するための概念図である。周辺領域とは、図４に示すように、注目画素ｐの周辺の３×３や、５×５、７×７画素等の領域である。ν_ｐや、ν_γは、Ｒ，Ｇ，Ｂ成分の値をラスタースキャンしたベクトルで表すことができる。例えば、局所領域の大きさが３×３の場合には、それぞれの成分が９次元なので、ν_ｐは２７（＝９次元×３成分）次元のベクトルとなる。 FIG. 4 is a conceptual diagram for explaining a method of calculating likelihood with respect to depth according to the present embodiment. FIG. 5 is a conceptual diagram for explaining epipolar lines (EL1, EL2) between a plurality of images. As shown in FIG. 4, the peripheral region is a region of 3 × 3, 5 × 5, 7 × 7 pixels, etc. around the pixel of interest p. ν _p and ν _γ can be represented by a raster scan vector of R, G, and B component values. For example, when the size of the local region is 3 × 3, each component is 9-dimensional, so ν _p is a 27 (= 9-dimensional × 3-component) -dimensional vector.

数式（６）により、カメラＣ_ｎの画素ｐについて複数の画像間のエピポーラ線（図５を参照）上の局所領域の相関情報を計算することにより尤度を求めている。また、カメラＣ_ｎに対して、周辺カメラＣ_ｏの選び方は撮影環境に依存する。共通視野がなるべく多いカメラを選ぶことで対応付けが行いやすくなる。そのため、カメラＣ_ｎから近い２台以上のカメラを選んだ方が好ましい。 The equation (6), seeking likelihood by computing the correlation information of the local region on the epipolar lines between the plurality of images (see Figure 5) for pixel p of the camera C _n. In addition, with respect to the camera C _n, the choice of peripheral camera C _o is dependent on the shooting environment. Matching is facilitated by selecting cameras with as many common fields of view as possible. Therefore, it is preferable to chose two or more cameras close to the camera C _n.

［奥行き推定］
次に、奥行き推定部１０４が、個々の画素の尤度に基づいて奥行きを推定する（ステップＳ３）。本実施形態では、個々の画素の尤度と平滑化項で定義されるエネルギー関数の最小化問題を解くことで、多視点画像の奥行きを推定する手法を用いる。この手法は、個々の画素の奥行きに対する尤度と近傍画素の奥行きの推定結果とが近い値になるような平滑化項により、エネルギー関数が定義される。ステレオマッチング法の尤度の結果だけでは、被写体表面が凸凹な奥行きに推定されがちである。しかし、平滑化項を設定することで推定結果が滑らかになる効果があり、その有効性が報告されている（参考文献２：Li Hong, George Chen : Segment-based Stereo matching Using Graph Cuts, in Proc. of CVPR, vol.1, pp. 74-81 (2004)）。
カメラＣ_ｎの画像Ｉ_ｎについて、注目画素をｐ、近傍画素をｑで表わすと、エネルギー関数は、以下の数式（７）、（８）、（９）ように定義される。 [Depth estimation]
Next, the depth estimation unit 104 estimates the depth based on the likelihood of each pixel (step S3). In this embodiment, a method of estimating the depth of a multi-viewpoint image by solving the minimization problem of the energy function defined by the likelihood of each pixel and the smoothing term is used. In this method, the energy function is defined by a smoothing term such that the likelihood with respect to the depth of each pixel is close to the estimation result of the depth of neighboring pixels. The subject surface tends to be estimated to have an uneven depth only by the result of the likelihood of the stereo matching method. However, setting smoothing terms has the effect of smoothing the estimation results, and its effectiveness has been reported (Reference 2: Li Hong, George Chen: Segment-based Stereo matching Using Graph Cuts, in Proc of CVPR, vol.1, pp. 74-81 (2004)).
The image _{I n} of camera _{C n,} expressed the pixel of interest p, neighboring pixels in q, the energy function, the following equation (7), (8), is defined as (9).

但し、大文字のＤ（ｐ）は、画素ｐの推定された奥行きであり、Ｅ_{Ｌｉｋｅｌｉｈｏｏｄ}は、画素ｐの奥行きがＤ（ｐ）と推定されたときのコストを出力する関数であり、Ｅ_{ｓｍｏｏｔｈ}は、平滑化項であり、λは、２つの関数を重視する比率である。尤度が大きい程、コストは小さくなる。Ｅ_{ｓｍｏｏｔｈ}は、画素ｐと近傍画素の奥行きの推定結果Ｄ（ｐ）とＤ（ｑ）の差が小さいほど、小さいコストを出力する関数である。 Where uppercase D (p) is the estimated depth of pixel p, E _Likelihood is a function that outputs the cost when the depth of pixel p is estimated to be D (p), and E _smooth is , Is a smoothing term, and λ is a ratio that emphasizes two functions. The higher the likelihood, the lower the cost. E _smooth is a function that outputs a smaller cost as the difference between the depth estimation results D (p) and D (q) of the pixel p and the neighboring pixels is smaller.

また、平滑化項については、数式（９）以外にも、近傍画素ｑとの色の違いによりコストの大きさを変化させるような次式（１０）の形や、ｐとｑの画素の奥行きが違うときには、一定のコストを算出する次式（１１）にしてもよい。 As for the smoothing term, in addition to the formula (9), the form of the following formula (10) that changes the cost due to the color difference from the neighboring pixel q, and the depth of the pixels of p and q If they are different, the following equation (11) may be used to calculate a certain cost.

ここで、Ｉ（ｐ）とＩ（ｑ）は、カメラＣ_ｎの画素ｐと画素ｑの色情報であり、画素ｐとｑの位置の［Ｒ，Ｇ，Ｂ］成分を並べたベクトルであり、｜｜Ｉ（ｐ）−Ｉ（ｑ）｜｜は、２−ノルムを表す。数式（１０）の平滑化項は、色が切り替わるところでは、近傍画素の奥行きも変化しやすい効果が得られる。 Here, I (p) and I (q) is the color information of the pixel p and the pixel q of the camera _{C n,} be a vector obtained by arranging [R, G, B] component of the position of the pixel p and q , || I (p) -I (q) || represents a 2-norm. The smoothing term of Equation (10) provides an effect that the depth of neighboring pixels is likely to change where the color changes.

最後に、数式（７）のエネルギー関数Ｅ_{ｔｏｔａｌ}が最小にするような奥行きを求める。この最小化問題は、ＳｉｍｕｌａｔｅｄＡｎｎｅａｌｉｎｇ法や、ＧｒａｐｈＣｕｔｓ法、ＢｅｌｉｅｆＰｒｏｐａｇａｔｉｏｎ法などのアルゴリズムにより近似解を得ることができる。 Finally, a depth that minimizes the energy function E _total of Equation (7) is obtained. For this minimization problem, an approximate solution can be obtained by an algorithm such as the Simulated Annealing method, the Graph Cuts method, or the Belief Propagation method.

［奥行き推定結果の評価］
次に、奥行き推定部１０４が、奥行きを補正する対象の画素（補正対象画素）の検出と、奥行き推定精度が高い画素（高精度推定画素）の検出とを行う（ステップＳ４）。奥行き推定部１０４は、補正対象画素として、ステレオマッチング法での対応付けが困難な画素を選択する。以下で、２通りの評価方法について述べる。
（１）ステレオマッチング法の尤度を用いた評価
テクスチャが少ない領域の画素やオクルージョンの領域では、数式（６）において画素ｐの奥行きに対する尤度関数の値の最大値が小さくなる傾向がある。また、奥行き推定を誤った場合には、その奥行きを用いて仮想視点画像を合成したときにアーティファクトが生じる。 [Evaluation of depth estimation results]
Next, the depth estimation unit 104 performs detection of a pixel whose depth is to be corrected (correction target pixel) and detection of a pixel with high depth estimation accuracy (high accuracy estimation pixel) (step S4). The depth estimation unit 104 selects a pixel that is difficult to associate with the stereo matching method as the correction target pixel. Two evaluation methods will be described below.
(1) In a pixel in an area where the evaluation texture using the likelihood of the stereo matching method is small or an occlusion area, the maximum value of the likelihood function value with respect to the depth of the pixel p in Equation (6) tends to be small. In addition, if the depth estimation is incorrect, an artifact occurs when the virtual viewpoint image is synthesized using the depth.

そこで奥行き推定部１０４は、補正対象画素として、尤度の最大値が閾値Ｔｈ＿ｌｉｋｅよりも小さい画素で、かつ、推定した奥行きの値で合成した画像と実カメラの映像との差分が閾値Ｔｈ＿ｄｉｆｆよりも大きい画素を選択しても良い。逆に、奥行き推定部１０４は、尤度の最大値が閾値Ｔｈ＿ｌｉｋｅよりも大きい画素で、かつ、推定した奥行きの値で合成した画像と実カメラの映像との差分が閾値Ｔｈ＿ｄｉｆｆよりも小さい画素を、高精度推定画素として選択しても良い。 Accordingly, the depth estimation unit 104 is a pixel whose maximum likelihood value is smaller than the threshold Th_like as the correction target pixel, and the difference between the image synthesized with the estimated depth value and the video of the actual camera is smaller than the threshold Th_diff. Large pixels may be selected. On the other hand, the depth estimation unit 104 calculates a pixel whose maximum likelihood value is larger than the threshold value Th_like and whose difference between the image synthesized with the estimated depth value and the video of the actual camera is smaller than the threshold value Th_diff. Alternatively, it may be selected as a high-precision estimated pixel.

これらの閾値は、例えば事前に実験を行う事によって決められるパラメータである。本実施形態では、画像全体の尤度の平均値と差分の平均を、それぞれ閾値Ｔｈ＿ｌｉｋｅとＴｈ＿ｄｉｆｆとした。以下では、高精度推定画素をｕで表わし、高精度推定画素の集合をＵと表現する。
（２）近傍カメラ画像との比較による評価
カメラＣｎの画像Ｉｉの画素ｐについて推定精度の評価を述べる．
画素ｐの推定された奥行きをＤｉ（ｐ）、その奥行きをもとに式（４）のホモグラフィ行列により近傍カメラＣｏへ射影した画素の位置をｑ、またカメラＣｏの画素ｑの推定された奥行きをＤｏ（ｑ）と表現すると、以下の式で評価をする。 These threshold values are parameters determined by conducting an experiment in advance, for example. In the present embodiment, the average value of the likelihood of the entire image and the average of the differences are set as threshold values Th_like and Th_diff, respectively. Hereinafter, the high-precision estimated pixel is represented by u, and the set of high-precision estimated pixels is represented by U.
(2) Evaluation by comparison with neighboring camera images The evaluation of the estimation accuracy for the pixel p of the image Ii of the camera Cn is described.
The estimated depth of the pixel p is Di (p), the position of the pixel projected to the neighboring camera Co by the homography matrix of the equation (4) based on the depth is q, and the pixel q of the camera Co is estimated. When the depth is expressed as Do (q), the following expression is used for evaluation.

画素ｐについて、近傍カメラＣｏの画素ｑの奥行きと色を比較したＳＤとＳＩについて、閾値Ｔｈ＿ＳＤ；Ｔｈ＿ＳＩを設定し、その閾値以下の画素について推定精度が高い画素と判定した。これらの閾値は実験的に決めるパラメータである。 For the pixel p, the threshold Th_SD; Th_SI is set for SD and SI that compare the depth and color of the pixel q of the neighboring camera Co, and the pixels that are equal to or lower than the threshold are determined to have high estimation accuracy. These thresholds are experimentally determined parameters.

［画像特徴から奥行き推定関数ｆの算出］
次に、カメラＣ_ｎの補正をする画素ｐについて、奥行き推定関数を算出する（ステップＳ５）。以下、奥行き推定関数の算出について、図６及び図７を用いて説明する。図６及び図７は、画像特徴から奥行き推定関数ｆの算出方法を説明するための概念図である。奥行き推定関数の算出には、補正対象画素ｐから半径Ｒ以内の高精度推定画素ｕ（∈Ｕ）を用いる（図６、図７を参照）。ここで、画素集合ＵはカメラＣ_ｎの近傍のカメラＣ_ｏ（ｏ＝…ｎ−２，ｎ−１，ｎ，ｎ＋１，…）をカメラＣ_ｎに射影した画素も含める。カメラＣ_ｏの高精度推定画素をｕ_ｏ、画素ｕ_ｏをカメラＣ_ｎに射影した座標の画素をｕ^ｏ _ｎで表わすと、尤度推定関数の算出の際に用いる高精度推定画素ｕの集合Ｕは、次式（１２）、（１３）のように求まる。 [Calculation of depth estimation function f from image features]
Next, a depth estimation function is calculated for the pixel p to be corrected by the camera C _n (step S5). Hereinafter, calculation of the depth estimation function will be described with reference to FIGS. 6 and 7. 6 and 7 are conceptual diagrams for explaining a method of calculating the depth estimation function f from the image feature. For the calculation of the depth estimation function, a high-precision estimated pixel u (∈U) within a radius R from the correction target pixel p is used (see FIGS. 6 and 7). Here, the pixel set U cameras _C in the vicinity of _n cameras _{C o (o = ... n-} 2, n-1, n, n + 1, ...) also include pixels projected on the camera _{C n.} Expressed camera C _o accurate estimate pixels u _o of _the pixel of coordinates obtained by projecting the pixel u _o to camera C _n by u ^o _n, a set of high-precision estimated pixel u used in the calculation of likelihood estimation function U is obtained by the following equations (12) and (13).

次に、カメラＣ_ｎの奥行き方向に多層平面をＤ枚設定し、各層（ｄ（＝１，２，…，Ｄ））に所属する高精度推定画素ｕ（∈Ｕ）から画像特徴を抽出する。画像特徴は、奥行きがｄの高精度推定画素ｕを含むＮ×Ｎの局所領域から抽出される。例えば、Ｎ＝１として高精度推定画素ｕの色（Ｒ，Ｇ，Ｂ）成分を並べた三次元のベクトルや、図４で示したように、５×５の領域のＲ，Ｇ，Ｂをラスタースキャンして並べたテクスチャ情報を含むベクトルや、ＨＯＧ（Histograms of Oriented Gradients）特徴や、ＳＵＲＦ（Speeded-Up Robust Features）特徴を用いる。 Next, D multi-layer planes are set in the depth direction of the camera C _n , and image features are extracted from high-precision estimated pixels u (∈U) belonging to each layer (d (= 1, 2,..., D)). . The image feature is extracted from an N × N local region including the high-precision estimated pixel u having a depth of d. For example, when N = 1, a three-dimensional vector in which the color (R, G, B) components of the high-precision estimated pixel u are arranged, or R, G, B in a 5 × 5 region as shown in FIG. Vectors including texture information arranged by raster scanning, HOG (Histograms of Oriented Gradients) features, and SURF (Speeded-Up Robust Features) features are used.

高精度推定画素ｕの奥行きがｄに推定された画素の集合をＵｄ、その画素ｕの特徴ベクトルをｖ_ｄで表わす。奥行き推定関数は、この辞書ベクトルと補正画素の画像特徴ベクトルとの類似度や距離から奥行きを推定する。類似度の算出方法は、例えば、辞書ベクトルと補正画素の画像特徴ベクトルとのマハラノビス距離や、最近傍探索した結果得られた最近傍ベクトルと補正画素の画像特徴ベクトルとの距離や、辞書ベクトルから生成される部分空間と補正画素との特徴ベクトルの内積角度などによって計算される。
以下では、マハラノビス距離を用いたときの奥行き推定関数ｆが、補正画素について奥行きｄに所属される尤度Ｆ（ｄ）の算出方法を示す。補正画素の特徴ベクトルをｘｐで表すと、次式（１４）、（１５）、（１６）、（１７）で表される。 A set of pixels in which the depth of the high-precision estimated pixel u is estimated to be _d is represented by Ud, and a feature vector of the pixel u is represented by vd. The depth estimation function estimates the depth from the similarity or distance between the dictionary vector and the image feature vector of the correction pixel. The similarity calculation method is, for example, the Mahalanobis distance between the dictionary vector and the image feature vector of the correction pixel, the distance between the nearest neighbor vector obtained as a result of the nearest neighbor search and the image feature vector of the correction pixel, or the dictionary vector. It is calculated by the inner product angle of the feature vector of the generated partial space and the correction pixel.
In the following, a method for calculating the likelihood F (d) in which the depth estimation function f when using the Mahalanobis distance belongs to the depth d for the correction pixel will be described. When the feature vector of the correction pixel is expressed by xp, it is expressed by the following equations (14), (15), (16), and (17).

但し、Γ_Ｆは、奥行きｄ（＝１，２，…，Ｄ）の尤度の総和が１になるための正規化係数、ｎｕｍ（ｖｄ）は辞書ベクトルｖｄの数、ｄｉｓｔ（ｘｐ，μｄ）はマハラノビス距離、μｄは奥行きｄの辞書ベクトルｖｄの平均ベクトル、Ｓｄは共分散行列であり、εは０割りを避けるための微小値である。半径Ｒは、実験的に決めるパラメータで、本実施形態では、Ｒ＝１０〜４０、εは０．１とした。 Where Γ _F is a normalization coefficient for the sum of likelihoods of depth d (= 1, 2,..., D) to be 1, num (vd) is the number of dictionary vectors vd, and dist (xp, μd) Is a Mahalanobis distance, μd is an average vector of dictionary vectors vd having a depth d, Sd is a covariance matrix, and ε is a minute value for avoiding 0 division. The radius R is a parameter determined experimentally. In this embodiment, R = 10 to 40 and ε is set to 0.1.

［尤度の補正］
次に、補正対象画素について、その画素が属する被写体の奥行き情報により尤度を補正する（ステップＳ６）。補正対象画素ｐのステレオマッチング法で求めた尤度Ｌｐ（ｄ）について、補正後の尤度Ｌ’ｐ（ｄ）は次式（１８）で表される。 [Likelihood correction]
Next, the likelihood of the correction target pixel is corrected based on the depth information of the subject to which the pixel belongs (step S6). For the likelihood Lp (d) obtained by the stereo matching method of the correction target pixel p, the corrected likelihood L′ p (d) is expressed by the following equation (18).

ここで、ｗ（０＜ｗ＜１）は、ステレオマッチング法で計算した尤度と奥行き推定関数の出力のいずれを重視するかを表す割合いとを調整するパラメータである。ｗが大きいとステレオマッチング法の尤度を重視することとなり、実験的に決定する。 Here, w (0 <w <1) is a parameter for adjusting the likelihood calculated by the stereo matching method and the ratio indicating which of the outputs of the depth estimation function is important. If w is large, the likelihood of the stereo matching method is emphasized, and is determined experimentally.

［画像の奥行きの再推定］
次に、個々の画素の尤度と平滑化項で定義されるエネルギー関数の最小化問題を解くことで、奥行きを再推定する（ステップＳ７）。すなわち、数式（７）に補正後の尤度を代入することで、奥行きを再推定する。 [Re-estimation of image depth]
Next, the depth is re-estimated by solving the minimization problem of the energy function defined by the likelihood of each pixel and the smoothing term (step S7). That is, the depth is re-estimated by substituting the corrected likelihood into Equation (7).

［仮想視点位置の画像合成］
次に、画像合成部１０８が、仮想視点位置に近いカメラを選択し、選択されたＮ個のカメラ画像と推定された奥行き情報とから３Ｄワーピング法により画像を合成する（ステップＳ８）。色をブレンドする際には、各カメラと仮想視点との位置の近さや、推定された奥行きの尤度の強さに応じた加重平均を行う。 [Image composition of virtual viewpoint position]
Next, the image composition unit 108 selects a camera close to the virtual viewpoint position, and composes an image by the 3D warping method from the selected N camera images and the estimated depth information (step S8). When blending colors, a weighted average is performed according to the proximity of the position of each camera and the virtual viewpoint and the strength of the estimated likelihood of depth.

ここで、図８は、仮想視点位置の画像合成を説明するための概念図である。３Ｄワーピング法は、多視点画像と画像の奥行き（デプスマップ）を基にして、仮想視点位置のカメラＣ_ｖの画像の画素ｍ_ｖの色Ｉ_ｖ（ｍ_ｖ）を決める方法である。図８に２台のカメラの例を示す。カメラの選択は、仮想視点から適当な距離の範囲にあるカメラを用いればよいので、２台以上でも可能である。 Here, FIG. 8 is a conceptual diagram for explaining the image composition of the virtual viewpoint position. The 3D warping method is a method of determining the color I _v (m _v ) of the pixel m _v of the image of the camera C _{v at} the virtual viewpoint position based on the multi-view image and the depth (depth map) of the image. FIG. 8 shows an example of two cameras. The cameras can be selected by using cameras within a suitable distance from the virtual viewpoint, so two or more cameras can be selected.

カメラＣ_１とカメラＣ_２の内部パラメータと外部パラメータをそれぞれＡ_１、Ａ_２、Ｒ_１、Ｔ_１、Ｒ_２、Ｔ_２とし、カメラＣ_１とカメラＣ_２の画像の奥行きをＤ_１，Ｄ_２とする。このとき、点Ｍの色はカメラＣ_１、カメラＣ_２それぞれから式（３）により仮想視点カメラＣ_ｖへ射影される。仮想視点カメラの内部パラメータをＡ_ｖ、外部パラメータをＲ_ｖ、Ｔ_ｖとすると、 The internal parameters and the external parameters of the camera C ₁ and the camera C ₂ are respectively A ₁ , A ₂ , R ₁ , T ₁ , R ₂ , T _2, and the depths of the images of the camera C ₁ and the camera C ₂ are D ₁ , D ₂ . At this time, the color of the point M is projected from the cameras C ₁ and C ₂ to the virtual viewpoint camera C _{v according} to the equation (3). Assuming that the internal parameters of the virtual viewpoint camera are A _v and the external parameters are R _v and T _v ,

となる。ここで、チルダ（〜）ｍ_ｖ ^１とチルダ（〜）ｍ_ｖ ^２は、カメラＣ_１とカメラＣ_２の画素ｍ_１、ｍ_２を数式（３）で射影したときの、位置の拡張ベクトルである。 It becomes. Here, the tilde (˜) m _v ¹ and the tilde (˜) m _v ² are extended vectors of positions when the pixels m ₁ and m ₂ of the camera C ₁ and the camera C ₂ are projected by Expression (3). is there.

仮想視点とカメラＣ_１、カメラＣ_２との距離の比と画素ｍ_１と画素ｍ_２の奥行きの尤度により、仮想視点の画像の画素ｍ_ｖの色Ｉ（ｍ_ｖ）を加重平均により求める。仮想視点とカメラＣ_１とカメラＣ_２の距離の比率をα：（１−α）（０＜α＜１）とし、尤度をＬ（Ｄｍ_１）：Ｌ（Ｄｍ_２）とすると、 The color I (m _v ) of the pixel m _v of the image of the virtual viewpoint is obtained by a weighted average based on the ratio of the distance between the virtual viewpoint and the camera C ₁ and the camera C ₂ and the likelihood of the depths of the pixels m ₁ and m _2. . When the ratio of the distance between the virtual viewpoint and the camera C ₁ and the camera C ₂ is α: (1-α) (0 <α <1) and the likelihood is L (Dm ₁ ): L (Dm ₂ ),

但し、Ｌ（Ｄｍ_１）、Ｌ（Ｄｍ_２）は、カメラＣ_１、カメラＣ_２の画像の画素ｍ_１、ｍ_２について、奥行き推定時に計算した尤度である。また、距離の比率と尤度の比率を加算によりｗ_１とｗ_２を求めたが、どちらか一方のみを利用することや、比率を掛け算することで求めてもよい。 However, L (Dm ₁ ) and L (Dm ₂ ) are likelihoods calculated at the time of depth estimation for the pixels m ₁ and m ₂ of the images of the cameras C ₁ and C ₂ . Although the ratio and the ratio of the likelihood of the distance to determine the w ₁ and w ₂ by the addition, and to use only one or the other may be determined by multiplying the ratio.

ここで、図９Ａ及び図９Ｂは、本実施形態による、３Ｄワーピング法について説明するための概念図である。式（４）により画素を奥行きに応じて射影した際に、図９に示すように異なる点Ｐと点Ｑが仮想視点カメラＣ_ｖから見ると、一直線上に存在する場合がある。このときは、点Ｐと点ＱについてカメラＣ_ｖの座標系における奥行きが小さい方の点Ｐが仮想カメラＣ_ｖから見える。例えば、カメラＣ_１から見える点ＰとカメラＣ_２から見える点Ｑについて、カメラＣ_ｖの座標系での奥行きがそれぞれＤ_ｖ（Ｐ）、Ｄ_ｖ（Ｑ）としたときに、（Ｄ_ｖ（Ｑ）−Ｄ_ｖ（Ｐ））＞δとすると、 Here, FIG. 9A and FIG. 9B are conceptual diagrams for explaining the 3D warping method according to the present embodiment. When performing projection in accordance with the depth of the pixel by Expression (4), different from P and the point Q as shown in FIG. 9 when viewed from a virtual viewpoint camera C _v, it may be present on a straight line. At this time, P points towards the depth is small in the coordinate system of the camera C _v for points P and Q are visible from the virtual camera C _v. For example, regarding the point P visible from the camera C ₁ and the point Q visible from the camera C ₂ , when the depth in the coordinate system of the camera C _v is D _v (P) and D _v (Q), respectively (D _v ( Q) −D _v (P))> δ,

となる。但し、δは閾値のパラメータであり、事前に予備実験により決める。閾値δ以下の場合には、数式（２１）により色を混合する。 It becomes. However, δ is a threshold parameter and is determined in advance by a preliminary experiment. If it is less than or equal to the threshold δ, the colors are mixed according to Equation (21).

次に、本発明の実施形態である仮想視点画像合成装置１００の効果について説明する。 Next, effects of the virtual viewpoint image composition device 100 according to the embodiment of the present invention will be described.

従来手法では、対応付けが困難な領域（画素）について、同一セグメント内の画素の奥行き情報を用いて補正を行っていた。従来手法では、他にも、前景や、背景というように被写体の奥行きが２値であることを前提に、対応付けが困難な画素と類似した色の被写体（前景もしくは背景）の奥行き情報を用いた補正も行っていた。 In the conventional method, for regions (pixels) that are difficult to associate, correction is performed using depth information of pixels in the same segment. The conventional method also uses depth information of a subject (foreground or background) having a color similar to that of a pixel that is difficult to associate, assuming that the subject's depth is binary, such as foreground and background. The correction that was done was also performed.

しかし、前者の手法では、同一セグメント内の大部分の画素の奥行き推定精度が高くないと、正しく補正が行えない。つまり、テクスチャが少ない領域やオクルージョンの影響を受ける領域が広範囲な場合には、奥行き推定誤差が大きくなる可能性がある。また、同一の被写体が同一のセグメントになることが前提となるが、高精度に画像をセグメンテーションすることが難しい。 However, with the former method, correction cannot be performed correctly unless the depth estimation accuracy of most pixels in the same segment is high. That is, when the area with less texture or the area affected by occlusion is wide, the depth estimation error may increase. In addition, it is assumed that the same subject becomes the same segment, but it is difficult to segment an image with high accuracy.

後者の手法では、被写体が前景又は背景に存在している、つまり、奥行きは、２段階で近似することを前提としている。しかし、仮想視点画像合成では、奥行きの値は、多値であるため適用が難しい。また、後者の手法は、色情報を基にして背景と前景とを分離するものである。しかし、前景と背景に類似した色がある場合には、分離が困難となる。 In the latter method, it is assumed that the subject exists in the foreground or background, that is, the depth is approximated in two stages. However, in the virtual viewpoint image composition, the depth value is multivalued, so that it is difficult to apply. The latter method separates the background and the foreground based on the color information. However, when there are colors similar to the foreground and background, separation becomes difficult.

一方、上述した仮想視点画像合成装置１００によれば、画像間の対応付けが困難な場合であっても奥行き推定誤差を抑制することが可能である。そのため、このような場合であっても高品質な仮想視点画像を合成できる。これにより、被写体のパーツ（顔、足、手など）にアーティファクトが生じることを防止し、合成画像の品質を向上させることが可能となる。 On the other hand, according to the virtual viewpoint image composition device 100 described above, it is possible to suppress a depth estimation error even when it is difficult to associate images. Therefore, even in such a case, a high-quality virtual viewpoint image can be synthesized. As a result, it is possible to prevent artifacts from occurring in the parts (face, foot, hand, etc.) of the subject and improve the quality of the composite image.

なお、画像間の対応付けが困難な場合とは、例えば、テクスチャが少ない領域が広範囲である場合や、オクルージョンの影響を受ける領域が広範囲な場合である。また、被写体の境界付近に、被写体と類似した色を持つ別の被写体が存在する場合も、画像間の対応付けが困難であった。また、被写体のパーツに生じるアーティファクトとは、例えばパーツの一部が欠損してしまった画像や、パーツの一部が拡大又は縮小されてしまった画像のことである。 Note that the case where it is difficult to associate images is, for example, a case where a region with little texture is wide or a region affected by occlusion is wide. In addition, when there is another subject having a color similar to the subject near the subject boundary, it is difficult to associate the images. Artifacts generated in a subject part are, for example, an image in which a part of the part is lost or an image in which a part of the part is enlarged or reduced.

＜変形例＞
補正対象画素を選択する処理は、上述したものに限定される必要は無い。例えば、注目画素周辺にテクスチャが少ない場合に、その注目画素を補正対象画素として選択しても良い。例えば、注目画素周辺に繰り返しテクスチャがある場合に、その注目画素を補正対象画素として選択しても良い。例えば、注目画素周辺がオクルージョンの影響を受けている場合に、その注目画素を補正対象画素として選択しても良い。例えば、テクスチャが少ないか否かについては、以下のような手法によって判定することができる。まず、注目画像に対してソーベルフィルタ（Sobel Filter：水平、垂直方向の輝度値の微分フィルタ）を適用する。そして、画素毎にフィルタ後の値をエッジ強度として使用し、エッジ強度に基づいてテクスチャが多いか少ないかの判定が可能である。 <Modification>
The process for selecting the correction target pixel is not necessarily limited to the above-described process. For example, when there are few textures around the target pixel, the target pixel may be selected as a correction target pixel. For example, when there is a repetitive texture around the target pixel, the target pixel may be selected as a correction target pixel. For example, when the periphery of the target pixel is affected by occlusion, the target pixel may be selected as a correction target pixel. For example, whether or not the texture is small can be determined by the following method. First, a Sobel filter (horizontal and vertical luminance value differential filter) is applied to the image of interest. Then, the value after filtering is used as the edge strength for each pixel, and it is possible to determine whether the texture is large or small based on the edge strength.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１００…仮想視点画像合成装置，１０１…被写体撮影部，１０２…カメラ画像取得部，１０３…カメラパラメータ入力部，１０４…奥行き推定部，１０５…仮想視点位置入力部，１０６…仮想視点位置決定部，１０７…画像データ記憶部，１０７ａ…カメラ画像・カメラパラメータ記憶部，１０７ｂ…奥行き記憶部，１０７ｃ…合成画像記憶部，１０８…画像合成部，１０９…合成画像出力部，１１０…合成画像表示部 DESCRIPTION OF SYMBOLS 100 ... Virtual viewpoint image synthesizer, 101 ... Subject imaging | photography part, 102 ... Camera image acquisition part, 103 ... Camera parameter input part, 104 ... Depth estimation part, 105 ... Virtual viewpoint position input part, 106 ... Virtual viewpoint position determination part, DESCRIPTION OF SYMBOLS 107 ... Image data storage part, 107a ... Camera image / camera parameter storage part, 107b ... Depth storage part, 107c ... Composite image storage part, 108 ... Image composition part, 109 ... Composite image output part, 110 ... Composite image display part

Claims

An image processing method for synthesizing an image of a subject viewed from an arbitrary virtual viewpoint position based on a multi-viewpoint image obtained by photographing the subject from a plurality of different viewpoints,
A first step of calculating a likelihood for the depth of each pixel with respect to the multi-viewpoint image by a stereo matching method;
A second step of estimating the depth of each pixel based on the likelihood obtained in the first step;
A third step of calculating an estimation function for estimating the likelihood for the depth from the image feature using the depth estimation result of the high-precision estimation pixel that satisfies the condition for estimating that the depth estimation accuracy is high;
A fourth step of performing likelihood correction using the estimation function calculated in the third step for a correction target pixel that satisfies a condition for estimating that the depth estimation accuracy is low;
A fifth step of re-estimating the depth of the entire image using the corrected likelihood performed in the fourth step;
An image processing method comprising: a sixth step of synthesizing the subject image corresponding to the virtual viewpoint position based on the depth re-estimated in the fifth step and the multi-viewpoint image.

An image processing apparatus that synthesizes an image of a subject viewed from an arbitrary virtual viewpoint position based on a multi-viewpoint image obtained by photographing the subject from a plurality of different viewpoints,
A likelihood calculating unit that calculates a likelihood for the depth of each pixel with respect to the multi-viewpoint image by a stereo matching method;
A depth estimation unit that estimates the depth of each pixel based on the likelihood obtained by the likelihood calculation unit;
A likelihood estimation function calculation unit that calculates an estimation function for estimating the likelihood with respect to the depth from the image feature using the depth estimation result of the high-precision estimation pixel that satisfies the condition for estimating that the depth estimation accuracy is high. When,
A likelihood correction unit that corrects likelihood using the estimation function calculated by the likelihood estimation function calculation unit for a correction target pixel that satisfies a condition for estimating that the depth estimation accuracy is low; ,
A depth re-estimation unit that re-estimates the depth of the entire image using the likelihood after correction performed in the likelihood correction unit;
An image processing device comprising: an image composition unit that composes an image of the subject according to the virtual viewpoint position based on the depth re-estimated by the depth re-estimation unit and the multi-viewpoint image. .

A computer of an image processing apparatus that synthesizes an image of the subject viewed from an arbitrary virtual viewpoint position based on a multi-viewpoint image obtained by photographing the subject from a plurality of different viewpoints.
A likelihood calculating step of calculating a likelihood for the depth of each pixel with respect to the multi-viewpoint image by a stereo matching method;
A depth estimation step for estimating the depth of each pixel based on the likelihood obtained in the likelihood calculation step;
A likelihood estimation function calculating step for calculating an estimation function for estimating a likelihood with respect to a depth from an image feature using a depth estimation result of a high-precision estimation pixel that satisfies a condition for estimating that the depth estimation accuracy is high When,
A likelihood correction step of correcting the likelihood using the estimation function calculated in the likelihood estimation function calculation step for a correction target pixel that satisfies the condition for estimating that the depth estimation accuracy is low; ,
A depth re-estimation step of re-estimating the depth of the entire image using the likelihood after correction performed in the likelihood correction step;
A computer program for executing an image synthesis step of synthesizing an image of the subject according to the virtual viewpoint position based on the depth re-estimated in the depth re-estimation step and a multi-viewpoint image.