Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

New camera designs—and new types of imaging sensors—have been instrumental in driving the field of computer vision in exciting new directions. In the last decade alone, time-of-flight cameras [1, 2] have been widely adopted for vision [3] and computational photography tasks [4,5,6,7]; event cameras [8] that support asynchronous imaging have led to new vision techniques for high-speed motion analysis [9] and 3D scanning [10]; high-resolution sensors with dual-pixel [11] and assorted-pixel [12] designs are defining the state of the art for smartphone cameras; and sensors with pixel-wise coded-exposure capabilities are starting to appear [13, 14] for compressed sensing applications [15].

Against this backdrop, we introduce a new type of computational video camera to the vision community—the coded two-bucket (C2B) camera (Fig. 1). The C2B camera is a pixel-wise coded-exposure camera that never blocks the incident light. Instead, each pixel in its sensor contains two charge-collection sites—two “buckets”—as well as a one-bit writeable memory that controls which bucket is active. The camera outputs two images per video frame—one per bucket—and performs exposure coding by rapidly controlling the active bucket of each pixel, via a programmable sequence of binary 2D patterns. Key to this unique functionality is a novel programmable CMOS sensor that we designed from the ground up, fabricated in a CMOS image sensor (CIS) process technology [16] for the first time, and turned into a working camera system.

Fig. 1.
figure 1

The C2B camera. Left: Our prototype’s sensor outputs video at 20 frames per second and consists of two arrays: a \(244 \times 160\)-pixel array that supports relatively slow bucket control (up to 4 sub-frames per frame) and a \(35 \times 48\) array with much faster control (up to 120 sub-frames per frame). Right: Each frame is divided into sub-frames during which the pixel’s SRAM memory remains unchanged. A user-specified sequence of 2D binary patterns determines the SRAM’s value at each pixel and sub-frame. Note that the two buckets of a pixel are never in the same state (i.e., both active or both inactive) as this would degrade imaging performance—see [32] for a discussion of this and other related CMOS design issues. The light-generated charges of both buckets are read, digitized and cleared only once, at the end of each frame.

The light efficiency and electronic per-pixel coding capabilities of C2B cameras open up a range of applications that go well beyond what is possible today. This potentially includes compressive acquisition of high-speed video [17] with optimal light efficiency; simultaneous acquisition of both epipolar-only [18] and non-epipolar video streams; fully-electronic acquisition of high-dynamic-range AC-flicker videos [19]; conferring EpiScan3D-like functionality [20] to non-rectified imaging systems; and performing many other coded-exposure imaging tasks [15, 21, 22] with a compact camera platform.

Our focus in this first paper, however, is to highlight the novel capabilities of C2B cameras for live dense one-shot 3D reconstruction: we show that from just one grayscale C2B video frame of a dynamic scene under active illumination, it is possible to reconstruct the scene’s 3D snapshot (i.e., per-pixel disparity or normal, plus albedo) at a resolution comparable to the sensor’s pixel array. We argue that C2B cameras allow us to reduce this very difficult 3D reconstruction problem  [23,24,25,26,27,28] to the potentially much easier 2D problems of image demosaicing [29, 30] and illumination multiplexing [31].

Fig. 2.
figure 2

Dense one-shot reconstruction with C2B cameras. The procedure runs in real time and is illustrated for structured-light triangulation. Please zoom in to the electronic copy to see individual pixels of the C2B frame and refer to the listed sections for notation and details. Photometric stereo is performed in an analogous way, by replacing the structured-light projector with a set of directional light sources (a reconstruction example is shown in the lower right).

In particular, we show that C2B cameras can acquire—in one frame—images of a scene under linearly-independent illuminations, multiplexed across the buckets of neighboring pixels. We call such a frame a two-bucket illumination mosaic. In this setting, reconstruction at full sensor resolution involves four steps (Fig. 2): (1) control bucket activities and light sources to pack distinct low-resolution images of the scene into one C2B frame (i.e., images per bucket); (2) upsample these images to full resolution by demosaicing; (3) demultiplex all the upsampled images jointly, to obtain up to linearly-independent full-resolution images; and (4) use these images to solve for shape and albedo at each pixel independently. We demonstrate the effectiveness of this procedure by recovering dense 3D shape and albedo from one shot with two of the oldest and simplest active 3D reconstruction algorithms available—multi-pattern cosine phase shifting [33, 34] and photometric stereo [35].

Fig. 3.
figure 3

Comparison of basic sensor abilities. Coded-exposure sensors can rapidly mask individual pixels but cannot collect all the incident light; continuous-wave ToF sensors always collect all the incident light but they cannot mask pixels individually; C2B sensors can do both. The column vectors \({\mathbf {c}}^{}_{fs}\) and \({\mathbf {\overline{c}}}^{}_{fs}\) denote bucket-1 masks/activities and their binary complements, respectively.

From a hardware perspective, we build on previous attempts to fabricate sensors with C2B-like functionality [36,37,38], which did not rely on a CMOS image sensor process technology. More broadly, our prototype can be thought of as generalizing three families of sensors. Programmable coded-exposure sensors [13] allow individual pixels to be “masked” for brief periods during the exposure of a video frame (Fig. 3, left). Just like the C2B sensor, they have a writeable one-bit memory inside each pixel to control masking, but their pixels lack a second bucket so light falling onto “masked” pixels is lost. Continuous-wave time-of-flight sensors [1, 2] can be thought of as having complementary functionality to coded-exposure sensors: their pixels have two buckets whose activity can be toggled programmatically (so no light is lost), but they have no in-pixel writeable memory. As such, the active bucket is constrained to be the same for all pixels (Fig. 3, middle). This makes programmable per-pixel coding—and acquisition of illumination mosaics in particular—impossible without specialized optics (e.g., [17]). Multi-bucket (a.k.a., “multi-tap”) sensors [39,40,41,42] have more than two buckets in each pixel but they have no writeable memory either, so per-pixel coding is not possible. In theory, an -bucket sensor would be uniquely suited for dense one-shot reconstruction because it can acquire in each frame full-resolution images corresponding to any set of illuminations [43]. In practice, however, C2B sensors have several advantages: they are scalable because they can pack linearly-independent images into one frame for any value of —without hard-wiring this value into the pixel’s CMOS design; they are much more light efficient because each extra bucket reduces the pixel’s photo-sensitive region significantly for a given pixel size; and they have a broader range of applications because they enable per-pixel coding. To our knowledge, 2D sensors with more than four buckets have not been fabricated in a standard CMOS image process, and it is unclear if they could offer acceptable imaging performance.

On the conceptual side, our contributions are the following: (1) we put forth a general model for the C2B camera that opens up new directions for coded-exposure imaging with active sources; (2) we formulate its control as a novel multiplexing problem [31, 44,45,46,47,48,49] in the bucket and pixel domains; (3) we draw a connection between two-bucket imaging and algorithms that operate directly on intensity ratios [50]; and (4) we provide an algorithm-independent framework for dense one-shot reconstruction that is simpler than earlier attempts [18] and is compatible with standard image processing pipelines.

Last but not least, we demonstrate all the above experimentally, on the first fully-operational C2B camera prototype.

2 Coded Two-Bucket Imaging

We begin by introducing an image formation model for C2B cameras. We consider the most general setting in this section, where a whole sequence of C2B frames may be acquired instead of just one.

C2B cameras output two images per video frame—one for each bucket (Fig. 2). We refer to these images as the bucket-\(1\) image and bucket-\(0\) image.

Fig. 4.
figure 4

(a) Structure of the code tensor . (b) Image formation model for pixel \(p\). We show the transport vector \({\mathbf {t}}^{p}\) for a structured-light setting, where the contribution of ambient illumination is , the corresponding projector pixel is \(l\), and its albedo is \({\mathbf {t}}^{p}[{l}] = a\).

The Code Tensor. Programming a C2B camera amounts to specifying the time-varying contents of its pixels’ memories at two different timescales: (1) at the scale of sub-frames within a video frame, which correspond to the updates of the in-pixel memories (Fig. 1, right), and (2) at the scale of frames within a video sequence. For a video sequence with frames and a camera that has pixels and supports sub-frames, bucket activities can be represented as a three-dimensional binary tensor of size . We call the code tensor (Fig. 4a).

We use two specific 2D “slices” of the code tensor in our analysis below, and have special notation for them. For a specific pixel \(p\), slice describes the activity of pixel \(p\)’s buckets across all frames and sub-frames. Similarly, for a specific frame \(f\), slice describes the bucket activity of all pixels across all sub-frames of \(f\):

(1)

where \({\mathbf {c}}^{p}_{f}\) is an -dimensional row vector that specifies the active bucket of pixel \(p\) in the sub-frames of frame \(f\); and \({\mathbf {c}}^{}_{fs}\) is a -dimensional column vector that specifies the active bucket of all pixels in sub-frame \(s\) of frame \(f\).

The Illumination Matrix. Although C2B cameras can be used for passive imaging applications [15], we model the case where illumination is programmable at sub-frame timescales too. In particular, we represent the scene’s time-varying illumination condition as an illumination matrix that applies to all frames:

(2)

where row vector \({\mathbf {l}}_{s}\) denotes the scene’s illumination condition in sub-frame \(s\) of every frame. We consider two types of scene illumination in this work: a set of directional light sources whose intensity is given by vector \({\mathbf {l}}_{s}\); and a projector that projects a pattern specified by the first elements of \({\mathbf {l}}_{s}\) in the presence of ambient light, which we treat as an -th source that is “always on” (i.e., element for all \(s\)).

Two-Bucket Image Formation Model for Pixel \(p\). Let \({\mathbf {i}}^{p}_{}\) and \({\mathbf {\hat{i}}}^{p}_{}\) be column vectors holding the intensities of pixel \(p\)’s bucket 1 and bucket 0, respectively, in frames. We model these intensities as the result of light transport from the light sources to the pixel’s two buckets (Fig. 4b):

(3)

where \(\overline{b}\) denotes the binary complement of matrix or vector b, is the slice of the code tensor corresponding to \(p\), and \({\mathbf {t}}^{p}\) is the pixel’s transport vector. Element \({\mathbf {t}}^{p}[{l}]\) of this vector describes the transport of light from source \(l\) to pixel \(p\) in the timespan of one sub-frame, across all light paths.

To gain some intuition about Eq. (3), consider the buckets’ intensities in frame \(f\):

(4)

In effect, the two buckets of pixel \(p\) can be thought of as “viewing” the scene under two potentially different illuminations given by vectors and , respectively. Moreover, if \({\mathbf {c}}^{p}_{f}\) varies from frame to frame, these illumination conditions may vary as well.

Bucket Ratios as Albedo “Quasi-Invariants”. Since the two buckets of pixel \(p\) generally represent different illumination conditions, the two ratios

$$\begin{aligned} \small r~=~\frac{{\mathbf {i}}^{p}_{}[{f}]}{{\mathbf {i}}^{p}_{}[{f}]+{\mathbf {\hat{i}}}^{p}_{}[{f}]} \ \ \ \ \ ,\ \ \ \ \hat{r}~=~\frac{{\mathbf {\hat{i}}}^{p}_{}[{f}]}{{\mathbf {i}}^{p}_{}[{f}]+{\mathbf {\hat{i}}}^{p}_{}[{f}]} \ \ \ \ \ ,\end{aligned}$$
(5)

defined by \(p\)’s buckets are illumination ratios [50,51,52]. Moreover, we show in [32] that under zero-mean Gaussian image noise, these ratios are well approximated by Gaussian random variables whose means are the ideal (noiseless) ratios and whose standard deviations depend weakly on albedo. In effect, C2B cameras provide two “albedo-invariant” images per frame. We exploit this feature of C2B cameras for both shape recovery and demosaicing in Sects. 3 and 5, respectively.

2.1 Acquiring Two-Bucket Illumination Mosaics

A key feature of C2B cameras is that they offer an important alternative to multi-frame acquisition: instead of capturing frames in sequence, they can capture a spatially-multiplexed version of them in a single C2B frame (Fig. 2). We call such a frame a two-bucket illumination mosaic in analogy to the RGB filter mosaics of color image sensors [12, 53, 54]. Unlike filter mosaics, however, which are attached to the sensor and cannot be changed, acquisition of illumination mosaics is programmable for any .

The Bucket-1 and Bucket-0 Image Sequences. Collecting the two buckets’ intensities in Eq. (4) across all frames and pixels, we define two matrices that hold all this data:

(6)

Code Tensor for Mosaic Acquisition. Formally, a two-bucket illumination mosaic is a spatial sub-sampling of the sequences and in Eq. (6). Acquiring it amounts to specifying a one-frame code tensor that spatially multiplexes the corresponding -frame tensor in Fig. 4(a). We do this by (1) defining a regular tiling of the sensor plane and (2) specifying a correspondence , between the pixels in a tile and frames. The rows of are then defined to be

$$\begin{aligned} \widetilde{{\mathbf {c}}}^{p_i}_{1}~\overset{\text {def}}{=}~{\mathbf {c}}^{p_i}_{f_i} \ \ .\end{aligned}$$
(7)

Mosaic Acquisition Example. The C2B frames in Fig. 2 were captured using a \(2\times 2\) pixel tile to spatially multiplex a three-frame code tensor. The tensor assigned identical illumination conditions to all pixels within a frame and different conditions across frames. Pixels within each tile were assigned to individual frames using the correspondence \(\{(1,1) \rightarrow 1,~(1,2) \rightarrow 2,~(2,1) \rightarrow 2,~(2,2) \rightarrow 3\}\).

3 Per-Pixel Estimation of Normals and Disparities

Let us now turn to the problem of normal and disparity estimation using photometric stereo and structured-light triangulation, respectively. We consider the most basic formulation of these tasks, where all computations are done independently at each pixel and the relation between observations and unknowns is expressed as a system of linear equations. These formulations should be treated merely as examples that showcase the special characteristics of two-bucket imaging; as with conventional cameras, using advanced methods to handle more general settings [55, 56] is certainly possible.

Table 1. The two basic multi-image reconstruction techniques considered in this work.

From Bucket Intensities to Demultiplexed Intensities. As a starting point, we expand Eq. (3) to get a relation that involves only intensities:

(8)

Each scalar \(i^{p}_{s}\) in the right-hand side of Eq. (8) is the intensity that a conventional camera pixel would record if the scene’s illumination condition was \({\mathbf {l}}_{s}\). Therefore, Eq. (8) tells us that as far as a single pixel \(p\) is concerned, C2B cameras capture the same measurements a conventional camera would capture for 3D reconstruction—except that these measurements are multiplexed over bucket intensities. To retrieve them, these intensities must be demultiplexed by inverting Eq. (8):

(9)

where \({}'\) denotes matrix transpose. This inversion is only possible if is non-singular. Moreover, the signal-to-noise ratio (SNR) of the demultiplexed intensities depends heavily on and (Sect. 4). Setting aside this issue for now, we consider below the task of shape recovery from already-demultiplexed intensities. For notational simplicity, we drop the pixel index \(p\) from the equations below.

Per-Pixel Constraints on 3D Shape. The relation between demultiplexed intensities and the pixel’s unknowns takes exactly the same form in both photometric stereo and structured-light triangulation with cosine patterns:

(10)

where is known; \({\mathbf {x}}\) is a 3D vector that contains the pixel’s shape unknowns; \(a_{}\) is the unknown albedo; and \({\mathbf {e}}\) is observation noise. See Table 1 for a summary of each problem’s assumptions and for the mapping of problem-specific quantities to Eq. (10).

There are (at least) three ways to turn Eq. (10) into a constraint on normals and disparities under zero-mean Gaussian noise. The resulting constraints are not equivalent when combining measurements from small pixel neighborhoods—as we implicitly do—because they are not equally invariant to spatial albedo variations:

  1. 1.

    Direct method (DM): treat Eq. (10) as providing independent constraints on vector \(a_{}{\mathbf {x}}\) and solve for both \(a_{}\) and \({\mathbf {x}}\). The advantage of this approach is that errors are Gaussian by construction; its disadvantage is that Eq. (10) depends on albedo.

  2. 2.

    Ratio constraint (R): divide individual intensities by their total sum to obtain an illumination ratio, as in Eq. (5). This yields the following constraint on \({\mathbf {x}}\):

    (11)

    where \(r_{l} = i^{}_{l}/\sum _{k} i^{}_{k}\) and \({\mathbf {1}}\) is a row vector of all ones. The advantage here is that both \(r_{l}\) and Eq. (11) are approximately invariant to albedo.

  3. 3.

    Cross-product constraint (CP): instead of computing an explicit ratio from Eq. (10), eliminate \(a_{}\) to obtain

    $$\begin{aligned} i^{}_{l}{\mathbf {d}}_{k}{\mathbf {x}}= i^{}_{k}{\mathbf {d}}_{l}{\mathbf {x}}. \end{aligned}$$
    (12)

    Since Eq. (12) has intensities \(i^{}_{l},i^{}_{k}\) as factors, it does implicitly depend on albedo.

Solving for the Unknowns. Both structured light and photometric stereo require at least independent constraints for a unique solution. In the DM method, we use least-squares to solve for \(a_{}{\mathbf {x}}\); when using the R or CP constraints, we apply singular-value decomposition to solve for \({\mathbf {x}}\).

4 Code Matrices for Bucket Multiplexing

The previous section gave ways to solve for 3D shape when we have enough independent constraints per pixel. Here we consider the problem of controlling a C2B camera to actually obtain them for a pixel \(p\). In particular, we show how to choose (1) the number of frames , (2) the number of sub-frames per frame , and (3) the pixel-specific slice of the code tensor, which defines the multiplexing matrix in Eq. (8).

Determining these parameters can be thought of as an instance of the optimal multiplexing problem [31, 44,45,46,47,48,49]. This problem has been considered in numerous contexts before, as a one-to-one mapping from desired measurements to actual, noisy observations. In the case of coded two-bucket imaging, however, the problem is slightly different because each frame yields two measurements instead of just one.

The results below provide further insight into this particular multiplexing problem (see [32] for proofs). Observation 1 implies that even though a pixel’s two buckets provide measurements in total across frames, at most of them can be independent because the multiplexing matrix is rank-deficient:

Observation 1

.

Intuitively, a C2B camera should not be thought of as being equivalent to two coded-exposure cameras that operate completely independently. This is because the activities of a pixel’s two buckets are binary complements of each other, and thus not independent.

Corollary 1

Multiplexing intensities requires frames.

Corollary 2

The minimal configuration for fully-constrained reconstruction at a pixel \(p\) is frames, sub-frames per frame, and linearly-independent illumination vectors of dimension . The next-highest configuration is 3 frames, 4 subframes/illumination vectors.

Table 2. Optimal matrices for small . Note that the lower bound given by Eq. (13) is attained only for , i.e., for the smallest Hadamard-based construction of .

We now seek the optimal matrix , i.e., the matrix that maximizes the SNR of the demultiplexed intensities in Eq. (9). Lemma 1 extends the lower-bound analysis of Ratner et al. [45] to obtain a lower bound on the mean-squared error (MSE) of two-bucket multiplexing [32]:

Lemma 1

For every multiplexing matrix , the MSE of the best unbiased linear estimator satisfies the lower bound

(13)

Although Lemma 1 does not provide an explicit construction, it does ensure the optimality of matrices whose MSEs achieve the lower bound. We used this observation to prove the optimality of matrices derived from the standard Hadamard construction [31]:

Proposition 1

Let where is derived from the Hadamard matrix by removing its row of ones to create an matrix. The bucket-multiplexing matrix defined by is optimal.

The smallest for which Proposition 1 applies are and . Since our main goal is one-shot acquisition, optimal matrices for other small values of are also of significant interest. To find them, we conducted a brute-force search over the space of small binary matrices to find the ones with the lowest MSE. These matrices are shown in Table 2. See Fig. 6(a), (b) and [32] for an initial empirical SNR analysis.

Fig. 5.
figure 5

Processing ratio mosaics. Left to right: Intermediate results of the BRD reconstruction procedure of Sect. 5, starting from the raw C2B frame shown in Fig. 2, Step 1. In contrast to the result of Steps 2 and 3 in Fig. 2, the images above are largely unaffected by albedo variations.

5 One-Shot Shape from Two-Bucket Illumination Mosaics

We use three different ways of estimating shape from a two-bucket illumination mosaic:

  1. 1.

    Intensity demosaicing (ID): treat the intensities in a mosaic tile as separate “imaging dimensions” for the purpose of demosaicing; upsample these intensities by applying either an RGB demosaicing algorithm to three of these dimensions at a time, or by using a more general assorted-pixel procedure [12, 54] that takes all of them into account; demultiplex the upsampled images using Eq. (9); and apply any of the estimation methods in Sect. 3 to the result. Fig. 2 illustrates this approach.

  2. 2.

    Bucket-ratio demosaicing (BRD): apply Eq. (5) to each pixel in the mosaic to obtain an albedo-invariant two-bucket ratio mosaic; demosaic and demultiplex them; and compute 3D shape using the ratio constraint of Sect. 3. See Fig. 5 for an example.

  3. 3.

    No demosaicing (ND): instead of upsampling, treat each mosaic tile as a “super-pixel” whose unknowns (i.e., normal, disparity, etc.) do not vary within the tile; compute one shape estimate per tile using any of the methods of Sect. 3.

Performance Evaluation of One-Shot Photometric Stereo on Synthetic Data. Figures 6(c) and (d) analyze the effective resolution and albedo invariance of normal maps computed by several combinations of methods from Sects. 3 and 5, plus two more—Baseline, which applies basic photometric stereo to three full-resolution images; and Color, the one-shot color photometric stereo technique in [23]. To generate synthetic data, we (1) generated scenes with random spatially-varying normal maps and RGB albedo maps, (2) applied a spatial low-pass filter to albedo maps and the spherical coordinates of normal maps, (3) rendered them to create three sets of images—a grayscale C2B frame; three full-resolution grayscale images; and a Bayer color mosaic—and (4) added zero-mean Gaussian noise to each pixel, corresponding to a peak SNR of 30dB. Since all calculations except demosaicing are done per pixel, any frequency-dependent variations in performance must be due to this upsampling step. Our simulation results do match the intuition that performance should degrade for very high normal map frequencies regardless of the type of neighborhood processing. For spatial frequencies up to 0.3 the Nyquist limit, however, one-shot C2B imaging confers a substantial performance advantage. A similar evaluation for structured-light triangulation can be found in [32].

Fig. 6.
figure 6

(a) Optimal versus sub-optimal multiplexing. We applied bucket multiplexing to the scene shown in (b) and empirically measured the average SNR of demultiplexed images when (1) is given by Table 2 and (2) , which is a non-degenerate and sub-optimal matrix according to its MSE using Eq. (13) ( is the identity matrix). The ratios of these SNRs are shown in blue, suggesting that SNR gains are possible. (b) One out of demultiplexed images obtained with each . The optimal yielded visibly less noisy images (please zoom in to the electronic copy). (c) Angular root-mean-squared error (RMSE) of normal estimates as a function of the normal map’s highest spatial frequency. Frequency 1.0 corresponds to the Nyquist limit. The highest spatial frequency of albedos was set to 0.3 the Nyquist limit. (d) Angular error as a function of the spatial frequency of the albedo map, with the maximum spatial frequency of the normal map set to 0.3 the Nyquist limit. Line colors are as indicated in (c). (Color figure online)

6 Live 3D Imaging with a C2B Camera

Experimental Conditions. Both C2B frame acquisition and scene reconstruction run at 20 Hz for all experiments, using , the corresponding optimal from Table 2, and the \(2\times 2\) mosaic tile defined in Sect. 2.1. C2B frames are always processed by the same sequence of steps—demosaicing, demultiplexing and per-pixel reconstruction. For structured light, we fit an 8 mm Schneider Cinegon f/1.4 lens to our camera with its aperture set to f/2, and use a TI LightCrafter for projecting \(684 \times 608\)-pixel, 24-gray-level patterns at a rate of  Hz in sync with sub-frames. The stereo baseline was approximately 20 cm, the scene was 1.1–1.5 m away, and the cosine frequency was 5 for all patterns and experiments. For photometric stereo we switch to a 23mm Schneider APO-Xenoplan f/1.4 lens to approximate orthographic imaging conditions, and illuminate a scene 2–3 m away with four sub-frame synchronized Luxdrive 7040 Endor Star LEDs, fitted with 26.5 mm Carclo Technical Plastics lenses.

Fig. 7.
figure 7

Quantitative experiments for photometric stereo (Row 1) and structured light (Rows 2, 3). Per-pixel unit normals \({\mathbf {n}}_{}\) are visualized by assigning to them the RGB color vector \(0.5{\mathbf {n}}_{}+0.5\).

Fig. 8.
figure 8

Live 3D acquisition experiments for photometric stereo (top) and structured light (bottom). Scenes were chosen to exhibit significant albedo, color, normal and/or depth variations, as well as discontinuities. For reference, color photos of these scenes are shown as insets in Column 1. Qualitatively, reconstructions appear to be consistent with the scenes’ actual 3D geometry except in regions of low albedo (e.g., hair) or cast shadows. (Color figure online)

Quantitative Experiments. Our goal was to compare the 3D accuracy of one-shot C2B imaging against that of full-resolution sequential imaging—using the exact same system and algorithms. Figure 7 shows the static scenes used for these experiments, along with example reconstructions for photometric stereo and structured light, respectively. The “ground truth,” which served as our reference, was computed by averaging 1000 sequentially-captured, bucket-1 images per illumination condition and applying the same reconstruction algorithm to the lower-noise, averaged images. To further distinguish the impact of demosaicing from that of sensor-specific non-idealities, we also compute shape from a simulated C2B frame; to create it we spatially multiplex the averaged images computationally in a way that simulates the operation of our C2B sensor. Row 3 of Fig. 7 shows some of these comparisons for structured light. The BRD-R method, coupled with OpenCV’s demosaicing algorithm, yields the best performance in this case, corresponding to a disparity error of \(4\%\). See [32] for more details and additional results.

Reconstructing Dynamic Scenes. Figure 8 shows several examples.

7 Concluding Remarks

Our experiments relied on some of the very first images from a C2B sensor. Issues such as fixed-pattern noise; slight variations in gain across buckets and across pixels; and other minor non-idealities do still exist. Nevertheless, we believe that our preliminary results support the claim that 3D data are acquired at near-sensor resolution.

We intentionally used raw, unprocessed intensities and the simplest possible approaches for demosaicing and reconstruction. There is no doubt that denoised images and more advanced reconstruction algorithms could improve reconstruction performance considerably. Our use of generic RGB demosaicing software is also clearly sub-optimal, as their algorithms do not take into account the actual correlations that exist across C2B pixels. A prudent approach would be to train an assorted-pixel algorithm on precisely such data.

Last but certainly not least, we are particularly excited about C2B cameras sparking new vision techniques that take full advantage of their advanced imaging capabilities.