Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Yilun Du¹, Cameron Smith¹, Ayush Tewari^1†, Vincent Sitzmann^1†

¹ MIT ^† indicates equal advising.

CVPR 2023

Abstract

We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and due to the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. We demonstrate that our method learns powerful multi-view geometry priors while reducing both rendering time and memory footprint. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis.

Results

Below, we illustrate novel view rendering results of our approach from different wide-baseline image pairs. Our approach is able to consistently render different wide baseline novel views.

Indoor Scene Baseline Comparisons

Next, we provide baseline comparisons comparisons of our approach with IBRNet, pixelNeRF, GPNR on different challenging indoor scene in Realestate10k.

Outdoor Scene Baseline Comparisons

Next, we provide baseline comparisons of our approach with IBRNet, pixelNeRF, GPNR on different outdoor scenes in ACID.

Novel View Synthesis from Unposed Images

Our approach can also render novel views from unposed images from the internet. We utilize 2D correspondences inferred from SuperGLUE to obtain relative pose estimates between images.

Limitations

In this paper, we've presented a new approach to render novel views from wide-baseline stereo pairs. While our approach outperforms the existing state-of-the-art, this is a very challenging problem, and there are many test scenes in which our approach either fails to adequately estimate the depth of the scene or fails to obtain consistent multiview renderings of these scenes. We believe both problems opens directions for future work.

Paper

Our Related Projects

Check out our related projects on neural rendering and neural fields!

Neural Radiance Flow for 4D View Synthesis and Video Processing

We present a method to capture a dynamic scene utilizing a spatial-temporal radiance field. We enforce consistency in this field utilizing a continuous flow field. We show that such an approach enables us to synthesize novel views in dynamic scenes captured using as little as a single monocular video, and further show that our radiance field can be utilized to denoise and super-resolve input images.

Seeing 3D Objects in a Single Image via Self-Supervised Static-Dynamic Disentanglement

We propose a method that maps a single image of a scene to a 3D neural scene representation that captures and disentangles the movable and immovable parts of the scene. Each scene component is parameterized with 2D neural ground plans, which are grids of features aligned with the ground plane that can be locally decoded into 3D neural radiance fields. We learn ground plans self-supervised through neural rendering and demonstrate the widespread utility of such ground plans such as extraction of object-centric 3D representations, novel view synthesis, instance segmentation, and 3D bounding box prediction.

Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation

We present Neural Descriptor Fields (NDFs), an SE3 equivariant object representation which enables manipulation of different categories of objects at arbitrary poses from a limited number (5-10) demonstrations.

Learning Signal-Agnostic Manifolds of Neural Fields

We present a method to capture the underlying structure of arbitrary data signals by representing each point of data as a neural field. This enables us to interpolate and generate new samples in image, shape, audio, and audiovisual domains all using the same identical architecture.

Citation

@inproceedings{du2023cross,
                  title={Learning to Render Novel Views from Wide-Baseline Stereo Pairs},
                  author={Du, Yilun and Smith, Cameron and Tewari, Ayush and Sitzmann, Vincent},
                  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
                  year={2023}
                }

This webpage template was recycled from here.

Accessibility