We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and due to the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. We demonstrate that our method learns powerful multi-view geometry priors while reducing both rendering time and memory footprint. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis.
Below, we illustrate novel view rendering results of our approach from different wide-baseline image pairs. Our approach is able to consistently render different wide baseline novel views.
Next, we provide baseline comparisons comparisons of our approach with IBRNet, pixelNeRF, GPNR on different challenging indoor scene in Realestate10k.
Next, we provide baseline comparisons of our approach with IBRNet, pixelNeRF, GPNR on different outdoor scenes in ACID.
Our approach can also render novel views from unposed images from the internet. We utilize 2D correspondences inferred from SuperGLUE to obtain relative pose estimates between images.
In this paper, we've presented a new approach to render novel views from wide-baseline stereo pairs. While our approach outperforms the existing state-of-the-art, this is a very challenging problem, and there are many test scenes in which our approach either fails to adequately estimate the depth of the scene or fails to obtain consistent multiview renderings of these scenes. We believe both problems opens directions for future work.
Check out our related projects on neural rendering and neural fields!
@inproceedings{du2023cross,
title={Learning to Render Novel Views from Wide-Baseline Stereo Pairs},
author={Du, Yilun and Smith, Cameron and Tewari, Ayush and Sitzmann, Vincent},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}