1 MIT CSAIL 2 Stanford University
We present a method, Neural Radiance Flow (NeRFlow), to learn a 4D spatial-temporal representation of a dynamic scene from a set of RGB images. Key to our approach is the use of a neural implicit representation that learns to capture the 3D occupancy, radiance, and dynamics of the scene. By enforcing consistency across different modalities, our representation enables multi-view rendering in diverse dynamic scenes, including water pouring, robotic interaction, and real images, outperforming state-of-the-art methods for spatial-temporal view synthesis. Our approach works even when inputs images are captured with only two separate cameras. We further demonstrate that the learned representation can serve an implicit scene prior, enabling video processing tasks such as image super-resolution and de-noising without any additional supervision.
We present NeRFLow, which learns a 4D spatial-temporal representation of a dynamic scene. In a scene with water pouring, we are able to render novel images (left), infer the depth map (middle left), the underlying continuous flow field (middle right), and denoise input observations (right).
Our approach also works with synthesis results on the iGibson robot below, with rendered images on the left and inferred depth map in the middle. We are further able to synthesize dynamic animations (right).
Our approach is able to represent a dynamic scene when given only a limited number of views across the scene, as well as a sparse set of timestamps from which the underlying scene is captured with. We illustrate the ability to do novel view synthesis utilizing a dynamic scene captured from a single real monocular video.
Next, we assess the ability of NeRFlow to synthesize scenes which are captured by a sparse set of timestamps. We capture the pouring animation utilizing only 1 in 10 frames of animation. We animate the resultant pouring animation across frames of animation and find that NeRFlow with consistency (left) performs significantly better than a approach without consistency (right).
NeRFlow is able to capture and aggregate radiance information across different viewpoints and timesteps. By rendering from this aggregate representation, NeRFlow can serve as a scene prior for video processing tasks. We present results where we show that when NeRFlow is fit on noisy input images, renders from NeRFlow are able to denoise and super-resolve the input images.