Neural Radiance Flow for 4D View Synthesis and Video Processing

Yilun Du1    Yinan Zhang2    Hong-Xing Yu2    Joshua B. Tenenbaum1    Jiajun Wu2

1 MIT CSAIL    2 Stanford University

Paper | Code (Coming Soon)


We present a method, Neural Radiance Flow (NeRFlow), to learn a 4D spatial-temporal representation of a dynamic scene from a set of RGB images. Key to our approach is the use of a neural implicit representation that learns to capture the 3D occupancy, radiance, and dynamics of the scene. By enforcing consistency across different modalities, our representation enables multi-view rendering in diverse dynamic scenes, including water pouring, robotic interaction, and real images, outperforming state-of-the-art methods for spatial-temporal view synthesis. Our approach works even when inputs images are captured with only two separate cameras. We further demonstrate that the learned representation can serve an implicit scene prior, enabling video processing tasks such as image super-resolution and de-noising without any additional supervision.

paper thumbnail


arXiv, 2020.


Yilun Du, Yinan Zhang, Hong-Xing Yu, Joshua B. Tenenbaum, Jiajun Wu. "Neural Radiance Flow for 4D View Synthesis and Video Processing", arXiv:2011.13084. Bibtex

Code (Coming Soon)

Full View Synthesis

We present NeRFLow, which learns a 4D spatial-temporal representation of a dynamic scene. In a scene with water pouring, we are able to render novel images (left), infer the depth map (middle left), the underlying continuous flow field (middle right), and denoise input observations (right).

Our approach also works with synthesis results on the Gibson robot below, with rendered images on the left and inferred depth map on the right

Sparse View Synthesis

Our approach is able to represent a dynamic scene when given a limited number of views across the scene.

Stereo Cameras

We consider a dynamic scene captured by two stereo cameras moving over a cup pouring liquid illustrated below:

In such a set up, we are able to render the entire dynamic animation from a fixed viewpoint (illustrated below), even though images are captured at opposite viewpoints over time.

Opposite Cameras

We further consider the Gibson scene captured by two opposite cameras moving (illustrated below) as a robot travels.

In such a set up, we are also able to render the entire dynamic animation from a fixed viewpoint (illustrated below), even though cameras move across different viewpoints over time

Temporal Interpolation

Our spatial-temporal representation allows us to interpolate and render intermediate timesteps inside a video. NeRFlow, with consistency, can generate a smooth rendering of pouring when trained on only 1 in 10 frames of the animation. We compare NeRFlow approach with consistency (left) compared with an approach without consistency (right)

Monocular Video

Our approach also allows us to synthesize real dynamic scenes captured by a monocular camera.

Video Processing

NeRFlow is able to capture and aggregate radiance information across different viewpoints and timesteps. By rendering from this aggregate representation, NeRFlow can serve as a scene prior for video processing tasks. We present results where we show that when NeRFlow is fit on noisy input images, renders from NeRFlow are able to denoise and super-resolve the input images.