How Neural Radiance Fields (NeRF) and Instant Neural Graphics Primitives work

Neural radiance fields (NeRFs) slowly become the next hot topic in the world of Deep Learning. Since they were originally proposed in 2020, there is an explosion of papers as it can be seen from CVPR’s 2022 submissions. Time magazine recently included a variation of NeRFs, called instant graphics neural primitives, in their best inventions of 2022 list. But what exactly are NeRFs they and what are their applications?

In this article, I will try to demystify all the different terminologies such as neural fields, NeRFs, neural graphic primitives etc. To give you a preview, they all stand for the same thing depending on who you ask. I will also present an explanation of how they work by analyzing the two most influential papers.

What is a neural field?

The term neural field was popularized by Xie et al. and describes a neural network that parametrizes a signal. This signal usually is a single 3D scene or object but that’s not mandatory. We can also use neural fields to represent any type of signals (discrete or continuous) such as audio or images.

Their most popular use is in computer graphics applications such as image synthesis and 3D reconstruction, which is the main topic of this article.

Please note that neural fields have also been applied in other applications such as generative modeling, 2D Image Processing, robotics, medical imaging and audio parameterization.

In most neural field variations, fully connected neural networks encode objects or scenes’ properties. Importantly, one network needs to be trained to encode (capture) a single scene. Note that in contrast with standard machine learning, the goal is to overfit the neural network to a particular scene. In essence, neural fields embed the scene into the weights of the network.

Why use neural fields?

3D scenes are typically stored using voxel grids or polygon meshes. On the one hand, voxels are usually very expensive to store. On the other hand, polygon meshes can represent only hard surfaces and aren’t suitable for applications such as medical imaging.


voxel-mesh


Voxels vs Polygon meshes. Source: Wikipedia on Voxels, Wikipedia on Polygon Meshes

Neural fields have gained increasing popularity in computer graphics applications as they are very efficient and compact 3D representations of objects or scenes. Why? In contrast with voxels or meshes, they are differentiable and continuous. One other advantage is that they can also have arbitrary dimensions and resolutions. Plus they are domain agnostic and do not depend on the input for each task.

At that point, you may ask: where does the name neural fields come from?

What do fields stand for?

In physics, a field is a quantity defined for all spatial and/or temporal coordinates. It can be represented as a mapping from a coordinate xx to a quantity yy, typically a scalar, a vector, or a tensor. Examples include gravitational fields and electromagnetic fields.

Next question you may ask: what are the steps to “learn” a neural field?

Steps to train a neural field

Following Xie et al. , the typical progress of computing neural fields can be formulated as follows:

  1. Sample coordinates of a scene.

  2. Feed them to a neural network to produce field quantities.

  3. Sample the field quantities from the desired reconstruction domain of the problem.

  4. Map the reconstruction back to the sensor domain (e.g 2D RGB images).

  5. Calculate the reconstruction error and optimize the neural network.


neural-field


A typical neural field algorithm. Source: Xie et al.

For clarity, let’s use some mathematical terms to denote the process. The reconstruction is a neural field, denoted as Φ:XY\Phi : X \rightarrow Y

As a result, we can solve the following optimization problem to calculate the neural field Φ\Phi.

argminΘxrecon,xsens(X,S)F(Φ(xrecon))Ω(xsens)\mathrm{argmin}_{\Theta} \int_{x_{recon}, x_{sens} \in (X, S)} || F( \Phi(x_{recon})) – \Omega(x_{sens}) ||

The table below (Xie et al.) illustrates different applications of neural fields alongside the reconstruction and sensor domains.


forward-maps


Examples of forward maps. Source: Xie et al.

Let’s analyze the most popular architecture of neural fields called NeRFs that solves the problem of view synthesis.

Neural Radiance Fields (NeRFs) for view synthesis

The most prominent neural field architecture is called Neural Radiance Fields or NeRFs. They were originally proposed in order to solve view synthesis. View synthesis is the task where you generate a 3D object or scene given a set of pictures from different angles (or views). View synthesis is almost equivalent to 3D reconstruction.


3d-reconstruction


Multi-view 3D reconstruction. Source: Convex Variational Methods for Single-View and Space-Time Multi-View Reconstruction

Note that in order to fully understand NeRFs, one has to familiarize themselves with many computer graphics concepts such as volumetric rendering and ray casting. In this section, I will try to explain them as efficiently as possible but also leave a few extra resources to extend your research. If you seek for a structured course to get started with computer graphics, Computer Graphics by UC San Diego is the best one afaik

NeRFs and Neural fields terminology side by side

As I already mentioned, NeRFs are a special case of neural fields. For that reason, let’s see a side-by-side comparison. Feel free to revisit this table once we explain NeRFs in order to draw the connection between them and neural fields.

Neural Fields Neural Radiance Fields (NeRF)
World coordinate xreconXx_{recon} \in X Spatial location (x,y,x)(x, y, x)
Field quantities yreconYy_{recon} \in Y Color c=(r,g,b)c=(r,g,b)
Field Φ:XY\Phi : X \rightarrow Y MLP
Sensor coordinates xsensSx_{sens} \in S 2D images
Measurements tsensTt_{sens} \in T Radiance
Sensor Ω:ST\Omega: S \rightarrow T Digital camera
Forward mapping F:(XY)(ST)F : (X \rightarrow Y ) \rightarrow (S \rightarrow T) Volume rendering

The reason I decided to first present neural fields and then NeRFs is to understand that neural fields are a far more general framework

NeRFs explained

NeRFs as proposed by Mildenhall et al . accept a single continuous 5D coordinate as input, which consists of a spatial location (x,y,x)(x, y, x)

The (probability) volume density indicates how much radiance (or luminance) is accumulated by a ray passing through (x,y,z)(x, y, z)


nerfs


Neural Radiance Fields. Source: Mildenhall et al.

The power of the neural field is that it can output different representations for the same point when viewed from different angles. As a result, it can capture various lighting effects such as reflections, and transparencies, making it ideal to render different views of the same scene. This makes it a much better representation compared to voxels grid or meshes.

Training NeRFs

The problem with training these architectures is that the target density and color are not known. Therefore we need a (differentiable) method to map them back to 2D images. These images are then compared with the ground truth images formulating a rendering loss against which we can optimize the network.


nerf-training


NeRFs training process. Source: Mildenhall et al.

As shown in the image above, volume rendering is used to map the neural field output back to 2D the image. The standard L2 loss can be computed using the input image/pixel in an autoencoder fashion. Note that volume rendering is a very common process in computer graphics. Let’s see in short how it works.

Volume rendering

When sampling coordinates from the original images, we emit rays at each pixel and sample at different timesteps, a process known as ray marching. Each sample point has a spatial location, a color, and a volume density. These are the inputs of the neural field.

A ray is a function of its origin oo, its direction dd, and its samples at timesteps tt. It can be formulated as r(t)=o+tdr


volumetric-ray-marching


Ray Marching. Source: Creating a Volumetric Ray Marcher by Ryan Brucks

To map them back to the image, all we have to do is integrate these rays and acquire the color of each pixel.

C(r)=tntfT(t)σ(r(t))c(r(t),d)dtC(\mathbf{r}) =\int_{t_n}^{t_f}{T


multiresolution-hash-encoding


Multiresolution Hash Encoding. Source: Müller et al.

But what do we gain from this somewhat complicated encoding?

  1. By training the encoding parameters alongside the network, we get a big boost in the quality of the final result.

  2. By using multiple resolutions, we gain an automatic level of detail, meaning that the network learns both coarse and fine features.

  3. By using hashing to associate the 3d space with feature vectors, the encoding process becomes entirely task-agnostic.

The following video provides an excellent deep dive into the paper for those interested.

Conclusion

In my opinion , NeRFs is one of the most exciting applications of neural networks of the last few years. Being able to render 3D models in a matter of seconds was simply inconceivable a couple of years ago. It won’t be long before we see these architectures enter the gaming and simulation industries.

To experiment with NeRFs, I recommend visiting the instant-ngp repo by Nvidia, install the necessary dependencies and play around by creating your own models.

If you’d like to see more articles of computer graphics, please let us know on our discord server. Finally, if you like our blogposts, feel free to support us by buying our courses or books.

References

Deep Learning in Production Book 📖

Learn how to build, train, deploy, scale and maintain deep learning models. Understand ML infrastructure and MLOps using hands-on examples.

Learn more

* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.



Source link