Historically, human artists have been challenged to recreate real-world locations as 3D models, particularly when applications call for photorealistic accuracy. But Google researchers have come up with an alternative that could simultaneously automate the 3D modeling process and improve its results, using a neural network with crowdsourced photos of a location to convincingly replicate landmarks and lighting in 3D.
The idea behind neural radiance fields (NeRF) is to extract 3D depth data from 2D images by determining where light rays terminate, a sophisticated technique that alone can create plausible textured 3D models of landmarks. Google’s NeRF in the Wild (NeRF-W) system goes further in several ways. First, it uses “in-the-wild photo collections” as inputs, expanding a computer’s ability to see landmarks from multiple angles. Next, it evaluates the images to find structures, separating out photographic and environmental variations such as image exposure, scene lighting, post-processing, and weather conditions, as well as shot-to-shot object differences such as people who might be in one image but not another. Then it recreates scenes as mixes of static elements — structure geometry and textures — with transient ones that provide volumetric radiance.
As a result, NeRF-W’s 3D models of landmarks can be smoothly viewed from multiple angles without looking jittery or artifacted, while the lighting system uses the detected variations to provide radiance guidance for scene lighting and shadowing. NeRF-W can also treat image-to-image object differences as an uncertainty field, either eliminating or de-emphasizing them, whereas the standard NeRF system allows those differences to appear as cloudlike occluding artifacts, because it doesn’t separate them from structures during image ingestion.
Google’s video comparison of standard NeRF results with NeRF-W suggests that the new neural system can so convincingly recreate landmarks in 3D that virtual reality and augmented reality device users will be able to experience complex architecture as it actually looks, including time-of-day and weather variations, stepping beyond its prior work with 3D models. It’s also an improvement on a similar alternative disclosed last year, Neural Rerendering in the Wild, because it does a better job of separating 3D structures from lighting and looking more temporally smooth as objects are viewed from different angles.
It’s worth noting that Google certainly isn’t the only company researching ways to use photos as input for 3D modeling; Intel researchers, for instance, are advancing their own work in generating synthesized versions of real world locations, using multiple photographs plus a recurrent encoder-decoder network to interpolate uncaptured angles. While Intel’s system appears to outperform numerous alternatives — including standard NeRF — on pixel-level sharpness and temporal smoothness, it doesn’t appear to offer the variable lighting capabilities of NeRF-W or have the same focus on using randomly sourced photos to recreate real-world locations.
Google’s NeRF-W is discussed in detail in this paper, which arrives just ahead of the August 23 European Conference on Computer Vision 2020. A video showing its performance with landmarks such as Berlin’s Brandenburg Gate and Rome’s Trevi Fountain is available here.