In-Place Scene Labelling and Understanding with Implicit Scene Representation

Dyson Robotics Laboratory at Imperial College

zhlédnutí 7 016

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 28. 03. 2021
Project Page: shuaifengzhi.com/Semantic-NeRF/
Paper: arxiv.org/abs/2103.15875
Authors: Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger and Andrew Davison
Organisation: Dyson Robotics Laboratory, Imperial College London
Abstract:
Semantic labelling is highly correlated with geometry and radiance reconstruction, as scene entities with similar shape and appearance are more likely to come from similar classes. Recent implicit neural reconstruction techniques are appealing as they do not require prior training data, but the same fully self-supervised approach is not possible for semantics because labels are human-defined properties.
We extend neural radiance fields (NeRF) to jointly encode semantics with appearance and geometry, so that complete and accurate 2D semantic labels can be achieved using a small amount of in-place annotations specific to the scene. The intrinsic multi-view consistency and smoothness of NeRF benefit semantics by enabling sparse labels to efficiently propagate. We show the benefit of this approach when labels are either sparse or very noisy in room-scale scenes.
We demonstrate its advantageous properties in various interesting applications such as an efficient scene labelling tool, novel semantic view synthesis, label denoising, super-resolution, label interpolation and multi-view semantic label fusion in visual semantic mapping systems.
Věda a technologie

Komentáře • 9

@kunzhang7654 Před 2 lety
Dear Dr. Zhi, Great work and congratulations for being accepted to ICCV2021 with Oral presentation! I was trying to contact with you by e-mail but it seems that your address could not be reached. Could you provide the camera trajectories you used in the Replica dataset? Meanwhile, any plan for releasing the code? Thanks a lot and looking forward to your reply!
@zhishuaifeng3342 Před 2 lety
Hi Kun. Thank you for your interests in our work. I am sorry I was busy writing thesis. My email address should work well right now and not sure if it is some wierd server issues. If you can not contact me via imperial email, you can also drop me a message to z.shuaifeng@foxmail.com if you like. I will release the rendered Replica sequences after the recent thesis DDL and sorry for the delay.
@kwea123 Před 3 lety
Too much information on each slide and the slides are switched too quickly... it makes the reader have to constantly stop the video to read...
1. The pixel denoising and region denoising results is counter-intuitive for me. With 90% chance of corruption, the same 3d point has so little change to be "consistent" across views. How can the model fuse the information, which is totally random from each view? Region-wise denoising is much more reasonable because only few images are perturbed, so the same chair has higher probability of having the same label across views. The quantitative results for pixel-wise denoising is therefore intriguing, how can it be better than region-wise denoising, despite having more noise? With 90% pixel noise I'd expect that the chairs are also 90% wrong, resulting in a lot more noise than the region-wise noise experiment...
2. Results of Super resolution and label propagation is also confusing. Sparse label with S=16 basically means 1/256=0.3% pixels per frame, and in this case the ground class is likely to be dominant, and some small classes might not be sampled at all. Why is the mIoU better than label propagation, where at least all classes are sampled once, with 1% pixels?
Did I misunderstand anything? Thank you
@zhishuaifeng3342 Před 3 lety
Hi kwea123 (AI葵), thank you for your interests and feedback. I also learn a lot from your tutorial videos of NeRF which are very helpful.
I agree that the information in this video is a bit dense and we have tried to keep a good balance between video length and presentation experience. I could possibly make another longer version on project page so that people can better follow the details.
@zhishuaifeng3342 Před 3 lety
About pixel-wise denoising:
The performance of pixel-denoising task is quite surprising at first glance, especially when some fine-structures can be well persevered. In the denoising task, we randomly vary the labels of randomly selected 90% pixels for each training label image.
In my opinion, I think there are several factors making this happen:
(1)Coherent consistency and smoothness within NeRF and view-invariant property of semantics are the key.
(2)The underlying geometry and appearance play a very important role so pixels with similar texture and geometry tend to have same classes. The photometric loss is important here as an auxiliary loss.
I personally think denosing task here is a “NeRF-CRF” given that CRF also refines semantics by modeling similarity in geometry and appearance in an explicit way.
(3)There are still average 10% pixels unchanged per frame and in addition a 3D position may have corrupted label in one view but may have a correct label in another view. I also tried 95% or even higher noise ratio, and as expected the fine-structures become much harder to recover with less accurate boundaries, etc.
The quantitative results does not aim to show which task is easier or harder in any sense but mainly to show that Semantic-NeRF has the ability to recover from those noisy labels. Note that the evaluation are computed using full label frames including chairs and other classes as well.
@zhishuaifeng3342 Před 3 lety
It is true that a larger scaling factor (x16, x32) has a risk of missing tiny structures. And we indeed observe, for example, prediction of windows frames (red) around blinds (purple) in spx8 is more accurate than that of spx16. Again, the tables does not mean to compare these two tasks but to show the capability of Semantic-NeRF.
A better way to think about super-resolution and propagation is how they sample the sparse/partial labels. Super-resolution (e.g, SPx16) sparsely decimate label maps following a regular grid pattern with a space of 16 pixels while label propagation (LP) select a “seed” randomly from each class per frame.
@zhishuaifeng3342 Před 3 lety
In SP, a class/instance larger than 16 pixels is very likely to be sampled at least once (i.e., having one/more seeds on this class/instance). Therefore I think the main difference is the coverage of seeds: SP spreads the seeds within class while LP learn from more labels from a local proximity.
This is also one of reasons why prediction of light (pink) on the ceiling (yellow) in SP has better quality (Fig.7 and 10) than that in LP(Fig. 8), partly because the appearance and geometry of light and ceiling are too similar for LP to interpolate and the spread of seeds in SP helps
@zhishuaifeng3342 Před 3 lety
Hope this information and my understanding is helpful. If you have any further questions, please feel free to discuss via emails.

Další v pořadí

Automatické přehrávání

[CVPR'23] OpenScene: 3D Scene Understanding with Open Vocabularies