Rodolfo Corona Rodriguez
University of California Berkeley
Berkeley, CA
Using Multi-view 3D Reconstruction for Understanding Descriptions of 2D Images
Language provides a natural interface for humans to interact with robotic systems. Because robots are embodied in the world, they must be able to understand the links between language and their visual perception for this communication to work. Additionally, although visual signal from cameras may be 2-dimensional, they capture views of a fundamentally 3- dimensional world. We hypothesize that explicitly reasoning about 3D structure is beneficial for this type of language learning. This presentation will describe the Voxel-informed Language Grounder (VLG), a machine learning system that uses a multi-view reconstruction model to derive explicit 3D voxel representations of objects, which it leverages to reason about the language used to describe to them. VLG employs a multimodal neural network used to extract joint features from images and text, which it supplements with features from a transformer used to correlate the text with the predicted voxel map. We use the ShapeNet Annotated with Referring Expressions (SNARE) benchmark as a testbed. SNARE tasks agents with selecting a target object against a confounder given images of both objects and a text caption describing the target. VLG attains state-of-the-art reference game accuracy (2.0% improvement over the closest baseline) on SNARE, improving over systems which do not employ explicit 3D representations in their structure. Further, we show that the largest improvements are gained on a split of captions focusing on describing the geometry of objects, highlighting the contribution of using explicit 3D representations when learning language in the context of 2D images.
SACNAS National Diversity in STEM Conference, San Juan, Puerto Rico, October 27-29, 2022