SRI authors: Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar Abstract Understanding the geometric relationships between objects in a scene is a core capability in enabling both humans and autonomous agents to navigate in new environments. A sparse, unified representation of the scene topology will allow agents to act efficiently to move through their environment, communicate the […]
SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments
This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments.
This paper describes a system that provides general head-worn outdoor AR capability for the user inside a moving vehicle.
SRI authors: Han-Pang Chiu, Supun Samarasekera Abstract Understanding the perceived scene during navigation enables intelligent robot behaviors. Current vision-based semantic SLAM (Simultaneous Localization and Mapping) systems provide these capabilities. However, their performance decreases in visually-degraded environments, that are common places for critical robotic applications, such as search and rescue missions. In this paper, we present […]
Proper occlusion based rendering is very important to achieve realism in all indoor and outdoor Augmented Reality (AR) applications. This paper addresses the problem of fast and accurate dynamic occlusion reasoning by real objects in the scene for large scale outdoor AR applications. Conceptually, proper occlusion reasoning requires an estimate of depth for every point in augmented scene which is technically hard to achieve for outdoor scenarios, especially in the presence of moving objects. We propose a method to detect and automatically infer the depth for real objects in the scene without explicit detailed scene modeling and depth sensing (e.g. without using sensors such as 3D-LiDAR). Specifically, we employ instance segmentation of color image data to detect real dynamic objects in the scene and use either a top-down terrain elevation model or deep learning based monocular depth estimation model to infer their metric distance from the camera for proper occlusion reasoning in real time. The realized solution is implemented in a low latency real-time framework for video-see-though AR and is directly extendable to optical-see-through AR. We minimize latency in depth reasoning and occlusion rendering by doing semantic object tracking and prediction in video frames.
By using our novel attention schema and auxiliary rewards to better utilize scene semantics, we outperform multiple baselines trained with only raw inputs or implicit semantic information while operating with an 80% decrease in the agent’s experience.
We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization.
We present an approach that combines appearance and semantic information for 2D image-based localization (2D-VL) across large perceptual changes and time lags. Compared to appearance features, the semantic layout of a scene is generally more invariant to appearance variations. We use this intuition and propose a novel end-to-end deep attention-based framework that utilizes multimodal cues to generate robust embeddings for 2D-VL. The proposed attention module predicts a shared channel attention and modality-specific spatial attentions to guide the embeddings to focus on more reliable image regions. We evaluate our model against state-of-the-art (SOTA) methods on three challenging localization datasets. We report an average (absolute) improvement of 19% over current SOTA for 2D-VL. Furthermore, we present an extensive study demonstrating the contribution of each component of our model, showing 8–15% and 4% improvement from adding semantic information and our proposed attention module. We finally show the predicted attention maps to offer useful insights into our model.
Accurate motion estimation using low-cost sensors for autonomous robots in visually-degraded environments is critical to applications such as infrastructure inspection and indoor rescue missions. This paper analyzes the feasibility of utilizing multiple low-cost on-board sensors for ground robots or drones navigating in visually-degraded environments. We select four low-cost and small-size sensors for evaluation: IMU, EO stereo cameras with LED lights, active IR cameras, and 2D LiDAR. We adapt and extend state-of-the-art multi-sensor motion estimation techniques, including a factor graph framework for sensor fusion, under poor illumination conditions. We evaluate different sensor combinations using the factor graph framework, and benchmark each combination with its accuracy for two representative datasets acquired in totally dark environments. Our results show the potential of this sensor fusion approach towards an improved ego-motion solution in challenging dark environments.