Class imbalance is a fundamental problem in computer vision applications such as semantic segmentation.
2d-3d reasoning and augmented reality publications
This paper describes a system that provides general head-worn outdoor AR capability for the user inside a moving vehicle.
We present SIGNAV, a real-time semantic SLAM system to operate in perceptually-challenging situations.
We present a method to estimate global camera head- ing by associating directional information from road segments in the camera view with annotated satellite imagery.
This paper addresses the problem of fast and accurate dynamic occlusion reasoning by real objects in the scene for large scale outdoor AR applications.
We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization.
We present an approach that combines appearance and semantic information for 2D image-based localization (2D-VL) across large perceptual changes and time lags. Compared to appearance features, the semantic layout of a scene is generally more invariant to appearance variations. We use this intuition and propose a novel end-to-end deep attention-based framework that utilizes multimodal cues to generate robust embeddings for 2D-VL. The proposed attention module predicts a shared channel attention and modality-specific spatial attentions to guide the embeddings to focus on more reliable image regions. We evaluate our model against state-of-the-art (SOTA) methods on three challenging localization datasets. We report an average (absolute) improvement of 19% over current SOTA for 2D-VL. Furthermore, we present an extensive study demonstrating the contribution of each component of our model, showing 8–15% and 4% improvement from adding semantic information and our proposed attention module. We finally show the predicted attention maps to offer useful insights into our model.
Accurate motion estimation using low-cost sensors for autonomous robots in visually-degraded environments is critical to applications such as infrastructure inspection and indoor rescue missions. This paper analyzes the feasibility of utilizing multiple low-cost on-board sensors for ground robots or drones navigating in visually-degraded environments. We select four low-cost and small-size sensors for evaluation: IMU, EO stereo cameras with LED lights, active IR cameras, and 2D LiDAR. We adapt and extend state-of-the-art multi-sensor motion estimation techniques, including a factor graph framework for sensor fusion, under poor illumination conditions. We evaluate different sensor combinations using the factor graph framework, and benchmark each combination with its accuracy for two representative datasets acquired in totally dark environments. Our results show the potential of this sensor fusion approach towards an improved ego-motion solution in challenging dark environments.
We propose a new approach that utilizes semantic information to register 2D monocular video frames to the world using 3D georeferenced data, for augmented reality driving applications. The geo-registration process uses our predicted vehicle pose to generate a rendered depth map for each frame, allowing 3D graphics to be convincingly blended with the real world view. We also estimate absolute depth values for dynamic objects, up to 120 meters, based on the rendered depth map and update the rendered depth map to reflect scene changes over time. This process also creates opportunistic global heading measurements, which are fused with other sensors, to improve estimates of the 6 degrees-of-freedom global pose of the vehicle over state-of-the-art outdoor augmented reality systems. We evaluate the navigation accuracy and depth map quality of our system on a driving vehicle within various large-scale environments for producing realistic augmentations.