We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization.
We present an approach that combines appearance and semantic information for 2D image-based localization (2D-VL) across large perceptual changes and time lags. Compared to appearance features, the semantic layout of a scene is generally more invariant to appearance variations. We use this intuition and propose a novel end-to-end deep attention-based framework that utilizes multimodal cues to generate robust embeddings for 2D-VL. The proposed attention module predicts a shared channel attention and modality-specific spatial attentions to guide the embeddings to focus on more reliable image regions. We evaluate our model against state-of-the-art (SOTA) methods on three challenging localization datasets. We report an average (absolute) improvement of 19% over current SOTA for 2D-VL. Furthermore, we present an extensive study demonstrating the contribution of each component of our model, showing 8–15% and 4% improvement from adding semantic information and our proposed attention module. We finally show the predicted attention maps to offer useful insights into our model.
Accurate motion estimation using low-cost sensors for autonomous robots in visually-degraded environments is critical to applications such as infrastructure inspection and indoor rescue missions. This paper analyzes the feasibility of utilizing multiple low-cost on-board sensors for ground robots or drones navigating in visually-degraded environments. We select four low-cost and small-size sensors for evaluation: IMU, EO stereo cameras with LED lights, active IR cameras, and 2D LiDAR. We adapt and extend state-of-the-art multi-sensor motion estimation techniques, including a factor graph framework for sensor fusion, under poor illumination conditions. We evaluate different sensor combinations using the factor graph framework, and benchmark each combination with its accuracy for two representative datasets acquired in totally dark environments. Our results show the potential of this sensor fusion approach towards an improved ego-motion solution in challenging dark environments.
We propose a new approach that utilizes semantic information to register 2D monocular video frames to the world using 3D georeferenced data, for augmented reality driving applications. The geo-registration process uses our predicted vehicle pose to generate a rendered depth map for each frame, allowing 3D graphics to be convincingly blended with the real world view. We also estimate absolute depth values for dynamic objects, up to 120 meters, based on the rendered depth map and update the rendered depth map to reflect scene changes over time. This process also creates opportunistic global heading measurements, which are fused with other sensors, to improve estimates of the 6 degrees-of-freedom global pose of the vehicle over state-of-the-art outdoor augmented reality systems. We evaluate the navigation accuracy and depth map quality of our system on a driving vehicle within various large-scale environments for producing realistic augmentations.
This paper presents a new approach for integrating semantic information for vision-based vehicle navigation. Although vision-based vehicle navigation systems using pre-mapped visual landmarks are capable of achieving submeter level accuracy in large-scale urban environment, a typical error source in this type of systems comes from the presence of visual landmarks or features from temporal objects in the environment, such as cars and pedestrians. We propose a gated factor graph framework to use semantic information associated with visual features to make decisions on outlier/ inlier computation from three perspectives: the feature tracking process, the geo-referenced map building process, and the navigation system using pre-mapped landmarks. The class category that the visual feature belongs to is extracted from a pre-trained deep learning network trained for semantic segmentation. The feasibility and generality of our approach is demonstrated by our implementations on top of two vision-based navigation systems. Experimental evaluations validate that the injection of semantic information associated with visual landmarks using our approach achieves substantial improvements in accuracy on GPS-denied navigation solutions for large-scale urban scenarios.
This paper presents a vehicle navigation system that is capable of achieving sub-meter GPS-denied navigation accuracy in large-scale urban environments, using pre-mapped visual landmarks. Our navigation system tightly couples IMU data with local feature track measurements, and fuses each observation of a pre-mapped visual landmark as a single global measurement. This approach propagates precise 3D global pose estimates for longer periods. Our mapping pipeline leverages a dual-layer architecture to construct high-quality pre-mapped visual landmarks in real time. Experimental results demonstrate that our approach provides sub-meter GPS-denied navigation solutions in large-scale urban scenarios.
This paper proposes a novel vision-aided navigation approach that continuously estimates precise 3D absolute pose for aerial vehicles, using only inertial measurements and monocular camera observations. Our approach is able to provide accurate navigation solutions under long-term GPS outage, by tightly incorporating absolute geo-registered information into two kinds of visual measurements: 2D-3D tie-points, and geo-registered feature tracks. 2D-3D tie-points are established by finding feature correspondences to align an aerial video frame to a 2D geo-referenced image rendered from the 3D terrain database. These measurements provide global information to correct accumulated error in navigation estimation. Geo-registered feature tracks are generated by associating features across consecutive frames. They enable the propagation of 3D geo-referenced values to further improve the pose estimation. All sensor measurements are fully optimized in a smoother-based inference framework, which achieves efficient relinearization and real-time estimation of navigation states and their covariances over a constant-length of sliding window. Experimental results demonstrate that our approach provides accurate and consistent aerial navigation solutions on several large-scale GPS-denied scenarios.
Our goal is to circumvent one of the roadblocks of using existing bundle adjustment algorithms for achieving satisfactory large-area structure from motion over long video sequences, namely, the need for sufficient visual features tracked across consecutive frames. We accomplish it by using a novel “”virtual insertion”” scheme, which constructs virtual points and virtual frames to adapt the existence of visual landmark link outage, namely “”visual breaks”” due to no common features observed from neighboring camera views in challenging environments. We show how to insert virtual point correspondences at each break position and its neighboring frames, by transforming initial motion estimations from non-vision sensors into 3D to 2D projection constraints of virtual scene landmarks. We also show how to add virtual frames to bridge the gap of non-overlapping field of view (FOV) across sequential frames. Experiments are conducted on several real-world challenging video sequences, collected by multi-sensor based visual odometry systems. We demonstrate our proposed scheme significantly improves bundle adjustment performance in both drift correction and reconstruction accuracy.
Bundle Adjustment (BA) can be seen as an inference process over a factor graph. From this perspective, the Schur complement trick can be interpreted as an ordering choice for elimination. The elimination of a single point in the BA graph induces a factor over the set of cameras observing that point. This factor has a very low information content (a point observation enforces a low-rank constraint on the cameras). In this work we show that, when using conjugate gradient solvers, there is a computational advantage in “grouping” factors corresponding to sets of points (fragments) that are co-visible by the same set of cameras. Intuitively, we collapse many factors with low information content into a single factor that imposes a high-rank constraint among the cameras. We provide a grounded way to group factors: the selection of points that are co-observed by the same camera patterns is a data mining problem, and standard tools for frequent pattern mining can be applied to reveal the structure of BA graphs. We demonstrate the computational advantage of grouping in large BA problems and we show that it enables a consistent reduction of BA time with respect to state-of-the-art solvers (Ceres).
This paper proposes a real-time navigation approach that is able to integrate many sensor types while fulfilling performance needs and system constraints. Our approach uses a plug-and-play factor graph framework, which extends factor graph formulation to encode sensor measurements with different frequencies, latencies, and noise distributions. It provides a flexible foundation for plug-and-play sensing, and can incorporate new evolving sensors. A novel constrained optimal selection mechanism is presented to identify the optimal subset of active sensors to use, during initialization and when any sensor condition changes. This mechanism constructs candidate subsets of sensors based on heuristic rules and a ternary tree expansion algorithm. It quickly decides the optimal subset among candidates by maximizing observability coverage on state variables, while satisfying resource constraints and accuracy demands. Experimental results demonstrate that our approach selects subsets of sensors to provide satisfactory navigation solutions under various conditions, on large-scale real data sets using many sensors.
Abstract This paper proposes a navigation algorithm that provides a low-latency solution while estimating the full nonlinear navigation state. Our approach uses Sliding-Window Factor Graphs, which extend existing incremental smoothing methods to operate on the subset of measurements and states that exist inside a sliding time window. We split the estimation into a fast short-term […]
This paper proposes a navigation algorithm that provides a low-latency solution while estimating the full nonlinear navigation state. Our approach uses Sliding-Window Factor Graphs, which extend existing incremental smoothing methods to operate on the subset of measurements and states that exist inside a sliding time window. We split the estimation into a fast short-term smoother, a slower but fully global smoother, and a shared map of 3D landmarks. A novel three-stage visual feature model is presented that takes advantage of both smoothers to optimize the 3D landmark map, while minimizing the computation required for processing tracked features in the short-term smoother. This three-stage model is formulated based on the maturity of the estimation of the 3D location of the underlying landmark in the map. Long-range associations are used as global measurements from matured landmarks in the short-term smoother and loop closure constraints in the long-term smoother. Experimental results demonstrate our approach provides highly-accurate solutions on large-scale real data sets using multiple sensors in GPS-denied settings.
There is a need within the military to enhance its training capability to provide more realistic and timely training, but without incurring excessive costs in time and infrastructure. This is especially true in preparing for urban combat. Unfortunately the creation of facility based training centers that provide sufficient realism is time consuming and costly. Many supporting actors are needed to provide opponent forces and civilians. Elaborate infrastructure is needed to create a range of training scenarios, and record and review training sessions. In this paper we describe the technical methods and experimental results on building an Augmented Reality Training system for training dismounts doing maneuver operations that addresses the above shortcomings. The augmented reality system uses computer graphics and special head mounted displays to insert virtual actors and objects into the scene as viewed by each trainee wearing augmented reality eyewear. The virtual actors respond in realistic ways to actions of the Warfighters, taking cover, firing back, or milling as crowds.
Perhaps most importantly, the system is designed to be infrastructure free. The primary hardware needed to implement augmented reality is worn by the individual trainees. The system worn by a trainee includes helmet mounted sensors, see through eye-wear, and a compact computer in his backpack. The augmented reality system tracks the actions, locations and head and weapon poses of each trainee in detail so the system can appropriately position virtual objects in his field of view. Synthetic actors, objects and effects are rendered by a game engine on the eyewear display. Stereo based 3D reasoning is used to occlude all or parts of synthetic entities obscured by real world three dimensional structures based on the location of the synthetic. We present implementation details for each of the modules and experimental results for both day time and night time operations.