Home » Research » Information and Computing Sciences » Center for vision technologies » 2D-3D reasoning and augmented reality

2D-3D reasoning and augmented reality

SRI has a strong portfolio of 2D-3D reasoning. This includes navigation and mapping using 2D and 3D sensors such as video and LIDAR.

In recent years, machine learning has significantly improved the semantic understanding of the 2D and 3D data. Incorporating semantics enables a new class of algorithms for navigation, Simultaneous Localization and Mapping (SLAM), geo-registration, wide-area search, augmented reality, data compression, 3D modeling, and surveillance. 

Semantic and GPS-denied navigation

CVT has developed highly efficient low-drift localization and mapping methods that exploit visual and inertial sensors. SRI has supported a large portfolio of programs and spin-offs using this technology. CVT has also incorporated high-level learning-based semantic information (recognition of objects and scene layouts) into dynamic maps and scene graphs, improving accuracy, efficiency, and robustness in our state-of-the-art navigation systems.

Map with color coded route overlays

Changes in lighting and weather significantly effects vision algorithms. CVT has developed a novel deep-embedding approach to project image data into a high-dimensional feature space with geo-spatial coherence to learn features that are invariant to weather and time of day. CVT processed two million images from thousands of webcams worldwide to learn how a scene changes over time (i.e., across day, night, and seasonal changes) via this approach. These learned embeddings incorporate scene semantics for contextual reasoning, which enables highly reliable image retrieval across extremely large reference image databases.

Geo-registration is the process of matching video to previous geo-reference data sources such as satellite imagery or LIDAR. CVT has worked across multiple government programs to perform high-precision geo-registration with and without GPS for aerial and ground platforms. CVT has also leveraged recent advances in machine learning to extract semantic features that can be matched across large viewpoint variations and changes in sensing modalities.

Long-range, wide-area, augmented reality

CVT has combined the localization and geo-registration methods described above with low-powered, compact, ruggedized hardware to create wide-area augmented reality applications. CVT has extended its augmented reality capabilities to work over multiple square kilometers while in GPS-challenged environments. This also includes long-range 3D occlusion-reasoning for augmented reality applications.

3D scene classification and modeling

CVT has developed extremely robust 3D scene classification methods over the last decade. These methods have now transitioned to Department of Defense (DoD) programs of record and commercially available software packages. Working with the Office of Naval Research (ONR), the U.S. Army and the National Geospatial-Intelligence Agency (NGA), CVT is now developing the next-generation 3D scene-understanding methods using machine learning. These methods incorporate top-down and bottom-up contextual reasoning and human-specified geographic rules within the learning process.


The robust scene-understanding methods have enabled us to re-visit 3D compression methods that are widely available today. By incorporating knowledge about different feature classes (such as ground, building and foliage), we can achieve significantly better bit rates in the compression of 3D data.


CVT’s work in change detection supports deployed improvised explosive devices (IEDs). These algorithms look at multiple passes of video data to detect change signatures of buried roadside IEDs. The recent integration of machine learning-based road-detection methods has significantly improved our change detection performance. SRI is developing novel anomaly detection and anomaly-guided change detection methods for next-generation systems. Specifically, CVT is developing a transformer-based joint spatiotemporal model encompassing multiple space and time resolutions. Transformer networks enable retention of properties of various data modalities, namely geography, weather, seasonal variations, knowledge of typical events and activities of interest, thus providing modularity that enables multimodality, scalability, and explainability.

CVT has developed an end-to-end pipeline that fuses multi-modal data in deep embedding space for specific tasks—such as target detection and recognition—by directly optimizing target metrics and learning the optimal contribution/control of each mode to the results. This pipeline has been applied to different modalities, including electro-optic/infrared images, hyperspectral imaging, and LIDAR/RADAR data. CVT has also incorporated scene-contextual information to further improve performance of the target task.

CVT created a new approach to continually acquire, fine-tune and transfer knowledge to optimize tasks such as target classification. Our approach advances state-of-the-art transfer learning and continual learning methods to create an in-situ algorithm-training environment to streamline the training of classifiers to new, unknown sensor data in real time.

Our work

  • A new augmented reality system delivers a smoother, more immersive experience

    A new augmented reality system delivers a smoother, more immersive experience

    By combining ground and aerial views with computer-generated elements, users on the ground view a more accurate augmented reality experience.

  • A modern approach to building inspections

    A modern approach to building inspections

    Using augmented reality and mobile technology to reduce construction overhead.

  • 75 Years of Innovation: augmented reality binoculars

    75 Years of Innovation: augmented reality binoculars

    The first mobile, precision, non-jitter, augmented reality binoculars

Recent publications

more +

Featured publications