Capture phrases in quotes for more specific queries (e.g. "rocket ship" or "Fred Lynn")

CVPR 2021 tutorial on Cross-view and Cross-modal Visual Geo-Localization

Cross-view and cross-modal visual geo-localization has become a new research field to address the problem of image-based geo-localization. We describe state-of-the-art methods utilizing hand-designed feature descriptors, pre-trained CNN based features, and learning-based approaches.

A new research field to address image-based geo-localization

Image-based geo-localization is the problem of estimating the precise geo-location of a new captured image, by searching and matching this image against a geo-referenced 2D-3D database. Localizing a ground image within a large-scale environment is crucial to many applications, including autonomous vehicles, robotics, wide area augmented reality etc.. It typically involves two steps: (1) coarse search (or geo-tagging) of the 2D input image to find a set of candidate matches from the database, (2) fine alignment performs 2D-3D verification for each candidate and returns the best match with refined 3D pose. Most works consider this problem by matching the input image to a database collected from similar ground viewpoints and same sensor modality (camera). Although these methods show good localization performance, their applications are limited by the difficulty in collecting and updating reference ground images covering a large area from all ground view-points and for all different weather and time of day/ year conditions.

Cross-view and cross-modal visual geo-localization has become a new research field to address this problem. Real-world can be represented in many data modalities that are sensed by disparate sensing devices from different viewpoints. For example, the same scene perceived from a ground camera can be captured as an RGB image or a set of 3D point cloud from an aerial vehicle using LIDAR or motion-imagery or from satellite. Localizing a ground image using an aerial/ overhead geo-referenced database has gained noticeable momentum in recent years, due to significant growth in the availability of public aerial/ overhead data with multiple modalities (such as aerial images from google maps, and USGS 2D and 3D data, Aerial LiDAR data, Satellite 3D Data etc.). Matching a ground image to aerial/ overhead data, whose acquisition is simpler and faster, also opens more opportunities to industrial and consumer applications. However, cross-view and cross-modal visual geo-localization comes with additional technical challenges due to dramatic changes in appearance between the ground image and aerial database, which capture the same scene differently in viewpoints or/and sensor modalities.

This tutorial will offer a) an overview of cross-view and cross-modal visual geo-localization and will stress the related algorithmic aspects, such as: b) ground-to-aerial image matching, and c) image-to-3D coarse search and fine alignment. For each topic, this tutorial will describe state-of-the-art methods utilizing hand-designed feature descriptors, pre-trained CNN based features, and learning-based approaches (such as deep embeddings, graph based matching techniques, semantic segmentation, and Generative Adversarial Networks).  For practical applications, the vision-based geo-localization techniques must work for images taken at different times of the day and year, and across different seasons and weather conditions.

Teddy is responsible for leading research and development of innovative end-to-end vision solutions from image capture to situational understanding that translate into real-world applications such as robotics, intelligence extraction and human computer interaction.
View profile

Rakesh “Teddy” Kumar

Vice President, Information and Computing Sciences and Director of the Center for Vision Technologies

Mubarak Shah

Mubarak’s research interests include: video surveillance, visual tracking, human activity recognition, visual analysis of crowded scenes, video registration, UAV video analysis, etc.
View profile

Mubarak Shah

Trustee Chair Professor of Computer Science, founding director of the Center for Research in Computer Vision at UCF

Han-Pang leads a research group to develop innovative solutions for real-world applications to navigation, mobile augmented reality, and robotics.
View profile

Han-Pang Chiu

Senior Technical Manager of the Center for Vision Technologies

Program agenda

Times and dates to be announced. 

Time TBD

Introduction to Visual Geo-Localization across Viewpoints, Time periods & Modalities

Time TBD

Cross-weather-time, long term Geo-Localization: Ground-to-Ground image matching across weather & long-term time changes

Time TBD

Cross-View Geo-Localization: Ground-to-Aerial Image Matching

Time TBD

Cross-Modal Geo-Localization: Image-to-3D Coarse Search & Fine alignment


The tutorial will consist of 4 lectures, as detailed below:

Virtual CVPR | June 19-25th, 2021

close check icon

Message Sent

Success! - Thank you for your interest.

share dwonload plus email external external project copy play call directions linkedin