Home » Research » Information and Computing Sciences » Center for vision technologies » CVPR 2021 tutorial on Cross-view and Cross-modal Visual Geo-Localization

Architecture background white city buildings models 3d rendering

Cross-view and cross-modal visual geo-localization has become a new research field to address the problem of image-based geo-localization. We describe state-of-the-art methods utilizing hand-designed feature descriptors, pre-trained CNN based features and learning-based approaches.

A new research field to address image-based geo-localization

Image-based geo-localization is the problem of estimating the precise geo-location of a new captured image, by searching and matching this image against a geo-referenced 2D-3D database. Localizing a ground image within a large-scale environment is crucial to many applications, including autonomous vehicles, robotics, wide area augmented reality etc.. It typically involves two steps: (1) coarse search (or geo-tagging) of the 2D input image to find a set of candidate matches from the database, (2) fine alignment performs 2D-3D verification for each candidate and returns the best match with refined 3D pose. Most works consider this problem by matching the input image to a database collected from similar ground viewpoints and same sensor modality (camera). Although these methods show good localization performance, their applications are limited by the difficulty in collecting and updating reference ground images covering a large area from all ground view-points and for all different weather and time of day/ year conditions.

Cross-view and cross-modal visual geo-localization has become a new research field to address this problem. Real-world can be represented in many data modalities that are sensed by disparate sensing devices from different viewpoints. For example, the same scene perceived from a ground camera can be captured as an RGB image or a set of 3D point cloud from an aerial vehicle using LIDAR or motion-imagery or from satellite. Localizing a ground image using an aerial/ overhead geo-referenced database has gained noticeable momentum in recent years, due to significant growth in the availability of public aerial/ overhead data with multiple modalities (such as aerial images from google maps, and USGS 2D and 3D data, Aerial LiDAR data, Satellite 3D Data etc.). Matching a ground image to aerial/ overhead data, whose acquisition is simpler and faster, also opens more opportunities to industrial and consumer applications. However, cross-view and cross-modal visual geo-localization comes with additional technical challenges due to dramatic changes in appearance between the ground image and aerial database, which capture the same scene differently in viewpoints or/and sensor modalities.

This tutorial will offer a) an overview of cross-view and cross-modal visual geo-localization and will stress the related algorithmic aspects, such as: b) ground-to-aerial image matching, and c) image-to-3D coarse search and fine alignment. For each topic, this tutorial will describe state-of-the-art methods utilizing hand-designed feature descriptors, pre-trained CNN based features and learning-based approaches (such as deep embeddings, graph based matching techniques, semantic segmentation, and Generative Adversarial Networks).  For practical applications, the vision-based geo-localization techniques must work for images taken at different times of the day and year, and across different seasons and weather conditions.

Program agenda

TUTORIALS TO TAKE PLACE ON JUNE 20th, 2021

+

Cross-view and Cross-modal Visual Geo-Localization

2:15 PM – 3:30 PM USA EST

+

Visual Localization Cross time CVPR 2021 Tutorial

3:30 PM – 4:15 PM USA EST

+

Cross-View Geo-Localization: Ground-to-Aerial Image Matching

4:15 PM – 4:45 PM USA EST

+

Large Scale Cross View Image Geo-localization

4:45 PM – 6:00 PM USA EST

+

Cross-Modal Geo-Localization: Image-to-3D Coarse Search & Fine Alignment

Tutorials

The tutorials will take place on June 20th, 2021, and will consist of 5 lectures, as detailed below:

CVPR 2021 tutorial on Cross-view and Cross-modal Visual Geo-Localization

2:00 PM – 2:15 PM USA EST

Abstract: This lecture will provide the general context for this new and emerging topic. It presents the aims of image-based geo-localization, the challenges and the important issues to be tackled for matching images to geo-reference databases across viewpoints, weather/ long-term time differences and modalities. It will also describe a typical visual geo-localization system in details, for both industrial and consumer applications such as autonomous navigation and wide area, outdoor augmented reality.

Speakers: Han-Pang Chiu and Rakesh (Teddy) Kumar

View video
View presentation

Visual Localization Cross time CVPR 2021 Tutorial

2:15 PM – 3:30 PM USA EST

Abstract: This lecture will discuss techniques for long-term Visual Localization in real-world environments, which are required to operate under (1) extreme perceptual changes such as weather and illumination (matching day images to night reference etc.), and (2) dynamic scene changes and occlusions such as from moving vehicles and people. The lecture will focus on state-of-the-art matching techniques using deep learning based methods including semantic scene segmentation, attention networks, triplet loss and embeddings.

Speaker: Rakesh (Teddy) Kumar

View video
View presentation

Cross-View Geo-Localization: Ground-to-Aerial Image Matching

3:30 PM – 4:15 PM USA EST

Abstract:  The lecture includes the essential knowledge about how we estimate the geo-location of a query street view image in a structured database of city-wide aerial reference images with known GPS coordinates.  It will start with a brief review of traditional geo-localization involving DOQ, DEM and meta data. Then it will present computer vision methods employing either traditional hard-craft features or learning based features with view-invariant descriptors for ground-to-aerial image matching. It will end with recent methods employing deep learning for cross-view image geo-localization employing Generative Adversarial Networks.

Speaker: Mubarak Shah

View video
View presentation

Large Scale Cross View Image Geo-localization

4:15 PM – 4:45 PM USA EST

Abstract: Image-based geo-localization aims at providing image-level GPS location by matching a query street/ground image with the GPS-tagged images in a reference dataset.  I will present our recent effort on cross-view image geo-localization using advanced deep learning algorithms and provide a visual explanation method to interpret the learning outcome for cross-view image matching. Then I will present the cross-view image geo-localization in a more realistic and challenging setting that breaks the one-to-one retrieval setting of existing datasets. To bridge the gap between this realistic setting and existing datasets, we propose a new large-scale benchmark –VIGOR– for cross View Image Geo-localization beyond One-to-one Retrieval. We benchmark existing state-of-the-art methods and present a novel end-to-end framework to localize the query in a coarse-to-fine manner.

Speaker: Chen Chen

View video
View presentation

Cross-View Geo-Localization: Ground-to-Aerial Image Matching

3:30 PM – 4:15 PM USA EST

Abstract:  The lecture includes the essential knowledge about how we estimate the geo-location of a query street view image in a structured database of city-wide aerial reference images with known GPS coordinates.  It will start with a brief review of traditional geo-localization involving DOQ, DEM and meta data. Then it will present computer vision methods employing either traditional hard-craft features or learning based features with view-invariant descriptors for ground-to-aerial image matching. It will end with recent methods employing deep learning for cross-view image geo-localization employing Generative Adversarial Networks.

Speakers: Mubarak Shah

View video
View presentation

rakesh-teddy-kumar-bio-pic

Rakesh “Teddy” Kumar

Vice President, Information and Computing Sciences and Director of the Center for Vision Technologies

Teddy is responsible for leading research and development of innovative end-to-end vision solutions from image capture to situational understanding that translate into real-world applications such as robotics, intelligence extraction and human computer interaction.

View profile

Dr_Mubarak_Shah

Mubarak Shah

Trustee Chair Professor of Computer Science, founding director of the Center for Research in Computer Vision at UCF

Mubarak’s research interests include: video surveillance, visual tracking, human activity recognition, visual analysis of crowded scenes, video registration, UAV video analysis, etc.

View profile

han-pang-chiu-bio-pic

Han-Pang Chiu

Senior Technical Manager of the Center for Vision Technologies

Han-Pang leads a research group to develop innovative solutions for real-world applications to navigation, mobile augmented reality and robotics.

View profile

Dr_Chen_Chen

Chen Chen

Department of Electrical and Computer Engineering University of North Carolina at Charlotte

Dr. Chen Chen’s research interests include signal and image processing, computer vision and deep learning. He published over 80 papers in refereed journals and conferences in these areas.

View profile

Virtual CVPR | June 19-25th, 2021

Relevant publications