CVPR 2023: A Comprehensive Tour and Recent Advancements toward Real-world Visual Geo-Localization

Tutorial description

Image-based geo-localization is the problem of estimating the precise geo-location of a new captured image, by searching and matching this image against a geo-referenced 2D-3D database. Localizing a ground image within a large-scale environment is crucial to many applications, including autonomous vehicles, robotics, wide area augmented reality etc.. It typically involves two steps: (1) coarse search (or geo-tagging) of the 2D input image to find a set of candidate matches from the database, (2) fine alignment performs 2D-3D verification for each candidate and returns the best match with refined 3D pose. Most works consider this problem by matching the input image to a database collected from similar ground viewpoints and same sensor modality (camera). Although these methods show good localization performance, their applications are limited by the difficulty in collecting and updating reference ground images covering a large area from all ground view-points and for all different weather and time of day/ year conditions.

Cross-view and cross-modal visual geo-localization has become a new research field to address this problem. Real-world can be represented in many data modalities that are sensed by disparate sensing devices from different viewpoints. For example, the same scene perceived from a ground camera can be captured as an RGB image or a set of 3D point cloud from an aerial vehicle using LIDAR or motion-imagery or from satellite. Localizing a ground image using an aerial/ overhead geo-referenced database has gained noticeable momentum in recent years, due to significant growth in the availability of public aerial/ overhead data with multiple modalities (such as aerial images from google maps, and USGS 2D and 3D data, Aerial LiDAR data, Satellite 3D Data etc.). Matching a ground image to aerial/ overhead data, whose acquisition is simpler and faster, also opens more opportunities to industrial and consumer applications. However, cross-view and cross-modal visual geo-localization comes with additional technical challenges due to dramatic changes in appearance between the ground image and aerial database, which capture the same scene differently in viewpoints or/and sensor modalities.

The same-view/cross-time, cross-view, cross-modal visual geo-localization are highly related research topics, but they are usually studied separately, and the researchers do not have much opportunity to connect. In addition, recent publications have made much progress toward more realistic settings and more comprehensive benchmarks, but it can be difficult for new researchers to understand the big picture of this field and the different datasets/settings. This tutorial aims to bridge these topics and provide a comprehensive review on the research problem of visual geo-localization, including same-view/cross-time, cross-view, cross-modal settings to both new and experienced researchers. It also provides connection opportunities for the researchers of visual geo-localization and other related fields.

Tutorial schedule

Tutorials will be held on Sunday, June 18th from 8.30 AM to 5.30 PM (local Vancouver, Canadian time) conference room East #6, Vancouver Convention Center.

Cross-time/Same-view Geo-localization

8:30 – 9:30am : Introduction of Generic Visual Geo-localization ( Han-Pang Chiu, in-person)

+ Slides, video, and description

This lecture will present the goals of image-based geo-localization, the challenges to be tackled for matching images to geo-reference databases across viewpoints, weather/ long-term time differences and modalities. It will also describe a typical visual geo-localization system in details, for such as autonomous navigation and wide area, outdoor augmented reality.

9:30 – 10:30 am : Recent Image Geo-localization Benchmark and Large-scale Real-world Scenarios ( Carlo Masone/ Gabriele Berton, in-person)

+ Slides, video, and description

This lecture will present an overview of previous works on same-view image geo-localization through a benchmark that helps to understand challenges and best practices. It will then describe the latest state-of-the-art methods, and investigate the open challenges and possible future areas of research.

10:30 – 10:45: Coffee break

10:45 -11:15 am : R2Former: Unified Retrieval and Reranking Transformer for Place Recognition ( Sijie Zhu, in person).

+ Slides, video, and description

This lecture will present a unified place recognition framework that handles both retrieval and reranking of candidates with a novel transformer model, named R2Former.  Unlike RANSAC-based geometry-only approaches, the reranking module takes feature correlation, attention value, and xy coordinates (geometry) into account, and learns to determine whether the image pair is from the same location.

11:15 – 11:45 am : Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes ( Mubarak Shah, in person).

+ Slides, video, and description

This lecture will introduce a introduce a novel approach toward world-wide visual geo-localization of images inspired by human experts. It will introduce an end-to-end transformer-based architecture that exploits the relationship between different geographic levels (which are referred to as hierarchies) and the corresponding visual scene information in an image through hierarchical cross-attention. Furthermore, the new approach learns a separate representation for different environmental scenes, as different scenes in the same location are often defined by completely different visual features.

11:45 – 12:30 pm: Lunch break

Cross-view and Cross- Modal Geo-localization

12:30 – 1:30 pm: Cross View and Cross-Modal Coarse Search and Fine alignment for Augmented Reality, Navigation and other applications ( Rakesh Kumar, in-person).

+ Slides, video, and description

Matching a ground image to different overhead/ aerial 2D and 3D reference, that are simpler to be collected over a large area, opens more opportunities and possibilities for visual geo-localization. This lecture will offer an overview of cross-view and cross-modal geo-localization from images and will stress the related algorithmic aspects. We will present techniques for metric estimation of pose and uncertainty by incorporation of cross-view matching in real time navigation systems for applications in augmented reality and robotics.


1:30 – 2:30 pm: Toward Real-world Cross-view Geo-localization ( Chen Chen/ Sijie Zhu, in-person)

+ Slides, video, and description

This lecture will introduce cross-view geo-localization from early works to recent new settings. It will cover real-world scenarios where the query and reference are not perfectly aligned in terms of spatial location and orientation. It will introduce visual explanation to gain better understanding on this task. It will further compare recent transformer models to CNN methods.

2:30 – 3:00 pm: Vision-based Metric Cross-view Geo-localization ( Florian Fervers, in-person).

+ Slides, video, and description

This lecture will introduce the metric cross-view geo-localization (CVGL) task and its potential use for autonomous vehicles. It will cover the progress from early hand-crafted solutions to later end-to-end differentiable models and will discuss the advantages and drawbacks of each type of method.

3:00 – 3:30 pm: Coffee break

3:30 – 4:30 pm: Geometry-based Cross-view Geo-localization and Metric Localization for Vehicle ( Yujiao Shi, in-person)

+ Slides, video, and description

This lecture will introduce different methods for bridging the cross-view domain gap with geometric transformation. It will also describe how to leverage a continuous video for vehicle localization to increase the discriminativeness of the query location.

4:30 – 5:30 pm: Learning Disentangled Geometric Layout Correspondence for Cross-View Geo-localization ( Waqas Sultani,  virtual).

+ Slides, video, and description

This lecture will discuss how to explicitly disentangles geometric information from raw features and learns the spatial correlations among visual features from aerial and ground pairs for better cross-view geolocalization. Since panoramic images are not readily available compared to the videos of limited Field Of-View (FOV) images, we will also discuss cross-view geo-localization for a sequence of limited FOV images 


Rakesh “Teddy” Kumar

SRI International, Menlo Park, CA, USA

More info

Dr. Rakesh “Teddy” Kumar is Vice President, Information and Computing Sciences and Director of the Center for Vision Technologies at SRI International. In this role, he is responsible for leading research and development of innovative end-to-end vision solutions from image capture to situational understanding that translate into real-world applications such as robotics, intelligence extraction and human computer interaction. He has received the Outstanding Achievement in Technology Development award from his alma mater, University of Massachusetts Amherst, the Sarnoff Presidents Award, and Sarnoff Technical Achievement awards for his work in registration of multi-sensor, multi-dimensional medical images and alignment of video to three-dimensional scene models. The paper “Stable Vision-Aided Navigation for Large-Area Augmented Reality” co-authored by him received the best paper award in the IEEE Virtual Reality 2011 conference. The paper “Augmented Reality Binoculars” co-authored by him received the best paper award in the IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2013 conference. Kumar has served on NSF review and DARPA ISAT panels. He has also been an associate editor for IEEE Transactions on Pattern Analysis and Machine Intelligence. He has co-authored more than 60 research publications and received more than 50 patents. A number of spin-off companies have been created based on the research done at the Center for Vision Technologies. Kumar received his Ph.D. in Computer Science from the University of Massachusetts at Amherst in 1992. His M.S. in Electrical and Computer Engineering is from State University of New York at Buffalo in 1995, and his B.Tech in Electrical Engineering is from Indian Institute of Technology, Kanpur, India in 1983.

See bio page 

Chen Chen

University of Central Florida, Orlando, FL, USA

More info

Dr. Chen Chen is an Assistant Professor at the Center for Research in Computer Vision, University of Central Florida. He received the Ph.D. degree from the Department of Electrical Engineering, University of Texas at Dallas in 2016 where he received the David Daniel Fellowship (Best Doctoral Dissertation Award). His research interests include computer vision, efficient deep learning, and federated learning. He has been actively involved in several NSF sponsored research projects, focusing on ubiquitous machine vision on the edge and federated learning over-the-air for large-scale camera networks. Dr. Chen is an Area Chair for CVPR 2022. He was an Area Chair for ACM Multimedia 2019-2021, ICME 2021, and WACV 2019. He was an organizer of CVPR 2021 tutorial on Cross-view and Cross-modal Visual Geo-Localization. He was the lead organizer of the First Workshop on Federated Learning for Computer Vision (FedVision) in conjunction with CVPR 2022. His paper entitled “Local Learning Matters: Rethinking Data Heterogeneity in Federated Learning” was one of the finalists for the CVPR 2022 Best Paper.  

Mubarak Shah

University of Central Florida, Orlando, FL, USA

More info

Dr. Mubarak Shah, Trustee Chair Professor of Computer Science, is the founding director of the Center for Research in Computer Vision at UCF. His research interests include: video surveillance, visual tracking, human activity recognition, visual analysis of crowded scenes, video registration, UAV video analysis, etc. Dr. Shah is a fellow of the National Academy of Inventors, IEEE, AAAS, IAPR and SPIE. In 2006, he was awarded a Pegasus Professor award, the highest award at UCF. He is ACM distinguished speaker. He was an IEEE Distinguished Visitor speaker for 1997-2000 and received IEEE Outstanding Engineering Educator Award in 1997. He received the Harris Corporation’s Engineering Achievement Award in 1999, the TOKTEN awards from UNDP in 1995, 1997, and 2000; Teaching Incentive Program award in 1995 and 2003, Research Incentive Award in 2003 and 2009, Millionaires’ Club awards in 2005 and 2006, University Distinguished Researcher award in 2007, honorable mention for the ICCV 2005 Where Am I? Challenge Problem, and was nominated for the best paper award in ACM Multimedia Conference in 2005. He is an editor of international book series on Video Computing; editor in chief of Machine Vision and Applications journal, and an associate editor of ACM Computing Surveys journal. He was an associate editor of the IEEE Transactions on PAMI, and a guest editor of the special issue of International Journal of Computer Vision on Video Computing.

Han-Pang Chiu

SRI International, Menlo Park, CA, USA

More info

Dr. Han-Pang Chiu is a Technical Director of the Center for Vision Technologies at SRI International. He leads a research group to develop innovative solutions for real-world applications to navigation, mobile augmented reality, and robotics. He has been chief scientist and technical lead in many DARPA, ONR, and US Army research programs. His research in GPS-denied navigation also supports a few spin-off companies from SRI International. The paper “Stable Vision-Aided Navigation for Large-Area Augmented Reality” co-authored and presented by him received the best paper award in the IEEE Virtual Reality 2011 conference. Before he joined SRI International, he was a postdoctoral researcher in Computer Science and Artificial Intelligence Laboratory (CSAIL) at Massachusetts Institute of Technology (MIT). He also received his Ph.D. in Computer Science from MIT in 2009.

Sijie Zhu

ByteDance Inc. Mountain View, CA, USA, Research Scientist

More info

Dr. Sijie Zhu is a research scientist at ByteDance Inc. He received the Ph.D. degree in 2022 from the Department of Computer Science, University of Central Florida. Prior to that, he obtained his M.S. from University of Chinese Academy of Sciences in 2018 and B.S. from University of Science and Technology of China in 2015. His research interests include visual geo-localization, metric learning, image retrieval, and visual explanation/interpretation. He serves as reviewers of major computer vision conferences including CVPR, ICCV, ECCV, ICLR, NeurIPS, etc. He has a thread of publications on cross-view image geo-localization.

Invited speakers

Carlo Masone

Politecnico of Torino

Turin, Italy

More info

Dr. Carlo Masone received his B.S. degree and M.S. degree in control engineering from the Sapienza University, Rome, Italy, in 2006 and 2010 respectively, and he received his Ph.D. degree in control engineering from the University of Stuttgart in collaboration with the Max Planck Institute for Biological Cybernetics (MPI-Kyb), Stuttgart, Germany, in 2014. From 2014 to 2017 he was a postdoctoral researcher at MPI-kyb, within the Autonomous Robotics \& Human-Machine Systems group. From 2017 to 2020 he worked in industry on the development of self-driving cars. From 2020 to 2022 he was a senior researcher at the Visual and Multimodal Applied Learning, at Politecnico di Torino. Since 2022 he is Assistant Professor at Politecnico di Torino.

Gabriele Berton

Politecnico of Torino

Turin, Italy

More info

Gabriele Berton is a Ph.D. student in Computer Vision at the Politecnico di Torino, as a member of the Visual Learning and Multimodal Applications Laboratory (VANDAL), supervised by Prof. Barbara Caputo. He received the MSc degree in CE from the Polytechnic University of Turin in 2020, and then he worked for a year in deep learning Research and Development at the Italian Institute of Technology. His current research is focused on Visual Geo-localization, Visual Place Recognition and large-scale Image Retrieval. He serves as a reviewer of major computer vision conferences such as CVPR, ICCV and ECCV.

Yujiao Shi 

Australian National University

Canberra, Australia

More info

Dr. Yujiao Shi is a postdoctoral research fellow in the College of Engineering and Computer Science, Australian National University. She received Ph.D. degree from Australian National University in 2022. Prior to that, she received the BE and MS degrees in automation from the Nanjing University of Posts and Telecommunications, Nanjing, China, in 2014 and 2017, respectively. Her research interests include satellite image based geo-localization, novel view synthesis, and scene understanding. She has a thread of publications on cross-view image geo-localization and image synthesis.

Waqas Sultani

Information Technology University of Punjab

Lahore, Pakistan

More info

Dr. Waqas Sultani is a Assistant Professor in the Computer Science Dept. in the Information Technology University of Punjab. He completed his Ph.D. in Computer Science under Professor Mubarak Shah at Center for Research in Computer Vision, University of Central Florida. He has worked on several projects related to human action recognition, anomaly detection, crowd tracking, anomaly detection, object segmentation, complex event detection and automatic weakly labeled annotations.

Florian Fervers

Fraunhofer Institute of Optronics

Karlsruhe, Germany

More info

Florian Fervers is a Ph.D. student working at Fraunhofer Institute of Optronics, System Technologies and Image Exploitation (IOSB) in cooperation with the Computer Vision for Human-Computer Interaction Lab (CV:HCI) at Karlsruhe Institute of Technology (KIT) under supervision of Prof. Rainer Stiefelhagen. He received his M.Sc. and B.Sc. in computer science from KIT in 2020 and 2017. His Ph.D. focuses on vision-based self-localization for autonomous vehicles by exploiting globally available aerial imagery that is matched against the vehicle’s sensor readings.

Previous years CVPR tutorial on cross-view and cross-modal localization

Full list of covered publications