Problem statement: Visual Place Recognition (VPR) is the problem of being able to recognize one’s location, given a dataset of images that have been previously visited. VPR is closely related to image retrieval, i.e., retrieving relevant images from a large database, but with the additional context that images depict places. VPR brings along many challenges, including perceptual aliasing, which is the problem that two distinct places can look strikingly similar. VPR has a wide range of downstream tasks, including localization and navigation in robotics, and intelligent augmentation.
Many feature description and feature matching techniques have been proposed for VPR. VPR methods can be divided into those that describe an image using a single feature vector (global feature descriptors) and those that describe local areas of the image (local feature descriptors). Global feature descriptors can be compared using Euclidean or cosine distance metrics in a nearest neighbor search, while local feature descriptors are cross-matched using for example RANSAC. Alternatively, methods can also be divided into those that are viewpoint invariant (with NetVLAD being the most popular of these methods) and those that assume a similar viewpoint at query and reference time (such as HybridNet and patch-normalized images).
Approach: Our first aim is to evaluate the suitability of these methods for spike-based encoding. In the context of nearest-neighbor search on Intel’s Pohoiki Springs, it was suggested to sparsely encode images using a PCA-ICA combination. HybridNet and NetVLAD can provide low-dimensional encodings of images using PCA, while patch-normalized images are of sufficiently low dimension already – we have recently shown successful robot navigation with a patch-normalized image of size 23x8x1. We will use different coding schemes, including rate coding, TTFS coding, phase coding and burst coding, and evaluate the accuracy of the resulting methods in terms of precision and recall. This will reveal which existing VPR methods perform well when compressing images to a very low-dimensional space, and their compatibility with different spike coding mechanisms.
We recently proposed Brisbane-Event-VPR, a dataset for VPR captured with an event camera, which will enable using the event stream to produce spikes, in addition to converting conventional images as above. We will also experiment with short spatio-temporal sequences, where the evolution of spikes over a brief time window is used to accumulate evidence using probabilistic inference.