Geotagging of Objects

Introduction

timeline
    title  Some (online Data) Context
    2001: Google Earth
        : satellite & aerial
    2004: OpenStreetMap
        : GIS
        : crowd sourcing
    2005: Google Map
        : GIS
    2006: Twitter
        : Geolocated multimedia posts
    2007: Google Street View
        : Road Scenes
    2013: Mapillary  
        : Road scenes
        : crowd sourcing

Introduction

timeline
    title  My research: imagery & GIS
    1998-2001: PhD 
             : Road scene video analysis
    2014-2018: GraiSearch  (FP7)
             : Social media       
    2016-2020: Bonseyes (H2020)
             : A.I. CNN
    2017: Eir 
        : Telegraph poles in ROI
        : road scene imagery (GSV)
    2018: OSI 
        : aerial imagery
        : roof tops
    2019-.: aimapit.com
        : collaborator

Introduction

Motivations:

Infrastructure maintenance
Infrastructure compliance (safety)
Planning (e.g. EV public chargers deployment)
Autonomous robotics
Bio diversity monitoring
etc.

Focus of this talk:

Static Object geotagging e.g. traffic signs, poles, trees (but \(\neq\) cars or pedestrians)

Early works

Detection of Changing Objects in Camera-in-Motion Video

Early works

\(\Rightarrow\) low dimensional feature engineering (straight edges)

\(\Rightarrow\) pixel positions \(\color{green}{(x_i,y_i)}\) used in feature computation (\(\cong\) positional encoding)

\(\Rightarrow\) P.d.f. modelled with histograms or KDE.

Shape descriptors at pixel \(i\) with position \((x_i,y_i)\) in image \(I\) with derivatives \(I_x\) and \(I_y\): \[ \left\lbrace \begin{array}{l} \|\nabla I_i\|=\sqrt{I^2_x(x_i,y_i)+I^2_y(x_i,y_i)}\\ \theta_i=\arctan\left( \frac{I_y(x_i,y_i)}{I_x(x_i,y_i)}\right)\\ \rho_i= \color{green}{x_i} \cdot \frac{I_x(x_i,y_i)}{\|\nabla I_i\|} + \color{green}{y_i} \cdot \frac{I_y(x_i,y_i)}{\|\nabla I_i\|} \end{array} \right. \] Hough Transform estimate: \[ \hat{p}(\theta,\rho)=\frac{1}{N}\sum_{i=1}^N \frac{1}{h_{\theta_i}} k_{\theta}\left(\frac{\theta-\theta_i}{h_{\theta_i}}\right)\ R_i(\theta,\rho) \]

Early works

Robust Object Recognition

(Dahyot, Charbonnier, and Heitz 2000) proposed:

using supervised learning with a (simple) AI (PCA) using a clean dataset of traffic signs augmented with rotations.
at test time though, out-of-distribution observations that deviate from training set need to be robustly detected and recognised.
this was done by performing robust regression against the Principal Components using a robust loss (M-estimator loss) at test time.
a sequence of robust loss functions was used to progressively increase the strength by which outlier pixels are rejected. See the mask image provided as a feedback 1 as a by product to the robust regression: the white pixel (weight value close to 1) are inliers, while black pixel (weight value close to 0) are the detected outlier pixels.
For a more recent overview of these robust loss functions, read also (Barron 2019, CVPR 2019)
the slide presents two results where our “AI” is successful (when occluding pixels are in blue) and is unsuccessful (when the same occluding pixels are now black leading to confusion to another traffic sign being recognised)

Early works

Extension to object detection:

several robust scores (based M-estimators) computed with a sliding window
Bayesian probability interpretation of these robust scores.
method robust to partial occlusion and cluttered background

Early works

Viz Unreal engine:

NN score sentiment analysis of tweet text (colour lights: blue=sad; yellow=happy)
tweet image geolocation+orientation inferred by feature matching against GSV to find tweet image GPS and orientation
3D city construction using OSM and GSV data.

Geotagging of Objects

MRF energy to minimize is formulated as: \[ \mathcal{U}(\mathbf{z}=\lbrace z_1,\cdots,z_{N_{\mathcal{Z}}}\rbrace,\lbrace\alpha\rbrace)=\sum_{i=1}^{N_{\mathcal{Z}}} \sum_{j} \alpha_j \ \underbrace{u_j(z_i) }_{\text{energy term } j} \]

Each image detection(s) corresponds to a ray with origin camera GPS and direction derived from pixel location and camera orientation
Any pair of intersecting rays (from pair of images) define a site (intersection) \(i\) as potential candidate for object geolocation. \(N_{\mathcal{Z}}\) is the total number of candidate sites extracted from a collection of images.
\(z_i\) at site \(i\) is a binary variable:
- \(z_i=0\) : object is absent
- \(z_i=1\) : object is present

Geotagging of Objects

Energy terms \(u_j(\cdot)\):

One that enforces consistency with the depth estimation.
A pairwise energy term that depends on the current state \(z\) and those of its neighbours \(\lbrace z_k \rbrace\), is introduced to enforce one detection from multiple sites clustered together.
An energy term penalizes rays that have no positive intersections: false positives or objects discovered from a single camera position.

MRF optimization:

Energy minimization computed with an iterative conditional modes (ICM) algorithm
Initial state for ICM is set as \(z_i=0,\forall i\) (all sites are empty)

Geotagging of Objects

Preprocessing:

Clustering: Ray intersections are only considered when in a max distance of 25 meters from the origins of the rays (camera GPS)

Postprocessing:

Clustering: Locations of positive sites found in the same vicinity (radius=1 meter) are averaged to obtain the final unique object’s geotag.

Geotagging of Objects

Evaluation:

Object Detection

Geotagging of Objects

Evaluation:

Absolute GPS positions against ground truth

Geotagging of Objects

Traffic lights geolocation using GSV + Lidar (Laefer et al. 2017)

Detection in Lidar point cloud performed by template matching using a pole-like object template (false alarm rate: high).
The MRF energy is modified by adding a new term to take into account Lidar candidate locations near each of the MRF sites.

Geotagging of Objects

Evaluation was also performed with Mapillary crowdsourced images (bottom pipeline) instead of GSV (top pipeline for reference). Additional preprocessing is needed to inferred the missing image metadata in Mapilliary.

Geotagging of Objects

Pre-processing: Enhancing quality of the GSV image metadata using Structure from Motion (Camera translation T and/or rotation R estimation).
Postprocessing: predicted object geolocation is further refined by imposing contextual geographic information extracted from OSM.

Image metadata correction	Actual	Detected	TP	Precision\(\uparrow\)	Recall\(\uparrow\)	F-measure\(\uparrow\)	Error in meters \(\downarrow\)	Error in meters (with OSM) \(\downarrow\)
None	76	94	58	0.61	0.76	0.68	2.71	2.64
T	76	89	57	0.64	0.75	0.69	2.79	2.74
R and T	76	92	54	0.57	0.72	0.64	2.53	2.48

Geotagging of Objects

Research out of the lab!

Improving DNNs

In CNN, filter weights \(\lbrace \alpha_i\rbrace\) are learnt: \[ \text{filter for convolution}=\sum_i \alpha_i \ \mathbf{e}_i \\ \text{with} \ \lbrace \mathbf{e}_i\rbrace \equiv \text{natural basis} \]

We propose Harmonic CNN layer, where the natural basis is replaced by DCT basis,that:

replace conventional convolutional layers to produce harmonic versions of existing CNN architectures,
can be efficiently compressed by truncating high-frequency components,
has been validated extensively for image classification, object detection and semantic segmentation applications.

Improving DNNs

Graph matching as postprocessing of DNN segmentation results. Example for 3D segmentation of brain (IBSR dataset):

Summary

We presented a modular pipeline for object geotagging:

Each module is optimised individually (not end-to-end).
Image segmentation and depth estimation are performed with DNNs
DNN based modules have been updated overtime (i.e. architecture changes).
Data quality is very important (e.g. metadata) for geotagging accuracy.
For creating a new training dataset for a new object of interest, we have taken advantages of multiple approaches (e.g. vintage Computer Vision approach, or AIs e.g. SAM)
MRF provides a flexible formalism to take advantage of multiple sources of information.

Thank you! Any Questions?

Picture left to right: V. Krylov, R. Dahyot, J. Connelly & M. Ulicny (2023)

Many thanks to all my collaborators, past and present!

References

Ahmad, W., and V. A. Krylov. 2024. “Roadside Object Geolocation from Street-Level Images with Reduced Supervision.” In 2024 32nd European Signal Processing Conference (EUSIPCO), 641–45. https://doi.org/10.23919/EUSIPCO63174.2024.10715092.

Barron, J. T. 2019. “A General and Adaptive Robust Loss Function.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4326–34. https://doi.org/10.1109/CVPR.2019.00446.

Bulbul, A., and R. Dahyot. 2017. “Social Media Based 3D Visual Popularity.” Computers & Graphics 63: 28–36. https://doi.org/10.1016/j.cag.2017.01.005.

Chopin, J., J.-B. Fasquel, H. Mouchere, R. Dahyot, and I. Bloch. 2022. “QAP Optimisation with Reinforcement Learning for Faster Graph Matching in Sequential Semantic Image Analysis.” In Pattern Recognition and Artificial Intelligence, edited by Mounîm El Yacoubi, Eric Granger, Pong Chi Yuen, Umapada Pal, and Nicole Vincent, 47–58. Paris, France: Springer International Publishing. https://doi.org/10.1007/978-3-031-09037-0_5.

Chopin, J., J.-B. Fasquel, H. Mouchère, R. Dahyot, and I. Bloch. 2023. “Model-Based Inexact Graph Matching on Top of DNNs for Semantic Scene Understanding.” Computer Vision and Image Understanding, 103744. https://doi.org/10.1016/j.cviu.2023.103744.

Dahyot, R. 2009. “Statistical Hough Transform.” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 8 (August): 1502–9. https://doi.org/10.1109/TPAMI.2008.288.

Dahyot, R., P. Charbonnier, and F. Heitz. 2000. “Robust Visual Recognition of Colour Images.” In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1:685–690 vol.1. https://doi.org/10.1109/CVPR.2000.855886.

———. 2001. “Unsupervised Statistical Detection of Changing Objects in Camera-in-Motion Video.” In Proceedings 2001 International Conference on Image Processing, 1:638–41. https://doi.org/10.1109/ICIP.2001.959126.

———. 2004. “A Bayesian Approach to Object Detection Using Probabilistic Appearance-Based Models.” Pattern Analysis and Applications 7 (3): 317–32. https://doi.org/10.1007/s10044-004-0230-5.

Krylov, V. A., and R. Dahyot. 2018. “Object Geolocation Using MRF Based Multi-Sensor Fusion.” In 2018 25th IEEE International Conference on Image Processing (ICIP), 2745–49. https://doi.org/10.1109/ICIP.2018.8451458.

———. 2019. “Object Geolocation from Crowdsourced Street Level Imagery.” In ECML PKDD 2018 Workshops, edited by Carlos Alzate, Anna Monreale, Haytham Assem, Albert Bifet, Teodora Sandra Buda, Bora Caglayan, Brett Drury, et al., 79–83. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-13453-2_7.

Krylov, V., E. Kenny, and R. Dahyot. 2018. “Automatic Discovery and Geotagging of Objects from Street View Imagery.” Remote Sensing 10 (5): 661. https://doi.org/10.3390/rs10050661.

Laefer, Debra F., Saleh Abuwarda, Anh-Vu Vo, Linh Truong-Hong, and Hamid Gharibi. 2017. “2015 Aerial Laser and Photogrammetry Datasets for Dublin, Ireland’s City Center.” New York University. Center for Urban Science; Progress. https://doi.org/10.17609/N8MQ0N.

Liu, C.-J., M. Ulicny, and R. Dahyot. 2023. Context aware object geotagging. 18087227, issued 2023. https://patents.google.com/patent/US20230206402A1/en.

Liu, C.-J., M. Ulicny, M. Manzke, and R. Dahyot. 2021. “Context Aware Object Geotagging.” In Irish Machine Vision and Image Processing (IMVIP 2021). https://doi.org/10.48550/arXiv.2108.06302.

Nassar, Ahmed Samy, Stefano D’Aronco, Sébastien Lefèvre, and Jan D. Wegner. 2020. “GeoGraph: Graph-Based Multi-View Object Detection with Geometric Cues End-to-End.” In Computer Vision – ECCV 2020, edited by Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, 488–504. Cham: Springer International Publishing.

Ulicny, M., V. A. Krylov, and R. Dahyot. 2022. “Harmonic Convolutional Networks Based on Discrete Cosine Transform.” Pattern Recognition 129: 1–12. https://doi.org/10.1016/j.patcog.2022.108707.