论文解读(14)-GeoCLIP

news2026/2/17 3:23:58

加油，加油！

原文：
GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization
（2309.16020 (arxiv.org)）

这一篇的重点在于范围放宽到全球了

摘要

首先指出了目前全球性的图像定位缺点，地球的地理分布具有多样性的特征，所以很难进行泛化。然后之前的方法是将地球进行切割，分成多个区域，然后按照分类问题进行考虑。

但是GeoCLIP则是引入了GPS坐标，增强了定位的连续性。所以这是一个Image-to-GPS的算法。

GeoCLIP’s location encoder models the Earth as a continuous function by employing positional encoding through random Fourier features and constructing a hierarchical representation that captures information at varying resolutions to yield a semantically rich highdimensional feature suitable to use even beyond geo-localization.

(这一段，有一点不是很懂随机傅里叶特征的作用，看文章后面有没有详细解释)
(GeoCLIP 的位置编码器通过随机傅里叶特征采用位置编码，并构建层次表示，以不同分辨率捕获信息，以产生语义丰富的高维特征，即使在地理定位之外也适合使用，从而将地球建模为连续函数。)

1. Intro

在这里插入图片描述

重点在于建立起了相似性

contribution:

第一个解决全球化的Image-to-GPS问题
第二点有点疑惑（我们的位置编码器将位置编码与随机傅里叶特征相结合，可有效编码 GPS 坐标并减轻 MLP 中的光谱偏差。此外，我们使用指数西格玛赋值策略来促进学习不同分辨率下的分层特征）之后看详细介绍

Our location encoder incorporates positional encoding with random Fourier features to efficiently encode GPS coordinates and mitigate spectral bias in MLPs. In addition, we use an exponential sigma assignment strategy to facilitate learning hierarchical features at different resolutions

多功能性，可用于地理以外的出任务
支持文本查询
少样本效果也很好

2. Related Works

Global Image Prediction
Learning from GPS Data
Contrastive Learning

3. Proposed Approach

设置目标：

精确找到图像对应的地理坐标（经纬度）

然后和CLIP一样，有两个Encoder，分别是Location Encoder（L）和 Image Encoder（V）

3.1.1 Image Encoder

因为CLIP中自带这一块，所以选择沿用

3.1.2 Location Encoder

采用了多种方法：

用地球投影表示GPS坐标（GPS coordinates using equal earth projection）
随机傅里叶变换进行位置编码（using positional encoding through random Fourier features）

Equal Earth Projection（EEP）：
在这里插入图片描述

因为地球是存在角度，并非平面，因此我觉得是这篇文章需要转化的原因

After applying the EEP, we scale the resulting longitude in the range −1 to 1, and the latitude values are scaled proportionally.

最后，经度会缩放至-1到1的区间内，同样纬度也会按比例缩放

Random Fourier Features (RFF):
因为需要捕获边缘变化部分，也就是高频成分，所以先进行了傅里叶变换，在频率域中就可以很好的分别出高频部分。
（道理可能如此，但是有些看不懂）

We limit the frequencies using a fixed matrix R, whose entries are sampled from a Gaussian distribution with the standard deviation (σ). The matrix R is set at the beginning of training and remains unchanged throughout the training process. The RFF operation γ(·) encodes GPS coordinate G′ i as γ(G′ i ) = [cos(2πRG′ i ),sin(2πRG′ i )]T, where the entries of a mth row and n th column of matrix R are rm,n ∼ N (0, σ).

Hierarchical Representation:
上面说到频率范围由sigma参数控制，因此可以对其进行分层操作：
在这里插入图片描述