Description: Paper note on LightGlue, an adaptive deep network for efficient sparse local feature matching (ICCV 2023)
My Notion Note ID: K2E-B-F2-2
Created: 2025-01-14
Updated: 2026-06-06
License: Reuse welcome — please credit Yu Zhang and link back to yuzhang.io

1. Paper Information
2. Summary
3. Key Contributions
4. Background & Related Work
5. Method Details & Key Equations
6. Training Setup & Datasets
7. Main Experiments & Quantitative Results
8. Ablations, Limitations & Practical Pointers
9. Conclusions & Future Work
References

1. Paper Information

Title: LightGlue: Local Feature Matching at Light Speed
Authors: Philipp Lindenberger, Paul-Edouard Sarlin, Marc Pollefeys
Affiliations: ETH Zurich; Microsoft Mixed Reality & AI Lab
Paper: ICCV 2023 / arXiv:2306.13643
Code: github.com/cvg/LightGlue

2. Summary

LightGlue is a deep neural network for matching sparse local features across images, building on and improving SuperGlue. It revisits multiple design decisions — positional encoding, prediction head, supervision strategy, and attention mechanism — and combines simple but effective modifications to achieve better accuracy, lower memory/compute cost, and dramatically easier training. Its defining feature is adaptivity: both network depth (early stopping) and width (point pruning) adjust dynamically based on the difficulty of each image pair, making inference much faster on easy pairs (e.g., large overlap) while retaining full capacity for hard cases.

3. Key Contributions

Adaptive depth and width: A confidence classifier after each layer enables early stopping when predictions are confident; unmatchable points are pruned at each layer, reducing the effective sequence length and quadratic attention cost.
Simpler, faster prediction head: Replaces SuperGlue's expensive Sinkhorn optimal transport with a disentangled combination of pairwise similarity scores and unary matchability scores, yielding a soft partial assignment matrix that is both faster and produces cleaner gradients.
Rotary positional encoding (RoPE): Uses relative positional encoding instead of SuperGlue's absolute learned encoding, improving accuracy and robustness since the model retains positional information across all layers.
Bidirectional cross-attention: Computes the cross-attention similarity matrix once and reuses it for both directions, saving ~20% runtime with no accuracy loss.
Deep supervision: The lightweight head allows supervision at every layer (not just the last), speeding convergence and enabling meaningful early exits.
Ease of training: Reaches state-of-the-art accuracy in just a few GPU-days (2 days on 2× RTX 3090), vs. SuperGlue's >7 days.

Local feature matching pipeline: Detect sparse keypoints (SIFT, SuperPoint), describe with high-dimensional vectors, match via nearest-neighbor search, filter with heuristics (Lowe's ratio test, mutual check) and robust geometric fitting (RANSAC). Requires extensive tuning; fails under challenging conditions.

Deep matchers: SuperGlue introduced end-to-end joint matching and outlier rejection using a Transformer with optimal transport (Sinkhorn). Highly effective but computationally expensive ( $O(MN)$ per Sinkhorn iteration) and notoriously hard to train. Follow-ups (SGMNet, ClusterGNN) reduce attention cost by restricting to subsets, but sacrifice robustness.

Dense matchers: LoFTR, MatchFormer, ASpanFormer match on dense grids — more robust but much slower, limited by resolution.

Efficient Transformers: Linear attention, bottleneck tokens, selective checkpointing, and adaptive depth explored in NLP/vision. LightGlue adapts these ideas specifically to 2D feature matching.

5. Method Details & Key Equations

5.1 Problem Formulation

Given two sets of local features from images $A$ and $B$ , each feature $i$ has position $\mathbf{p}_i := (x, y)_i \in [0,1]^2$ (normalized) and descriptor $\mathbf{d}_i \in \mathbb{R}^d$ . Goal: predict a set of correspondences $\mathcal{M} = \{(i,j)\} \subset \mathcal{A} \times \mathcal{B}$ via a soft partial assignment matrix $\mathbf{P} \in [0,1]^{M \times N}$ .

5.2 Transformer Backbone

Each local feature $i$ in image $I$ is associated with a state $\mathbf{x}_i^I \in \mathbb{R}^d$ , initialized from its descriptor. The network is a stack of $L$ identical layers, each containing one self-attention and one cross-attention unit.

Attention unit — updates state via an MLP on the concatenation of current state and aggregated message:

\mathbf{x}_i^I \leftarrow \mathbf{x}_i^I + \text{MLP}\big([\mathbf{x}_i^I \mid \mathbf{m}_i^{I \leftarrow S}]\big)

$[\cdot \mid \cdot]$ = concatenation; residual connection ( $+ \mathbf{x}_i^I$ ) preserves original information.

where the message is a weighted average over source states:

\mathbf{m}_i^{I \leftarrow S} = \sum_{j \in S} \text{Softmax}_{k \in S}(a_{ik}^{IS})_j\, \mathbf{W}\mathbf{x}_j^S

Self-attention uses rotary positional encoding (RoPE). Attention score between points $i$ and $j$ :

a_{ij} = \mathbf{q}_i^\top \mathbf{R}(\mathbf{p}_j - \mathbf{p}_i)\, \mathbf{k}_j

$\mathbf{q}_i = W_q \mathbf{x}_i$ , $\mathbf{k}_j = W_k \mathbf{x}_j$ — query and key projections. $\mathbf{R}(\mathbf{p}_j - \mathbf{p}_i)$ is a $d \times d$ block-diagonal rotation matrix encoding 2D relative position into $d/2$ independent rotations. Each angle $\theta_k = \mathbf{b}_k^\top(\mathbf{p}_j - \mathbf{p}_i)$ is determined by a learned 2D frequency vector $\mathbf{b}_k$ . Translation-equivariant (relative positions, not absolute coordinates).

\mathbf{R}(\mathbf{p}) = \begin{pmatrix} \hat{\mathbf{R}}(\mathbf{b}_1^\top \mathbf{p}) & & \mathbf{0} \\ & \ddots & \\ \mathbf{0} & & \hat{\mathbf{R}}(\mathbf{b}_{d/2}^\top \mathbf{p}) \end{pmatrix}, \quad \hat{\mathbf{R}}(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}

$d/2$ rotation blocks: $d$ dimensions partitioned into $d/2$ pairs, each undergoing an independent 2D rotation. RoPE is applied at every layer, unlike SuperGlue's one-shot absolute encoding.

Cross-attention attends to all points in the other image, exploiting bidirectional symmetry:

a_{ij}^{IS} = \mathbf{k}_i^{I\top} \mathbf{k}_j^S \stackrel{!}{=} a_{ji}^{SI}

Cross-attention uses only keys (no separate queries), because relative positions are meaningless across different images. The symmetric similarity matrix is computed once and reused for both directions $A \to B$ and $B \to A$ , halving the cost.

5.3 Correspondence Prediction (Lightweight Head)

Assignment scores — pairwise similarity between projected states:

\mathbf{S}_{ij} = \text{Linear}(\mathbf{x}_i^A)^\top \text{Linear}(\mathbf{x}_j^B) \quad \forall (i,j) \in \mathcal{A} \times \mathcal{B}

Shared linear projection $W\mathbf{x} + b$ (independent per layer). Maps states into a matching subspace.

Matchability scores — unary likelihood that point $i$ has a valid correspondence:

\sigma_i = \text{Sigmoid}(\text{Linear}(\mathbf{x}_i)) \in [0, 1]

$\sigma_i \approx 1$ : likely has a correspondence; $\sigma_i \approx 0$ : likely unmatchable. Replaces SuperGlue's dustbin mechanism with an explicit per-point matchability prediction.

Soft partial assignment matrix — combines both:

\mathbf{P}_{ij} = \sigma_i^A\, \sigma_j^B\, \underset{k \in \mathcal{A}}{\text{Softmax}}(\mathbf{S}_{kj})_i\, \underset{k \in \mathcal{B}}{\text{Softmax}}(\mathbf{S}_{ik})_j

Four multiplicative terms: (1) $\sigma_i^A$ — is point $i$ in $A$ matchable? (2) $\sigma_j^B$ — is point $j$ in $B$ matchable? (3) Softmax over $\mathcal{A}$ — competition within $A$ for $j$ . (4) Softmax over $\mathcal{B}$ — competition within $B$ for $i$ . Product is high only when both points are matchable AND mutually prefer each other — effectively a soft mutual nearest neighbor with matchability gating.

Replaces Sinkhorn iterations. Disentangles "is this point matchable?" from "which point does it match?", yielding cleaner gradients.

Correspondences extracted where $\mathbf{P}_{ij} > \tau$ and is the maximum along both row and column.

5.4 Adaptive Depth and Width

Confidence classifier — at each layer $\ell$ , an MLP predicts per-point confidence:

c_i = \text{Sigmoid}(\text{MLP}(\mathbf{x}_i)) \in [0, 1]

Early stopping — halt if sufficient fraction $\alpha$ of all points are confident:

\text{exit} = \left(\frac{1}{N+M} \sum_{I \in \{A,B\}} \sum_{i \in I} \llbracket c_i^\ell > \lambda_\ell \rrbracket \right) > \alpha

Confidence threshold decays exponentially across layers:

\lambda_\ell = 0.8 + 0.1\, e^{-4\ell/L}

Earlier layers need higher thresholds (less calibrated). $\alpha = 95\%$ yields ~25% runtime reduction with virtually no accuracy loss.

Point pruning — points confidently predicted as unmatchable ( $c_i^\ell > \lambda_\ell$ and $\sigma_i^\ell < \beta$ , $\beta = 0.01$ ) are removed from subsequent layers. After a few layers, >30% of keypoints are pruned.

5.5 Supervision

Two-stage training: first train matching, then train the confidence classifier.

Matching loss — average over all layers (deep supervision):

\text{loss} = -\frac{1}{L}\sum_\ell \left( \frac{1}{|\mathcal{M}|}\sum_{(i,j)\in\mathcal{M}} \log\, {}^\ell\mathbf{P}_{ij} + \frac{1}{2|\bar{\mathcal{A}}|}\sum_{i\in\bar{\mathcal{A}}} \log(1 - {}^\ell\sigma_i^A) + \frac{1}{2|\bar{\mathcal{B}}|}\sum_{j\in\bar{\mathcal{B}}} \log(1 - {}^\ell\sigma_j^B) \right)

Term 1: $\log P_{ij}$ pushes $P_{ij} \to 1$ for true matches $\mathcal{M}$ . Terms 2 & 3: $\log(1-\sigma)$ pushes $\sigma \to 0$ for unmatchable points ( $\bar{\mathcal{A}}, \bar{\mathcal{B}}$ ). $1/L$ average over layers = deep supervision — every layer trains to produce good predictions, enabling early exit.

Confidence classifier loss — binary cross-entropy at each layer, predicting whether that layer's assignment matches the final layer's assignment.

6. Training Setup & Datasets

Pre-training on synthetic homographies:

170k images from Oxford-Paris 1M distractors (150k/10k/10k split)
Random homographies + strong photometric augmentations (blur, hue, saturation, illumination, noise, occlusion)
512/1024/1024 keypoints for SuperPoint/SIFT/DISK; batch size 64, lr 0.0001
Achieves >99% recall and >90% precision; 2 GPU-days on 2× RTX 3090

Fine-tuning on MegaDepth:

1M crowd-sourced images of 196 landmarks with SfM poses and dense depth
Image pairs binned by covisibility: [0.1, 0.3], [0.3, 0.5], [0.5, 0.7]
2048 keypoints/image; batch size 32; lr 1e-5 with exponential decay
50 epochs (~2 days on 2× RTX 3090)
Large epipolar-error points labeled unmatchable (incomplete MegaDepth depth maps)

Implementation: $L=9$ layers, 4 attention heads, $d=256$ . Gradient checkpointing + mixed-precision for 24GB VRAM.

7. Main Experiments & Quantitative Results

7.1 Homography Estimation (HPatches)

Features + Matcher	Recall (R)	Precision (P)	AUC-RANSAC @1px / @5px	AUC-DLT @1px / @5px
LoFTR (dense)	—	92.7	41.5 / 78.8	38.5 / 70.6
NN+mutual (SP)	72.7	67.2	35.0 / 75.3	0.0 / 2.0
SuperGlue (SP)	94.9	87.4	38.3 / 79.3	33.8 / 76.7
SGMNet (SP)	95.5	83.0	38.6 / 79.0	31.7 / 76.0
LightGlue (SP)	94.3	88.9	38.3 / 79.6	35.9 / 78.6

LightGlue achieves the highest precision and competitive AUC, making simple DLT competitive with expensive MAGSAC.

7.2 Relative Pose Estimation (MegaDepth-1500)

Features + Matcher	RANSAC AUC 5°/10°/20°	LO-RANSAC AUC 5°/10°/20°	Time (ms)
LoFTR (dense)	52.8 / 69.2 / 81.2	66.4 / 78.6 / 86.5	181
SuperGlue (SP)	49.7 / 67.1 / 80.6	65.8 / 78.7 / 87.5	70.0
LightGlue (SP)	49.9 / 67.0 / 80.1	66.7 / 79.3 / 87.9	44.2
↳ adaptive	49.4 / 67.2 / 80.1	66.3 / 79.0 / 87.9	31.4
NN+ratio (DISK)	38.1 / 55.4 / 69.6	57.2 / 69.5 / 78.6	7.4
LightGlue (DISK)	43.5 / 61.0 / 75.3	61.3 / 74.3 / 83.8	44.5

LightGlue outperforms SuperGlue at ~30% lower inference time. The adaptive variant is >2× faster than SuperGlue.

7.3 Outdoor Visual Localization (Aachen Day-Night)

SuperPoint + Matcher	Day (0.25m,2°) / (0.5m,5°) / (1.0m,10°)	Night	Pairs/sec
SuperGlue	88.2 / 95.5 / 98.7	86.7 / 92.9 / 100	6.5
LightGlue	89.2 / 95.4 / 98.5	87.8 / 93.9 / 100	17.2 / 26.1

Matches SuperGlue accuracy at 2.5× higher throughput (4× when optimized).

7.4 Image Matching Challenge

IMC 2020: LightGlue yields better poses than SuperGlue on multi-view track, reduces matching time significantly
IMC 2021: Outperforms sparse baselines, surpasses tuned baselines on public leaderboard
IMC 2023: SuperPoint+LightGlue reaches 38.4 / 46.1 (public/private) — +2.3% over SuperPoint+SuperGlue

8. Ablations, Limitations & Practical Pointers

Ablation study (synthetic homographies):

Architecture	Precision	Recall	Time (ms)
SuperGlue	74.6	90.5	29.1
LightGlue (full)	86.8	96.3	19.4
↳ no matchability	67.4	97.0	18.9
↳ absolute positions	84.2	94.7	18.7
↳ full cross-attention	86.6	96.1	22.8
↳ early layer (#5/9)	78.1	92.7	11.9

Matchability critical for precision (+12% over SuperGlue) — disentangles similarity from matchability
Rotary encoding outperforms absolute positions (+2.6% precision)
Bidirectional cross-attention saves 20% time with no accuracy drop

Adaptive depth & width impact:

Difficulty	Avg. stopping layer	Unmatchable pruned (%)	Speedup
Easy	4.7	19.8	1.86×
Medium	5.5	23.4	1.33×
Hard	6.9	27.9	1.16×
Average	5.7	23.7	1.45×

Limitations:

On InLoc (indoor), competitive but not clearly better — failures from repeated-texture matching instead of geometric structure
Sparse features: accuracy bounded by keypoint detector quality
Softmax over both dimensions remains a computational bottleneck

Practical Pointers:

Compatible with SuperPoint, SIFT, and DISK features (plug-and-play)
Optimized variant uses FlashAttention-style efficient self-attention
Pre-training on synthetic homographies is critical — do not skip
Filter threshold $\tau = 0.1$ for correspondences; tune RANSAC threshold per method for fair evaluation
Training with 2k keypoints/image instead of 1k improves accuracy

9. Conclusions & Future Work

LightGlue is a drop-in replacement for SuperGlue that is faster, more accurate, and far easier to train. Key innovation: adaptivity — depth (early stopping when confident) and width (pruning unmatchable points) adjust to each image pair's difficulty. The disentangled prediction head (similarity + matchability) replaces expensive Sinkhorn with a simpler formulation yielding cleaner gradients and enabling deep supervision. On easy pairs, LightGlue can stop after 2–3 layers and run at ~100 FPS. Future directions: a dedicated bidirectional softmax kernel to eliminate the remaining bottleneck, and extending the adaptive framework to dense matching.

References

Lindenberger, P., Sarlin, P.-E., & Pollefeys, M. (2023). LightGlue: Local Feature Matching at Light Speed. ICCV 2023. arXiv:2306.13643 — paper

LightGlue Local Feature Matching at Light Speed

Table of Contents

1. Paper Information

2. Summary

3. Key Contributions

5. Method Details & Key Equations

5.1 Problem Formulation

5.2 Transformer Backbone

5.3 Correspondence Prediction (Lightweight Head)

5.4 Adaptive Depth and Width

5.5 Supervision

6. Training Setup & Datasets

7. Main Experiments & Quantitative Results

7.1 Homography Estimation (HPatches)

7.2 Relative Pose Estimation (MegaDepth-1500)

7.3 Outdoor Visual Localization (Aachen Day-Night)

7.4 Image Matching Challenge

8. Ablations, Limitations & Practical Pointers

9. Conclusions & Future Work

References

Table of Contents

1. Paper Information

2. Summary

3. Key Contributions

4. Background & Related Work

5. Method Details & Key Equations

5.1 Problem Formulation

5.2 Transformer Backbone

5.3 Correspondence Prediction (Lightweight Head)

5.4 Adaptive Depth and Width

5.5 Supervision

6. Training Setup & Datasets

7. Main Experiments & Quantitative Results

7.1 Homography Estimation (HPatches)

7.2 Relative Pose Estimation (MegaDepth-1500)

7.3 Outdoor Visual Localization (Aachen Day-Night)

7.4 Image Matching Challenge

8. Ablations, Limitations & Practical Pointers

9. Conclusions & Future Work

References