LightGlue Local Feature Matching at Light Speed


  • Description: Paper note on LightGlue, an adaptive deep network for efficient sparse local feature matching (ICCV 2023)
  • My Notion Note ID: K2E-B-F2-2
  • Created: 2025-01-14
  • Updated: 2026-06-06
  • License: Reuse welcome — please credit Yu Zhang and link back to yuzhang.io

Table of Contents


1. Paper Information

Title: LightGlue: Local Feature Matching at Light Speed
Authors: Philipp Lindenberger, Paul-Edouard Sarlin, Marc Pollefeys
Affiliations: ETH Zurich; Microsoft Mixed Reality & AI Lab
Paper: ICCV 2023 / arXiv:2306.13643
Code: github.com/cvg/LightGlue

2. Summary

LightGlue is a deep neural network for matching sparse local features across images, building on and improving SuperGlue. It revisits multiple design decisions — positional encoding, prediction head, supervision strategy, and attention mechanism — and combines simple but effective modifications to achieve better accuracy, lower memory/compute cost, and dramatically easier training. Its defining feature is adaptivity: both network depth (early stopping) and width (point pruning) adjust dynamically based on the difficulty of each image pair, making inference much faster on easy pairs (e.g., large overlap) while retaining full capacity for hard cases.

3. Key Contributions

  • Adaptive depth and width: A confidence classifier after each layer enables early stopping when predictions are confident; unmatchable points are pruned at each layer, reducing the effective sequence length and quadratic attention cost.
  • Simpler, faster prediction head: Replaces SuperGlue's expensive Sinkhorn optimal transport with a disentangled combination of pairwise similarity scores and unary matchability scores, yielding a soft partial assignment matrix that is both faster and produces cleaner gradients.
  • Rotary positional encoding (RoPE): Uses relative positional encoding instead of SuperGlue's absolute learned encoding, improving accuracy and robustness since the model retains positional information across all layers.
  • Bidirectional cross-attention: Computes the cross-attention similarity matrix once and reuses it for both directions, saving ~20% runtime with no accuracy loss.
  • Deep supervision: The lightweight head allows supervision at every layer (not just the last), speeding convergence and enabling meaningful early exits.
  • Ease of training: Reaches state-of-the-art accuracy in just a few GPU-days (2 days on 2× RTX 3090), vs. SuperGlue's >7 days.

4. Background & Related Work

Local feature matching pipeline: Detect sparse keypoints (SIFT, SuperPoint), describe with high-dimensional vectors, match via nearest-neighbor search, filter with heuristics (Lowe's ratio test, mutual check) and robust geometric fitting (RANSAC). Requires extensive tuning; fails under challenging conditions.

Deep matchers: SuperGlue introduced end-to-end joint matching and outlier rejection using a Transformer with optimal transport (Sinkhorn). Highly effective but computationally expensive (O(MN)O(MN) per Sinkhorn iteration) and notoriously hard to train. Follow-ups (SGMNet, ClusterGNN) reduce attention cost by restricting to subsets, but sacrifice robustness.

Dense matchers: LoFTR, MatchFormer, ASpanFormer match on dense grids — more robust but much slower, limited by resolution.

Efficient Transformers: Linear attention, bottleneck tokens, selective checkpointing, and adaptive depth explored in NLP/vision. LightGlue adapts these ideas specifically to 2D feature matching.

5. Method Details & Key Equations

5.1 Problem Formulation

Given two sets of local features from images AA and BB, each feature ii has position pi:=(x,y)i[0,1]2\mathbf{p}_i := (x, y)_i \in [0,1]^2 (normalized) and descriptor diRd\mathbf{d}_i \in \mathbb{R}^d. Goal: predict a set of correspondences M={(i,j)}A×B\mathcal{M} = \{(i,j)\} \subset \mathcal{A} \times \mathcal{B} via a soft partial assignment matrix P[0,1]M×N\mathbf{P} \in [0,1]^{M \times N}.

5.2 Transformer Backbone

Each local feature ii in image II is associated with a state xiIRd\mathbf{x}_i^I \in \mathbb{R}^d, initialized from its descriptor. The network is a stack of LL identical layers, each containing one self-attention and one cross-attention unit.

Attention unit — updates state via an MLP on the concatenation of current state and aggregated message:

xiIxiI+MLP([xiImiIS])\mathbf{x}_i^I \leftarrow \mathbf{x}_i^I + \text{MLP}\big([\mathbf{x}_i^I \mid \mathbf{m}_i^{I \leftarrow S}]\big)

[][\cdot \mid \cdot] = concatenation; residual connection (+xiI+ \mathbf{x}_i^I) preserves original information.

where the message is a weighted average over source states:

miIS=jSSoftmaxkS(aikIS)jWxjS\mathbf{m}_i^{I \leftarrow S} = \sum_{j \in S} \text{Softmax}_{k \in S}(a_{ik}^{IS})_j\, \mathbf{W}\mathbf{x}_j^S

Self-attention uses rotary positional encoding (RoPE). Attention score between points ii and jj:

aij=qiR(pjpi)kja_{ij} = \mathbf{q}_i^\top \mathbf{R}(\mathbf{p}_j - \mathbf{p}_i)\, \mathbf{k}_j

qi=Wqxi\mathbf{q}_i = W_q \mathbf{x}_i, kj=Wkxj\mathbf{k}_j = W_k \mathbf{x}_j — query and key projections. R(pjpi)\mathbf{R}(\mathbf{p}_j - \mathbf{p}_i) is a d×dd \times d block-diagonal rotation matrix encoding 2D relative position into d/2d/2 independent rotations. Each angle θk=bk(pjpi)\theta_k = \mathbf{b}_k^\top(\mathbf{p}_j - \mathbf{p}_i) is determined by a learned 2D frequency vector bk\mathbf{b}_k. Translation-equivariant (relative positions, not absolute coordinates).

R(p)=(R^(b1p)00R^(bd/2p)),R^(θ)=(cosθsinθsinθcosθ)\mathbf{R}(\mathbf{p}) = \begin{pmatrix} \hat{\mathbf{R}}(\mathbf{b}_1^\top \mathbf{p}) & & \mathbf{0} \\ & \ddots & \\ \mathbf{0} & & \hat{\mathbf{R}}(\mathbf{b}_{d/2}^\top \mathbf{p}) \end{pmatrix}, \quad \hat{\mathbf{R}}(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}

d/2d/2 rotation blocks: dd dimensions partitioned into d/2d/2 pairs, each undergoing an independent 2D rotation. RoPE is applied at every layer, unlike SuperGlue's one-shot absolute encoding.

Cross-attention attends to all points in the other image, exploiting bidirectional symmetry:

aijIS=kiIkjS=!ajiSIa_{ij}^{IS} = \mathbf{k}_i^{I\top} \mathbf{k}_j^S \stackrel{!}{=} a_{ji}^{SI}

Cross-attention uses only keys (no separate queries), because relative positions are meaningless across different images. The symmetric similarity matrix is computed once and reused for both directions ABA \to B and BAB \to A, halving the cost.

5.3 Correspondence Prediction (Lightweight Head)

Assignment scores — pairwise similarity between projected states:

Sij=Linear(xiA)Linear(xjB)(i,j)A×B\mathbf{S}_{ij} = \text{Linear}(\mathbf{x}_i^A)^\top \text{Linear}(\mathbf{x}_j^B) \quad \forall (i,j) \in \mathcal{A} \times \mathcal{B}

Shared linear projection Wx+bW\mathbf{x} + b (independent per layer). Maps states into a matching subspace.

Matchability scores — unary likelihood that point ii has a valid correspondence:

σi=Sigmoid(Linear(xi))[0,1]\sigma_i = \text{Sigmoid}(\text{Linear}(\mathbf{x}_i)) \in [0, 1]

σi1\sigma_i \approx 1: likely has a correspondence; σi0\sigma_i \approx 0: likely unmatchable. Replaces SuperGlue's dustbin mechanism with an explicit per-point matchability prediction.

Soft partial assignment matrix — combines both:

Pij=σiAσjBSoftmaxkA(Skj)iSoftmaxkB(Sik)j\mathbf{P}_{ij} = \sigma_i^A\, \sigma_j^B\, \underset{k \in \mathcal{A}}{\text{Softmax}}(\mathbf{S}_{kj})_i\, \underset{k \in \mathcal{B}}{\text{Softmax}}(\mathbf{S}_{ik})_j

Four multiplicative terms: (1) σiA\sigma_i^A — is point ii in AA matchable? (2) σjB\sigma_j^B — is point jj in BB matchable? (3) Softmax over A\mathcal{A} — competition within AA for jj. (4) Softmax over B\mathcal{B} — competition within BB for ii. Product is high only when both points are matchable AND mutually prefer each other — effectively a soft mutual nearest neighbor with matchability gating.

Replaces Sinkhorn iterations. Disentangles "is this point matchable?" from "which point does it match?", yielding cleaner gradients.

Correspondences extracted where Pij>τ\mathbf{P}_{ij} > \tau and is the maximum along both row and column.

5.4 Adaptive Depth and Width

Confidence classifier — at each layer \ell, an MLP predicts per-point confidence:

ci=Sigmoid(MLP(xi))[0,1]c_i = \text{Sigmoid}(\text{MLP}(\mathbf{x}_i)) \in [0, 1]

Early stopping — halt if sufficient fraction α\alpha of all points are confident:

exit=(1N+MI{A,B}iIci>λ)>α\text{exit} = \left(\frac{1}{N+M} \sum_{I \in \{A,B\}} \sum_{i \in I} \llbracket c_i^\ell > \lambda_\ell \rrbracket \right) > \alpha

Confidence threshold decays exponentially across layers:

λ=0.8+0.1e4/L\lambda_\ell = 0.8 + 0.1\, e^{-4\ell/L}

Earlier layers need higher thresholds (less calibrated). α=95%\alpha = 95\% yields ~25% runtime reduction with virtually no accuracy loss.

Point pruning — points confidently predicted as unmatchable (ci>λc_i^\ell > \lambda_\ell and σi<β\sigma_i^\ell < \beta, β=0.01\beta = 0.01) are removed from subsequent layers. After a few layers, >30% of keypoints are pruned.

5.5 Supervision

Two-stage training: first train matching, then train the confidence classifier.

Matching loss — average over all layers (deep supervision):

loss=1L(1M(i,j)MlogPij+12AˉiAˉlog(1σiA)+12BˉjBˉlog(1σjB))\text{loss} = -\frac{1}{L}\sum_\ell \left( \frac{1}{|\mathcal{M}|}\sum_{(i,j)\in\mathcal{M}} \log\, {}^\ell\mathbf{P}_{ij} + \frac{1}{2|\bar{\mathcal{A}}|}\sum_{i\in\bar{\mathcal{A}}} \log(1 - {}^\ell\sigma_i^A) + \frac{1}{2|\bar{\mathcal{B}}|}\sum_{j\in\bar{\mathcal{B}}} \log(1 - {}^\ell\sigma_j^B) \right)

Term 1: logPij\log P_{ij} pushes Pij1P_{ij} \to 1 for true matches M\mathcal{M}. Terms 2 & 3: log(1σ)\log(1-\sigma) pushes σ0\sigma \to 0 for unmatchable points (Aˉ,Bˉ\bar{\mathcal{A}}, \bar{\mathcal{B}}). 1/L1/L average over layers = deep supervision — every layer trains to produce good predictions, enabling early exit.

Confidence classifier loss — binary cross-entropy at each layer, predicting whether that layer's assignment matches the final layer's assignment.

6. Training Setup & Datasets

Pre-training on synthetic homographies:

  • 170k images from Oxford-Paris 1M distractors (150k/10k/10k split)
  • Random homographies + strong photometric augmentations (blur, hue, saturation, illumination, noise, occlusion)
  • 512/1024/1024 keypoints for SuperPoint/SIFT/DISK; batch size 64, lr 0.0001
  • Achieves >99% recall and >90% precision; 2 GPU-days on 2× RTX 3090

Fine-tuning on MegaDepth:

  • 1M crowd-sourced images of 196 landmarks with SfM poses and dense depth
  • Image pairs binned by covisibility: [0.1, 0.3], [0.3, 0.5], [0.5, 0.7]
  • 2048 keypoints/image; batch size 32; lr 1e-5 with exponential decay
  • 50 epochs (~2 days on 2× RTX 3090)
  • Large epipolar-error points labeled unmatchable (incomplete MegaDepth depth maps)

Implementation: L=9L=9 layers, 4 attention heads, d=256d=256. Gradient checkpointing + mixed-precision for 24GB VRAM.

7. Main Experiments & Quantitative Results

7.1 Homography Estimation (HPatches)

Features + Matcher Recall (R) Precision (P) AUC-RANSAC @1px / @5px AUC-DLT @1px / @5px
LoFTR (dense) 92.7 41.5 / 78.8 38.5 / 70.6
NN+mutual (SP) 72.7 67.2 35.0 / 75.3 0.0 / 2.0
SuperGlue (SP) 94.9 87.4 38.3 / 79.3 33.8 / 76.7
SGMNet (SP) 95.5 83.0 38.6 / 79.0 31.7 / 76.0
LightGlue (SP) 94.3 88.9 38.3 / 79.6 35.9 / 78.6

LightGlue achieves the highest precision and competitive AUC, making simple DLT competitive with expensive MAGSAC.

7.2 Relative Pose Estimation (MegaDepth-1500)

Features + Matcher RANSAC AUC 5°/10°/20° LO-RANSAC AUC 5°/10°/20° Time (ms)
LoFTR (dense) 52.8 / 69.2 / 81.2 66.4 / 78.6 / 86.5 181
SuperGlue (SP) 49.7 / 67.1 / 80.6 65.8 / 78.7 / 87.5 70.0
LightGlue (SP) 49.9 / 67.0 / 80.1 66.7 / 79.3 / 87.9 44.2
↳ adaptive 49.4 / 67.2 / 80.1 66.3 / 79.0 / 87.9 31.4
NN+ratio (DISK) 38.1 / 55.4 / 69.6 57.2 / 69.5 / 78.6 7.4
LightGlue (DISK) 43.5 / 61.0 / 75.3 61.3 / 74.3 / 83.8 44.5

LightGlue outperforms SuperGlue at ~30% lower inference time. The adaptive variant is >2× faster than SuperGlue.

7.3 Outdoor Visual Localization (Aachen Day-Night)

SuperPoint + Matcher Day (0.25m,2°) / (0.5m,5°) / (1.0m,10°) Night Pairs/sec
SuperGlue 88.2 / 95.5 / 98.7 86.7 / 92.9 / 100 6.5
LightGlue 89.2 / 95.4 / 98.5 87.8 / 93.9 / 100 17.2 / 26.1

Matches SuperGlue accuracy at 2.5× higher throughput (4× when optimized).

7.4 Image Matching Challenge

  • IMC 2020: LightGlue yields better poses than SuperGlue on multi-view track, reduces matching time significantly
  • IMC 2021: Outperforms sparse baselines, surpasses tuned baselines on public leaderboard
  • IMC 2023: SuperPoint+LightGlue reaches 38.4 / 46.1 (public/private) — +2.3% over SuperPoint+SuperGlue

8. Ablations, Limitations & Practical Pointers

Ablation study (synthetic homographies):

Architecture Precision Recall Time (ms)
SuperGlue 74.6 90.5 29.1
LightGlue (full) 86.8 96.3 19.4
↳ no matchability 67.4 97.0 18.9
↳ absolute positions 84.2 94.7 18.7
↳ full cross-attention 86.6 96.1 22.8
↳ early layer (#5/9) 78.1 92.7 11.9
  • Matchability critical for precision (+12% over SuperGlue) — disentangles similarity from matchability
  • Rotary encoding outperforms absolute positions (+2.6% precision)
  • Bidirectional cross-attention saves 20% time with no accuracy drop

Adaptive depth & width impact:

Difficulty Avg. stopping layer Unmatchable pruned (%) Speedup
Easy 4.7 19.8 1.86×
Medium 5.5 23.4 1.33×
Hard 6.9 27.9 1.16×
Average 5.7 23.7 1.45×

Limitations:

  • On InLoc (indoor), competitive but not clearly better — failures from repeated-texture matching instead of geometric structure
  • Sparse features: accuracy bounded by keypoint detector quality
  • Softmax over both dimensions remains a computational bottleneck

Practical Pointers:

  • Compatible with SuperPoint, SIFT, and DISK features (plug-and-play)
  • Optimized variant uses FlashAttention-style efficient self-attention
  • Pre-training on synthetic homographies is critical — do not skip
  • Filter threshold τ=0.1\tau = 0.1 for correspondences; tune RANSAC threshold per method for fair evaluation
  • Training with 2k keypoints/image instead of 1k improves accuracy

9. Conclusions & Future Work

LightGlue is a drop-in replacement for SuperGlue that is faster, more accurate, and far easier to train. Key innovation: adaptivity — depth (early stopping when confident) and width (pruning unmatchable points) adjust to each image pair's difficulty. The disentangled prediction head (similarity + matchability) replaces expensive Sinkhorn with a simpler formulation yielding cleaner gradients and enabling deep supervision. On easy pairs, LightGlue can stop after 2–3 layers and run at ~100 FPS. Future directions: a dedicated bidirectional softmax kernel to eliminate the remaining bottleneck, and extending the adaptive framework to dense matching.

References

  • Lindenberger, P., Sarlin, P.-E., & Pollefeys, M. (2023). LightGlue: Local Feature Matching at Light Speed. ICCV 2023. arXiv:2306.13643 — paper