Description: Paper note on ALIKED, an efficient keypoint and descriptor extraction network using sparse deformable descriptor heads (IEEE TIM 2023)
My Notion Note ID: K2E-B-F2-1
Created: 2024-08-11
Updated: 2026-06-06
License: Reuse welcome — please credit Yu Zhang and link back to yuzhang.io

1. Paper Information
2. Summary
3. Key Contributions
4. Background & Related Work
5. Method Details & Key Equations
6. Training Setup & Datasets
7. Main Experiments & Quantitative Results
8. Ablations, Limitations & Practical Pointers
9. Conclusions & Future Work
References

1. Paper Information

Title: ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation
Authors: Xiaoming Zhao, Xingming Wu, Weihai Chen*, Peter C. Y. Chen, Qingsong Xu, Zhengguo Li
Affiliations: Beihang University; National University of Singapore; University of Macau; A*STAR Institute for Infocomm Research
Paper: IEEE TIM 2023 / arXiv:2304.03608
Code: github.com/Shiaoming/ALIKED

2. Summary

ALIKED is an efficient network for joint keypoint detection and descriptor extraction that builds on the lightweight ALIKE architecture. Its core innovation is the Sparse Deformable Descriptor Head (SDDH), which extracts descriptors only at detected sparse keypoints rather than computing a dense descriptor map, using deformable convolutions to model arbitrary geometric transformations. The paper also introduces a relaxation of the Neural Reprojection Error (NRE) loss from dense to sparse, enabling training without dense descriptor maps. ALIKED achieves state-of-the-art matching accuracy with significantly lower computational cost — the tiny variant (ALIKED-T16) has only 0.192M parameters and runs at 125.87 FPS.

3. Key Contributions

Sparse Deformable Descriptor Head (SDDH): Extracts deformable descriptors at sparse keypoint locations instead of computing dense descriptor maps, greatly reducing redundant computation while allowing modeling of any geometric transformation (not just affine).
Sparse NRE loss: Relaxes the Neural Reprojection Error loss from dense to sparse by constructing sparse probability vectors from descriptor similarity, enabling descriptor training without dense descriptor maps while reducing GPU memory.
Reliable loss: A reliability loss that constrains the score map based on matching similarity, ensuring that areas of low texture are suppressed.
Efficient and scalable: Three model variants (Tiny, Normal, Large) offering different speed-accuracy trade-offs; ALIKED-T(16) achieves real-time performance (125.87 FPS) with only 0.192M parameters and 1.37 GFLOPs.

Geometric invariant descriptors: Classical methods (SIFT, ORB) achieve scale/orientation invariance through explicit estimation. Learning-based methods use either patch-based approaches or data augmentation. Some methods (LIFT, AffNet, LF-Net) explicitly estimate affine parameters, while others (GIFT, HDD-Net) use spatial transformers or rotating kernels. Deformable convolution networks (DCN) can model arbitrary geometric transformations by learning per-pixel offsets.

Joint keypoint and descriptor learning: Map-based methods (SuperPoint, R2D2, ASLFeat, DISK, ALIKE) jointly estimate a score map and descriptor map via a Score Map Head (SMH) and Descriptor Map Head (DMH). Most extract dense descriptor maps, which is computationally expensive. D2-Net detects keypoints directly on the feature map but sacrifices localization accuracy. ALIKE uses differentiable keypoint detection (DKD) for accurate and lightweight extraction — ALIKED builds upon ALIKE.

Deformable convolutions: DCN learns offsets for convolution kernels, enabling modeling of geometric transformations. Used in object detection, segmentation, and action recognition. ASLFeat uses DCN for dense features, but ALIKED's SDDH is specifically designed for efficient sparse descriptor extraction.

5. Method Details & Key Equations

5.1 Network Architecture Overview

ALIKED consists of three components:

Feature encoding: Four encoding blocks transform input image $I \in \mathbb{R}^{H \times W \times 3}$ into multi-scale features $\{F_1, F_2, F_3, F_4\}$ . Block 1 uses two vanilla conv3×3 layers. Block 2 uses pooling2 + conv3×3. Blocks 3 & 4 use pooling4 + $3 \times 3$ deformable convolutions (DCN) for geometric invariance. SELU activation throughout.

Notation: Pooling2/Pooling4 = 2×2/4×4 average pooling, reducing spatial resolution by 2×/4×. DCN = Deformable Convolution Network — like standard convolution but with learned per-pixel offsets.
Feature aggregation: Four ublocks (conv1×1 + upsample) align and concatenate $\{F_1^u, F_2^u, F_3^u, F_4^u\}$ into aggregated feature $F$ .

Each ublock uses conv1×1 to unify channel dimensions and bilinear interpolation to unify spatial resolution.
Keypoint and descriptor extraction:
- SMH (Score Map Head): conv1×1 → 2× conv3×3 → conv3×3 + sigmoid → score map $S \in \mathbb{R}^{H \times W}$
- DKD (Differentiable Keypoint Detection): NMS → threshold → softargmax on local patches → subpixel keypoints $P = p_1, p_2, \ldots$
- SDDH: Extracts deformable descriptors at keypoints (see § 5.2)

Model variants (Table II):

Model	$c_1$	$c_2$	$c_3$	$c_4$	dim
Tiny (-T)	8	16	32	64	64
Normal (-N)	16	32	64	128	128
Large (-L)	32	64	128	128	128*

Large model descriptor head has two $1 \times 1$ conv layers.

5.2 Sparse Deformable Descriptor Head (SDDH)

Motivation: Traditional Descriptor Map Head (DMH) computes a dense descriptor map over the entire feature map — expensive and representationally limited. SDDH operates only at sparse keypoints.

Deformable transformation — per-pixel offset instead of 6-DoF affine:

\begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} x \\ y \end{bmatrix} + \begin{bmatrix} \Delta x \\ \Delta y \end{bmatrix}

$[\Delta x, \Delta y]^T$ = learned per-pixel offset; DoF equals number of pixels (far more flexible than affine's 6 DoF).

SDDH pipeline — for keypoint $p \in \mathbb{R}^2$ :

Extract feature patch $F_{K \times K}$ centered at $p$ .
Estimate $M$ deformable sample positions:

p^s = \text{conv1x1}(\text{SELU}(\text{convKxK}(F_{K \times K})))

output channels $2M$ ; convKxK has no padding → $1 \times 1$ output → $p^s \in \mathbb{R}^{M \times 2}$ .

Sample $M$ supporting features from feature map via bilinear sampling.
Compute descriptor:

d = \sum_{i=1}^{M} w_M(p_i)\, \Phi(F(p + p_i^s))

where $\Phi(x) = \text{SELU}(\text{conv1x1}(x))$ and $w_M$ = weights over $M$ flexible positions.

Key differences from DCN: (1) SDDH operates on sparse keypoints, not dense maps; (2) SDDH estimates $M$ arbitrary positions, DCN is limited to $K \times K$ grid; (3) more sophisticated position estimation network.

5.3 Loss Functions

Reprojection loss — differentiable 3D projection:

p_{AB} = \pi(d_A R_{AB} \pi^{-1}(p_A) + t_{AB})

\mathcal{L}_{rp}(p_A, p_B) = \frac{1}{2}(\|p_A - p_{BA}\| + \|p_B - p_{AB}\|)

$d_A$ = depth; $R_{AB}, t_{AB}$ = relative rotation and translation; bidirectional (A→B and B→A) for stability.

Dispersity peak loss — maximizes score peakiness at keypoints:

\mathcal{L}_{pk}(p) = \text{mean}(\text{softmax}(s_p) \cdot \|p - c\|)

$s_p \in S_p$ = scores in a local patch; $c$ = pixel coordinates.

Sparse NRE loss — relaxes Dense Neural Reprojection Error to work without a dense descriptor map:

\text{sim}(d_A, D_B) = D_B d_A

q_m(d_A, D_B) = \text{softmax}((\text{sim}(d_A, D_B) - 1) / t_{des})

\mathcal{L}_{ds}(p_A, \mathcal{I}_B) = CE(q_r(p_A, P_B) \| q_m(d_A, D_B)) = -\ln(q_m(d_A, d_B))

$q_r$ = reprojection probability (binary ground truth); $q_m$ = descriptor-similarity matching probability; $t_{des} = 0.1$ (temperature, lower → sharper distribution). Loss is cross-entropy pushing the correct match to dominate.

Reliable loss — suppresses low-texture detections:

r(p_A, \mathcal{I}_B) = \text{softmax}(\text{sim}(d_A, D_B) / t_{rel})

\mathcal{L}_{re}(S_A, \mathcal{I}_B) = \frac{1}{\hat{S}_A} \sum_{\substack{p_A \in P_A, \\ s_A = S_A(p_A)}} (1 - r(p_A, \mathcal{I}_B)) \cdot s_A

$t_{rel} = 1.0$ . Penalizes high-scoring keypoints with low matching reliability.

Overall loss:

\mathcal{L} = \omega_{rp} \mathcal{L}_{rp} + \omega_{pk} \mathcal{L}_{pk} + \omega_{ds} \mathcal{L}_{ds} + \omega_{re} \mathcal{L}_{re}

$\omega_{pk} = 0.5$ , $\omega_{ds} = 5$ , $\omega_{rp} = \omega_{re} = 1$ .

6. Training Setup & Datasets

Training data:

MegaDepth: 135 scenes of tourist landmarks with COLMAP SfM depth and poses; 10k pairs/scene; excludes IMW2020 validation/test scenes
R2D2 homographic dataset: Synthetic pairs on Oxford-Paris retrieval + Aachen datasets
Perspective and homographic datasets trained together

Training details:

$800 \times 800$ images; batch size 2 accumulated over 6 batches
DKD radius 2 px; reprojection threshold 5 px
$t_{det} = 0.1$ , $t_{des} = 0.1$ , $t_{rel} = 1$ ; Adam (0.9, 0.999)
Top 400 + 400 random keypoints; NMS to avoid repeats
100K steps; 3 trained variants: ALIKED-T(16), ALIKED-N(16), ALIKED-N(32)

Hardware: Intel i7-10700F, RTX 2060, CUDA 10.2, PyTorch 1.11.0

7. Main Experiments & Quantitative Results

7.1 Real-time Performance (Table IV)

Model	MP	GFLOPs	FPS	MMA@3	MHA@3
SuperPoint	1.301	26.11	52.63	65.37%	70.19%
R2D2(MS)	0.484	464.55	4.10	75.77%	71.48%
ASLFeat(MS)	0.823	44.24	7.10*	72.44%	73.52%
DISK	1.092	98.97	11.81	77.59%	70.56%
ALIKE-N	0.318	7.91	84.96	70.78%	75.74%
ALIKE-L	0.653	19.68	56.66	70.50%	76.85%
ALIKED-T(16)	0.192	1.37	125.87	72.99%	78.70%
ALIKED-N(16)	0.677	4.05	77.40	74.43%	77.22%
ALIKED-N(32)	0.980	4.62	75.64	75.23%	74.44%

ALIKED-T(16): fewest params (0.192M), lowest GFLOPs (1.37), highest FPS (125.87), best MHA@3 (78.70%) on HPatches.

7.2 3D Reconstruction (IMW Test Set, Table V)

Up to 2048 keypoints:

Stereo: ALIKED-N(16) best mAA(5°) = 39.53%; ALIKED-T(16) best PPC = 36.77
Multiview: DISK best mAA(5°) = 63.25%; ALIKED-N(32) close at 61.67%; ALIKED-T(16) best PPC = 51.74
ALIKED-N(16) outperforms DISK on stereo: +1.5% Rep, +0.81% mAA(5°), +1.06% mAA(10°)

7.3 Feature Matching (FM-Bench, Table VI)

TUM SLAM, KITTI, Tanks & Temples, CPC datasets:

ALIKED-N(16): best recall on TUM and T&T
ALIKED-N(32): best recall on KITTI and best overall matching performance

7.4 Visual Relocalization (Aachen Day-Night, Table VII)

Method	1024 kpts: 0.25m,2° / 0.5m,5° / 5m,10°	2048 kpts: 0.25m,2° / 0.5m,5° / 5m,10°
SuperPoint	58.2 / 66.3 / 72.4	69.4 / 78.6 / 87.8
ALIKE-N	59.2 / 73.5 / 83.7	68.4 / 84.7 / 96.9
DISK	60.2 / 72.4 / 81.6	70.4 / 82.7 / 94.9
ALIKED-T(16)	70.4 / 87.8 / 98.0	78.6 / 87.8 / 98.0
ALIKED-N(16)	73.5 / 85.7 / 98.0	80.6 / 87.8 / 99.0
ALIKED-N(32)	77.6 / 88.8 / 100.0	76.5 / 87.8 / 100.0

ALIKED-N(32) achieves the best relocalization performance at both 1024 and 2048 keypoints.

8. Ablations, Limitations & Practical Pointers

NRE Loss and Training Data (Table VIII):

Dense NRE slightly better, but SDDH requires sparse NRE; increasing resolution 480→800 closes the gap
Homography dataset improves HPatches but slightly degrades IMW-validation

Network Architecture (Table IX) — progressive design ablation:

AVG+SELU (vs max-pool+ReLU): +3.4% mAA(10°) on IMW-val
Score Head SH3: best cost vs. quality balance
2×DCN in blocks 3&4: +0.1 GFLOPs, +0.8ms → +2.81% MS@3 on HPatches, +4.36% mAA(10°) on IMW
SDDH K=3, M=16: best trade-off — 65.39% mAA(10°) at 12.92ms total
SDDH K=3, M=32: best overall (67.78% mAA(10°)) at slightly higher cost

Limitations:

Struggles with large simultaneous scale + viewpoint differences (shared with SOTA)
SDDH has one deformable layer only — complex deformations not fully captured
32-bit floating descriptors — not ideal for mobile/embedded
mNN matcher (not learned) → results reflect pure descriptor quality

Practical Pointers:

ALIKED-T(16): real-time use — 125.87 FPS, 0.192M params, best MHA on HPatches
ALIKED-N(16) K=3 M=16: best speed/accuracy balance
ALIKED-N(32): best overall when compute allows
Compatible with learned matchers (SuperGlue, LightGlue) for further gain

9. Conclusions & Future Work

ALIKED introduces SDDH for efficient, geometrically invariant descriptor extraction at sparse keypoints. Combined with sparse NRE loss and reliability loss, it achieves strong performance across image matching, 3D reconstruction, and visual relocalization at very low computational cost. Future: (1) joint training with learned matchers, (2) hardware-friendly mobile networks, (3) better large-scale/viewpoint handling.

References

Zhao, X., Wu, X., Chen, W., Chen, P. C. Y., Xu, Q., & Li, Z. (2023). ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation. IEEE Transactions on Instrumentation and Measurement. arXiv:2304.03608 — paper

ALIKED A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation

Table of Contents

1. Paper Information

2. Summary

3. Key Contributions

5. Method Details & Key Equations

5.1 Network Architecture Overview

5.2 Sparse Deformable Descriptor Head (SDDH)

5.3 Loss Functions

6. Training Setup & Datasets

7. Main Experiments & Quantitative Results

7.1 Real-time Performance (Table IV)

7.2 3D Reconstruction (IMW Test Set, Table V)

7.3 Feature Matching (FM-Bench, Table VI)

7.4 Visual Relocalization (Aachen Day-Night, Table VII)

8. Ablations, Limitations & Practical Pointers

9. Conclusions & Future Work

References

Table of Contents

1. Paper Information

2. Summary

3. Key Contributions

4. Background & Related Work

5. Method Details & Key Equations

5.1 Network Architecture Overview

5.2 Sparse Deformable Descriptor Head (SDDH)

5.3 Loss Functions

6. Training Setup & Datasets

7. Main Experiments & Quantitative Results

7.1 Real-time Performance (Table IV)

7.2 3D Reconstruction (IMW Test Set, Table V)

7.3 Feature Matching (FM-Bench, Table VI)

7.4 Visual Relocalization (Aachen Day-Night, Table VII)

8. Ablations, Limitations & Practical Pointers

9. Conclusions & Future Work

References