ALIKED A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation
- Description: Paper note on ALIKED, an efficient keypoint and descriptor extraction network using sparse deformable descriptor heads (IEEE TIM 2023)
- My Notion Note ID: K2E-B-F2-1
- Created: 2024-08-11
- Updated: 2026-06-06
- License: Reuse welcome — please credit Yu Zhang and link back to yuzhang.io
Table of Contents
- 1. Paper Information
- 2. Summary
- 3. Key Contributions
- 4. Background & Related Work
- 5. Method Details & Key Equations
- 6. Training Setup & Datasets
- 7. Main Experiments & Quantitative Results
- 8. Ablations, Limitations & Practical Pointers
- 9. Conclusions & Future Work
- References
1. Paper Information
Title: ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation
Authors: Xiaoming Zhao, Xingming Wu, Weihai Chen*, Peter C. Y. Chen, Qingsong Xu, Zhengguo Li
Affiliations: Beihang University; National University of Singapore; University of Macau; A*STAR Institute for Infocomm Research
Paper: IEEE TIM 2023 / arXiv:2304.03608
Code: github.com/Shiaoming/ALIKED
2. Summary
ALIKED is an efficient network for joint keypoint detection and descriptor extraction that builds on the lightweight ALIKE architecture. Its core innovation is the Sparse Deformable Descriptor Head (SDDH), which extracts descriptors only at detected sparse keypoints rather than computing a dense descriptor map, using deformable convolutions to model arbitrary geometric transformations. The paper also introduces a relaxation of the Neural Reprojection Error (NRE) loss from dense to sparse, enabling training without dense descriptor maps. ALIKED achieves state-of-the-art matching accuracy with significantly lower computational cost — the tiny variant (ALIKED-T16) has only 0.192M parameters and runs at 125.87 FPS.
3. Key Contributions
- Sparse Deformable Descriptor Head (SDDH): Extracts deformable descriptors at sparse keypoint locations instead of computing dense descriptor maps, greatly reducing redundant computation while allowing modeling of any geometric transformation (not just affine).
- Sparse NRE loss: Relaxes the Neural Reprojection Error loss from dense to sparse by constructing sparse probability vectors from descriptor similarity, enabling descriptor training without dense descriptor maps while reducing GPU memory.
- Reliable loss: A reliability loss that constrains the score map based on matching similarity, ensuring that areas of low texture are suppressed.
- Efficient and scalable: Three model variants (Tiny, Normal, Large) offering different speed-accuracy trade-offs; ALIKED-T(16) achieves real-time performance (125.87 FPS) with only 0.192M parameters and 1.37 GFLOPs.
4. Background & Related Work
Geometric invariant descriptors: Classical methods (SIFT, ORB) achieve scale/orientation invariance through explicit estimation. Learning-based methods use either patch-based approaches or data augmentation. Some methods (LIFT, AffNet, LF-Net) explicitly estimate affine parameters, while others (GIFT, HDD-Net) use spatial transformers or rotating kernels. Deformable convolution networks (DCN) can model arbitrary geometric transformations by learning per-pixel offsets.
Joint keypoint and descriptor learning: Map-based methods (SuperPoint, R2D2, ASLFeat, DISK, ALIKE) jointly estimate a score map and descriptor map via a Score Map Head (SMH) and Descriptor Map Head (DMH). Most extract dense descriptor maps, which is computationally expensive. D2-Net detects keypoints directly on the feature map but sacrifices localization accuracy. ALIKE uses differentiable keypoint detection (DKD) for accurate and lightweight extraction — ALIKED builds upon ALIKE.
Deformable convolutions: DCN learns offsets for convolution kernels, enabling modeling of geometric transformations. Used in object detection, segmentation, and action recognition. ASLFeat uses DCN for dense features, but ALIKED's SDDH is specifically designed for efficient sparse descriptor extraction.
5. Method Details & Key Equations
5.1 Network Architecture Overview
ALIKED consists of three components:
-
Feature encoding: Four encoding blocks transform input image into multi-scale features . Block 1 uses two vanilla conv3×3 layers. Block 2 uses pooling2 + conv3×3. Blocks 3 & 4 use pooling4 + deformable convolutions (DCN) for geometric invariance. SELU activation throughout.
Notation: Pooling2/Pooling4 = 2×2/4×4 average pooling, reducing spatial resolution by 2×/4×. DCN = Deformable Convolution Network — like standard convolution but with learned per-pixel offsets.
-
Feature aggregation: Four ublocks (conv1×1 + upsample) align and concatenate into aggregated feature .
Each ublock uses conv1×1 to unify channel dimensions and bilinear interpolation to unify spatial resolution.
-
Keypoint and descriptor extraction:
- SMH (Score Map Head): conv1×1 → 2× conv3×3 → conv3×3 + sigmoid → score map
- DKD (Differentiable Keypoint Detection): NMS → threshold → softargmax on local patches → subpixel keypoints
- SDDH: Extracts deformable descriptors at keypoints (see § 5.2)
Model variants (Table II):
| Model | dim | ||||
|---|---|---|---|---|---|
| Tiny (-T) | 8 | 16 | 32 | 64 | 64 |
| Normal (-N) | 16 | 32 | 64 | 128 | 128 |
| Large (-L) | 32 | 64 | 128 | 128 | 128* |
Large model descriptor head has two conv layers.
5.2 Sparse Deformable Descriptor Head (SDDH)
Motivation: Traditional Descriptor Map Head (DMH) computes a dense descriptor map over the entire feature map — expensive and representationally limited. SDDH operates only at sparse keypoints.
Deformable transformation — per-pixel offset instead of 6-DoF affine:
= learned per-pixel offset; DoF equals number of pixels (far more flexible than affine's 6 DoF).
SDDH pipeline — for keypoint :
- Extract feature patch centered at .
- Estimate deformable sample positions:
output channels ; convKxK has no padding → output → .
- Sample supporting features from feature map via bilinear sampling.
- Compute descriptor:
where and = weights over flexible positions.
Key differences from DCN: (1) SDDH operates on sparse keypoints, not dense maps; (2) SDDH estimates arbitrary positions, DCN is limited to grid; (3) more sophisticated position estimation network.
5.3 Loss Functions
Reprojection loss — differentiable 3D projection:
= depth; = relative rotation and translation; bidirectional (A→B and B→A) for stability.
Dispersity peak loss — maximizes score peakiness at keypoints:
= scores in a local patch; = pixel coordinates.
Sparse NRE loss — relaxes Dense Neural Reprojection Error to work without a dense descriptor map:
= reprojection probability (binary ground truth); = descriptor-similarity matching probability; (temperature, lower → sharper distribution). Loss is cross-entropy pushing the correct match to dominate.
Reliable loss — suppresses low-texture detections:
. Penalizes high-scoring keypoints with low matching reliability.
Overall loss:
, , .
6. Training Setup & Datasets
Training data:
- MegaDepth: 135 scenes of tourist landmarks with COLMAP SfM depth and poses; 10k pairs/scene; excludes IMW2020 validation/test scenes
- R2D2 homographic dataset: Synthetic pairs on Oxford-Paris retrieval + Aachen datasets
- Perspective and homographic datasets trained together
Training details:
- images; batch size 2 accumulated over 6 batches
- DKD radius 2 px; reprojection threshold 5 px
- , , ; Adam (0.9, 0.999)
- Top 400 + 400 random keypoints; NMS to avoid repeats
- 100K steps; 3 trained variants: ALIKED-T(16), ALIKED-N(16), ALIKED-N(32)
Hardware: Intel i7-10700F, RTX 2060, CUDA 10.2, PyTorch 1.11.0
7. Main Experiments & Quantitative Results
7.1 Real-time Performance (Table IV)
| Model | MP | GFLOPs | FPS | MMA@3 | MHA@3 |
|---|---|---|---|---|---|
| SuperPoint | 1.301 | 26.11 | 52.63 | 65.37% | 70.19% |
| R2D2(MS) | 0.484 | 464.55 | 4.10 | 75.77% | 71.48% |
| ASLFeat(MS) | 0.823 | 44.24 | 7.10* | 72.44% | 73.52% |
| DISK | 1.092 | 98.97 | 11.81 | 77.59% | 70.56% |
| ALIKE-N | 0.318 | 7.91 | 84.96 | 70.78% | 75.74% |
| ALIKE-L | 0.653 | 19.68 | 56.66 | 70.50% | 76.85% |
| ALIKED-T(16) | 0.192 | 1.37 | 125.87 | 72.99% | 78.70% |
| ALIKED-N(16) | 0.677 | 4.05 | 77.40 | 74.43% | 77.22% |
| ALIKED-N(32) | 0.980 | 4.62 | 75.64 | 75.23% | 74.44% |
ALIKED-T(16): fewest params (0.192M), lowest GFLOPs (1.37), highest FPS (125.87), best MHA@3 (78.70%) on HPatches.
7.2 3D Reconstruction (IMW Test Set, Table V)
Up to 2048 keypoints:
- Stereo: ALIKED-N(16) best mAA(5°) = 39.53%; ALIKED-T(16) best PPC = 36.77
- Multiview: DISK best mAA(5°) = 63.25%; ALIKED-N(32) close at 61.67%; ALIKED-T(16) best PPC = 51.74
- ALIKED-N(16) outperforms DISK on stereo: +1.5% Rep, +0.81% mAA(5°), +1.06% mAA(10°)
7.3 Feature Matching (FM-Bench, Table VI)
TUM SLAM, KITTI, Tanks & Temples, CPC datasets:
- ALIKED-N(16): best recall on TUM and T&T
- ALIKED-N(32): best recall on KITTI and best overall matching performance
7.4 Visual Relocalization (Aachen Day-Night, Table VII)
| Method | 1024 kpts: 0.25m,2° / 0.5m,5° / 5m,10° | 2048 kpts: 0.25m,2° / 0.5m,5° / 5m,10° |
|---|---|---|
| SuperPoint | 58.2 / 66.3 / 72.4 | 69.4 / 78.6 / 87.8 |
| ALIKE-N | 59.2 / 73.5 / 83.7 | 68.4 / 84.7 / 96.9 |
| DISK | 60.2 / 72.4 / 81.6 | 70.4 / 82.7 / 94.9 |
| ALIKED-T(16) | 70.4 / 87.8 / 98.0 | 78.6 / 87.8 / 98.0 |
| ALIKED-N(16) | 73.5 / 85.7 / 98.0 | 80.6 / 87.8 / 99.0 |
| ALIKED-N(32) | 77.6 / 88.8 / 100.0 | 76.5 / 87.8 / 100.0 |
ALIKED-N(32) achieves the best relocalization performance at both 1024 and 2048 keypoints.
8. Ablations, Limitations & Practical Pointers
NRE Loss and Training Data (Table VIII):
- Dense NRE slightly better, but SDDH requires sparse NRE; increasing resolution 480→800 closes the gap
- Homography dataset improves HPatches but slightly degrades IMW-validation
Network Architecture (Table IX) — progressive design ablation:
- AVG+SELU (vs max-pool+ReLU): +3.4% mAA(10°) on IMW-val
- Score Head SH3: best cost vs. quality balance
- 2×DCN in blocks 3&4: +0.1 GFLOPs, +0.8ms → +2.81% MS@3 on HPatches, +4.36% mAA(10°) on IMW
- SDDH K=3, M=16: best trade-off — 65.39% mAA(10°) at 12.92ms total
- SDDH K=3, M=32: best overall (67.78% mAA(10°)) at slightly higher cost
Limitations:
- Struggles with large simultaneous scale + viewpoint differences (shared with SOTA)
- SDDH has one deformable layer only — complex deformations not fully captured
- 32-bit floating descriptors — not ideal for mobile/embedded
- mNN matcher (not learned) → results reflect pure descriptor quality
Practical Pointers:
- ALIKED-T(16): real-time use — 125.87 FPS, 0.192M params, best MHA on HPatches
- ALIKED-N(16) K=3 M=16: best speed/accuracy balance
- ALIKED-N(32): best overall when compute allows
- Compatible with learned matchers (SuperGlue, LightGlue) for further gain
9. Conclusions & Future Work
ALIKED introduces SDDH for efficient, geometrically invariant descriptor extraction at sparse keypoints. Combined with sparse NRE loss and reliability loss, it achieves strong performance across image matching, 3D reconstruction, and visual relocalization at very low computational cost. Future: (1) joint training with learned matchers, (2) hardware-friendly mobile networks, (3) better large-scale/viewpoint handling.
References
- Zhao, X., Wu, X., Chen, W., Chen, P. C. Y., Xu, Q., & Li, Z. (2023). ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation. IEEE Transactions on Instrumentation and Measurement. arXiv:2304.03608 — paper