Modular Quantization-Aware Training for 6D Object Pose Estimation

Abstract

Edge applications, such as collaborative robotics and spacecraft rendezvous, demand efficient 6D object pose estimation on resource-constrained embedded platforms. Existing 6D object pose estimation networks are often too large for such deployments, necessitating compression while maintaining reliable performance.

To address this challenge, we introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy that exploits the modular structure of modern 6D object pose estimation architectures. MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques.

Our experiments showcase the generality of MQAT across datasets, architectures, and quantization algorithms. Additionally, we observe that MQAT quantized models can achieve an accuracy boost (> 7% ADI-0.1d) over the baseline full-precision network while reducing model size by a factor of 4x or more.

Keypoints (TL; DR)

Modular Quantization-Aware Training (MQAT): MQAT leverages the modular structure of 6D object pose estimation networks to apply mixed-precision quantization, optimizing performance while minimizing computational costs.

Enhanced Edge Efficiency: Specifically designed for resource-constrained edge devices, MQAT supports applications like collaborative robotics and spacecraft rendezvous, providing efficient and accurate 6D object pose estimation.

Quantization Flow Control: MQAT introduces a unique quantization flow control that optimally sequences the quantization of different modules (e.g., Backbone, Rotation Head, and PnP-Patch), ensuring minimal performance loss during model compression.

Superior Compression with High Accuracy: Compared to traditional quantization techniques like LSQ and HAWQ-V3, MQAT achieves higher accuracy (up to 7% boost in ADI-0.1d) with greater compression, reducing model sizes by up to 4x or more.

Broad Applicability Across Datasets and Architectures: MQAT demonstrates versatility by achieving improvements in both 6D pose estimation and object detection tasks, validated across multiple datasets.

Quantization Sequence Matters

A heuristic search of quantization flow sequences to demonstrate quantization flow optimality for K=3 module WDR.

Performance Comparison between Methods

Comparison between our proposed MQAT, uniform QAT (LSQ), and layer-wise mixed-precision QAT (HAWQ-V3).

Architecture Generality

MQAT demonstrates its adaptability and effectiveness across various single-stage and multi-stage 6D object pose estimation architectures.

Single-stage Architecture

In Table 1, we compare several single-stage PnP architectures on the SwissCube dataset. To demonstrate the generality of our performance enhancement, we apply MQAT to aggressively quantize the FPN of both CA-SpaceNet and WDR. We demonstrate an accuracy improvement of 4.5%, 4.4% and 4.5% for Near, Medium and Far images, respectively, on CA-SpaceNet, resulting in a total testing set accuracy improvement of 3.3%. Recall the already presented total testing set accuracy improve- ment of 5.0% for WDR. Previously, the full precision CA-SpaceNet had shown a performance improvement over the full precision WDR, but WDR sees greater gains from the application of MQAT.

Network	Near	Medium	Far	All
SegDriven-Z	41.1	22.9	7.1	21.8
DLR	52.6	45.4	29.4	43.2
CA-SpaceNet	91.0	86.3	61.7	79.4
CA-SpaceNet*	95.5	90.7	66.2	82.7
WDR	92.4	84.2	61.3	78.8
WDR*	96.1	91.5	68.2	83.8

Table 1: Comparison with the state-of-the-art on SwissCube. We report ADI-0.1d scores for three different depth ranges. A '*' indicates applying MQAT with 2-bit precision FPN to the model.

Figure 1 illustrates the WDR network with MQAT applied in an 8-2-8 quantization scheme. For comparison, the full-precision model (without quantization) is also shown.

Figure 1: Comparison between our proposed MQAT paradigm, and full- precision network. The model applying MQAT yields predictions that are on par with, or more concentrated than its full-precision counterpart.

In addition, published accuracy results for a uniform QAT quantized CA-Space network, shared in Table 2. Specifically, CA-SpaceNet explored three quantization modes (B, BF and BFH). These correspond to quantizing the backbone, quantizing the backbone and FPN (paired), and quantizing the whole network (uniformly), respectively.

Quantization Method	ADI-0.1d	Compression	Bit-Precisions (B-F-H)
LSQ	79.4	1x	32-32-32
LSQ B	76.2	2.2x	8-32-32
LSQ BF	75.0	3.2x	8-8-32
LSQ BFH	74.7	4.0x	8-8-8
MQAT (Ours)	82.7	4.7x	8-2-8
MQAT (Ours)	80.2	8.2x	4-2-4
LSQ BFH	68.7	10.6x	3-3-3

Table 2: CA-SpaceNet Published Quantization vs MQAT. We report ADI scores on the SwissCube dataset sorted by the compression factor of the network, for MQAT methods, we use quantization flow F→H→B.

Multi-stage Architecture

In Table 3, we demonstrate the performance of our method on the state-of-the-art 6D object pose estimation network, the two-stage ZebraPose network.

Quantization Method	ADD 0.1d	Compression	Bit-Precisions	Quantization Flow
Full Precision	76.90	1×	Full precision	N/A
HAWQ-V3	71.11	4×	Mixed (layer-wise)	N/A
HAWQ-V3	69.87	4.60×	Mixed (layer-wise)	N/A
MQAT	72.54	4.62×	8-4 (B-D)	D → B

Table 3: Quantization of ZebraPose. We report ADD scores on the LM-O dataset and compare MQAT to mix-precision quantization (HAWQ-V3).

In Table 4, we evaluated GDR-Net using existing uniform and mixed-precision quantization methods alongside our MQAT approach. The ADD-0.1d metric was used for 6D object pose estimation on the O-Linemod dataset, following the methodology of Wang et al. MQAT outperforms both LSQ and HAWQ-V3, achieving superior results even with a slightly more compressed network.

Quantization Method	ADD 0.1d	Compression	Bit-Precisions	Quantization Flow
Full Precision	56.1	1×	Full precision	N/A
LSQ	50.7	4.57×	Uniform (7-bit)	N/A
HAWQ-V3	50.3	4.9×	Mixed (layer-wise)	N/A
MQAT	51.8	4.97×	8-4-4 (B-R-P)	R → P → B

Table 4: Quantization of GDR-Net. We report ADD scores on the LM-O dataset and compare MQAT to uniform (LSQ) and mix-precision quantization (HAWQ-V3). B, R and P indicates Backbone, Rotation Head and PnP-Patch modules.

Generality of MQAT in Object Detection Problem

In our study, while the primary focus is on 6D pose estimation, where our method's efficacy is already demonstrated, we further extend our evaluation to object detection tasks. This extension is aimed at underscoring the generality of our approach.

To this end, we applied our quantization technique to the Faster R-CNN network, which utilizes a ResNet- 50 backbone, a widely recognized model in object detection tasks. Our evaluation was conducted on the comprehensive COCO dataset, a benchmark for object detection.

Network	QAT Method	mAP	Compression
FasterRCNN	Full Precision	37.9	1x
	FQN	32.4	8x
	INQ	33.4	8x
	LSQ	33.8	8x
	MQAT (ours)	35.1	8x
EfficientDet-D0	Full Precision	33.16	1x
	N2UQ	20.11	10x
	MQAT (ours)	21.67	10x

Table 3: Quantization for Object Detection. We evaluate the given networks on COCO dataset and report mAP.

Figure 2: Visualization of the Difference in Object Detection Performanceon MSCOCO between N2UQ and MQAT at the same compression ratio.

Acknowledgements

We thank Ziqi Zhao for assisting with the experiments for Multi-stage architecture. This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 945363. Moreover, this work was funded in part by the Swiss National Science Foundation and the Swiss Innovation Agency (Innosuisse) via the BRIDGE Discovery grant No. 194729.

BibTeX

@article{
javed2024modular,
title={Modular Quantization-Aware Training for 6D Object Pose Estimation},
author={Saqib Javed and Chengkun Li and Andrew Lawrence Price and Yinlin Hu and Mathieu Salzmann},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2024},
url={https://openreview.net/forum?id=lIy0TEUou7},
note={}
}