Weakly aligned visible-infrared object detection poses significant challenges due to the imprecise alignment between visible and infrared images. Most existing methods explore the alignment strategies between visible and infrared images, yielding unbearable computation costs. This paper first proposes an end-to-end alignment-free architecture Cross-modal Fusion Deformable DEtection TRansformer (``CF-Deformable DETR'') for weakly aligned visible-infrared object detection. Abandoning the traditional image alignment, CF-Deformable DETR introduces a simple yet effective cross-modal deformable attention mechanism to directly implement automatic cross-modal point mapping, generating well-aligned bimodal features with high efficiency. Moreover, we design a Point-level Feature Consistency Loss to guide the cross-modal point mapping, ensuring the consistency of paired features to support the following fusion. Extensive experiments are conducted on three benchmark datasets. The experimental results demonstrate that CF-Deformable DETR achieves close accuracy on weakly aligned and strictly aligned data as well as maintains stable performance to a certain extent against various offset degrees of weakly aligned data. Code is available at https://github.com/116508/CF-Deformable-DETR.
This paper's license is marked as closed access or non-commercial and cannot be viewed on ResearchHub. Visit the paper's external site.