The vision transformer has been widely applied in remote sensing image scene classification due to its excellent ability to capture global features. However, remote sensing scene images involve challenges such as scene complexity and small inter-class differences. Directly utilizing the global tokens of transformer for feature learning may increase computational complexity. Therefore, constructing a distinguishable transformer network which adaptively selects tokens can effectively improve the classification performance of remote sensing scene images while considering computational complexity. Based on this, a second-order differentiable token transformer network (SDT 2 Net) is proposed for considering the efficacy of distinguishable statistical features and non-redundant learnable tokens of remote sensing scene images. A novel transformer block, including an efficient attention block (EAB) and differentiable token compression (DTC) mechanism, is inserted into SDT 2 Net for acquiring selectable token features of each scene image guided by sparse shift local features and token compression rate learning style. Furthermore, a fast token fusion (FTF) module is developed for acquiring more distinguishable token feature representations. This module utilizes the fast global covariance pooling algorithm to acquire high-order visual tokens and validates the effectiveness of classification tokens and high-order visual tokens for scene classification. Compared with other recent methods, SDT 2 Net achieves the most advanced performance with comparable FLOP-s (Floating Point Operations Per Second). The code will be available at https://github.com/RSIP-NJUPT/SDT2Net.
This paper's license is marked as closed access or non-commercial and cannot be viewed on ResearchHub. Visit the paper's external site.