Building Change Detection (BCD) aims to identify new or disappeared buildings from bi-temporal images. However, the varied scales and appearances of buildings, along with the challenge of pseudo-change interference from complex backgrounds, make it difficult to accurately extract complete changes. To address these challenges in BCD, a U-shaped hybrid Siamese network combining a convolutional neural network and a vision Transformer (CNN-ViT) with learnable mask guidance, called U-Conformer, is designed. Firstly, a new hybrid architecture of U-Conformer is proposed. The architecture integrates the strengths of CNNs and ViTs to establish a robust, multi-scale heterogeneous representation that aids in detecting buildings of various sizes. Secondly, a Learnable Mask Guidance Module (LMGM) is specifically designed for U-Conformer, focusing the multi-scale heterogeneous representation on extracting relevant scale changes while progressively suppressing pseudo-changes. Furthermore, for the U-Conformer architecture, a mask information joint class-balanced loss function that combines the Binary Cross-Entropy (BCE) loss function and the Dice loss function is devised, significantly mitigating the issue of class imbalance. Experimental results on three publicly available change detection datasets, LEVIR-CD, WHU-CD, and GZ-CD, demonstrate that U-Conformer surpasses previous methods, achieving F1 scores of 91.5%, 94.6%, and 86.7%, as well as IoU scores of 84.3%, 89.7%, and 76.5% on the LEVIR-CD, WHU-CD, and GZ-CD datasets, respectively.