The complete understanding of 3D scenes is crucial in robotic visual perception, impacting tasks such as motion planning and map localization. However, due to the limited field of view and scene occlusion constraints of sensors, inferring complete scene geometry and semantic information from restricted observations is challenging. In this work, we propose a novel Multimodal Representation Fusion Transformer framework (MRFTrans) that robustly fuses semantic, geometric occupancy, and depth representations for monocular-image-based scene completion. MRFTrans centers on an affinity representation fusion transformer, integrating geometric occupancy and semantic relationships within a transformer architecture. This integration enables the modeling of long-range dependencies within scenes for inferring missing information. Additionally, we present a depth representation fusion method, efficiently extracting reliable depth knowledge from biased monocular estimates. Extensive experiments demonstrate MRFTrans's superiority, setting a new benchmark on SemanticKITTI and NYUv2 datasets. It significantly enhances completeness and accuracy, particularly in large structures, movable objects, and scene components with major occlusions. The results underscore the benefits of the affinity-aware transformer and robust depth fusion in monocular-image-based completion.