Sleep staging serves as a fundamental assessment for sleep quality measurement and sleep disorder diagnosis. Although current deep learning approaches have successfully integrated multimodal sleep signals, enhancing the accuracy of automatic sleep staging, certain challenges remain, as follows: 1) optimizing the utilization of multi-modal information complementarity, 2) effectively extracting both long- and short-range temporal features of sleep information, and 3) addressing the class imbalance problem in sleep data. To address these challenges, this paper proposes a two-stream encode-decoder network, named TSEDSleepNet, which is inspired by the depth sensitive attention and automatic multi-modal fusion (DSA2F) framework. In TSEDSleepNet, a two-stream encoder is used to extract the multiscale features of electrooculogram (EOG) and electroencephalogram (EEG) signals. And a self-attention mechanism is utilized to fuse the multiscale features, generating multi-modal saliency features. Subsequently, the coarser-scale construction module (CSCM) is adopted to extract and construct multi-resolution features from the multiscale features and the salient features. Thereafter, a Transformer module is applied to capture both long- and short-range temporal features from the multi-resolution features. Finally, the long- and short-range temporal features are restored with low-layer details and mapped to the predicted classification results. Additionally, the Lovász loss function is applied to alleviate the class imbalance problem in sleep datasets. Our proposed method was tested on the Sleep-EDF-39 and Sleep-EDF-153 datasets, and it achieved classification accuracies of 88.9% and 85.2% and Macro-F1 scores of 84.8% and 79.7%, respectively, thus outperforming conventional traditional baseline models. These results highlight the efficacy of the proposed method in fusing multi-modal information. This method has potential for application as an adjunct tool for diagnosing sleep disorders.