Abstract Motivation Spatially resolved transcriptomics (SRT) shows its impressive power in yielding biological insights into neuroscience, disease study, and even plant biology. However, current methods do not sufficiently explore the expressiveness of the multi-modal SRT data, leaving a large room for improvement of performance. Moreover, the current deep learning based methods lack interpretability due to the “black box” nature, impeding its further applications in the areas that require explanation. Results We propose conST, a powerful and flexible SRT data analysis framework utilizing contrastive learning techniques. conST can learn low-dimensional embeddings by effectively integrating multi-modal SRT data, i . e . gene expression, spatial information, and morphology (if applicable). The learned embeddings can be then used for various downstream tasks, including clustering, trajectory and pseudotime inference, cell-to-cell interaction, etc . Extensive experiments in various datasets have been conducted to demonstrate the effectiveness and robustness of the proposed conST, achieving up to 10% improvement in clustering ARI in the commonly used benchmark dataset. We also show that the learned embedding can be used in complicated scenarios, such as predicting cancer progression by analyzing the tumour microenvironment and cell-to-cell interaction (CCI) of breast cancer. Our framework is interpretable in that it is able to find the correlated spots that support the clustering, which matches the CCI interaction pairs as well, providing more confidence to clinicians when making clinical decisions.