Abstract The objective diagnostic and stratification biomarkers developed with resting-state functional magnetic resonance imaging (rs-fMRI) data are expected to contribute to more effective treatment for mental disorders. Unfortunately, there are currently no widely accepted biomarkers, partially due to the large variety of analysis pipelines for developing them. In this study we comprehensively evaluated analysis pipelines using a large-scale, multi-site fMRI dataset for major depressive disorder (MDD) (1162 participants from eight imaging sites). We explored the combinations of options in four subprocesses of analysis pipelines: six types of brain parcellation, four types of estimations of functional connectivity (FC), three types of site difference harmonization, and five types of machine learning methods. 360 different MDD diagnostic biomarkers were constructed using the SRPBS dataset acquired with unified protocols (713 participants from four imaging sites) as a discovery dataset and evaluated with datasets from other projects acquired with heterogeneous protocols (449 participants from four imaging sites) for independent validation. To identify the optimal options regardless of the discovery dataset, we repeated the same procedure after swapping the roles of the two datasets. We found pipelines that included Glasser’s parcellation, tangent-covariance, no harmonization, and non-sparse machine learning methods tended to result in high classification performance. The diagnosis results of the top 10 biomarkers showed high similarity, and weight similarity was also observed between eight of the biomarkers, except two that used both data-driven parcellation and FC computation. We applied the top 10 pipelines to the datasets of other mental disorders (autism spectral disorder: ASD and schizophrenia: SCZ) and eight of the ten biomarkers showed sufficient classification performances for both disorders, except two pipelines that included Pearson correlation, ComBat harmonization and random forest classifier combination. Highlights We evaluated the analysis pipelines of rsFC biomarker development. Four subprocesses in them were investigated with two multi-site datasets. Glasser’s parcellation, tangent covariance, and non-sparse methods were preferred. The weight patterns of eight of the top 10 biomarkers showed high commonality. Eight of the top 10 pipelines were successful for developing SCZ/ASD biomarkers.