Abstract We report a surprising phenomenon about identifying differentially expressed genes (DEGs) from population-level RNA-seq data: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates (FDRs). Via permutation analysis on an immunotherapy RNA-seq dataset, we observed that DESeq2 and edgeR identified even more DEGs after samples’ condition labels were randomly permuted. Motivated by this, we evaluated six DEG identification methods (DESeq2, edgeR, limma-voom, NOISeq, dearseq, and the Wilcoxon rank-sum test) on population-level RNA-seq datasets. We found that the FDR control was often failed by the three popular parametric methods—DESeq2, edgeR, and limma-voom— and the new non-parametric method dearseq. In particular, the actual FDRs of DESeq2 and edgeR sometimes exceeded 20% when the target FDR threshold was only 5%. Although NOISeq, a non-parametric method used by GTEx, controlled the FDR better than the other four methods did, its power was much lower than that of the Wilcoxon rank-sum test, a classic nonparametric test that consistently controlled the FDR and achieved good power in our evaluation. Based on these results, for population-level RNA-seq studies, we recommend the Wilcoxon rank-sum test.