Multilayer network models have been proposed as an effective means to capture the dynamic configuration of distributed neural circuits and quantitatively describe how communities vary over time. However, test-retest reliabilities for multilayer network measures are yet to be fully quantified. Here, we systematically evaluated the impact of code implementation, network parameter selections, scan duration, and task condition on test-retest reliability of key multilayer network measures (i.e., flexibility, integration, recruitment). We found that each of these factors impacted reliability, although to differing degrees. The choice of parameters is a longstanding difficulty of multilayer modularity-maximization algorithms. As suggested by prior work, we found that optimal parameter selection was a key determinant of reliability. Though, due to changes in implementation of the multilayer community detection algorithm, our findings revealed a more complex story than previously appreciated, as the parameter landscape of reliability was found to be dependent on the implementation of the software. Consistent with findings from the static functional connectivity literature, scan duration was found to be a much stronger determinant of reliability than scan condition. We found that both passive (i.e., resting state, Inscapes, and movie) and active (i.e., flanker) tasks can be highly reliable when the parameters are optimized and the scan duration is sufficient, although reliability in the movie watching condition was significantly higher than in the other three tasks. Accordingly, the minimal data requirement for achieving reliable measures for the movie watching condition was 20 min, which is less than the 30 min needed for the other three tasks. Collectively, our results quantified test-retest reliability for multilayer network measures and support the utility of movie fMRI as a reliable context in which to investigation time-invariant network dynamics. Our practice of using test-retest reliability to optimize free parameters of multilayer modularity-maximization algorithms has the potential to enhance our ability to use these measures for the study of individual differences in cognitive traits.