Associations between datasets can be discovered through multivariate methods like Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS). A requisite property for interpretability and generalizability of CCA/PLS solutions is stability of feature patterns driving an association. However, stability of CCA/PLS in high-dimensional datasets is questionable, as found in empirical characterizations. To study these issues in a systematic manner, we developed a generative modeling framework to simulate synthetic datasets, parameterized by dimensionality, variance structure, and association strength. We found that when sample size is relatively small, but comparable to typical studies, CCA/PLS associations are highly unstable and inaccurate; both in their magnitude and importantly in the latent pattern underlying the discovered association. We confirmed these trends across two neuroimaging modalities, functional and diffusion MRI, and in independent datasets, Human Connectome Project (n{approx}1000) and UK Biobank (n{approx}20000) and found that only the latter comprised sufficient samples for stable mappings between imaging-derived and behavioral features. We further developed a power calculator to provide sample sizes required for stability and reliability of multivariate analyses for future studies.
Support the authors with ResearchCoin
Support the authors with ResearchCoin