The availability of large-scale biobanks linking rich phenotypes and biological measures is a powerful opportunity for scientific discovery. However, real-world collections frequently have extensive non-random missingness. While missing data prediction is possible, performance is significantly impaired by block-wise missingness inherent to many biobanks. To address this, we developed Missingness Adapted Group-wise Informed Clustered (MAGIC)-LASSO which performs hierarchical clustering of variables based on missingness followed by sequential Group LASSO within clusters. Variables are pre-filtered for missingness and balance between training and target sets with final models built using stepwise inclusion of features ranked by completeness. This research has been conducted using the UK Biobank (n>500k) to predict unmeasured Alcohol Use Disorders Identification Test (AUDIT) scores. The phenotypic correlation between measured and predicted total score was 0.67 while genetic correlations between independent subjects was high >0.86, demonstrating the method has significant accuracy and utility.
Support the authors with ResearchCoin
Support the authors with ResearchCoin