ResearchHub | Open Science Community

Definitions, methods, and applications in interpretable machine learning

William Murdoch et al.Oct 16, 2019

Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the predictive, descriptive, relevant (PDR) framework for discussing interpretations. The PDR framework provides 3 overarching desiderata for evaluation: predictive accuracy, descriptive accuracy, and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post hoc categories, with subgroups including sparsity, modularity, and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often underappreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods.

Genetics

Philosophy

0

Paper

Save

Artificial intelligence and statistics

Bin Yu et al.Jan 1, 2018

Artificial intelligence (AI) is intrinsically data-driven. It calls for the application of statistical concepts through human-machine collaboration during the generation of data, the development of algorithms, and the evaluation of results. This paper discusses how such human-machine collaboration can be approached through the statistical concepts of population, question of interest, representativeness of training data, and scrutiny of results (PQRS). The PQRS workflow provides a conceptual framework for integrating statistical ideas with human input into AI products and researches. These ideas include experimental design principles of randomization and local control as well as the principle of stability to gain reproducibility and interpretability of algorithms and data results. We discuss the use of these principles in the contexts of self-driving cars, automated medical diagnoses, and examples from the authors’ collaborative research.

Artificial Intelligence

Law

0

Paper

Artificial Intelligence

1,211

0

Save

1

Iterative random forests to discover predictive and stable high-order interactions

Sumanta Basu et al.Jan 19, 2018

Significance We developed a predictive, stable, and interpretable tool: the iterative random forest algorithm (iRF). iRF discovers high-order interactions among biomolecules with the same order of computational cost as random forests. We demonstrate the efficacy of iRF by finding known and promising interactions among biomolecules, of up to fifth and sixth order, in two data examples in transcriptional regulation and alternative splicing.

Genetics

Artificial Intelligence

1

Paper

Save

iterative Random Forests to discover predictive and stable high-order interactions

Sumanta Basu et al.Nov 20, 2017

Abstract Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila , among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology.

Genetics

Molecular Biology

0

Paper

Save

Learning epistatic polygenic phenotypes with Boolean interactions

Merle Behr et al.Nov 25, 2020

Abstract Detecting epistatic drivers of human phenotypes is a considerable challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests for interactions, based on a stabilized likelihood ratio test, by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that probabilisticly quantify improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline in two case studies using data from the UK Biobank: predicting red hair and multiple sclerosis (MS). In the case of predicting red hair, epiTree recovers known epistatic interactions surrounding MC1R and novel interactions, representing non-linearities not captured by logistic regression models. In the case of predicting MS, a more complex phenotype than red hair, epiTree rankings prioritize novel interactions surrounding HLA-DRB1 , a variant previously associated with MS in several populations. Taken together, these results highlight the potential for epiTree rankings to help reduce the design space for follow up experiments.

Genetics

Artificial Intelligence

0

Paper

Save

Dissecting the effects of GTPase and kinase domain mutations on LRRK2 endosomal localization and activity

Capria Rinaldi et al.Oct 27, 2022

Abstract Parkinson’s disease-causing LRRK2 mutations lead to varying degrees of Rab GTPase hyperphosphorylation. Puzzlingly, LRRK2 GTPase-inactivating mutations—which do not affect intrinsic kinase activity—lead to higher levels of cellular Rab phosphorylation than kinase-activating mutations. Here, we investigated whether mutation-dependent differences in LRRK2 cellular localization could explain this discrepancy. We discovered that blocking endosomal maturation leads to the rapid formation of mutant LRRK2 + endosomes on which LRRK2 phosphorylates substrate Rabs. LRRK2 + endosomes are maintained through positive feedback, which mutually reinforces membrane localization of LRRK2 and phosphorylated Rab substrates. Furthermore, across a panel of mutants, cells expressing GTPase-inactivating mutants formed strikingly more LRRK2 + endosomes than cells expressing kinase-activating mutants, resulting in higher total cellular levels of phosphorylated Rabs. Our study suggests that an increased probability of LRRK2 GTPase-inactivating mutants to be retained on intracellular membranes over the kinase-activating mutants leads to higher substrate phosphorylation.

Biochemistry

Biophysics

1

Paper

Save

Identifying FUS amyotrophic lateral sclerosis disease signatures in patient dermal fibroblasts

Karl Kumbier et al.Jun 14, 2024

Amyotrophic lateral sclerosis (ALS) is a rapidly progressing, highly heterogeneous neurodegenerative disease, underscoring the importance of obtaining information to personalize clinical decisions quickly after diagnosis. Here, we investigated whether ALS-relevant signatures can be detected directly from biopsied patient fibroblasts. We profiled familial ALS (fALS) fibroblasts, representing a range of mutations in the fused in sarcoma (FUS) gene and ages of onset. To differentiate FUS fALS and healthy control fibroblasts, machine-learning classifiers were trained separately on high-content imaging and transcriptional profiles. "Molecular ALS phenotype" scores, derived from these classifiers, captured a spectrum from disease to health. Interestingly, these scores negatively correlated with age of onset, identified several pre-symptomatic individuals and sporadic ALS (sALS) patients with FUS-like fibroblasts, and quantified "movement" of FUS fALS and "FUS-like" sALS toward health upon FUS ASO treatment. Taken together, these findings provide evidence that non-neuronal patient fibroblasts can be used for rapid, personalized assessment in ALS.

Genetics

Molecular Biology

0

Paper

Save

A scalable screening platform for phenotypic subtyping of ALS patient-derived fibroblasts

Karl Kumbier et al.Sep 28, 2022

ABSTRACT A major challenge for understanding and treating Amyotrophic Lateral Sclerosis (ALS) is that most patients have no known genetic cause. Even within defined genetic subtypes, patients display considerable clinical heterogeneity. It is unclear how to identify subsets of ALS patients that share common molecular dysregulation or could respond similarly to treatment. Here, we developed a scalable microscopy and machine learning platform to phenotypically subtype readily available, primary patient-derived fibroblasts. Application of our platform identified robust signatures for the genetic subtype FUS-ALS, allowing cell lines to be scored along a spectrum from FUS-ALS to non-ALS. Our FUS-ALS phenotypic score negatively correlates with age of diagnosis and provides information that is distinct from transcript profiling. Interestingly, the FUS-ALS phenotypic score can be used to identify sporadic patient fibroblasts that have consistent pathway dysregulation with FUS-ALS. Further, we showcase how the score can be used to evaluate the effects of ASO treatment on patient fibroblasts. Our platform provides an approach to move from genetic to phenotypic subtyping and a first step towards rational selection of patient subpopulations for targeted therapies.

Genetics

Molecular Biology

0

Paper

Save

Selection of optimal cell lines for high-content phenotypic screening

Louise Heinrich et al.Jan 12, 2023

Abstract High-content microscopy offers a scalable approach to screen against multiple targets in a single pass. Prior work has focused on methods to select “optimal” cellular readouts in microscopy screens. However, methods to select optimal cell line models have garnered much less attention. Here, we provide a roadmap for how to select the cell line or lines that are best suited to identify bioactive compounds and their mechanism of action (MOA). We test our approach on compounds targeting cancer-relevant pathways, ranking cell lines in two tasks: detecting compound activity (“phenoactivity”) and grouping compounds with similar MOA by similar phenotype (“phenosimilarity”). Evaluating six cell lines across 3214 well-annotated compounds, we show that optimal cell line selection depends on both the task of interest (e.g. detecting phenoactivity vs. inferring phenosimilarity) and distribution of MOAs within the compound library. Given a task of interest and set of compounds, we provide a systematic framework for choosing optimal cell line(s). Our framework can be used to reduce the number of cell lines required to identify hits within a compound library and help accelerate the pace of early drug discovery.

Genetics

Artificial Intelligence

2

Paper

Genetics

Artificial Intelligence

0

Save

0

Refining interaction search through signed iterative Random Forests

Karl Kumbier et al.Nov 11, 2018

Advances in supervised learning have enabled accurate prediction in biological systems governed by complex interactions among biomolecules. However, state-of-the-art predictive algorithms are typically "black-boxes," learning statistical interactions that are difficult to translate into testable hypotheses. The iterative Random Forest (iRF) algorithm took a step towards bridging this gap by providing a computationally tractable procedure to identify the stable, high-order feature interactions that drive the predictive accuracy of Random Forests (RF). Here we refine the interactions identified by iRF to explicitly map responses as a function of interacting features. Our method, signed iRF (s-iRF), describes subsets of rules that frequently occur on RF decision paths. We refer to these "rule subsets" as signed interactions. Signed interactions share not only the same set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We describe stable and predictive importance metrics (SPIMs) to rank signed interactions in terms of their stability, predictive accuracy, and strength of interaction. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate our proposed approach in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of enhancer activity, s-iRF recovers one of the few experimentally validated high-order interactions and suggests novel enhancer elements where this interaction may be active. In the case of spatial gene expression patterns, s-iRF recovers all 11 reported links in the gap gene network. By refining the process of interaction recovery, our approach has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension.

Genetics

Philosophy

0

Paper

Genetics

Philosophy

0

Save