ResearchHub | Open Science Community

A rarefaction-without-resampling extension of PERMANOVA for testing presence-absence associations in the microbiome

Yi‐Juan Hu et al.Apr 8, 2021

Abstract Background PERMANOVA [1] is currently the most commonly used method for testing community-level hypotheses about microbiome associations with covariates of interest. PERMANOVA can test for associations that result from changes in which taxa are present or absent by using the Jaccard or unweighted UniFrac distance. However, such presence-absence analyses face a unique challenge: confounding by library size (total sample read count), which occurs when library size is associated with covariates in the analysis. It is known that rarefaction (subsampling to a common library size) controls this bias, but at the potential costs of information loss and the introduction of a stochastic component into the analysis. Methods Here we develop a non-stochastic approach to PERMANOVA presence-absence analyses that aggregates information over all potential rarefaction replicates without actual resampling, when the Jaccard or unweighted UniFrac distance is used. We compare this new approach to three possible ways of aggregating PERMANOVA over multiple rarefactions obtained from resampling: averaging the distance matrix, averaging the (element-wise) squared distance matrix, and averaging the F -statistic. Results Our simulations indicate that our non-stochastic approach is robust to confounding by library size and outperforms each of the stochastic resampling approaches. We also show that, when overdispersion is low, averaging the (element-wise) squared distance outperforms averaging the unsquared distance, currently implemented in the R package vegan . We illustrate our methods using an analysis of data on inflammatory bowel disease (IBD) in which samples from case participants have systematically smaller library sizes than samples from control participants.

Genetics

Ecology

5

Paper

Save

Testing microbiome associations with censored survival outcomes at both the community and individual taxon levels

Yingtian Hu et al.Mar 14, 2022

Abstract Background Finding microbiome associations with possibly censored survival times is an important problem, especially as specific taxa could serve as biomarkers for disease prognosis or as targets for therapeutic interventions. The two existing methods for survival outcomes, MiRKAT-S and OMiSA, are restricted to testing associations at the community level and do not provide results at the individual taxon level. An ad hoc approach testing each taxon with a survival outcome using the Cox proportional hazard model may not perform well in the microbiome setting with sparse count data and small sample sizes. Methods We have previously developed the linear decomposition model (LDM) that unifies community-level and taxon-level tests into one framework. Here we extend the LDM to test survival outcomes. We propose to use the Martingale residuals or the deviance residuals obtained from the Cox model as continuous covariates in the LDM. We further construct tests that combine the results of analyzing each set of residuals separately. Finally, we extend PERMANOVA, the most commonly used distance-based method for testing community-level hypotheses, to handle survival outcomes in a similar manner. Results Using simulated data, we showed that the LDM-based tests preserved the false discovery rate for testing individual taxa and had good sensitivity. The LDM-based community-level tests and PERMANOVA-based tests had comparable or better power than MiRKAT-S and OMiSA. An analysis of data on the association of the gut microbiome and the time to acute graft-versus-host disease revealed several dozen associated taxa that would not have been achievable by any community-level test, as well as improved community-level tests by the LDM and PERMANOVA over those obtained using MiRKAT-S and OMiSA. Availability and Implementation The new methods described here have been added to our R package LDM , which is available on GitHub at https://github.com/yijuanhu/LDM .

Ecology

Oncology

3

Paper

Save

Integrative analysis of relative abundance data and presence-absence data of the microbiome using the LDM

Zhengyi Zhu et al.Jan 14, 2022

Abstract Summary We previously developed LDM for testing hypotheses about the microbiome that performs the test at both the community level and the individual taxon level. LDM can be applied to relative abundance data and presence-absence data separately, which work well when associated taxa are abundant and rare, respectively. Here we propose an omnibus test based on LDM that allows simultaneous consideration of data at different scales, thus offering optimal power across scenarios with different association mechanisms. The omnibus test is available for the wide range of data types and analyses that are supported by LDM. Availability and Implementation The omnibus test has been added to the R package LDM, which is available on GitHub at https://github.com/yijuanhu/LDM . Contact yijuan.hu@emory.edu Supplementary information Supplementary data are available at Bioinformatics online.

Ecology

Molecular Biology

7

Paper

Save

MIDASim: a fast and simple simulator for realistic microbiome data

Mengyu He et al.Mar 25, 2023

Abstract Background Advances in sequencing technology has led to the discovery of associations between the human microbiota and many diseases, conditions, and traits. With the increasing availability of microbiome data, many statistical methods have been developed for studying these associations. The growing number of newly developed methods highlights the need for simple, rapid, and reliable methods to simulate realistic microbiome data, which is essential for validating and evaluating the performance of these methods. However, generating realistic microbiome data is challenging due to the complex nature of microbiome data, which feature correlation between taxa, sparsity, overdispersion, and compositionality. Current methods for simulating microbiome data are deficient in their ability to capture these important features of microbiome data, or can require exorbitant computational time. Methods We develop MIDASim ( MI crobiome DA ta Sim ulator), a fast and simple approach for simulating realistic microbiome data that reproduces the distributional and correlation structure of a template microbiome dataset. MIDASim is a two-step approach. The first step generates correlated binary indicators that represent the presence-absence status of all taxa, and the second step generates relative abundances and counts for the taxa that are considered to be present in step 1, utilizing a Gaussian copula to account for the taxon-taxon correlations. In the second step, MIDASim can operate in both a nonparametric and parametric mode. In the nonparametric mode, the Gaussian copula uses the empirical distribution of relative abundances for the marginal distributions. In the parametric mode, an inverse generalized gamma distribution is used in place of the empirical distribution. Results We demonstrate improved performance of MIDASim relative to other existing methods using gut and vaginal data. MIDASim showed superior performance by PER-MANOVA and in terms of alpha diversity and beta dispersion in either parametric or nonparametric mode. We also show how MIDASim in parametric mode can be used to assess the performance of methods for finding differentially abundant taxa in a compositional model. Conclusions MIDASim is easy to implement, flexible and suitable for most microbiome data simulation situations. MIDASim has three major advantages. First, MIDASim performs better in reproducing the distributional features of real data compared to other methods at both presence-absence level and relative-abundance level. MIDASim-simulated data are more similar to the template data than competing methods, as quantified using a variety of measures. Second, MIDASim makes few distributional assumptions for the relative abundances, and thus can easily accommodate complex distributional features in real data. Third, MIDASim is computationally efficient and can be used to simulate large microbiome datasets.

Philosophy

Oncology

1

Paper

Save

LOCOM: A logistic regression model for testing differential abundance in compositional microbiome data with false discovery rate control

Yingtian Hu et al.Oct 4, 2021

Abstract Motivation Compositional analysis is based on the premise that a relatively small proportion of taxa are “differentially abundant”, while the ratios of the relative abundances of the remaining taxa remain unchanged. Most existing methods of compositional analysis such as ANCOM or ANCOM-BC use log-transformed data, but log-transformation of data with pervasive zero counts is problematic, and these methods cannot always control the false discovery rate (FDR). Further, high-throughput microbiome data such as 16S amplicon or metagenomic sequencing are subject to experimental biases that are introduced in every step of the experimental workflow. McLaren, Willis and Callahan [1] have recently proposed a model for how these biases affect relative abundance data. Methods Motivated by [1], we show that the (log) odds ratios in a logistic regression comparing counts in two taxa are invariant to experimental biases. With this motivation, we propose LOCOM, a robust logistic regression approach to compositional analysis, that does not require pseudocounts. We use a Firth bias-corrected estimating function to account for sparse data. Inference is based on permutation to account for overdispersion and small sample sizes. Traits can be either binary or continuous, and adjustment for continuous and/or discrete confounding covariates is supported. Results Our simulations indicate that LOCOM always preserved FDR and had much improved sensitivity over existing methods. In contrast, ANCOM often had inflated FDR; ANCOM-BC largely controlled FDR but still had modest inflation occasionally; ALDEx2 generally had low sensitivity. LOCOM and ANCOM were robust to experimental biases in every situation, while ANCOM-BC and ALDEx2 had elevated FDR when biases at causal and non-causal taxa were differentially distributed. The flexibility of our method for a variety of microbiome studies is illustrated by the analysis of data from two microbiome studies. Availability and implementation Our R package LOCOM is available on GitHub at https://github.com/yijuanhu/LOCOM in formats appropriate for Macintosh or Windows.

Artificial Intelligence

Biochemistry

9

Paper

Artificial Intelligence

1

0

Save

1

MERIT: controlling Monte-Carlo error rate in large-scale Monte-Carlo hypothesis testing

Yunxiao Li et al.Jan 18, 2022

Abstract The use of Monte-Carlo (MC) p -values when testing the significance of a large number of hypotheses is now commonplace. In large-scale hypothesis testing, we will typically encounter at least some p -values near the threshold of significance, which require a larger number of MC replicates than p -values that are far from the threshold. As a result, the list of detections can vary when different MC replicates are used, resulting in lack of reproducibility. The method of Gandy and Hahn (GH) (2014; 2016; 2017) is the only method that has directly addressed this problem, defining a Monte-Carlo error rate (MCER) to be the probability that any decisions on accepting or rejecting a hypothesis based on MC p -values are different from decisions based on ideal p -values, and then making decisions that control the MCER. Unfortunately, GH is frequently very conservative, often making no rejections at all and leaving a large number of hypotheses “undecided”. In this article, we propose MERIT, a method for large-scale MC hypothesis testing that also controls the MCER but is more statistically efficient than the GH method. Through extensive simulation studies, we demonstrated that MERIT controlled the MCER and substantially improved the sensitivity and specificity of detections compared to GH. We also illustrated our method by an analysis of gene expression data from a prostate cancer study.

Statistics And Probability

Computer Science

1

Paper

Statistics And Probability

Computer Science

0

Save

0

A Rarefaction-Based Extension of the LDM for Testing Presence-Absence Associations in the Microbiome

Yi‐Juan Hu et al.May 30, 2020

Abstract Background Many methods for testing association between the microbiome and covariates of interest (e.g., clinical outcomes, environmental factors) assume that these associations are driven by changes in the relative abundance of taxa. However, these associations may also result from changes in which taxa are present and which are absent. Analyses of such presence-absence associations face a unique challenge: confounding by library size (total sample read count), which occurs when library size is associated with covariates in the analysis. It is known that rarefaction (subsampling to a common library size) controls this bias, but at the potential cost of information loss as well as the introduction of a stochastic component into the analysis. Currently, there is a need for robust and efficient methods for testing presence-absence associations in the presence of such confounding, both at the community level and at the individual-taxon level, that avoid the drawbacks of rarefaction. Methods We have previously developed the linear decomposition model (LDM) that unifies the community-level and taxon-level tests into one framework. Here we present an extension of the LDM for testing presence-absence associations. The extended LDM is a non-stochastic approach that repeatedly applies the LDM to all rarefied taxa count tables, averages the residual sum-of-squares (RSS) terms over the rarefaction replicates, and then forms an F -statistic based on these average RSS terms. We show that this approach compares favorably to averaging the F -statistic from R rarefaction replicates, which can only be calculated stochastically. The flexible nature of the LDM allows discrete or continuous traits or interactions to be tested while allowing confounding covariates to be adjusted for. Results Our simulations indicate that our proposed method is robust to any systematic differences in library size and has better power than alternative approaches. We illustrate our method using an analysis of data on inflammatory bowel disease (IBD) in which case samples have systematically smaller library sizes than controls. Conclusions The rarefaction-based extension of the LDM performs well for testing presenceabsence associations and should be adopted even when there is no obvious systematic variation in library size.

Ecology

Oncology

0

Paper

Save

Efficient Estimation of Indirect Effects in Case-Control Studies Using a Unified Likelihood Framework

Glen Satten et al.Jul 16, 2021

ABSTRACT Mediation models are a set of statistical techniques that investigate the mechanisms that produce an observed relationship between an exposure variable and an outcome variable in order to deduce the extent to which the relationship is influenced by intermediate mediator variables. For a case-control study, the most common mediation analysis strategy employs a counterfactual framework that permits estimation of indirect and direct effects on the odds ratio scale for dichotomous outcomes, assuming either binary or continuous mediators. While this framework has become an important tool for mediation analysis, we demonstrate that we can embed this approach in a unified likelihood framework for mediation analysis in case-control studies that leverages more features of the data (in particular, the relationship between exposure and mediator) to improve efficiency of indirect effect estimates. One important feature of our likelihood approach is that it naturally incorporates cases within the exposure-mediator model to improve efficiency. Our approach does not require knowledge of disease prevalence and can model confounders and exposure-mediator interactions, and is straightforward to implement in standard statistical software. We illustrate our approach using both simulated data and real data from a case-control genetic study of lung cancer.

Artificial Intelligence

Law

1

Paper

Artificial Intelligence

Law

0

Save

0

Testing hypotheses about the microbiome using the linear decomposition model (LDM)

Yi‐Juan Hu et al.Dec 6, 2017

Motivation Methods for analyzing microbiome data generally fall into one of two groups: tests of the global hypothesis of any microbiome effect, which do not provide any information on the contribution of individual operational taxonomic units (OTUs); and tests for individual OTUs, which do not typically provide a global test of microbiome effect. Without a unified approach, the findings of a global test may be hard to resolve with the findings at the individual OTU level. Further, many tests of individual OTU effects do not preserve the false discovery rate (FDR).Results We introduce the linear decomposition model (LDM), that provides a single analysis path that includes global tests of any effect of the microbiome, tests of the effects of individual OTUs while accounting for multiple testing by controlling the FDR, and a connection to distance-based ordination. The LDM accommodates both continuous and discrete variables (e.g., clinical outcomes, environmental factors) as well as interaction terms to be tested either singly or in combination, allows for adjustment of confounding covariates, and uses permutation-based p -values that can control for correlation. The LDM can also be applied to transformed data, and an “omnibus” test can easily combine results from analyses conducted on different transformation scales. We also provide a new implementation of PERMANOVA based on our approach. For global testing, our simulations indicate the LDM provided correct type I error and can have comparable power to existing distance-based methods. For testing individual OTUs, our simulations indicate the LDM controlled the FDR well. In contrast, DESeq2 often had inflated FDR; MetagenomeSeq generally had the lowest sensitivity. The flexibility of the LDM for a variety of microbiome studies is illustrated by the analysis of data from two microbiome studies. We also show that our implementation of PERMANOVA can outperform existing implementations.

Biochemistry

Environmental Engineering

0

Paper

Biochemistry

Environmental Engineering

0

Save

8

Impact of experimental bias on compositional analysis of microbiome data

Yingtian Hu et al.Feb 9, 2023

Microbiome data are subject to experimental bias that is caused by DNA extraction, PCR amplification among other sources, but this important feature is often ignored when developing statistical methods for analyzing microbiome data. McLaren, Willis and Callahan (2019) proposed a model for how such bias affects the observed taxonomic profiles, which assumes main effects of bias without taxon-taxon interactions. Our newly developed method, LOCOM (logistic regression for compositional analysis) for testing differential abundance of taxa, is the first method that accounted for experimental bias and is robust to the main effect biases. However, there is also evidence for taxon-taxon interactions. In this report, we formulated a model for interaction biases and used simulations based on this model to evaluate the impact of interaction biases on the performance of LOCOM as well as other available compositional analysis methods. Our simulation results indicated that LOCOM remained robust to a reasonable range of interaction biases. The other methods tended to have inflated FDR even when there were only main effect biases. LOCOM maintained the highest sensitivity even when the other methods cannot control the FDR. We thus conclude that LOCOM outperforms the other methods for compositional analysis of microbiome data considered here.

Ecology

Artificial Intelligence

8

Paper

Ecology

Artificial Intelligence

0

Save