Overcoming the Winner’s Curse: Estimating Penetrance Parameters from Case-Control Data

Authors

Sebastian Zöllner

Published

March 13, 2007

Abstract

Genomewide association studies are now a widely used approach in the search for loci that affect complex traits. After detection of significant association, estimates of penetrance and allele-frequency parameters for the associated variant indicate the importance of that variant and facilitate the planning of replication studies. However, when these estimates are based on the original data used to detect the variant, the results are affected by an ascertainment bias known as the “winner’s curse.” The actual genetic effect is typically smaller than its estimate. This overestimation of the genetic effect may cause replication studies to fail because the necessary sample size is underestimated. Here, we present an approach that corrects for the ascertainment bias and generates an estimate of the frequency of a variant and its penetrance parameters. The method produces a point estimate and confidence region for the parameter estimates. We study the performance of this method using simulated data sets and show that it is possible to greatly reduce the bias in the parameter estimates, even when the original association study had low power. The uncertainty of the estimate decreases with increasing sample size, independent of the power of the original test for association. Finally, we show that application of the method to case-control data can improve the design of replication studies considerably. Genomewide association studies are now a widely used approach in the search for loci that affect complex traits. After detection of significant association, estimates of penetrance and allele-frequency parameters for the associated variant indicate the importance of that variant and facilitate the planning of replication studies. However, when these estimates are based on the original data used to detect the variant, the results are affected by an ascertainment bias known as the “winner’s curse.” The actual genetic effect is typically smaller than its estimate. This overestimation of the genetic effect may cause replication studies to fail because the necessary sample size is underestimated. Here, we present an approach that corrects for the ascertainment bias and generates an estimate of the frequency of a variant and its penetrance parameters. The method produces a point estimate and confidence region for the parameter estimates. We study the performance of this method using simulated data sets and show that it is possible to greatly reduce the bias in the parameter estimates, even when the original association study had low power. The uncertainty of the estimate decreases with increasing sample size, independent of the power of the original test for association. Finally, we show that application of the method to case-control data can improve the design of replication studies considerably. Identification of the genetic variants that contribute to complex traits is an important current challenge in the field of human genetics. Although there is a steady stream of reported associations, replication of findings is often inconsistent, even for those associations that do ultimately turn out to be genuine.1Hirschhorn JN Lohmueller K Byrne E Hirschhorn K A comprehensive review of genetic association studies.Genet Med. 2002; 4: 45-61Crossref PubMed Scopus (1401) Google Scholar, 2Lohmueller KE Pearce CL Pike M Lander ES Hirschhorn JN Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease.Nat Genet. 2003; 33: 177-182Crossref PubMed Scopus (1625) Google Scholar In large part, the difficulties of replication occur because most genuine associations have modest effects; hence, there is generally incomplete power to detect associations in any given study. These challenges will undoubtedly continue as we move into the era of affordable whole-genome association studies, through which it is possible to detect variants of small effect anywhere in the genome.3Herbert A Gerry NP McQueen MB Heid IM Pfeufer A Illig T Wichmann HE Meitinger T Hunter D Hu FB et al.A common genetic variant is associated with adult and childhood obesity.Science. 2006; 312: 279-283Crossref PubMed Scopus (582) Google Scholar, 4Edwards AO Ritter R Abel JK Manning A Panhuysen C Farrer LA Complement factor H polymorphism and age-related macular degeneration.Science. 2005; 308: 421-424Crossref PubMed Scopus (2079) Google Scholar When a study identifies a marker that shows evidence of association with a disease, it is common to estimate the impact of this variant on the phenotype of interest. This impact is often expressed as an odds ratio—that is, the ratio of the odds of manifesting the disease in carriers of the risk allele to the odds of manifesting the disease in noncarriers. A complete description of the impact of a variant affecting a binary phenotype includes two sets of parameters: the frequencies of the genotypes and the penetrances of the genotypes. These parameters can be used to assess the impact of the detected variant, as measured, for example, by the attributable risk. Estimation of the strength of the effect of a genetic variant on the phenotype is also helpful for planning successful replication studies. Unfortunately, estimation of these parameters with the same data set that was used to identify the variant of interest is not straightforward, since the data set does not constitute a random population sample for two reasons. First, samples that are used for association mapping are usually collected to oversample affected individuals relative to their frequency in the population (e.g., the sample might include equal numbers of cases and controls). Second, and more seriously, there is a major ascertainment effect that occurs when a variant is of interest specifically because it was significant for association. For a variant that is genuinely—but weakly—associated with disease, there may be only low or moderate power to detect association. Hence, when there is a significant result, it may imply that the genotype counts of cases and controls are more different from each other than expected. Consequently, the estimates of effect size are biased upward. This effect, which is an example of the “winner’s curse” from economics,5Capen EC Clapp RV Campbell WM Competitive bidding in high-risk situations.J Petrol Technol. 1971; 23: 641-653Crossref Google Scholar depends strongly on the power of the initial test for association.6Göring HHH Terwilliger JD Blangero J Large upward bias in estimation of locus-specific effects from genomewide scans.Am J Hum Genet. 2001; 69: 1357-1369Abstract Full Text Full Text PDF PubMed Scopus (394) Google Scholar If the power is high, most random draws from the distribution of genotype counts will result in a significant test for association; thus, the ascertainment effect is small. On the other hand, if the power is low, conditioning on a successful association scan will result in a big ascertainment effect. This problem is well appreciated in the field. It has been observed that the odds ratio of a disease variant is usually overestimated in the study that first describes the variant. In a meta-analysis of 301 association studies of 25 putative disease loci, Lohmueller et al.2Lohmueller KE Pearce CL Pike M Lander ES Hirschhorn JN Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease.Nat Genet. 2003; 33: 177-182Crossref PubMed Scopus (1625) Google Scholar concluded that, even though the replication studies indicated that 11 of the loci were genuinely associated with disease, a striking 24 of the 25 loci were reported in the initial study to have an odds ratio higher than the estimates based on subsequent replication studies. One important implication of the winner’s curse is that it makes it hard to design appropriately powered replication studies. If the sample size of a replication study is chosen on the basis of the odds ratio observed in the initial study, then the replication will almost certainly be underpowered.1Hirschhorn JN Lohmueller K Byrne E Hirschhorn K A comprehensive review of genetic association studies.Genet Med. 2002; 4: 45-61Crossref PubMed Scopus (1401) Google Scholar Göring et al.6Göring HHH Terwilliger JD Blangero J Large upward bias in estimation of locus-specific effects from genomewide scans.Am J Hum Genet. 2001; 69: 1357-1369Abstract Full Text Full Text PDF PubMed Scopus (394) Google Scholar were the first to highlight the challenges of the winner’s curse for genome scans. Studying the problem for variance-components linkage scans of QTLs, they showed that, for low-powered studies, the naive estimate of the genetic effect at a significant locus is almost uncorrelated with the true underlying genetic effect; this results in a significant overestimation. They concluded that the only method for generating a useful estimate is to collect a large, independent, population-based sample of individuals, without regard to phenotype, and then to phenotype and genotype each individual. Although such a sample would provide unbiased estimates of allele frequencies and penetrance parameters with properly calibrated confidence regions, this may be a prohibitively expensive solution to the problem. Thus, it is of interest to develop methods that generate unbiased estimates of the parameters of interest from the same data set used to detect the variant. Several authors have since described methods that correct for the ascertainment bias when estimating genetic effects after a whole-genome scan for linkage. Allison et al.7Allison DB Fernandez JR Heo M Zhu S Etzel C Beasley TM Amos CI Bias in estimates of quantitative-trait-locus effect in genome scans: demonstration of the phenomenon and a method-of-moments procedure for reducing bias.Am J Hum Genet. 2002; 70: 575-585Abstract Full Text Full Text PDF PubMed Scopus (58) Google Scholar noted that, for a specified genetic model, the distribution of the effect, constrained by ascertainment bias, can be analyzed. Using a method-of-moments approach, they calculated an estimate of the genetic effect. Siegmund8Siegmund D Upward bias in estimation of genetic effects.Am J Hum Genet. 2002; 71: 1183-1188Abstract Full Text Full Text PDF PubMed Scopus (24) Google Scholar proposed lowering the confidence limit in the initial test, thus accepting a large number of false-positive results. As indicated above, this leads to a high power for each individual test and only a small ascertainment effect. Siegmund8Siegmund D Upward bias in estimation of genetic effects.Am J Hum Genet. 2002; 71: 1183-1188Abstract Full Text Full Text PDF PubMed Scopus (24) Google Scholar then suggested calculating CIs and accounting for the high number of tests by increasing the stringency of the CIs. To correct for ascertainment, this method also requires specifying the genetic model. Sun and Bull9Sun L Bull SB Reduction of selection bias in genomewide studies by resampling.Genet Epidemiol. 2005; 28: 352-367Crossref PubMed Scopus (52) Google Scholar suggested multiple methods based on randomly splitting the sample into a detection sample and an estimation sample. By comparing the estimate generated from the detection sample and the estimate from the estimation sample, they were able to calculate a correction factor for the ascertainment effect. However, the resulting estimator is still somewhat biased, and the SE of the corrected estimate is actually higher than the SE of the naive estimator.9Sun L Bull SB Reduction of selection bias in genomewide studies by resampling.Genet Epidemiol. 2005; 28: 352-367Crossref PubMed Scopus (52) Google Scholar In a somewhat analogous setting, one study of family-based association tests proposed that the available information be split into two orthogonal components.10Van Steen K McQueen MB Herbert A Raby B Lyon H DeMeo DL Murphy A Su J Datta S Rosenow C et al.Genomic screening and replication using the same data set in family-based association testing.Nat Genet. 2005; 37: 683-691Crossref PubMed Scopus (149) Google Scholar Then, one component of the information is used to validate promising signals from the other component. However, that study focused primarily on testing rather than estimation. Although some of the methods for correcting for the winner’s curse in linkage studies could be extended to association studies, association studies differ from linkage studies in several respects. The sample collected for an association study is more similar to a random population sample, which allows a more precise calculation of the sampling probabilities. Furthermore, the power of an association study should be much higher than the power of a linkage study of the same trait,11Risch N Merikangas K The future of genetic studies of complex human diseases.Science. 1996; 273: 1516-1517Crossref PubMed Scopus (4292) Google Scholar so it is not clear how well the conclusions of Göring et al.6Göring HHH Terwilliger JD Blangero J Large upward bias in estimation of locus-specific effects from genomewide scans.Am J Hum Genet. 2001; 69: 1357-1369Abstract Full Text Full Text PDF PubMed Scopus (394) Google Scholar apply to whole-genome scans for complex diseases. The goal of the present study was to develop a method for generating corrected estimates of genetic-effect size for a locus that was identified in a significant test for association. Instead of calculating an odds ratio or relative-risk parameter, we use information about the population prevalence of the disease to estimate directly the penetrance parameters of the variants of interest. This allows us to perform the estimation for any specific genetic model (e.g., additive or dominant) or for a completely general genetic model. We describe an algorithm for calculating the approximate maximum-likelihood estimates (MLEs) of the frequencies and the penetrance parameters of the genotypes and associated confidence regions. We find that, for a variety of genetic models, our estimator corrects the ascertainment effect and provides reasonably accurate estimates and well-calibrated CIs while slightly underestimating the genetic effect. We show that these corrected estimates provide a far better basis for designing replication studies than do the naive uncorrected estimates. Last, we show an application of the method to the association of the Pro12Ala polymorphism in PPARγ with type 2 diabetes.12Deeb SS Fajas L Nemoto M Pihlajamäki J Mykkänen L Kuusisto J Laakso M Fujimoto W Auwerx J A Pro12-Ala substitution in PPARγ2 associated with decreased receptor activity, lower body mass index and improved insulin sensitivity.Nat Genet. 1998; 20: 284-287Crossref PubMed Scopus (1207) Google Scholar We developed a fairly general model for calculating the likelihood of a set of penetrance parameters and genotype frequencies conditional on having observed a “significant” signal for association at a certain biallelic marker. It is assumed that significance is determined according to a prespecified test and type 1 error rate (α). In a data set of na affected individuals and nu unaffected individuals, we consider the three genotypes g1, g2, and g3 indicating, respectively, the minor-allele homozygote, the heterozygote, and the major-allele homozygote. Let the data D=(a1,…,a3,u1…,u3) be the counts of these genotypes in affected and unaffected individuals that constitute the significant signal for association. Furthermore, let ϕ=(ϕ1,…,ϕ3) be the population frequencies of the genotypes, let θ=(θ1,…,θ3) be the penetrances, and let F be the population prevalence of the disease phenotype. F, which is assumed to be known from independent data, will be used to constrain the sample space for the other parameters as follows:F=∑i=13Θiϕi .(1) We split the ascertainment into two parts. Let B indicate that, as required, the marker of interest shows significant association at level α, and let S be the experimental design of sampling na affected individuals and nu unaffected individuals, regardless of the prevalence F. We use PrS(·) as a shorthand for Pr(·|S). We calculate the likelihood L(θ,ϕ) and obtain an MLE for θ and ϕ, using the equationL(Θ,φ)=⪻s(D|B,Θ,φ)=⪻s(B|D,Θ,φ)×⪻s(D|Θ,φ)⪻s(B|Θ,φ)=⪻s(D|Θ,φ)⪻s(B|Θ,φ) .(2) This result is obtained using the fact that the data D constitute, by definition, a significant result, so D implies B; hence, Prs(B|D,θ,ϕ)=1. The numerator on the right side of equation (2) is the likelihood of the observed genotype counts, and the denominator is the power of the test used in the initial genome scan. The numerator is maximized at the naive penetrance estimates. Meanwhile, the denominator (power) is made smaller as the penetrance values move closer together. This has the effect of tilting the maximum likelihood toward smaller differences among the penetrances. Notice that equation (2) is undefined when the power is 0 (e.g., if all the ϕi=0); however, power=0 implies that observing a significant result is impossible. Since we condition our estimation on observing a significant result, this case can be ignored. Under the assumption that the samples of affected and unaffected individuals are only a small proportion of the affected and unaffected individuals in the population, Prs(G|θ,ϕ) is the product of two multinomial distributionsPrS(D|Θ,ϕ)=na!∑i=13ai!Πi=13Pr(gi|A)ainu!∑i=13ui!Πi=13Pr(gi|U)ui,(3) where A indicates the affected phenotype and U the unaffected phenotype. Pr(gi|A) is the probability that a randomly selected affected individual carries genotype gi and can be calculated⪻(gi|A)=⪻(A|gi)⪻(gi)⪻(A)=Θi×ϕiFand⪻(gi|U)=(1−Θi)ϕi1−F .A general expression for the denominator of equation (2) is⪻S(B|Θ,φ)=∑Di significant⪻S(Di|Θ,φ) ,(4) where the Di represent all significant realizations of the data vector D. For many designs of tests for association, it is possible to calculate the power of the initial test exactly (see appendix A). To apply this algorithm to tests that do not have a simple method of power calculation, equation (4) can be evaluated by sampling Di conditional on θ and ϕ and approximating Pr(B|θ,ϕ) by Monte Carlo integration. These calculations assume that controls are selected to not show the phenotype of interest. If random (unphenotyped) controls are used, equation (3) is modified by replacing Pr(gi|U) with Pr(gi). Note that, although, in the initial scan for association, several tests can be performed at each of many markers, the multiple testing affects the estimates only indirectly through the choice of the level of significance, α. The equations can be extended to estimate gene-gene interaction parameters. To assess the interaction of m loci in the genome, all possible combinations of genotypes have to be considered, so there are k=3m states. Enumerating these states 1,…,k, the set of genotype counts can then be expressed as D={u1,…,uk,a1,…,ak}, and the goal is to estimate the vector of population genotype frequencies ϕ=(ϕ1,…,ϕk) and the vector of penetrances θ=(θ1,…,θk) by extending equations ((1), (2), (4)). We designed a two-stage algorithm to estimate the population frequencies of the underlying genotypes and their penetrance parameters conditional on a known disease prevalence F, by maximizing L(θ,ϕ). In the first step, we generated an approximate likelihood surface by sampling m=30,000 independent sets of parameters conditional on F (i.e., that satisfy eq. [1]). The three genotypes were assumed to be in Hardy-Weinberg proportions in the overall population. We then calculated the likelihood L(D|ϕ1,…,ϕ3, θ1,…,θ3) for each parameter set and selected the set with the highest likelihood as a first approximation of the point estimate of the parameters. In the second step, we improved this estimate by perturbing each parameter value by a small value ε and accepting the new parameter values if the likelihood is higher than the likelihood of the old maximum. By repeating this procedure 3,000 times, reducing ε with every repetition, we generated highly stable estimates. We assessed the fidelity of this algorithm by analyzing a set of data sets multiple times, and we observed that the parameter estimates differed by a magnitude of only 10−5. To generate estimates for a known genetic model, we repeated the analysis in parameter spaces that are constrained accordingly. We generated 95% confidence regions by comparing the likelihood of all initial m parameter points with the likelihood of the point estimate. We included all points for which twice the difference of log-likelihoods was C is large for genetic models with small effect sizes, but it is not very large for models with moderate or large effect sizes. In the models we considered, whenever the test statistic was close to C, a model of low genetic effect had the highest likelihood. Furthermore, under most genetic models, some data sets generated a test statistic near C. In these data sets, the effect size was underestimated after correction for ascertainment. Thus, the corrected estimate has a bias toward underestimating the effect size. We can also observe differences in the variances of the estimates. In general, figure 1 reveals that estimates of genetic effect can vary widely, even if no ascertainment bias is introduced. In comparison with the distribution of estimates generated from the unascertained sample, the uncorrected estimates for low- and moderately powered studies are more tightly clustered around the biased average, whereas estimates generated with the corrected method are more widely dispersed. These results illustrate the problems of the winner’s curse in low-powered association studies and show that our algorithm generates nearly unbiased estimates of the penetrance parameters. To assess the bias of our method in a more systematic fashion, we calculated the relative difference between the estimated and true underlying genetic effect, Δ, for each data set generated in simulation study 2 (see the “Simulation study 2” section) that was simulated with an additive model. Without correction for ascertainment, the genetic effect is overestimated by 20%, on average, over all parameter sets, independent of sample size. After applying the