ResearchHub | Open Science Community

Precision neoantigen discovery using large-scale immunopeptidomes and composite modeling of MHC peptide presentation

Rachel Pyke et al.May 1, 2021

Abstract Major histocompatibility complex (MHC)-bound peptides that originate from tumor-specific genetic alterations, known as neoantigens, are an important class of anti-cancer therapeutic targets. Accurately predicting peptide presentation by MHC complexes is a key aspect of discovering therapeutically relevant neoantigens. Technological improvements in mass-spectrometry-based immunopeptidomics and advanced modeling techniques have vastly improved MHC presentation prediction over the past two decades. However, improvement in the sensitivity and specificity of prediction algorithms is needed for clinical applications such as the development of personalized cancer vaccines, the discovery of biomarkers for response to checkpoint blockade and the quantification of autoimmune risk in gene therapies. Toward this end, we generated allele-specific immunopeptidomics data using 25 mono-allelic cell lines and created Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), a pan-allelic MHC-peptide algorithm for predicting MHC-peptide binding and presentation. In contrast to previously published large-scale mono-allelic data, we used an HLA-null K562 parental cell line and a stable transfection of HLA alleles to better emulate native presentation. Our dataset includes five previously unprofiled alleles that expand MHC binding pocket diversity in the training data and extend allelic coverage in underprofiled populations. To improve generalizability, SHERPA systematically integrates 128 mono-allelic and 384 multi-allelic samples with publicly available immunoproteomics data and binding assay data. Using this dataset, we developed two features that empirically estimate the propensities of genes and specific regions within gene bodies to engender immunopeptides to represent antigen processing. Using a composite model constructed with gradient boosting decision trees, multiallelic deconvolution and 2.15 million peptides encompassing 167 alleles, we achieved a 1.44 fold improvement of positive predictive value compared to existing tools when evaluated on independent mono-allelic datasets and a 1.15 fold improvement when evaluating on tumor samples. With a high degree of accuracy, SHERPA has the potential to enable precision neoantigen discovery for future clinical applications.

Genetics

Immunology

0

Paper

Save

Genomic loci susceptible to systematic sequencing bias in clinical whole genomes

Timothy Freeman et al.Jun 22, 2019

J

D

T

Accurate massively parallel sequencing (MPS) of genetic variants is key to many areas of science and medicine, such as cataloguing population genetic variation and diagnosing genetic diseases. Certain genomic positions can be prone to higher rates of systematic sequencing and alignment bias that limit accuracy, resulting in false positive variant calls. Current standard practices to differentiate between loci that can and cannot be sequenced with high confidence utilise consensus between different sequencing methods as a proxy for sequencing confidence. These practices have significant limitations and alternative methods are required to overcome these.We have developed a novel statistical method based on summarising sequenced reads from whole genome clinical samples and cataloguing them in “Incremental Databases” that maintain individual confidentiality. Allele statistics were catalogued for each genomic position that consistently showed systematic biases with the corresponding MPS sequencing pipeline. We found systematic biases present at ∼1-3% of the human autosomal genome across five patient cohorts. We identified which genomic regions were more or less prone to systematic biases, including large homopolymer flanks (odds ratio=23.29-33.69) and the NIST high confidence genomic regions (odds ratio=0.154-0.191). We confirmed our predictions on a gold-standard reference genome and showed that these systematic biases can lead to suspect variant calls within clinical panels.Our results recommend increased caution to address systematic biases in whole genome sequencing and alignment. This study provides the implementation of a simple statistical approach to enhance quality control of clinically sequenced samples by flagging variants at suspect loci for further analysis or exclusion.

Genetics

Cancer Research

0

Paper

Save

A variant by any name: quantifying annotation discordance across tools and clinical databases

Jennifer Yen et al.May 18, 2016

Background: Clinical genomic testing is dependent on the robust identification and reporting of variant-level information in relation to disease. With the shift to high-throughput sequencing, a major challenge for clinical diagnostics is the cross-identification of variants called on their genomic position to resources that rely on transcript- or protein-based descriptions. Methods: We evaluated the accuracy of three tools (SnpEff, Variant Effect Predictor and Variation Reporter) that generate transcript and protein-based variant nomenclature from genomic coordinates according to guidelines by the Human Genome Variation Society (HGVS). Our evaluation was based on comparisons to a manually-curated list of 127 test variants of various types drawn from data sources, each with HGVS-compliant transcript and protein descriptors. We further evaluated the concordance between annotations generated by Snpeff and Variant Effect Predictor with those in major germline and cancer databases: ClinVar and COSMIC, respectively. Results: We find that there is substantial discordance between the annotation tools and databases in the description of insertion and/or deletions. Accuracy based on our ground truth set was between 80-90% for coding and 50-70% for protein variants, numbers that are not adequate for clinical reporting. Exact concordance for SNV syntax was over 99.5% between ClinVar and Variant Effect Predictor (VEP) and SnpEff, but less than 90% for non-SNV variants. For COSMIC, exact concordance for coding and protein SNVs were between 65 and 88%, and less than 15% for insertions. Across the tools and datasets, there was a wide range of equivalent expressions describing protein variants. Conclusion: Our results reveal significant inconsistency in variant representation across tools and databases. These results highlight the urgent need for the adoption and adherence to uniform standards in variant annotation, with consistent reporting on the genomic reference, to enable accurate and efficient data-driven clinical care.

Genetics

Molecular Biology

0

Paper

Genetics

Molecular Biology

0

Save