ResearchHub | Open Science Community

A robust benchmark for germline structural variant detection

Justin Zook et al.Jun 9, 2019

Abstract New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment- and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls ≥50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90 % of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3 %, and genotype concordance with manual curation was >98.7 %. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping.

Genetics

Artificial Intelligence

0

Paper

Save

Linking T cell receptor sequence to transcriptional profiles with clonotype neighbor graph analysis (CoNGA)

Stefan Schattgen et al.Jun 5, 2020

Abstract Multi-modal single-cell technologies capable of simultaneously assaying gene expression and surface phenotype across large numbers of immune cells have described extensive heterogeneity within these complex populations, in healthy and diseased states. In the case of T cells, these technologies have made it possible to profile clonotype, defined by T cell receptor (TCR) sequence, and phenotype, as reflected in gene expression (GEX) profile, surface protein expression, and peptide:MHC (pMHC) binding, across large and diverse cell populations. These rich, high-dimensional datasets have the potential to reveal new relationships between TCR sequence and T cell phenotype that go beyond identification of features shared by clonally related cells. In order to uncover these connections in an unbiased way, we developed a graph-theoretic approach---clonotype neighbor-graph analysis or “CoNGA”---that identifies correlations between GEX profile and TCR sequence through statistical analysis of a pair of T cell similarity graphs, one in which cells are linked based on gene expression similarity and another in which cells are linked by similarity of TCR sequence. Applying CoNGA across diverse human and mouse T cell datasets uncovered known and novel associations between TCR sequence features and cellular phenotype including the classical invariant T cell subsets; a novel defined population of human blood CD8+ T cells expressing the transcription factors HOBIT and HELIOS , NK-associated receptors, and a biased TCR repertoire, representing a potential previously undescribed lineage of “natural lymphocytes”; a striking association between usage of a specific V-beta gene segment and expression of the EPHB6 gene that is conserved between mouse and human; and TCR sequence determinants of differentiation in developing thymocytes. As the size and scale of single-cell datasets continue to grow, we expect that CoNGA will prove to be a useful tool for deconvolving complex relationships between TCR sequence and cellular state in single-cell applications.

Genetics

Immunology

0

Paper

Save

Benchmarking challenging small variants with linked and long reads

Justin Wagner et al.Jul 25, 2020

Summary Genome in a Bottle (GIAB) benchmarks have been widely used to help validate clinical sequencing pipelines and develop new variant calling and sequencing methods. Here, we use accurate linked reads and long reads to expand the prior benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are not readily accessible to short reads. Our new benchmark adds more than 300,000 SNVs, 50,000 indels, and 16 % new exonic variants, many in challenging, clinically relevant genes not previously covered (e.g., PMS2 ). For HG002, we include 92% of the autosomal GRCh38 assembly, while excluding problematic regions for benchmarking small variants (e.g., copy number variants and reference errors) that should not have been in the previous version, which included 85% of GRCh38. By including difficult-to-map regions, this benchmark identifies eight times more false negatives in a short read variant call set relative to our previous benchmark.We have demonstrated the utility of this benchmark to reliably identify false positives and false negatives across technologies in more challenging regions, which enables continued technology and bioinformatics development.

Genetics

Artificial Intelligence

113

Paper

Genetics

Artificial Intelligence

0

Save

1

The regulation of methylation on the Z chromosome and the identification of multiple novel Male Hyper-Methylated regions in the chicken

Anders Höglund et al.Mar 27, 2023

Abstract DNA methylation is a key regulator of eukaryote genomes, and is of particular relevance in the regulation of gene expression on the sex chromosomes, with a key role in dosage compensation in mammalian XY systems. In the case of birds, dosage compensation is largely absent, with it being restricted to two small Male Hyper-Methylated (MHM) regions on the Z chromosome. To investigate how variation in DNA methylation is regulated on the Z chromosome we utilised a wild x domestic advanced intercross in the chicken, with both hypothalamic methylomes and transcriptomes assayed in 124 individuals. The relatively large numbers of individuals allowed us to identify additional genomic MHM regions on the Z chromosome that were significantly differentially methylated between the sexes. These regions appear to down-regulate local gene expression in males, but not remove it entirely (unlike the lncRNAs identified in the initial MHM regions). In addition, trans effect hotspots were also identified that were based on the autosomes but affected the Z, and also that were based on the Z chromosome but that affected autosomal DNA methylation regulation. In addition, quantitative trait loci (QTL) that regulate variation in methylation on the Z chromosome, and those loci that regulate methylation on the autosomes that derive from the Z chromosome were mapped. Trans-effect hotspots were also identified that were based on the autosomes but affected the Z, and also one that was based on the Z chromosome but that affected both autosomal and sex chromosome DNA methylation regulation. Our results highlight how additional MHM regions are actually present on the Z chromosome, and they appear to have smaller-scale effects on gene expression in males. Quantitative variation in methylation is also regulated both from the autosomes to the Z chromosome, and from the Z chromosome to the autosomes.

Genetics

Molecular Biology

1

Paper

Save

RAD sequencing and a hybrid Antarctic fur seal genome assembly reveal rapidly decaying linkage disequilibrium, global population structure and evidence for inbreeding

Emily Humble et al.Feb 22, 2018

Recent advances in high throughput sequencing have transformed the study of wild organisms by facilitating the generation of high quality genome assemblies and dense genetic marker datasets. These resources have the potential to significantly advance our understanding of diverse phenomena at the level of species, populations and individuals, ranging from patterns of synteny through rates of linkage disequilibrium (LD) decay and population structure to individual inbreeding. Consequently, we used PacBio sequencing to refine an existing Antarctic fur seal (Arctocephalus gazella) genome assembly and genotyped 83 individuals from six populations using restriction site associated DNA (RAD) sequencing. The resulting hybrid genome comprised 6,169 scaffolds with an N50 of 6.21 Mb and provided clear evidence for the conservation of large chromosomal segments between the fur seal and dog (Canis lupus familiaris). Focusing on the most extensively sampled population of South Georgia, we found that LD decayed rapidly, reaching the background level of r2 = 0.09 by around 26 kb, consistent with other vertebrates but at odds with the notion that fur seals experienced a strong historical bottleneck. We also found evidence for population structuring, with four main Antarctic island groups being resolved. Finally, appreciable variance in individual inbreeding could be detected, reflecting the strong polygyny and site fidelity of the species. Overall, our study contributes important resources for future genomic studies of fur seals and other pinnipeds while also providing a clear example of how high throughput sequencing can generate diverse biological insights at multiple levels of organisation.

Genetics

Ecology

0

Paper

Genetics

Ecology

0

Save