ResearchHub | Open Science Community

Haplotype-resolved diverse human genomes and integrated analysis of structural variation

Peter Ebert et al.Feb 25, 2021

Resolving genomic structural variation Many human genomes have been reported using short-read technology, but it is difficult to resolve structural variants (SVs) using these data. These genomes thus lack comprehensive comparisons among individuals and populations. Ebert et al. used long-read structural variation calling across 64 human genomes representing diverse populations and developed new methods for variant discovery. This approach allowed the authors to increase the number of confirmed SVs and to describe the patterns of variation across populations. From this dataset, they identified quantitative trait loci affected by these SVs and determined how they may affect gene expression and potentially explain genome-wide association study hits. This information provides insights into patterns of normal human genetic variation and generates reference genomes that better represent the diversity of our species. Science , this issue p. eabf7117

Genetics

Demography

-1

Paper

Save

High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios

Marta Byrska-Bishop et al.Sep 1, 2022

The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies.

Genetics

Molecular Biology

1

Paper

Save

A pangenome reference of 36 Chinese populations

Yang Gao et al.Jun 14, 2023

Human genomics is witnessing an ongoing paradigm shift from a single reference sequence to a pangenome form, but populations of Asian ancestry are underrepresented. Here we present data from the first phase of the Chinese Pangenome Consortium, including a collection of 116 high-quality and haplotype-phased de novo assemblies based on 58 core samples representing 36 minority Chinese ethnic groups. With an average 30.65× high-fidelity long-read sequence coverage, an average contiguity N50 of more than 35.63 megabases and an average total size of 3.01 gigabases, the CPC core assemblies add 189 million base pairs of euchromatic polymorphic sequences and 1,367 protein-coding gene duplications to GRCh38. We identified 15.9 million small variants and 78,072 structural variants, of which 5.9 million small variants and 34,223 structural variants were not reported in a recently released pangenome reference1. The Chinese Pangenome Consortium data demonstrate a remarkable increase in the discovery of novel and missing sequences when individuals are included from underrepresented minority ethnic groups. The missing reference sequences were enriched with archaic-derived alleles and genes that confer essential functions related to keratinization, response to ultraviolet radiation, DNA repair, immunological responses and lifespan, implying great potential for shedding new light on human evolution and recovering missing heritability in complex disease mapping.

Genetics

Molecular Biology

1

Paper

Save

HiCAT: A tool for automatic annotation of centromere structure

Shenghan Gao et al.Aug 7, 2022

Abstract Significant improvements in long-read sequencing technologies have unlocked complex genomic areas, such as centromeres, in the genome and introduced the centromere annotation problem. Currently, centromeres are annotated in a semi-manual way. Here, we propose HiCAT, a generalizable automatic centromere annotation tool, based on hierarchical tandem repeat mining and maximization of tandem repeat coverage to facilitate decoding of centromere architecture. We applied HiCAT to human CHM13-T2T and gapless Arabidopsis thaliana genomes. Our results not only were generally consistent with previous inferences but also greatly improved annotation continuity and revealed additional fine structures, demonstrating HiCAT’s performance and general applicability.

Genetics

Artificial Intelligence

11

Paper

Save

De novo assembly of 64 haplotype-resolved human genomes of diverse ancestry and integrated analysis of structural variation

Peter Ebert et al.Dec 16, 2020

Abstract Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent–child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average contig N50: 26 Mbp) integrate all forms of genetic variation across even complex loci such as the major histocompatibility complex. We focus on 107,590 structural variants (SVs), of which 68% are inaccessible by short-read sequencing. We identify new SV hotspots (spanning megabases of gene-rich sequence), characterize 130 of the most active mobile element source elements, and find that 63% of all SVs arise by homology-mediated mechanisms—a twofold increase from previous studies. Our resource now enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1,525 expression quantitative trait loci (SV-eQTLs) as well as SV candidates for adaptive selection within the human population.

Genetics

Molecular Biology

150

Paper

Save

High-quality Arabidopsis thaliana Genome Assembly with Nanopore and HiFi Long Reads

Bo Wang et al.Jun 9, 2021

Abstract Arabidopsis thaliana is an important and long-established model species for plant molecular biology, genetics, epigenetics, and genomics. However, the latest version of reference genome still contains significant number of missing segments. Here, we report a high-quality and almost complete Col-0 genome assembly with two gaps (Col-XJTU) using combination of Oxford Nanopore Technology ultra-long reads, PacBio high-fidelity long reads, and Hi-C data. The total genome assembly size is 133,725,193 bp, introducing 14.6 Mb of novel sequences compared to the TAIR10.1 reference genome. All five chromosomes of Col-XJTU assembly are highly accurate with consensus quality (QV) scores > 60 (ranging from 62 to 68), which are higher than those of TAIR10.1 reference (QV scores ranging from 45 to 52). We have completely resolved chromosome (Chr) 3 and Chr5 in a telomere-to-telomere manner. Chr4 has been completely resolved except the nucleolar organizing regions, which comprise long repetitive DNA fragments. The Chr1 centromere (CEN1), reportedly around 9 Mb in length, is particularly challenging to assemble due to the presence of tens of thousands of CEN180 satellite repeats. Using the cutting-edge sequencing data and novel computational approaches, we assembled about 4 Mb of sequence for CEN1 and a 3.5-Mb-long CEN2. We investigated the structure and epigenetics of centromeres. We detected four clusters of CEN180 monomers, and found that the centromere-specific histone H3-like protein (CENH3) exhibits a strong preference for CEN180 cluster 3. Moreover, we observed hypomethylation patterns in CENH3-enriched regions. We believe that this high-quality genome assembly, Col-XJTU, would serve as a valuable reference to better understand the global pattern of centromeric polymorphisms, as well as genetic and epigenetic features in plants.

Genetics

Molecular Biology

1

Paper

Save

Haplotype-resolved assemblies and variant benchmark of a Chinese Quartet

Peng Jia et al.Sep 12, 2022

Abstract As the state-of-the-art sequencing technologies and computational methods enable investigation of challenging regions in the human genome, an update variant benchmark is demanded. Herein, we sequenced a Chinese Quartet, consisting of two monozygotic twin daughters and their biological parents, with multiple advanced sequencing platforms, including Illumina, BGI, PacBio, and Oxford Nanopore Technology. We phased the long reads of the monozygotic twin daughters into paternal and maternal haplotypes using the parent-child genetic map. For each haplotype, we utilized advanced long reads to generate haplotype-resolved assemblies (HRAs) with high accuracy, completeness, and continuity. Based on the ingenious quartet samples, novel computational methods, high-quality sequencing reads, and HRAs, we established a comprehensive variant benchmark, including 3,883,283 SNVs, 859,256 Indels, 9,678 large deletions, 15,324 large insertions, 40 inversions, and 31 complex structural variants shared between the monozygotic twin daughters. In particular, the preciously excluded regions, such as repeat regions and the human leukocyte antigen (HLA) region, were systematically examined. Finally, we illustrated how the sequencing depth correlated with the de novo assembly and variant detection, from which we learned that 30 × HiFi is a balance between performance and cost. In summary, this study provides high-quality haplotype-resolved assemblies and a variant benchmark for two Chinese monozygotic twin samples. The benchmark expanded the regions of the previous report and adapted to the evolving sequencing technologies and computational methods.

Genetics

Molecular Biology

9

Paper

Save

MSIsensor-pro: fast, accurate and matched-normal-sample-free detection of microsatellite instability

Peng Jia et al.Jan 9, 2020

ABSTRACT We developed MSIsensor-pro ( https://github.com/xjtu-omics/msisensor-pro ), an open-source single sample microsatellite instability (MSI) scoring method for research and clinical applications. MSIsensor-pro introduces a multinomial distribution model to quantify polymerase slippages for each tumor sample and a discriminative sites selection method to enable MSI detection without matched normal samples. For samples of various sequencing depths and tumor purities, MSIsensor-pro significantly outperformed the current leading methods in terms of both accuracy and computational cost.

Genetics

Artificial Intelligence

0

Paper

Save

Long read sequencing reveals sequential complex rearrangements driven by Hepatitis B virus integration

Songbo Wang et al.Dec 9, 2021

Integration of Hepatitis B virus (HBV) into human genome disrupts genetic structures and cellular functions. Here, we conducted multiplatform long read sequencing on two cell lines and five clinical samples of HBV-infected hepatocellular carcinomas (HCC). We resolved three types of viral integration-induced complex genome rearrangements (CGR) and proposed a model of ‘multi-hits and sequential-breaks’ to depict their formation process by differentiating inserted HBV copies with HiFi long reads. We deduced that all three complex types were initialized from focal replacement and fragile virus-human junctions triggered subsequent rearrangements. We further revealed that such rearrangements caused a prevalent loss-of-heterozygosity at chr4q, accounting for 19.5% of HCC samples in ICGC cohort and contributing to immune and metabolic dysfunction. Overall, our long read based analysis reveals novel sequential rearrangement processes initiated by HBV integration, hinting its structural and functional impact on HCC.

Genetics

Epidemiology

1

Paper

Save

Mako: a graph-based pattern growth approach to detect complex structural variants

Jiadong Lin et al.Mar 2, 2021

Abstract Complex structural variants (CSVs) are genomic alterations that have more than two breakpoints and are considered as simultaneous occurrence of simple structural variants. However, detecting the compounded mutational signals of CSVs is challenging through a commonly used model-match strategy. As a result, there has been limited progress for CSV discovery compared with simple structural variants. We systematically analyzed the multi-breakpoint connection feature of CSVs, and proposed Mako, utilizing a bottom-up guided model-free strategy, to detect CSVs from paired-end short-read sequencing. Specifically, we implemented a graph-based pattern growth approach, where the graph depicts potential breakpoint connections and pattern growth enables CSV detection without predefined models. Comprehensive evaluations on both simulated and real datasets revealed that Mako outperformed other algorithms. Notably, validation rates of CSV on real data based on experimental and computational validations as well as manual inspections are around 70%, where the medians of experimental and computational breakpoint shift are 13bp and 26bp, respectively. Moreover, Mako CSV subgraph effectively characterized the breakpoint connections of a CSV event and uncovered a total of 15 CSV types, including two novel types of adjacent segments swap and tandem dispersed duplication. Further analysis of these CSVs also revealed impact of sequence homology in the formation of CSVs. Mako is publicly available at https://github.com/jiadong324/Mako .

Genetics

Artificial Intelligence

1

Paper

Genetics

1

0

Save