ResearchHub | Open Science Community

High-quality Arabidopsis thaliana Genome Assembly with Nanopore and HiFi Long Reads

Bo Wang et al.Jun 9, 2021

Abstract Arabidopsis thaliana is an important and long-established model species for plant molecular biology, genetics, epigenetics, and genomics. However, the latest version of reference genome still contains significant number of missing segments. Here, we report a high-quality and almost complete Col-0 genome assembly with two gaps (Col-XJTU) using combination of Oxford Nanopore Technology ultra-long reads, PacBio high-fidelity long reads, and Hi-C data. The total genome assembly size is 133,725,193 bp, introducing 14.6 Mb of novel sequences compared to the TAIR10.1 reference genome. All five chromosomes of Col-XJTU assembly are highly accurate with consensus quality (QV) scores > 60 (ranging from 62 to 68), which are higher than those of TAIR10.1 reference (QV scores ranging from 45 to 52). We have completely resolved chromosome (Chr) 3 and Chr5 in a telomere-to-telomere manner. Chr4 has been completely resolved except the nucleolar organizing regions, which comprise long repetitive DNA fragments. The Chr1 centromere (CEN1), reportedly around 9 Mb in length, is particularly challenging to assemble due to the presence of tens of thousands of CEN180 satellite repeats. Using the cutting-edge sequencing data and novel computational approaches, we assembled about 4 Mb of sequence for CEN1 and a 3.5-Mb-long CEN2. We investigated the structure and epigenetics of centromeres. We detected four clusters of CEN180 monomers, and found that the centromere-specific histone H3-like protein (CENH3) exhibits a strong preference for CEN180 cluster 3. Moreover, we observed hypomethylation patterns in CENH3-enriched regions. We believe that this high-quality genome assembly, Col-XJTU, would serve as a valuable reference to better understand the global pattern of centromeric polymorphisms, as well as genetic and epigenetic features in plants.

Genetics

Molecular Biology

1

Paper

Save

Haplotype-resolved assemblies and variant benchmark of a Chinese Quartet

Peng Jia et al.Sep 12, 2022

Abstract As the state-of-the-art sequencing technologies and computational methods enable investigation of challenging regions in the human genome, an update variant benchmark is demanded. Herein, we sequenced a Chinese Quartet, consisting of two monozygotic twin daughters and their biological parents, with multiple advanced sequencing platforms, including Illumina, BGI, PacBio, and Oxford Nanopore Technology. We phased the long reads of the monozygotic twin daughters into paternal and maternal haplotypes using the parent-child genetic map. For each haplotype, we utilized advanced long reads to generate haplotype-resolved assemblies (HRAs) with high accuracy, completeness, and continuity. Based on the ingenious quartet samples, novel computational methods, high-quality sequencing reads, and HRAs, we established a comprehensive variant benchmark, including 3,883,283 SNVs, 859,256 Indels, 9,678 large deletions, 15,324 large insertions, 40 inversions, and 31 complex structural variants shared between the monozygotic twin daughters. In particular, the preciously excluded regions, such as repeat regions and the human leukocyte antigen (HLA) region, were systematically examined. Finally, we illustrated how the sequencing depth correlated with the de novo assembly and variant detection, from which we learned that 30 × HiFi is a balance between performance and cost. In summary, this study provides high-quality haplotype-resolved assemblies and a variant benchmark for two Chinese monozygotic twin samples. The benchmark expanded the regions of the previous report and adapted to the evolving sequencing technologies and computational methods.

Genetics

Molecular Biology

9

Paper

Save

Long read sequencing reveals sequential complex rearrangements driven by Hepatitis B virus integration

Songbo Wang et al.Dec 9, 2021

Integration of Hepatitis B virus (HBV) into human genome disrupts genetic structures and cellular functions. Here, we conducted multiplatform long read sequencing on two cell lines and five clinical samples of HBV-infected hepatocellular carcinomas (HCC). We resolved three types of viral integration-induced complex genome rearrangements (CGR) and proposed a model of ‘multi-hits and sequential-breaks’ to depict their formation process by differentiating inserted HBV copies with HiFi long reads. We deduced that all three complex types were initialized from focal replacement and fragile virus-human junctions triggered subsequent rearrangements. We further revealed that such rearrangements caused a prevalent loss-of-heterozygosity at chr4q, accounting for 19.5% of HCC samples in ICGC cohort and contributing to immune and metabolic dysfunction. Overall, our long read based analysis reveals novel sequential rearrangement processes initiated by HBV integration, hinting its structural and functional impact on HCC.

Genetics

Epidemiology

1

Paper

Save

Mako: a graph-based pattern growth approach to detect complex structural variants

Jiadong Lin et al.Mar 2, 2021

Abstract Complex structural variants (CSVs) are genomic alterations that have more than two breakpoints and are considered as simultaneous occurrence of simple structural variants. However, detecting the compounded mutational signals of CSVs is challenging through a commonly used model-match strategy. As a result, there has been limited progress for CSV discovery compared with simple structural variants. We systematically analyzed the multi-breakpoint connection feature of CSVs, and proposed Mako, utilizing a bottom-up guided model-free strategy, to detect CSVs from paired-end short-read sequencing. Specifically, we implemented a graph-based pattern growth approach, where the graph depicts potential breakpoint connections and pattern growth enables CSV detection without predefined models. Comprehensive evaluations on both simulated and real datasets revealed that Mako outperformed other algorithms. Notably, validation rates of CSV on real data based on experimental and computational validations as well as manual inspections are around 70%, where the medians of experimental and computational breakpoint shift are 13bp and 26bp, respectively. Moreover, Mako CSV subgraph effectively characterized the breakpoint connections of a CSV event and uncovered a total of 15 CSV types, including two novel types of adjacent segments swap and tandem dispersed duplication. Further analysis of these CSVs also revealed impact of sequence homology in the formation of CSVs. Mako is publicly available at https://github.com/jiadong324/Mako .

Genetics

Artificial Intelligence

1

Paper

Save

De novo and somatic structural variant discovery with SVision-pro

Songbo Wang et al.Mar 22, 2024

Long-read-based de novo and somatic structural variant (SV) discovery remains challenging, necessitating genomic comparison between samples. We developed SVision-pro, a neural-network-based instance segmentation framework that represents genome-to-genome-level sequencing differences visually and discovers SV comparatively between genomes without any prerequisite for inference models. SVision-pro outperforms state-of-the-art approaches, in particular, the resolving of complex SVs is improved, with low Mendelian error rates, high sensitivity of low-frequency SVs and reduced false-positive rates compared with SV merging approaches.

Genetics

Artificial Intelligence

-1

Paper

Genetics

Artificial Intelligence

0

Save