ResearchHub | Open Science Community

Towards complete and error-free genome assemblies of all vertebrate species

Arang Rhie et al.Apr 28, 2021

Abstract High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species 1–4 . To address this issue, the international Genome 10K (G10K) consortium 5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

Genetics

Molecular Biology

13

Paper

Save

MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads

Marcela Uliano‐Silva et al.Jul 18, 2023

Abstract Background PacBio high fidelity (HiFi) sequencing reads are both long (15–20 kb) and highly accurate (> Q20). Because of these properties, they have revolutionised genome assembly leading to more accurate and contiguous genomes. In eukaryotes the mitochondrial genome is sequenced alongside the nuclear genome often at very high coverage. A dedicated tool for mitochondrial genome assembly using HiFi reads is still missing. Results MitoHiFi was developed within the Darwin Tree of Life Project to assemble mitochondrial genomes from the HiFi reads generated for target species. The input for MitoHiFi is either the raw reads or the assembled contigs, and the tool outputs a mitochondrial genome sequence fasta file along with annotation of protein and RNA genes. Variants arising from heteroplasmy are assembled independently, and nuclear insertions of mitochondrial sequences are identified and not used in organellar genome assembly. MitoHiFi has been used to assemble 374 mitochondrial genomes (368 Metazoa and 6 Fungi species) for the Darwin Tree of Life Project, the Vertebrate Genomes Project and the Aquatic Symbiosis Genome Project. Inspection of 60 mitochondrial genomes assembled with MitoHiFi for species that already have reference sequences in public databases showed the widespread presence of previously unreported repeats. Conclusions MitoHiFi is able to assemble mitochondrial genomes from a wide phylogenetic range of taxa from Pacbio HiFi data. MitoHiFi is written in python and is freely available on GitHub ( https://github.com/marcelauliano/MitoHiFi ). MitoHiFi is available with its dependencies as a Docker container on GitHub (ghcr.io/marcelauliano/mitohifi:master).

Genetics

Ecology

1

Paper

Save

Complete vertebrate mitogenomes reveal widespread gene duplications and repeats

Giulio Formenti et al.Jul 1, 2020

Abstract Modern sequencing technologies should make the assembly of the relatively small mitochondrial genomes an easy undertaking. However, few tools exist that address mitochondrial assembly directly. As part of the Vertebrate Genomes Project (VGP) we have developed mitoVGP, a fully automated pipeline for similarity-based identification of mitochondrial reads and de novo assembly of mitochondrial genomes that incorporates both long (>10 kbp, PacBio or Nanopore) and short (100-300 bp, Illumina) reads. Our pipeline led to successful complete mitogenome assemblies of 100 vertebrate species of the VGP. We have observed that tissue type and library size selection have considerable impact on mitogenome sequencing and assembly. Comparing our assemblies to purportedly complete reference mitogenomes based on short-read sequencing, we have identified errors, missing sequences, and incomplete genes in those references, particularly in repeat regions. Our assemblies have also identified novel gene region duplications, shedding new light on mitochondrial genome evolution and organization.

Genetics

Molecular Biology

63

Paper

Save

Distinct patterns of genetic variation at low-recombining genomic regions represent haplotype structure

Jun Ishigohoka et al.Dec 23, 2021

Abstract Genetic variation of the entire genome represents population structure, yet individual loci can show distinct patterns. Such deviations identified through genome scans have often been attributed to effects of selective factors instead of randomness, assuming that the genomic intervals are long enough to average out randomness in underlying genealogies. However, an alternative explanation to distinct patterns has not been fully addressed: too few genealogies to average out the effect of randomness. Specifically, distinct patterns of genetic variation may be due to reduced local recombination rate, since the number of genealogies in a genomic interval corresponds to the number of ancestral recombination events. Here, we associate distinct patterns of local genetic variation with reduced recombination rate in a songbird, the Eurasian blackcap, using genome sequences and recombination maps. We find that distinct patterns of local genetic variation represent haplotype structure at low-recombining regions present either in all populations or only in a few populations. At the former species-wide low- recombining regions, genetic variation depicts conspicuous haplotypes segregating in multiple populations. On the contrary, at the latter population-specific low-recombining regions, genetic variation primarily represents cryptic haplotype structure among individuals of the low-recombining populations. With simulations, we confirm that reduction in recombination rate alone can cause distinct patterns of genetic variation mirroring our empirical data. Our results highlight that distinct patterns of genetic variation can emerge through evolution of reduced local recombination rate. Recombination landscape as an evolvable trait therefore plays an important role determining the heterogeneous distribution of genetic variation along the genome.

Genetics

Demography

0

Paper

Save

Scalable, accessible, and reproducible reference genome assembly and evaluation in Galaxy

Delphine Larivière et al.Jun 30, 2023

Improvements in genome sequencing and assembly are enabling high-quality reference genomes for all species. However, the assembly process is still laborious, computationally and technically demanding, lacks standards for reproducibility, and is not readily scalable. Here we present the latest Vertebrate Genomes Project assembly pipeline and demonstrate that it delivers high-quality reference genomes at scale across a set of vertebrate species arising over the last ~500 million years. The pipeline is versatile and combines PacBio HiFi long-reads and Hi-C-based haplotype phasing in a new graph-based paradigm. Standardized quality control is performed automatically to troubleshoot assembly issues and assess biological complexities. We make the pipeline freely accessible through Galaxy, accommodating researchers even without local computational resources and enhanced reproducibility by democratizing the training and assembly process. We demonstrate the flexibility and reliability of the pipeline by assembling reference genomes for 51 vertebrate species from major taxonomic groups (fish, amphibians, reptiles, birds, and mammals).

Genetics

Molecular Biology

6

Paper

Save

Genome assembly and annotation of the tambaqui (Colossoma macropomum): an emblematic fish of the Amazon River basin

Alexandre Hilsdorf et al.Sep 9, 2021

ABSTRACT Colossoma macropomum known as “tambaqui” is the largest Characiformes fish in the Amazon River Basin and a leading species in Brazilian aquaculture and fisheries. Good quality meat and great adaptability to culture systems are some of its remarkable farming features. To support studies into the genetics and genomics of the tambaqui, we have produced the first high-quality genome for the species. We combined Illumina and PacBio sequencing technologies to generate a reference genome, assembled with 39X coverage of long reads and polished to a QV=36 with 130X coverage of short reads. The genome was assembled into 1,269 scaffolds to a total of 1,221,847,006 bases, with a scaffold N50 size of 40 Mb where 93% of all assembled bases were placed in the largest 54 scaffolds that corresponds to the diploid karyotype of the tambaqui. Furthermore, the NCBI Annotation Pipeline annotated genes, pseudogenes, and non-coding transcripts using the RefSeq database as evidence, guaranteeing a high-quality annotation. A Genome Data Viewer for the tambaqui was produced which benefits any groups interested in exploring unique genomic features of the species. The availability of a highly accurate genome assembly for tambaqui provides the foundation for novel insights about ecological and evolutionary facets and is a helpful resource for aquaculture purposes.

Genetics

Immunology

1

Paper

Save

Caecilian genomes reveal the molecular basis of adaptation, and convergent evolution of limblessness in snakes and caecilians

Vladimir Ovchinnikov et al.Feb 25, 2022

Abstract We present genome sequences for the caecilians Geotrypetes seraphini (3.8Gb) and Microcaecilia unicolor (4.7Gb), representatives of a limbless, mostly soil-dwelling amphibian clade with reduced eyes, and unique putatively chemosensory tentacles. More than 69% of both genomes are composed of repeats, with retrotransposons the most abundant. We identify 1,150 orthogroups which are unique to caecilians and enriched for functions in olfaction and detection of chemical signals. There are 379 orthogroups with signatures of positive selection on caecilian lineages with roles in organ development and morphogenesis, sensory perception and immunity amongst others. We discover that caecilian genomes are missing the ZRS enhancer of Sonic Hedgehog which is also mutated in snakes. In vivo deletions have shown ZRS is required for limb development in mice, thus revealing a shared molecular target implicated in the independent evolution of limblessness in snakes and caecilians.

Genetics

Immunology

96

Paper

Save

A revamped rat reference genome improves the discovery of genetic diversity in laboratory rats

Tristan Jong et al.Jan 1, 2023

The seventh iteration of the reference genome assembly for Rattus norvegicus, mRatBN7.2, corrects numerous misplaced segments and reduces base-level errors by approximately 9-fold and increases contiguity by 290-fold compared to its predecessor. Gene annotations are now more complete, significantly improving the mapping precision of genomic, transcriptomic, and proteomics data sets. We jointly analyzed 163 short-read whole genome sequencing datasets representing 120 laboratory rat strains and substrains using mRatBN7.2. We defined ~20.0 million sequence variations, of which 18.7 thousand are predicted to potentially impact the function of 6,677 genes. We also generated a new rat genetic map from 1,893 heterogeneous stock rats and annotated transcription start sites and alternative polyadenylation sites. The mRatBN7.2 assembly, along with the extensive analysis of genomic variations among rat strains, enhances our understanding of the rat genome, providing researchers with an expanded resource for studies involving rats.

Genetics

Ecology

6

Paper

Save

The genomes of invasive coral Tubastraea spp. (Dendrophylliidae) as tool for the development of biotechnological solutions

Soares Souza et al.Apr 25, 2020

Abstract Corals have been attracting huge attention due to the impact of climate change and ocean acidification on reef formation and resilience. Nevertheless, some species like Tubastraea coccinea and T. tagusensis have been spreading very fast replacing the native ones which affect the local environment and decrease biodiversity of corals and other organisms associated with them. Despite some focal efforts to understand the biology of these organisms, they remain understudied at the molecular level. This knowledge gap hinders the development of cost-effective strategies for both conservation and management of invasive species. In this circumstance, it is expected that genome sequencing would provide powerful insights that could lead to better strategies for prevention, management, and control of this and other invasive species. Here, we present three genomes of Tubastraea spp. in one of the most comprehensive biological studies of corals, that includes flow cytometry, karyotyping, transcriptomics, genomics, and phylogeny. The genome of T. tagusensis is organized in 23 chromosomes pairs and has 1.1 Gb, the T. coccinea genome is organized in 22 chromosome pairs and has 806 Mb, and the Tubastraea sp. genome is organized in 21 chromosome pairs and has 795 Mb. The hybrid assembly of T. tagusensis using short and long-reads has a N50 of 227,978 bp, 7,996 contigs and high completeness estimated as 91.6% of BUSCO complete genes, of T. coccinea has a N50 of 66,396 bp, 17,214 contigs and 88.1% of completeness, and of Tubastraea sp. has a N50 of 82,672 bp, 12,922 contigs and also 88.1% of completeness. We inferred that almost half of the genome consists of repetitive elements, mostly interspersed repeats. We provide evidence for exclusive Scleractinia and Tubastraea gene content related to adhesion and immunity. The Tubastraea spp. genomes are a fundamental study which promises to provide insights not only about the genetic basis for the extreme invasiveness of this particular coral genus, but to understand the adaptation flaws of some reef corals in the face of anthropic-induced environmental disturbances. We expect the data generated in this study will foster the development of efficient technologies for the management of coral species, whether invasive or threatened.

Genetics

Ecology

1

Paper

Save

A high quality chromosome-level genome assembly for the golden mussel (Limnoperna fortunei)

João Ferreira et al.Sep 30, 2022

Abstract The golden mussel ( Limnoperna fortunei ) is a highly adaptive species that causes environmental and socioeconomic losses in invaded areas. Reference genomes have proven to be a valuable resource for studying the biology of invasive species. While the current golden mussel genome has been useful for identifying new genes, its high fragmentation hinders some applications. In this Data Note, we provide the first chromosome-level reference genome for the golden mussel. The genome was built using Hi-C, PacBio HiFi and 10X sequencing data. The final assembly contains 99.4% of its total length assembled to the 15 chromosomes of the species and a scaffold N50 of 97.05 Mb. Approximately 47% of the genome was annotated as repetitive sequences. A total of 34 862 protein-coding genes were predicted, of which 84.7% were functionally annotated. This new high quality genome is expected to support both basic and applied research on this invasive species. Species taxonomy Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Protostomia; Spiralia; Lophotrochozoa; Mollusca; Bivalvia; Autobranchia; Pteriomorphia; Mytilida; Mytiloidea; Mytilidae; Arcuatulinae; Limnoperna; Limnoperna fortunei (Dunker, 1857) (NCBI Taxonomy ID: 356393)

Genetics

Ecology

12

Paper

Genetics

Ecology

0

Save