ResearchHub | Open Science Community

The zebrafish reference genome sequence and its relationship to the human genome

Kerstin Howe et al.Apr 16, 2013

A high-quality sequence assembly of the zebrafish genome reveals the largest gene set of any vertebrate and provides information on key genomic features, and comparison to the human reference genome shows that approximately 70% of human protein-coding genes have at least one clear zebrafish orthologue. The genome of the zebrafish — a key model organism for the study of development and human disease — has now been sequenced and published as a well-annotated reference genome. Zebrafish turns out to have the largest gene set of any vertebrate so far sequenced, and few pseudogenes. Importantly for disease studies, comparison between human and zebrafish sequences reveals that 70% of human genes have at least one obvious zebrafish orthologue. A second paper reports on an ongoing effort to identify and phenotype disruptive mutations in every zebrafish protein-coding gene. Using the reference genome sequence along with high-throughput sequencing and efficient chemical mutagenesis, the project's initial results — covering 38% of all known protein-coding genes — they describe phenotypic consequences of more than 1,000 alleles. The long-term goal is the creation of a knockout allele in every protein-coding gene in the zebrafish genome. All mutant alleles and data are freely available at go.nature.com/en6mos . Zebrafish have become a popular organism for the study of vertebrate gene function1,2. The virtually transparent embryos of this species, and the ability to accelerate genetic studies by gene knockdown or overexpression, have led to the widespread use of zebrafish in the detailed investigation of vertebrate gene function and increasingly, the study of human genetic disease3,4,5. However, for effective modelling of human genetic disease it is important to understand the extent to which zebrafish genes and gene structures are related to orthologous human genes. To examine this, we generated a high-quality sequence assembly of the zebrafish genome, made up of an overlapping set of completely sequenced large-insert clones that were ordered and oriented using a high-resolution high-density meiotic map. Detailed automatic and manual annotation provides evidence of more than 26,000 protein-coding genes6, the largest gene set of any vertebrate so far sequenced. Comparison to the human reference genome shows that approximately 70% of human genes have at least one obvious zebrafish orthologue. In addition, the high quality of this genome assembly provides a clearer understanding of key genomic features such as a unique repeat content, a scarcity of pseudogenes, an enrichment of zebrafish-specific genes on chromosome 4 and chromosomal regions that influence sex determination.

Genetics

Molecular Biology

0

Paper

Save

Towards complete and error-free genome assemblies of all vertebrate species

Arang Rhie et al.Apr 28, 2021

Abstract High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species 1–4 . To address this issue, the international Genome 10K (G10K) consortium 5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

Genetics

Molecular Biology

13

Paper

Save

Significantly improving the quality of genome assemblies through curation

Kerstin Howe et al.Jan 1, 2021

Abstract Genome sequence assemblies provide the basis for our understanding of biology. Generating error-free assemblies is therefore the ultimate, but sadly still unachieved goal of a multitude of research projects. Despite the ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near error-free genome assemblies for eukaryotes. Whilst working towards improved datasets and fully automated pipelines, assembly evaluation and curation is actively used to bridge this shortcoming and significantly reduce the number of assembly errors. In addition to this increase in product value, the insights gained from assembly curation are fed back into the automated assembly strategy and contribute to notable improvements in genome assembly quality. We describe our tried and tested approach for assembly curation using gEVAL, the genome evaluation browser. We outline the procedures applied to genome curation using gEVAL and also our recommendations for assembly curation in a gEVAL-independent context to facilitate the uptake of genome curation in the wider community.

Genetics

Ecology

1

Paper

Save

The genomes of four tapeworm species reveal adaptations to parasitism

Isheng Tsai et al.Mar 12, 2013

Tapeworms (Cestoda) cause neglected diseases that can be fatal and are difficult to treat, owing to inefficient drugs. Here we present an analysis of tapeworm genome sequences using the human-infective species Echinococcus multilocularis, E. granulosus, Taenia solium and the laboratory model Hymenolepis microstoma as examples. The 115- to 141-megabase genomes offer insights into the evolution of parasitism. Synteny is maintained with distantly related blood flukes but we find extreme losses of genes and pathways that are ubiquitous in other animals, including 34 homeobox families and several determinants of stem cell fate. Tapeworms have specialized detoxification pathways, metabolism that is finely tuned to rely on nutrients scavenged from their hosts, and species-specific expansions of non-canonical heat shock proteins and families of known antigens. We identify new potential drug targets, including some on which existing pharmaceuticals may act. The genomes provide a rich resource to underpin the development of urgently needed treatments and control. Genome sequences of human-infective tapeworm species reveal extreme losses of genes and pathways that are ubiquitous in other animals, species-specific expansions of non-canonical heat shock proteins and families of known antigens, specialized detoxification pathways, and metabolism that relies on host nutrients; this information is used to identify new potential drug targets. Tapeworms cause echinococcosis and cysticercosis, two of the most severe parasitic diseases found in humans, and both on the World Health Organization's list of neglected tropical diseases. The publication of four tapeworm genome sequences — human-infective tapeworm species Echinococcus multilocularis, E. granulosus, Taenia solium and the laboratory model Hymenolepis microstoma — and identification of potential new drug targets for treating tapeworm infections is therefore a welcome development. Analysis of the sequences provides insights into the evolution of parasitism and reveals extreme losses of genes and pathways ubiquitous in other animals on one hand and species-specific expansions of genes on the other. More than a thousand E. multilocularis proteins emerge as potential targets, and of these, close to 200 with the highest scores may be targeted with existing pharmaceuticals.

Genetics

Ecology

0

Paper

Save

Comparative genomics of the major parasitic worms

Avril Coghlan et al.Oct 29, 2018

Parasitic nematodes (roundworms) and platyhelminths (flatworms) cause debilitating chronic infections of humans and animals, decimate crop production and are a major impediment to socioeconomic development. Here we report a broad comparative study of 81 genomes of parasitic and non-parasitic worms. We have identified gene family births and hundreds of expanded gene families at key nodes in the phylogeny that are relevant to parasitism. Examples include gene families that modulate host immune responses, enable parasite migration though host tissues or allow the parasite to feed. We reveal extensive lineage-specific differences in core metabolism and protein families historically targeted for drug development. From an in silico screen, we have identified and prioritized new potential drug targets and compounds for testing. This comparative genomics resource provides a much-needed boost for the research community to understand and combat parasitic worms. Comparative study of 81 genomes of parasitic and non-parasitic worms identifies gene family births and expanded gene families at key nodes in the phylogeny that are relevant to parasitism and proteins historically targeted for drug development.

Genetics

Ecology

1

Paper

Save

The genomic basis of parasitism in the Strongyloides clade of nematodes

Vicky Hunt et al.Feb 1, 2016

Taisei Kikuchi, Mark Viney, Matthew Berriman and colleagues report the genome sequences of six species of nematodes from the Strongyloides clade of nematodes, including human and animal pathogens, facultative parasites and a free-living species. They find that expansions of the astacin and SCP/TAPS gene families are associated with parasitism in these species. Soil-transmitted nematodes, including the Strongyloides genus, cause one of the most prevalent neglected tropical diseases. Here we compare the genomes of four Strongyloides species, including the human pathogen Strongyloides stercoralis, and their close relatives that are facultatively parasitic (Parastrongyloides trichosuri) and free-living (Rhabditophanes sp. KR3021). A significant paralogous expansion of key gene families—families encoding astacin-like and SCP/TAPS proteins—is associated with the evolution of parasitism in this clade. Exploiting the unique Strongyloides life cycle, we compare the transcriptomes of the parasitic and free-living stages and find that these same gene families are upregulated in the parasitic stages, underscoring their role in nematode parasitism.

Genetics

Ecology

0

Paper

Save

Assembled chromosomes of the blood fluke Schistosoma mansoni provide insight into the evolution of its ZW sex-determination system

Sarah Buddenborg et al.Aug 13, 2021

ABSTRACT Background Schistosoma mansoni is a flatworm that causes a neglected tropical disease affecting millions worldwide. Most flatworms are hermaphrodites but schistosomes have genotypically determined male (ZZ) and female (ZW) sexes. Sex is essential for pathology and transmission, however, the molecular determinants of sex remain unknown and is limited by poorly resolved sex chromosomes in previous genome assemblies. Results We assembled the 391.4 Mb S. mansoni genome into individual, single-scaffold chromosomes, including Z and W. Manual curation resulted in a vastly improved gene annotation, resolved gene and repeat arrays, trans-splicing, and almost all UTRs. The sex chromosomes each comprise pseudoautosomal regions and single sex-specific regions. The Z-specific region contains 932 genes, but on W all but 29 of these genes have been lost and the presence of five pseudogenes indicates that degeneration of W is ongoing. Synteny analysis reveals an ancient chromosomal fusion corresponding to the oldest part of Z, where only a single gene—encoding the large subunit of pre-mRNA splicing factor U2AF—has retained an intact copy on W. The sex-specific copies of U2AF have divergent N-termini and show sex-biased gene expression. Conclusion Our assembly with fully resolved chromosomes provides evidence of an evolutionary path taken to create the Z and W sex chromosomes of schistosomes. Sex-linked divergence of the single U2AF gene, which has been present in the sex-specific regions longer than any other extant gene with distinct male and female specific copies and expression, may have been a pivotal step in the evolution of gonorchorism and genotypic sex determination of schistosomes.

Genetics

Ecology

1

Paper

Save

Automated assembly of high-quality diploid human reference genomes

Erich Jarvis et al.Mar 6, 2022

Abstract The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has greatly benefited society 1, 2 . However, it still has many gaps and errors, and does not represent a biological human genome since it is a blend of multiple individuals 3, 4 . Recently, a high-quality telomere-to-telomere reference genome, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a duplicate genome, and is thus nearly homozygous 5 . To address these limitations, the Human Pangenome Reference Consortium (HPRC) recently formed with the goal of creating a collection of high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity 6 . Here, in our first scientific report, we determined which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Approaches that used highly accurate long reads and parent-child data to sort haplotypes during assembly outperformed those that did not. Developing a combination of all the top performing methods, we generated our first high- quality diploid reference assembly, containing only ∼4 gaps (range 0-12) per chromosome, most within + 1% of CHM13’s length. Nearly 1/4th of protein coding genes have synonymous amino acid changes between haplotypes, and centromeric regions showed the highest density of variation. Our findings serve as a foundation for assembling near-complete diploid human genomes at the scale required for constructing a human pangenome reference that captures all genetic variation from single nucleotides to large structural rearrangements.

Genetics

Molecular Biology

107

Paper

Save

Genomics of cold adaptations in the Antarctic notothenioid fish radiation

Iliana Bista et al.Jun 9, 2022

Abstract Numerous novel adaptations characterise the radiation of notothenioids, the dominant fish group in the freezing seas of the Southern Ocean. To improve understanding of the evolution of this iconic fish group, we generated and analysed new genome assemblies for 24 species covering all major subgroups of the radiation. We present a new estimate for the onset of the radiation at 10.7 million years ago, based on a time-calibrated phylogeny derived from genome-wide sequence data. We identify a two-fold variation in genome size, driven by expansion of multiple transposable element families, and use long-read sequencing data to reconstruct two evolutionarily important, highly repetitive gene family loci. First, we present the most complete reconstruction to date of the antifreeze glycoprotein gene family, whose emergence enabled survival in sub-zero temperatures, showing the expansion of the antifreeze gene locus from the ancestral to the derived state. Second, we trace the loss of haemoglobin genes in icefishes, the only vertebrates lacking functional haemoglobins, through complete reconstruction of the two haemoglobin gene clusters across notothenioid families. Finally, we show that both the haemoglobin and antifreeze genomic loci are characterised by multiple transposon expansions that may have driven the evolutionary history of these genes.

Genetics

Ecology

61

Paper

Save

Complete vertebrate mitogenomes reveal widespread gene duplications and repeats

Giulio Formenti et al.Jul 1, 2020

Abstract Modern sequencing technologies should make the assembly of the relatively small mitochondrial genomes an easy undertaking. However, few tools exist that address mitochondrial assembly directly. As part of the Vertebrate Genomes Project (VGP) we have developed mitoVGP, a fully automated pipeline for similarity-based identification of mitochondrial reads and de novo assembly of mitochondrial genomes that incorporates both long (>10 kbp, PacBio or Nanopore) and short (100-300 bp, Illumina) reads. Our pipeline led to successful complete mitogenome assemblies of 100 vertebrate species of the VGP. We have observed that tissue type and library size selection have considerable impact on mitogenome sequencing and assembly. Comparing our assemblies to purportedly complete reference mitogenomes based on short-read sequencing, we have identified errors, missing sequences, and incomplete genes in those references, particularly in repeat regions. Our assemblies have also identified novel gene region duplications, shedding new light on mitochondrial genome evolution and organization.

Genetics

Molecular Biology

63

Paper

Genetics

11

0

Save