ResearchHub | Open Science Community

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

Ruibang Luo et al.Dec 1, 2012

There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions.To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome.Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.

Genetics

Molecular Biology

0

Paper

Save

The oyster genome reveals stress adaptation and complexity of shell formation

Guofan Zhang et al.Sep 19, 2012

The Pacific oyster Crassostrea gigas belongs to one of the most species-rich but genomically poorly explored phyla, the Mollusca. Here we report the sequencing and assembly of the oyster genome using short reads and a fosmid-pooling strategy, along with transcriptomes of development and stress response and the proteome of the shell. The oyster genome is highly polymorphic and rich in repetitive sequences, with some transposable elements still actively shaping variation. Transcriptome studies reveal an extensive set of genes responding to environmental stress. The expansion of genes coding for heat shock protein 70 and inhibitors of apoptosis is probably central to the oyster’s adaptation to sessile life in the highly stressful intertidal zone. Our analyses also show that shell formation in molluscs is more complex than currently understood and involves extensive participation of cells and their exosomes. The oyster genome sequence fills a void in our understanding of the Lophotrochozoa. The sequencing and assembly of the highly polymorphic oyster genome through a combination of short reads and fosmid pooling, complemented with extensive transcriptome analysis of development and stress response and proteome analysis of the shell, provides new insight into oyster biology and adaptation to a highly changeable environment. Oysters are keystone species in estuarine ecology and among the most important aquaculture species worldwide. The sequencing and assembly of the genome of the Pacific oyster, Crassostrea gigas, are now reported. Comparisons with other genomes reveal an expansion of defence genes as an adaptation to life as a sessile species in the intertidal zone, a surprisingly complex pathway for shell formation and dramatic evolution of genes related to larval development, highlighting their adaptive significance for marine invertebrates.

Genetics

Ecology

0

Paper

Save

The genome of the mesopolyploid crop species Brassica rapa

Xiaowu Wang et al.Aug 28, 2011

The Brassica rapa Genome Sequencing Project Consortium reports the draft genome of the B. rapa accession Chiifu-401-42, an inbred Chinese cabbage line. The B. rapa genome should provide a useful reference genome for the Brassica species, which include many important oil and vegetable crops. We report the annotation and analysis of the draft genome sequence of Brassica rapa accession Chiifu-401-42, a Chinese cabbage. We modeled 41,174 protein coding genes in the B. rapa genome, which has undergone genome triplication. We used Arabidopsis thaliana as an outgroup for investigating the consequences of genome triplication, such as structural and functional evolution. The extent of gene loss (fractionation) among triplicated genome segments varies, with one of the three copies consistently retaining a disproportionately large fraction of the genes expected to have been present in its ancestor. Variation in the number of members of gene families present in the genome may contribute to the remarkable morphological plasticity of Brassica species. The B. rapa genome sequence provides an important resource for studying the evolution of polyploid genomes and underpins the genetic improvement of Brassica oil and vegetable crops.

Genetics

Molecular Biology

0

Paper

Save

Whole-genome analyses resolve early branches in the tree of life of modern birds

Erich Jarvis et al.Dec 11, 2014

To better determine the history of modern birds, we performed a genome-scale phylogenetic analysis of 48 species representing all orders of Neoaves using phylogenomic methods created to handle genome-scale data. We recovered a highly resolved tree that confirms previously controversial sister or close relationships. We identified the first divergence in Neoaves, two groups we named Passerea and Columbea, representing independent lineages of diverse and convergently evolved land and water bird species. Among Passerea, we infer the common ancestor of core landbirds to have been an apex predator and confirm independent gains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging to sister clades. Even with whole genomes, some of the earliest branches in Neoaves proved challenging to resolve, which was best explained by massive protein-coding sequence convergence and high levels of incomplete lineage sorting that occurred during a rapid radiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago.

Genetics

Paleontology

0

Paper

Save

The sequence and de novo assembly of the giant panda genome

Ruiqiang Li et al.Dec 13, 2009

Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes. The genome of the giant panda — specifically of the female Beijing Olympics mascot Jingjing — has been determined using short-read sequencing technology, a first for such a complex genome. It consists of some 2.4 billion DNA base pairs, compared to 3 billion in humans, and contains around 21,000 protein-encoding genes, similar to the human genome. Genomic diversity reflected in the sequence is high, raising hopes that despite a population of only about 2,500, conservation efforts can keep the species from extinction. Intriguingly, the panda appears to have all the genes needed for a carnivorous digestive system but lacks digestive cellulase genes. It may therefore depend on its gut microbiome to handle its famously limited bamboo diet. Taste may be a diet-limiting factor: loss of function of the T1R1 gene means that pandas may not experience the umami taste associated with high-protein foods. Technical aspects of this work pave the way for the use of next-generation sequencing for rapid de novo assembly of large eukaryotic genomes. Here, a draft sequence of the giant panda genome is assembled using next-generation sequencing technology alone. Genome analysis reveals a low divergence rate in comparison with dog and human genomes and insights into panda-specific traits; for example, the giant panda's bamboo diet may be more dependent on its gut microbiome than its own genetic composition.

Genetics

Molecular Biology

0

Paper

Save

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Keith Bradnam et al.Jul 22, 2013

Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results - In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions - Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.

Genetics

Molecular Biology

0

Paper

Save

Assemblathon 1: A competitive assessment of de novo short read assembly methods

Dent Earl et al.Sep 16, 2011

Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/ .

Genetics

Ecology

0

Paper

Save

Insights into salt tolerance from the genome of Thellungiella salsuginea

Hua‐Jun Wu et al.Jul 9, 2012

Thellungiella salsuginea, a close relative of Arabidopsis , represents an extremophile model for abiotic stress tolerance studies. We present the draft sequence of the T. salsuginea genome, assembled based on ∼134-fold coverage to seven chromosomes with a coding capacity of at least 28,457 genes. This genome provides resources and evidence about the nature of defense mechanisms constituting the genetic basis underlying plant abiotic stress tolerance. Comparative genomics and experimental analyses identified genes related to cation transport, abscisic acid signaling, and wax production prominent in T. salsuginea as possible contributors to its success in stressful environments.

Genetics

Ecology

0

Paper

Save

A Reference Feature based method for Quantification and Identification of LC-MS based untargeted metabolomics

enhui luan et al.Mar 29, 2020

Batch inconsistency is a major problem when applying LC-MS based untargeted metabolomics in real-time analysis situation such as clinical diagnosis or health monitoring. And inefficiency of collecting MS2 is a major problem for metabolite identification. Here, we developed a reference-feature based quantification and identification strategy (RFQI). In RFQI, samples are individually profiled using a pre-fixed reference feature table. Quantification results show that RFQI improves features＇overlap rate and reduce variance across batches significantly in real-time-analysis mode, and can find more than 4-fold numbers of features. Besides, RFQI collects MS2 from consecutive increasing samples for metabolite identification of pre-fixed features, thus it can effectively compensate for the poor efficiency of MS2 collection in data-dependent acquisition mode. In summary, RFQI can make full advantage of consecutive increasing samples in real-time analysis situation, both for quantification and identification. Key words: batch effect, LC-MS based untargeted metabolomics, metabolite identification

Philosophy

Artificial Intelligence

0

Paper

Philosophy

Artificial Intelligence

0

Save