ResearchHub | Open Science Community

The Genomes of Oryza sativa: A History of Duplications

Jun Yu et al.Jan 21, 2005

We report improved whole-genome shotgun sequences for the genomes of indica and japonica rice, both with multimegabase contiguity, or almost 1,000-fold improvement over the drafts of 2002. Tested against a nonredundant collection of 19,079 full-length cDNAs, 97.7% of the genes are aligned, without fragmentation, to the mapped super-scaffolds of one or the other genome. We introduce a gene identification procedure for plants that does not rely on similarity to known genes to remove erroneous predictions resulting from transposable elements. Using the available EST data to adjust for residual errors in the predictions, the estimated gene count is at least 38,000–40,000. Only 2%–3% of the genes are unique to any one subspecies, comparable to the amount of sequence that might still be missing. Despite this lack of variation in gene content, there is enormous variation in the intergenic regions. At least a quarter of the two sequences could not be aligned, and where they could be aligned, single nucleotide polymorphism (SNP) rates varied from as little as 3.0 SNP/kb in the coding regions to 27.6 SNP/kb in the transposable elements. A more inclusive new approach for analyzing duplication history is introduced here. It reveals an ancient whole-genome duplication, a recent segmental duplication on Chromosomes 11 and 12, and massive ongoing individual gene duplications. We find 18 distinct pairs of duplicated segments that cover 65.7% of the genome; 17 of these pairs date back to a common time before the divergence of the grasses. More important, ongoing individual gene duplications provide a never-ending source of raw material for gene genesis and are major contributors to the differences between members of the grass family.

Genetics

Molecular Biology

0

Paper

Save

Topological structure analysis of the protein-protein interaction network in budding yeast

Dongbo Bu et al.Apr 23, 2003

Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology. Here, a spectral method derived from graph theory was introduced to uncover hidden topological structures (i.e. quasi-cliques and quasi-bipartites) of complicated protein-protein interaction networks. Our analyses suggest that these hidden topological structures consist of biologically relevant functional groups. This result motivates a new method to predict the function of uncharacterized proteins based on the classification of known proteins within topological structures. Using this spectral analysis method, 48 quasi-cliques and six quasi-bipartites were isolated from a network involving 11,855 interactions among 2617 proteins in budding yeast, and 76 uncharacterized proteins were assigned functions.

Genetics

Law

0

Paper

Save

ProALIGN: Directly learning alignments for protein structure prediction via exploiting context-specific alignment motifs

Lupeng Kong et al.Dec 29, 2020

Abstract Template-based modeling (TBM), including homology modeling and protein threading, is one of the most reliable techniques for protein structure prediction. It predicts protein structure by building an alignment between the query sequence under prediction and the templates with solved structures. However, it is still very challenging to build the optimal sequence-template alignment, especially when only distantly-related templates are available. Here we report a novel deep learning approach ProALIGN that can predict much more accurate sequence-template alignment. Like protein sequences consisting of sequence motifs, protein alignments are also composed of frequently-occurring alignment motifs with characteristic patterns. Alignment motifs are context-specific as their characteristic patterns are tightly related to sequence contexts of the aligned regions. Inspired by this observation, we represent a protein alignment as a binary matrix (in which 1 denotes an aligned residue pair) and then use a deep convolutional neural network to predict the optimal alignment from the query protein and its template. The trained neural network implicitly but effectively encodes an alignment scoring function, which reduces inaccuracies in the handcrafted scoring functions widely used by the current threading approaches. For a query protein and a template, we apply the neural network to directly infer likelihoods of all possible residue pairs in their entirety, which could effectively consider the correlations among multiple residues. We further construct the alignment with maximum likelihood, and finally build structure model according to the alignment. Tested on three independent datasets with in total 6,688 protein alignment targets and 80 CASP13 TBM targets, our method achieved much better alignments and 3D structure models than the existing methods including HHpred, CNFpred, CEthreader and DeepThreader. These results clearly demonstrate the effectiveness of exploiting the context-specific alignment motifs by deep learning for protein threading.

Genetics

Artificial Intelligence

2

Paper

Save

CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction

Fusong Ju et al.Oct 7, 2020

Protein functions are largely determined by the final details of their tertiary structures, and the structures could be accurately reconstructed based on inter-residue distances. Residue co-evolution has become the primary principle for estimating inter-residue distances since the residues in close spatial proximity tend to co-evolve. The widely-used approaches infer residue co-evolution using an indirect strategy, i.e., they first extract from the multiple sequence alignment (MSA) of query protein some handcrafted features, say, co-variance matrix, and then infer residue co-evolution using these features rather than the raw information carried by MSA. This indirect strategy always leads to considerable information loss and inaccurate estimation of inter-residue distances. Here, we report a deep neural network framework (called CopulaNet) to learn residue co-evolution directly from MSA without any handcrafted features. The CopulaNet consists of two key elements: i ) an encoder to model context-specific mutation for each residue, and ii ) an aggregator to model correlations among residues and thereafter infer residue co-evolutions. Using the CASP13 (the 13th Critical Assessment of Protein Structure Prediction) target proteins as representatives, we demonstrated the successful application of CopulaNet for estimating inter-residue distances and further predicting protein tertiary structure with improved accuracy and efficiency. Head-to-head comparison suggested that for 24 out of the 31 free modeling CASP13 domains, ProFOLD outperformed AlphaFold, one of the state-of-the-art prediction approaches.

Artificial Intelligence

Biochemistry

1

Paper

Artificial Intelligence

4

0

Save

6

Accurate prediction of RNA secondary structure including pseudoknots through solving minimum-cost flow with learned potentials

Tiansu Gong et al.Sep 19, 2022

Abstract Pseudoknots are key structure motifs of RNA and pseudoknotted RNAs play important roles in a variety of biological processes. Here, we present KnotFold, an accurate approach to the prediction of RNA secondary structure including pseudoknots. The key elements of Knot-Fold include a learned potential function and a minimum-cost flow algorithm to find the secondary structure with the lowest potential. KnotFold learns the potential from the RNAs with known structures using a self-attention-based neural network, thus avoiding the inaccuracy of hand-crafted energy functions. The specially-designed minimum-cost flow algorithm used by KnotFold considers all possible combinations of base pairs and selects from them the optimal combination. The algorithm breaks the restriction of nested base pairs required by the widely-used dynamic programming algorithms, thus facilitating the identification of pseudoknots. Using a total of 1605 RNAs as representatives, we demonstrate the successful application of KnotFold in predicting RNA secondary structures including pseudoknots with accuracy significantly higher than the state-of-the-art approaches. We anticipate that KnotFold, with its superior accuracy, will greatly facilitate the understanding of RNA structures and functionalities.

Genetics

Biochemistry

6

Paper

Save

Accurate and efficient protein sequence design through learning concise local environment of residues

Bin Huang et al.Jun 29, 2022

Protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired. Here we present ProDESIGN-LE, an accurate and efficient design approach, which adopts a concise but informative representation of residue’s local environment and trains a transformer to select an appropriate residue at a position from its local environment. ProDESIGN-LE iteratively applies the transformer on the positions in the target structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. ProDESIGN-LE designed sequences for 68 naturally occurring and 129 hallucinated proteins within 20 seconds per protein on average, and the predicted structures from the designed sequences perfectly resemble the target structures with state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O -acetyltransferase type III (CAT III), and recombinantly expressing the proteins in E. coli . Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein.

Biochemistry

Immunology

1

Paper

Save

Predicting immunogenicity by modeling the positive and negative selection of CD8+ T cells in individual patients

Ngoc Tran et al.Jul 5, 2022

Neoantigens are promising targets for cancer immunotherapy but their discovery remains challenging, mainly due to the sensitivity of current technologies to detect them and the specificity of our immune system to recognize them. In this study, we addressed both of those problems and proposed a new approach for neoantigen identification and validation from mass spectrometry (MS) based immunopeptidomics. In particular, we developed DeepNovo Peptidome, a de novo sequencing-based search engine that was optimized for HLA peptide identification, especially non-canonical HLA peptides. We also developed DeepSelf, a personalized model for immunogenicity prediction based on the central tolerance of T cells, which could be used to select candidate neoantigens from non-canonical HLA peptides. Both tools were built on deep learning models that were trained specifically for HLA peptides and for the immunopeptidome of each individual patient. To demonstrate their applications, we presented a new MS-based immunopeptidomics study of native tumor tissues from five patients with cervical cancer. We applied DeepNovo Peptidome and DeepSelf to identify and prioritize candidate neoantigens, and then performed in vitro validation of autologous neoantigen-specific T cell responses to confirm our results. Our MS-based de novo sequencing approach does not depend on prior knowledge of genome, transcriptome, or proteome information. Thus, it provides an unbiased solution to discover neoantigens from any sources.

Genetics

Artificial Intelligence

7

Paper

Save

Highly accurate and robust protein sequence design with CarbonDesign

Mingrong Ren et al.Aug 7, 2023

Abstract Protein sequence design, the inverse problem of protein structure prediction, plays a crucial role in protein engineering. Although recent deep learning-based methods have shown promising advancements, achieving accurate and robust protein sequence design remains an ongoing challenge. Here, we present CarbonDesign, a new approach that draws inspiration from successful ingredients of AlphaFold for protein structure prediction and makes significant and novel developments tailored specifically for protein sequence design. At its core, CarbonDesign explores Inverseformer, a novel network architecture adapted from AlphaFold’s Evoformer, to learn representations from backbone structures and an amortized Markov Random Fields model for sequence decoding. Moreover, we incorporate other essential AlphaFold concepts into CarbonDesign: an end-to-end network recycling technique to leverage evolutionary constraints in protein language models and a multi-task learning technique to generate side chain structures corresponding to the designed sequences. Through rigorous evaluations on independent testing data sets, including the CAMEO and recent CASP15 data sets, as well as the predicted structures from AlphaFold, we show that CarbonDesign outperforms other published methods, achieving high accuracy in sequence generation. Moreover, it exhibits superior performance on de novo backbone structures obtained from recent diffusion generative models such as RFdiffusion and FrameDiff, highlighting its potential for enhancing de novo protein design. Notably, CarbonDesign also supports zero-shot prediction of the functional effects of sequence variants, indicating its potential application in directed evolution-based design. In summary, our results illustrate CarbonDesign’s accurate and robust performance in protein sequence design, making it a promising tool for applications in bioengineering.

Genetics

Artificial Intelligence

1

Paper

Save

Bi-clustering interpretation and prediction of correlation between gene expression and protein abundance

Xiaojun Wang et al.Feb 23, 2018

Most organisms' transcript and protein level only moderately correlate for various reasons, such as regulation of transcription and protein degradation. Better prediction and understanding the correlation between gene expression and protein abundance has been possible by harnessing the matching RNA/protein datasets produced by modern high-throughput RNA-Seq and mass spectrometry methods. In this work, we have utilized some well-studied matching RNA/protein datasets, and explored for the first time a bi-clustering method to cluster genes that have consistent correlation patterns between gene expression and protein abundance. The clustering results have been interpreted from the perspective of both transcriptomic and proteomic features, which show that mRNA half-life, protein half-life and protein structure in concert significantly affect the correlation of gene expression and protein abundance. With these and other carefully selected features, a prediction model based on individual clusters, called Cluster-based Linear prediction Model (CLM), was built and tested on mouse liver mitochondrial, mouse brainstem mitochondrial, Saccharomyces cerevisiae and Danio rerio datasets. CLM could find genes for which protein abundance can be predicted from mRNA data. In summary, based on bi-clustering, feature selection and CLM model, we have established a new and valuable cluster-based protein abundance prediction method.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save

0

Cost-effective DNA storage with DNA movable type

Chenyang Wang et al.Jul 19, 2024

In the face of exponential data growth, DNA-based storage offers a promising solution for preserving big-data. However, most existing DNA storage methods, akin to traditional block printing, require costly chemical synthesis for each individual data file, adopting a sequential, one-time-use synthesis approach. To overcome these limitations, we introduce a novel, cost-effective "DNA-Movable-Type Storage" system, inspired by movable type printing. This system utilizes pre-fabricated DNA movable types-short, double-stranded DNA oligonucleotides encoding specific payload, address, and checksum data. These DNA-MTs are enzymatically ligated/assembled into cohesive sequences, termed "DNA movable type blocks", streamlining the assembly process with the automated BISHENG-1 DNA-MT inkjet printer. Using BISHENG-1, we successfully printed, assembled, stored and accurately retrieved 43.7 KB of data files in diverse formats (text, image, audio, and video) in vitro and in vivo, using only 350 DNA-MTs. Notably, each DNA-MT, synthesized once (2 OD), can be used up to 10,000 times, reducing costs to 121.57 $/MB-outperforming existing DNA storage methods. This innovation circumvents the need to synthesize entire DNA sequences encoding files from scratch, offering significant cost and efficiency advantages. Furthermore, it has considerable untapped potential to advance a robust DNA storage system, better meeting the extensive data storage demands of the big-data era.

Genetics

Molecular Biology

0

Paper

Genetics

Molecular Biology

0

Save