ResearchHub | Open Science Community

T4SEpp: a pipeline integrated with protein language models effectively predicting bacterial type IV secreted effectors

Yueming Hu et al.Jul 3, 2023

Abstract Many pathogenic bacteria use type IV secretion systems(T4SSs) to deliver effectors (T4SEs) into the cytoplasm of eukaryotic cells, causeing diseases. The identification of effectors is a crucial step in understanding the mechanisms of bacterial pathogenicity, but this remains a major challenge. In this study, we used the full-length embedding features generated by six pre-trained protein language models to train classifiers predicting T4SEs, and compared their performance. An integrated model T4SEpp was assembled by a module searching full-length, signal sequence and effector domain homologs of known T4SEs, a machine learning module based on the hand-crafted features extracted from the signal sequences, and the third module containing three best-performing protein language pre-trained models. T4SEpp outperformed the other state-of-the-art (SOTA) software tools, achieving ∼0.95 sensitivity at a high specificity of ∼0.99, based on the assessment of an independent testing dataset. Additionally, we performed a comprehensive search among 8,761 bacterial species, leading to the discovery of 227 species belonging to 3 phyla and 117 genera that possess T4SSs. Furthermore, leveraging the power of T4SEpp, we successfully identified a grand total of 12,622 plausible T4SEs. Overall, T4SEpp provides a better solution to assist in the identification of bacterial T4SEs, and facilitates studies of bacterial pathogenicity. T4SEpp is freely accessible at https://bis.zju.edu.cn/T4SEpp .

Genetics

Artificial Intelligence

1

Paper

Save

Integrated aqueous humor ceRNA and miRNA-TF-mRNA network analysis reveals potential molecular mechanisms governing primary open-angle glaucoma pathogenesis

Xiaoqin Wang et al.Jul 17, 2020

Abstract Primary open-angle glaucoma (POAG) is the leading cause of blindness globally, which develops through complex and poorly understood biological mechanisms. Herein, we conducted an integrated bioinformatics analysis of extant aqueous humor (AH) gene expression datasets in order to identify key genes and regulatory mechanisms governing POAG progression. We downloaded AH gene expression datasets (GSE101727 and GSE105269) corresponding to healthy controls and POAG patients from the Gene Expression Omnibus. We then identified mRNAs, microRNAs (miRNAs), and long non-coding RNAs (lncRNAs) that were differentially expressed (DE) between control and POAG patients. DEmRNAs and DElncRNAs were then subjected to pathway enrichment analyses, after which a protein-protein interaction (PPI) network was generated. This network was then expanded to establish lncRNA-miRNA-mRNA and miRNA-transcription factor(TF)-mRNA networks. In total, the GSE101727 dataset was used to identify 2746 DElncRNAs and 2208 DEmRNAs, while the GSE105269 dataset was used to identify 45 DEmiRNAs. We ultimately constructed a competing endogenous RNA (ceRNA) network incorporating 37, 5, and 14 of these lncRNAs, miRNAs and mRNAs, respectively. The proteins encoded by these 14 hub mRNAs were found to be significantly enriched for activities that may be linked to POAG pathogenesis. In addition, we generated a miRNA-TF-mRNA regulatory network containing 2 miRNAs (miR-135a-5p and miR-139-5p), 5 TFs (TGIF2, TBX5, HNF1A, TCF3, and FOS) and 5 mRNAs (SHISA7, ST6GAC2, TXNIP, FOS, and DCBLD2). The SHISA7, ST6GAC2, TXNIP, FOS, and DCBLD2 genes that may be viable therapeutic targets for the prevention or treatment of POAG, and regulated by the TFs (TGIF2, HNF1A, TCF3, and FOS).

Genetics

Molecular Biology

1

Paper

Save

iSeq: An integrated tool to fetch public sequencing data

Haoyu Chao et al.May 20, 2024

Abstract High-throughput sequencing technologies (Next Generation Sequencing; NGS) are increasingly utilized by researchers to tackle a diverse array of biological inquiries. Leveraging the remarkable scale and efficiency of modern sequencing, significant advancements are made across various fields, spanning from genome analysis to the intricate dynamics of protein-nucleic acid interactions. Recognizing that NGS data harbors rich biological information, the International Nucleotide Sequence Database Collaboration (INSDC) was established nearly 40 years ago to collect and disseminate public nucleotide sequence data and associated metadata. The National Genomics Data Center (NGDC) has also provided open access to vast amounts of raw sequence data. These databases have greatly enhanced the capacity for reanalyzing NGS data. In recent years, amid the rise of large language models, biological sequences and data have emerged as inputs for training models to address biological challenges. However, methods for programmatically accessing this public sequencing data remain limited. To address this gap, we have developed iSeq, an integrated tool that allows for quick and straightforward retrieval of metadata and NGS data via the command-line interface. iSeq is currently the only tool that supports simultaneous retrieval from multiple databases (GSA, SRA, ENA, DDBJ, and GEO). Additionally, iSeq supports a wide range of accession formats as input and features parallel downloads, multi-threaded processes, and FASTQ file merging. It is freely available on Bioconda ( https://anaconda.org/bioconda/iseq ) and GitHub ( https://github.com/BioOmics/iSeq ). Highlights iSeq supports multiple databases for accessing a wide range of raw sequencing data and metadata. iSeq supports at least 25 different accession formats as input. iSeq supports parallel downloads, multi-threaded processes, FASTQ file merging, and integrity verification.

Ecology

Geology

0

Paper

Save

Systematic single-cell analysis reveals dynamic control of transposable element activity orchestrating the endothelial-to-hematopoietic transition

Cong Feng et al.Jun 21, 2023

Abstract Background The endothelial-to-hematopoietic transition (EHT) process during definitive hematopoiesis in vertebrate is highly conserved. Stage-specific expression of transposable elements (TEs) has been detected during zebrafish EHT and may promote hematopoietic stem cell formation by activating inflammatory signaling. However, little is known about how TEs contribute to the EHT process in human and mouse. Results We reconstructed the single-cell EHT trajectories of human and mouse, and resolved the dynamic expression patterns of TEs during EHT. Most TEs presented a transient co-upregulation pattern along the conserved EHT trajectories. Enhanced TE activation was tightly associated with the temporal relaxation of epigenetic silencing systems. TE products can be sensed by multiple pattern recognition receptors, triggering inflammatory signaling to facilitate the emergence of hematopoietic stem cells. Furthermore, we observed that hypoxia-related signals were enriched in cells with higher TE expression. Additionally, we constructed the hematopoietic cis-regulatory network of accessible TEs and identified potential enhancers derived by TEs, which may boost the expression of specific EHT marker genes. Conclusions Our study provides a systematic vision on how TEs are dynamically controlled to promote the hematopoietic fate decision through transcriptional and cis-regulatory networks, and pre-train the immunity of nascent hematopoietic stem cells.

Genetics

Molecular Biology

1

Paper

Genetics

Molecular Biology

0

Save