ResearchHub | Open Science Community

Complete sequencing and characterization of 21,243 full-length human cDNAs

Toshio Ota et al.Dec 21, 2003

As a base for human transcriptome and functional genomics, we created the “full-length long Japan” (FLJ) collection of sequenced human cDNAs. We determined the entire sequence of 21,243 selected clones and found that 14,490 cDNAs (10,897 clusters) were unique to the FLJ collection. About half of them (5,416) seemed to be protein-coding. Of those, 1,999 clusters had not been predicted by computational methods. The distribution of GC content of nonpredicted cDNAs had a peak at ∼58% compared with a peak at ∼42%for predicted cDNAs. Thus, there seems to be a slight bias against GC-rich transcripts in current gene prediction procedures. The rest of the cDNAs unique to the FLJ collection (5,481) contained no obvious open reading frames (ORFs) and thus are candidate noncoding RNAs. About one-fourth of them (1,378) showed a clear pattern of splicing. The distribution of GC content of noncoding cDNAs was narrow and had a peak at ∼42%, relatively low compared with that of protein-coding cDNAs.

Genetics

Molecular Biology

0

Paper

Save

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Koki Tsuyuzaki et al.May 20, 2019

Principal component analysis (PCA) is an essential method for analyzing single-cell RNA-seq (scRNA-seq) datasets, but large-scale scRNA-seq datasets require long computational times and a large memory capacity.In this work, we review 21 fast and memory-efficient PCA implementations (10 algorithms) and evaluate their application using 4 real and 18 synthetic datasets. Our benchmarking showed that some PCA algorithms are faster, more memory efficient, and more accurate than others. In consideration of the differences in the computational environments of users and developers, we have also developed guidelines to assist with selection of appropriate PCA implementations.* PCA : principal component analysis scRNA-seq : single-cell RNA sequencing sci-RNA-seq : single-cell combinatorial-indexing RNA-sequencing analysis UML : unsupervised machine learning QC : quality control PC : principal component EVD : eigenvalue decomposition SVD : singular value decomposition SimT : similarity transformation-based DS : downsampling-based SU : SVD update-based Krylov : Krylov subspace-based GD : gradient descent-based Rand : Random projection-based Sklearn : scikit-learn SKL : sequential Karhunen-Loeve transform IRLBA : augmented implicitly restarted Lanczos bidiagonalization IRAM : implicitly restarted Arnoldi method GD : gradient descent SGD : stochastic gradient descent t-SNE : t-stochastic neighbor embedding UMAP : uniform manifold approximation and projection FIt-SNE : Fourier transform-accelerated interpolation-based t-stochastic neighbor embedding oocPCA : out-of-core PCA GMM : Gaussian mixture model ARI : adjusted Rand index Zstd : Zstandard UMI : unique molecular identifier CSV : comma-separated values HDF5 : hierarchical data format 5 10X-HDF5 : HDF5 provided by 10X Genomics CSC : compressed sparse column format CSR : compressed sparse row format CCA : canonical correlation analysis GLM : generalized linear models CPMED : Count per median HVGs : highly variable genes

Artificial Intelligence

Molecular Biology

0

Paper

Artificial Intelligence

Molecular Biology

0

Save

0

Nonlinear conjugate gradient method for vector optimization on Riemannian manifolds with retraction and vector transport

Kangming Chen et al.Feb 1, 2025

Artificial Intelligence

Numerical Analysis

0

Paper

Artificial Intelligence

Numerical Analysis

0

Save