ResearchHub | Open Science Community

Lizhen Shi

Author with expertise in RNA Sequencing Data Analysis

Achievements

Cited Author

Open Access Advocate

Key Stats

Upvotes received:

Publications:

(50% Open Access)

Cited by:

240

h-index:

i10-index:

Reputation

Biology

< 1%

Chemistry

< 1%

Economics

< 1%

How is this calculated?

Publications

Critical Assessment of Metagenome Interpretation: the second round of challenges

Fernando Meyer et al.Apr 1, 2022

Abstract Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.

Paper

Save

Critical Assessment of Metagenome Interpretation - the second round of challenges

Fernando Meyer et al.Jul 12, 2021

Abstract Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the community-driven initiative for the Critical Assessment of Metagenome Interpretation (CAMI). In its second challenge, CAMI engaged the community to assess their methods on realistic and complex metagenomic datasets with long and short reads, created from ∼1,700 novel and known microbial genomes, as well as ∼600 novel plasmids and viruses. Altogether 5,002 results by 76 program versions were analyzed, representing a 22x increase in results. Substantial improvements were seen in metagenome assembly, some due to using long-read data. The presence of related strains still was challenging for assembly and genome binning, as was assembly quality for the latter. Taxon profilers demonstrated a marked maturation, with taxon profilers and binners excelling at higher bacterial taxonomic ranks, but underperforming for viruses and archaea. Assessment of clinical pathogen detection techniques revealed a need to improve reproducibility. Analysis of program runtimes and memory usage identified highly efficient programs, including some top performers with other metrics. The CAMI II results identify current challenges, but also guide researchers in selecting methods for specific analyses.

Genetics

Ecology

Paper

Genetics

Save

Corrosion behavior of an ultrafine-grained Ti6Al4V-5Cu alloy in artificial saliva containing fluoride ions

Hui Liu et al.Jun 20, 2024

Ti6Al4V alloy has been widely used in dental applications, such as orthodontic mini-implants. However, it has been reported that fluoride ions could obviously accelerate the corrosion of implant materials and affect their performance. This work aimed to improve the F− erosion resistance of Ti6Al4V alloy through the strategy of both Cu addition and grain refinement. As contrasted with Ti6Al4V alloy, both the coarse- and ultrafine-grained Ti6Al4V-5Cu alloys effectively mitigated the acceleration of the fluoride ions to the anode process, because Cu substituents blocked the continuous damage of FO· doped in the passive film. Furthermore, grain refinement enhanced the protective ability of the passive film, more oxides and less adsorption amount of fluorides presented in the passive film of ultrafine-grained Ti6Al4V-5Cu alloy than those of coarse-grained Ti6Al4V-5Cu alloy. Under the combination of Cu alloying and grain refinement, the ultrafine-grained Ti6Al4V-5Cu alloy is greatly appropriate for the fabrication of orthodontic devices.

Paper

Save

SpaRC: Scalable Sequence Clustering using Apache Spark

Lizhen Shi et al.Jan 11, 2018

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed a Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large scale sequence data analysis problems. The software is available under the Apache 2.0 license at https://bitbucket.org/LizhenShi/sparc.

Genetics

Artificial Intelligence

Paper

Genetics

Artificial Intelligence

Save

Hybrid Clustering of Long and Short-read for Improved Metagenome Assembly

Yakang Lu et al.Jan 26, 2021

ABSTRACT Next-generation sequencing has enabled metagenomics, the study of the genomes of microorganisms sampled directly from the environment without cultivation. We previously developed a proof-of-concept, scalable metagenome clustering algorithm based on Apache Spark to cluster sequence reads according to their species of origin. To overcome its under-clustering problem on short-read sequences, in this study we developed a new, two-step Label Propagation Algorithm (LPA) that first forms clusters of long reads and then recruits short reads to these clusters. Compared to alternative label propagation strategies, this hybrid clustering algorithm (hybrid-LPA) yields significantly larger read clusters without compromising cluster purity. We show that adding an extra clustering step before assembly leads to improved metagenome assemblies, predicting more complete genomes or gene clusters from a synthetic metagenome dataset and a real-world metagenome dataset, respectively. These results suggest that hybrid-LPA is a good alternative to current metagenome assembly practice by providing benefits in both scalability and accuracy on large metagenome datasets. Availability and implementation https://bitbucket.org/zhong_wang/hybridlpa/src/master/ . Contact zhongwang@lbl.gov

Paper

Save

Effect of Cu content on the properties of laser powder bed fused biomedical titanium alloys

Hui Liu et al.May 1, 2024

Laser powder bed fusion (LPBF) technology enables personalized customization of medical implants. However, the clinical titanium alloy is biologically inert and can't address bacterial infections. It has been proved that Copper (Cu) -bearing titanium alloys exhibit excellent antibacterial ability and potentials for clinical application. However, it still unclear that the effect of Cu content on the comprehensive properties of LPBF-produced titanium alloy which usually has high strength but poor ductility and corrosion resistance caused by its α' martensite. This study systematically investigates the effects of Cu content (x=2, 5, 8 wt.%) on the microstructure, mechanical properties, corrosion resistance, antibacterial properties, and cytrocompatibility of the LPBF-produced Ti-xCu alloys. These results prove that the LPBF produced Ti-5Cu alloy possesses the optimal comprehensive performance and has promising clinical application prospects as implant materials.

Mechanical Engineering

Materials Chemistry

Paper

Mechanical Engineering

Materials Chemistry

Save

A Vector Representation of DNA Sequences Using Locality Sensitive Hashing

Lizhen Shi et al.Aug 6, 2019

Drawing from the analogy between natural language and "genomic sequence language", we explored the applicability of word embeddings in natural language processing (NLP) to represent DNA reads in Metagenomics studies. Here, k-mer is the equivalent concept of word in NLP and it has been widely used in analyzing sequence data. However, directly replacing word embedding with k-mer embedding is problematic due to two reasons: First, the number of k-mers is many times of the number of words in NLP, making the model too big to be useful. Second, sequencing errors create lots of rare k-mers (noise), making the model hard to be trained. In this work, we leverage Locality Sensitive Hashing (LSH) to overcoming these challenges. We then adopted the skip-gram with negative sampling model to learn k-mer embeddings. Experiments on metagenomic datasets with labels demonstrated that LSH can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than alternative methods. Finally, we demonstrate the trained low-dimensional k-mer embeddings can be potentially used for accurate metagenomic read clustering and predict their taxonomy, and this method is robust on reads with high sequencing error rates (12-22%).

Genetics

Artificial Intelligence

Paper

Genetics

Artificial Intelligence

Save

Deconvolute individual genomes from metagenome sequences through short read clustering

Kexue Li et al.Apr 29, 2019

Motivation: Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Results: Based on a previously developed scalable read clustering method on Apache Spark, SpaRC, that has very low false positives, here we extended its capability by adding a new method to further cluster small clusters. This method exploits statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using a synthetic dataset from mouse gut microbiomes we show that this method has the potential to cluster almost all of the reads from genomes with sufficient sequencing coverage. We also explored several clustering parameters that deferentially affect genomes with various sequencing coverage.

Genetics

Artificial Intelligence

Paper

Genetics

Artificial Intelligence

Save