ResearchHub | Open Science Community

A new SARS-CoV-2 lineage that shares mutations with known Variants of Concern is rejected by automated sequence repository quality control

Bryan Thornlow et al.Apr 6, 2021

We report a SARS-CoV-2 lineage that shares N501Y, P681H, and other mutations with known variants of concern, such as B.1.1.7. This lineage, which we refer to as B.1.x (COG-UK sometimes references similar samples as B.1.324.1), is present in at least 20 states across the USA and in at least six countries. However, a large deletion causes the sequence to be automatically rejected from repositories, suggesting that the frequency of this new lineage is underestimated using public data. Recent dynamics based on 339 samples obtained in Santa Cruz County, CA, USA suggest that B.1.x may be increasing in frequency at a rate similar to that of B.1.1.7 in Southern California. At present the functional differences between this variant B.1.x and other circulating SARS-CoV-2 variants are unknown, and further studies on secondary attack rates, viral loads, immune evasion and/or disease severity are needed to determine if it poses a public health concern. Nonetheless, given what is known from well-studied circulating variants of concern, it seems unlikely that the lineage could pose larger concerns for human health than many already globally distributed lineages. Our work highlights a need for rapid turnaround time from sequence generation to submission and improved sequence quality control that removes submission bias. We identify promising paths toward this goal.

Framework for determining accuracy of RNA sequencing data for gene expression profiling of single samples

Holly Beale et al.Jul 30, 2019

Background: The clinical value of identifying aberrant gene expression in tumors is becoming increasingly evident. In order for multi-gene expression analysis to achieve wider adoption and eventually be developed as a Clinical Laboratory Improvement Amendments (CLIA)-approved test, the input sample requirements, sensitivity, specificity and reference ranges must be quantified. Methods: We analyzed paired-end Illumina RNA sequencing (RNA-Seq) data from 1088 tumor samples from 29 projects. We categorized reads based on where and how well they map to the genome, as well as their PCR duplicate status. We subsampled 5 deeply sequenced samples, identified exceptionally highly expressed genes and samples with similar gene expression profiles. Results: We addressed variability in RNA-Seq dataset composition by defining reference ranges for four types of reads found in sequencing data: unmapped (0-13%); mapped duplicate (2-66%); mapped non exonic (0-26%) and mapped, exonic, non-duplicate (MEND, 27-76%). With 20 million MEND reads, we detected over-expressed genes ("up-outlier" genes) with a median sensitivity of 96.1% and specificity of 99.8%; sample similarity had 96.6% sensitivity and 100.0% specificity. Conclusions: This strategy for measuring RNA-Seq data content and identifying thresholds could be applied to a clinical test of a single sample, specifying minimum inputs and defining the sensitivity and specificity. We estimate that a sample sequenced to the depth of 70 million total reads will typically have sufficient data for accurate gene expression analysis.