ResearchHub | Open Science Community

Karen Miga

Author with expertise in RNA Sequencing Data Analysis

University of California, Santa Cruz, Northeastern University, ORCID

+ 3 more

Achievements

Cited Author

Open Access Advocate

Key Stats

Upvotes received:

Publications:

(74% Open Access)

Cited by:

1,576

h-index:

i10-index:

Reputation

Biology

< 1%

Chemistry

< 1%

Economics

< 1%

How is this calculated?

Publications

The complete sequence of a human genome

Sergey Nurk et al.Apr 1, 2022

+97

Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.

Euchromatin

Genome

Human Genome

195

Paper

Euchromatin

1,417

Save

Towards a Comprehensive Variation Benchmark for Challenging Medically-Relevant Autosomal Genes

Justin Wagner et al.Oct 24, 2023

+34

Abstract The repetitive nature and complexity of multiple medically important genes make them intractable to accurate analysis, despite the maturity of short-read sequencing, resulting in a gap in clinical applications of genome sequencing. The Genome in a Bottle Consortium has provided benchmark variant sets, but these excluded some medically relevant genes due to their repetitiveness or polymorphic complexity. In this study, we characterize 273 of these 395 challenging autosomal genes that have multiple implications for medical sequencing. This extended, curated benchmark reports over 17,000 SNVs, 3,600 INDELs, and 200 SVs each for GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically important genes including CBS , CRYAA , and KCNE1 . Our proposed solution improves variant recall in these genes from 8% to 100%. This benchmark will significantly improve the comprehensive characterization of these medically relevant genes and guide new method development.

Paper

Save

The structure, function, and evolution of a complete human chromosome 8

Glennis Logsdon et al.Oct 24, 2023

+26

ABSTRACT The complete assembly of each human chromosome is essential for understanding human biology and evolution. Using complementary long-read sequencing technologies, we complete the first linear assembly of a human autosome, chromosome 8. Our assembly resolves the sequence of five previously long-standing gaps, including a 2.08 Mbp centromeric α-satellite array, a 644 kbp defensin copy number polymorphism important for disease risk, and an 863 kbp variable number tandem repeat at chromosome 8q21.2 that can function as a neocentromere. We show that the centromeric α-satellite array is generally methylated except for a 73 kbp hypomethylated region of diverse higher-order α-satellite enriched with CENP-A nucleosomes, consistent with the location of the kinetochore. Using a dual long-read sequencing approach, we complete the assembly of the orthologous chromosome 8 centromeric regions in chimpanzee, orangutan, and macaque for the first time to reconstruct its evolutionary history. Comparative and phylogenetic analyses show that the higher-order α-satellite structure evolved specifically in the great ape ancestor, and the centromeric region evolved with a layered symmetry, with more ancient higher-order repeats located at the periphery adjacent to monomeric α-satellites. We estimate that the mutation rate of centromeric satellite DNA is accelerated at least 2.2-fold, and this acceleration extends beyond the higher-order α-satellite into the flanking sequence.

Centromere

Biology

Genetics

546

Paper

Centromere

Save

100

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

Ann Cartney et al.Oct 24, 2023

+17

ABSTRACT Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies

Nanopore Sequencing

Telomere

Genome

100

Paper

Nanopore Sequencing

Save

100

From telomere to telomere: the transcriptional and epigenetic state of human repeat elements

Savannah Hoyt et al.Oct 24, 2023

+23

Abstract Mobile elements and highly repetitive genomic regions are potent sources of lineage-specific genomic innovation and fingerprint individual genomes. Comprehensive analyses of large, composite or arrayed repeat elements and those found in more complex regions of the genome require a complete, linear genome assembly. Here we present the first de novo repeat discovery and annotation of a complete human reference genome, T2T-CHM13v1.0. We identified novel satellite arrays, expanded the catalog of variants and families for known repeats and mobile elements, characterized new classes of complex, composite repeats, and provided comprehensive annotations of retroelement transduction events. Utilizing PRO-seq to detect nascent transcription and nanopore sequencing to delineate CpG methylation profiles, we defined the structure of transcriptionally active retroelements in humans, including for the first time those found in centromeres. Together, these data provide expanded insight into the diversity, distribution and evolution of repetitive regions that have shaped the human genome.

Genome

Biology

Interspersed Repeat

100

Paper

Genome

Save

Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA

Sasha Langley et al.May 6, 2020

Abstract Despite critical roles in chromosome segregation and disease, the repetitive structure and vast size of centromeres and their surrounding heterochromatic regions impede studies of genomic variation. We report here large-scale haplotypes ( cenhaps ) in humans that span the centromere-proximal regions of all metacentric chromosomes, including the arrays of highly repeated α-satellites on which centromeres form. Cenhaps reveal surprisingly deep diversity, including entire introgressed Neanderthal centromeres and equally ancient lineages among Africans. These centromere-spanning haplotypes contain variants, including large differences in α-satellite DNA content, which may influence the fidelity and bias of chromosome transmission. The discovery of cenhaps creates new opportunities to investigate their contribution to phenotypic variation, especially in meiosis and mitosis, as well as to more incisively model the unexpectedly rich evolution of these challenging genomic regions. One Sentence Summary Genomic polymorphism across centromeric regions of humans is organized into large-scale haplotypes with great diversity, including entire Neanderthal centromeres.

Paper

Save

Gapless assembly of complete human and plant chromosomes using only nanopore sequencing

Zhigui Bao et al.Mar 19, 2024

+17

The combination of ultra-long Oxford Nanopore (ONT) sequencing reads with long, accurate PacBio HiFi reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT "Duplex" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely-studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used "Pore-C" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the ultra-long reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and has the potential to provide a single-instrument solution for the reconstruction of complete genomes.

Paper

Save

From complete genomes to pangenomes

Karen MigaSep 12, 2024

Highlighting the Distinguished Speakers Symposium on "The Future of Human Genetics and Genomics," this collection of articles is based on presentations at the ASHG 2023 Annual Meeting in Washington, DC, in celebration of all our field has accomplished in the past 75 years, since the founding of ASHG in 1948.

Paper

Save

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

Sergey Nurk et al.May 7, 2020

Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced PacBio HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a significant modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultra-long Oxford Nanopore reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance towards the complete assembly of human genomes.Availability HiCanu is implemented within the Canu assembly framework and is available from .

Contig

Segmental Duplication

Genome

Paper

Contig

Segmental Duplication

Save

Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit

Kishwar Shafin et al.May 6, 2020

+27

Present workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.

Paper

Save