Question

What are the advantages of using the T2T as a reference vs GRCh38 today?

7

Entering edit mode

12 months ago

onter ▴ 170

What are the advantages in using the T2T data as the reference genome right now? Is it able to provide extra (useful) variant calling information that GRCh38 is not able to?

reference GRCh38 t2t calling variant • 4.4k views

ADD COMMENT • link updated 11 months ago by LauferVA 4.2k • written 12 months ago by onter ▴ 170

score 19 · Answer 1 · 2023-04-16

Onter,

The advantages of (T2T) CHM13v2 over (GRC) GRCh38.p14 are myraid and extend into over a dozen fields. The second question you ask is a minor subset of that first, huge question. To date, even dedicated peer reviewed manuscripts on this subject have struggled to present all the ways in which science, medicine, etc. can benefit from CHM13 (as well as reviews, perspectives or editorials). Ill outline a brief answer here, and point you to key resources that can help you further deepen knowledge of the area.

N.B. in this answer, I will use the terms 2nd generation sequencing, short-read sequencing and NGS interchangeably, and please note it is this technology that is largely the basis for the most recent GAIB builds e.g. GRCh38.p14. By contrast, it is understood herein that long-read sequencing, 3rd generation sequencing, and CHM13 "go together". The advantages of CHM13v2 over GRCh38.p14 are largely predicated on the differences in technical abilities of these technologies, though certainly by no means limited to them.

First, it is helpful to begin with Eichler, Surti, and Ophoff's 2002 letter to the NHGRI in which they propose the use of a complete hydatidiform mole (CHM) as a genomic resource. This proposal, written in the wake of the first generation sequencing (Sanger) based Human Genome Project, outlines several key problems that first and second generation sequencing are ultimately unable to address, and that indeed remained problematic throughout the era in which NGS arose to hegemony (for arguments sake, lets say 2002-2021). These 3 arguments, below, follow the logic of that letter closely:

1) Gap closure in certain regions of the genome, in particular low complexity/highly repetitive regions beyond a certain length was a key reason for proposal of CHM. When such a region - for instance, an alpha satellite repeat, exceeds the length of the longest read in a NGS dataset, one loses the ability to span that region, leading to a gap in the reference. In this specific context the gap contains an unknown number of repeats, and any structural polymorphisms. But gaps in the reference are "bad" for many other reasons, but enumeration of all of them is prohibitive here. GRCh38.p14, the last in the series to date, still contains several hundred gaps.

2) Gene rich segments that contribute to human disease By the time Eichler et al. wrote the above letter, it was already clear that genes relevant to human disease phenotypes (e.g. autism) lay in segmental duplications that are 1) gene rich 2) polymorphic and 3) highly copy number variable. As it turns out, SDs also have very high homology - and are thus susceptible to additional genomic processes; e.g. gene conversion events. They also tend to be long (a single SD might be 150kb, far longer than the longest read in a NGS dataset). As such, short read data alone are essentially unable to provide certain kinds of insight into these loci. If these regions were unimportant, perhaps this would not be a big issue. But, we now know (see, for instance, Porubsky et al 2022) that tandem blocs of SD repeats display several characteristics that provide fundamental insights into multiple distinct fields in human biology. For instance, these loci are rapidly evolving (having achieved remarkable copy number gain in only the last 200 kya), tend to appear on the edges of large inversion variants, contain genes thought to be involved in speciation of humans from great apes, contribute to certain human phenotypes (like the aforementioned neurocognitive ones), and so on. To complete the argument, what I am driving at is that these regions cannot be adequately studied using GRCh38.p14.

3) Chromosomal Structural Polymorphisms We now know for the first time the degree of SV that previous builds, such as short-read sequencing based GAIB resources, misses. Inversion polymorphisms have to date been poorly ascertained - more than half have been missed. To return to SDs, they too have until now been poorly understood. The role of SV in regions of heterochromatin, like centromeres, telomeric and subtelomeric regions, the short arms of acrocentric chromosomes, (which contains human rDNA) is essentially in its infancy at present, because NGS, and therefore prior GRC assemblies, are almost totally blind to SV in regions of heterochromatin. One dramatic example is pertains to rDNA: CHM13v2 contains about 781% more rDNA sequence than GRCh38.p14. Therefore, biologists whose studies of health, disease, genomic fluidity, human evolution, etc., is impacted by ribosomal kinetics, for instance, have a rationale for use of CHM13v2.

OK, now lets shift gears and start considering implications.

De novo assembly with reference to an available pangenome will improve genomic alignment and variant calling - What does it mean to align genomic material to a reference genome that has gaps, inaccurate SV calls (both false positives and false negatives), etc? What problems can arise? Well, briefly, consider a read from a human genome that sequences a SV not found in the reference. Where would this read map to? Well, evidently nowhere. Thus, if one wanted to use that read in part of a large contiguous sequence, one would need to assemble it de novo. This is the approach that will be used in the human pangenome reference consortium, and there is an emerging consensus that this leads on average to increased sensitivity and recall without loss of specificity (to simplify, it is slightly better without a downside). This is just one of many examples, but flowing from 3), above, the idea is that because human genomes are sturcturally polymorphic, no reference genome can enable perfect alignment of all sequences. To hammer home this idea, consider that even short-read based sequencing studies identified about 700,000 bases of unique sequence per genome. Because this is based on short-read, NGS data, it is an underestimate; specifically, SVs in highly repetitive regions are most likely to be missed. It is these regions - which in fact contribute the most to differences between any two people - that are most poorly aligned and represented if alignment to a linear reference genome is used.

Use of gapless assemblies for studies of epistasis and haplotype effects the T2T consortium has now officially been absorbed into the HPRC. The goal of the HPRC is the creation of a human pangenome, to transcend and supercede the singular reference genome assembled by consortia such as GRC. A big part of this will be the ability to produce phased, gapless, diploid human genomes on a routine basis. The ability to provide chromosome length phase information is currently based on both short read sequencing data (Strand Seq; Falconer 2012) and long-read together, an important present limitation to future goals. As an example of a locus at which phase information could prove extremely important, consider the HLA locus. Here, the importance of haplotype effects is already established, and will likely continue to retain importance in transplant biology, study of immune syndromes, etc. However, despite this, the high levels of homology, SV, and CNV in these extensive haplotypes have proved problematic for short-read seq-based genomic references.

Finally, to touch on the variant calling part of your post, consider also increased imputation accuracy of SV based on SNV data, which would lend new life to GWAS studies to date, and perhaps assist efforts to fine-map causal variation for all diseases studied using DNA microarray and NGS data to date. I dont have a good citation for this yet.

There is so much this answer leaves out. I did not even mention genome graphs, and how they are advancing human genomics and powering third generation assembly algorithms. So, to close, please consider a few other treatments of this topic, e.g. Aganezov et al., or Paten et al.

Hope this helps.

VL

score 8 · Answer 2 · 2023-04-16

8

Entering edit mode

12 months ago

Gordon Smyth ★ 7.0k

As one small example to supplement Vincent Laufer's excellent answer, we have found in our work that we can correctly identify the isotype of human plasma cells when using T2T-CHM13 as the reference for 10x Genomics Chromium profiling but not when using GRCh38.p14. GRCh38.p14 contains a few errors that result in some sequence reads being assigned to the wrong Ig gene. Using GRCh38, many cells appear to express two or more isotypes whereas with T2T-CHM13 every cell expresses just one isotype.

ADD COMMENT • link 12 months ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Gordon Smyth - fascinating. sounds like 2 papers are in the works!!

ADD REPLY • link 12 months ago by LauferVA 4.2k

3

Entering edit mode

We've posted a preprint of our result now on bioRxiv:

Nie J, Tellier J, Tarasova I, Nutt SL, Smyth GK (2023). The T2T-CHM13 reference genome has more accurate sequences for immunoglobulin genes than GRCh38. bioRxiv https://doi.org/10.1101/2023.05.24.542206 (Posted 25 May 2023).