How can there be numerous high quality heterozygous y chromosome alleles not within pseudoautosomal regions across chrY in WGS data?
1
1
Entering edit mode
10 months ago
Charles R. ▴ 10

Sorry if this seems ignorant, but that is why one asks questions: to learn. While investigating a WGS sequence within IGV, there appear numerous heterozygous y alleles across the full Y chromosome. How can this occur in general? How common is this? At what point is it not common, i.e. frequency of these variants within a chromosome? Could it be contamination? Or DNA damage/mutation? Possible intersex issue? So the basic question is:

How can there be numerous high quality heterozygous y chromosome alleles exist not within pseudoautosomal regions across of the chrY within WGS data?

An example Allele quality of a SNP is over 400. The Allele Freq. 0.5, Depth 30, read strands even balanced, Genotype quality 99. What other info. would be useful? All seems good. The WGS overall quality was very high. Numerous of these variants made it into the VCF file. Also, investigating the CRAM data file, these variants looks very solid, in a beginners view.

Another general question is: If these were related to intersex issues, how could there be a variant Y, i.e. how could someone ever have heterozygous Y in general. The set of parents should have only one Y correct to pass on. Where would a variant Y come from?

Intersex XYYs should still have homozygous y chromosomes correct? There should two copies of the same Y chromosome correct?

WGS, Y, YY, XYY, sex, heterozygous, alleles, non-pseudoautosomal, intersex, mutation, contamination, basics

heterozygous-y • 1.5k views
ADD COMMENT
0
Entering edit mode

doi: 10.1093/gigascience/giz074

So I am including a journal article related to the question, I believe, and an additional tool that can be used to evaluate the issue. The tool is call xyalign. It is a command line boiconda tool. The journal and article come from is around a 7.5 impact rating, which is suppose to be a very highly rated journal. This should imply the information is more reliable than most publications. Based on the paper's Fig. 4E, most Y chromosomes are some what heterozygous within WGS datasets. Fig. 4E shows an example distribution of heterozygous to homozygous reads within a sampled WGS. This directly implies Y chromosome are to some extent heterozygous. Thus explaining what I have seen in the data (perhaps). Therefore, this should be expected within the general population.

Perhaps a formal geneticist can help confirm this notion.

Furthermore, the xyalign tool can be used to create the similar Fig. 4 plots for any WGS dataset. I believe that the plots in Fig. 4C and 4D can be used to infer intersex mosaicism. The relative allele peak around 50% read line might imply mosaic percentage of the WGS. If geneticist's would like to discuss this more that would be great! I think this kind of test can help validate and answer WGS dataset questions in the future. Furthermore, the relative height of the peak within Fig. 4E might also imply loss of Y chromosome percentages. This is a hot genetics topic in the world today. Perhaps there are some papers to be considered around these ideas. Any geneticist's interested in exploring these ideas further.

ADD REPLY
0
Entering edit mode

Charles R. , I make the most important points in my post, below, but I would be careful before trusting a tool like the one you have described completely.

This is because you are totally correct when you say this is a hot area in genetics. The thing is, it is SO hot that understanding of these regions has changed dramatically between the date of publication of the XYalign tool and today.

When I say "dramatically" I am not really exaggerating - in 2018-9 our human reference genome had about 30 Mbp for ChrY, but now we know that it is >60Mbp in length.

It's therefore important to say that, because our understanding was based on a reference genome that was in and of itself both inaccurate and incomplete, tools attempting to work with the regions of genome responsible for this inaccuracy benefit greatly from the newer information.

I am not sure I would trust data generated from NGS (short read, 2nd generation) alone in these very complex areas.

To explain this, I made a much longer post below.

ADD REPLY
3
Entering edit mode
10 months ago
LauferVA 4.2k

Charles R. ,

Thanks for your post, which relates to recent, key findings in genomics. First, to be thorough, I have to acknowledge the possibility that a variety of technical issues could contribute to findings like what you have described, e.g. an early cycle PCR amp error could give you reads precisely half of which have a C the other an A. However, for the purposes of this post, I will assume that the quality metrics like those you provide are accurate, and that the read such stats (e.g. quality score) in fact perfectly reflects the primary sequence of the gDNA.

The problem is that, even if those statistics are entirely accurate, the data may still prove very hard to interpret correctly due to specific sequence characteristics of ChrY (as well as other chromosomes). To explain this, will need a bit of background:

Both the X and the Y chromosomes have unique sequence properties that have complicated their analysis - and thus also our understanding of the variation in them, until very recently. The completion of the T2T project and creation of the T2T v2 assembly concluded with the inclusion of chrY to T2T CHM13 v1.3 which with its addition then became T2T CHM13 v2, the best assembly ever until T2T - merged with the Human Pangenome Reference Consortium (HPRC). At last count they have several hundred gapless, phased human haplotypes, and as such are well on their way to the first human pangenome reference.


ChrY in particular displays overlapping as well as distinct types of genomic fluidity than autosomes as the forces imposed by homologous recombination do not act on autosomes and sex chromosomes in precisely the same way. In fact, ChrY has several distinct types of variation that are very hard to deal with using short read (next-gen) sequencing data. Thus, while GRCh38 was >90% complete in other chromosomes, about half of the sequence for ChrY was missing from even the most recent references based on short-read sequencing data, e.g. GRCh38.p13 (!!).

Specifically, complex genomic structures including satellite regions, long palindromic sequences, tandem repeats, and segmental duplications remained opaque to NGS for the entire ~20 year period between completion of HGP and T2T-CHM13-v2.0. To understand why NGS fails in these regions, let's take just one example: segmental duplications, or SDs. SDs are repeat regions of 10-200kB that bear very high homology to one another (95-99.9%), and arise due to specific genomic forces, e.g. the presence of inversion events (porubsky et al 2022).

This ultra high homology alone is a formidable a challenge for short read sequencing technologies, however, additional phenomena, such as gene conversion events, further influence polymorphism at these loci and have greatly complicated their analysis. While with the advent of 3rd generation sequencing techniques (SMRT, nanopore) we now have the ability to obtain phased, gapless assemblies of these regions, in reality, the praxis of nearly all of genomic science lags behind this. What I mean is, insofar as the vast majority of sequencing data are still generated using NGS (short-read, 2nd generation) sequencing, most datasets are still unable to resolve these regions.

We are now at last in a position to address your question.

There are two broad classes of structural variants, copy number neutral SV and copy number altering SV. Segmental Duplications are the later type - they increase copy number at a locus. So, while a given human individual may have only one chr Y, this in no way implies that they have only one copy of a given sequence. Rather, consider these two statements:

1) SDs increase copy number at a locus/loci 2) Polymorphisms account for the 0.1-5% their sequence, but they are otherwise identical.

These two conditions, together, strongly suggest that NGS-based sequencing of two homologous SDs will produce reads that are identical except for one or a few NTs. In this case, even in high-quality sequencing data, this could potentially result in 2 different base calls at a locus that, in the reference being used, only has a copy number of 1.

Actually, though, we now know that having just two copies of an SD is not all that surprising. Recent evidence spanning several fields suggests that SDs evolved recently in the speciation of humans from great apes in order to fuel the evolutionary thirst for greater and greater intelligence (as well as other reasons). In this case, SDs have increased copy number of certain genes/gene families from ~2 to ~30 in just 200,000 (200kYa). In loci like these, a person might not only have two alleles, {she, he} might have MANY. This could show up in a .vcf file in a variety of ways, but detailed studies of alignment of such reads shows that prior assembly structure was a contributor to inaccurate read mapping in cases when a SD or other SV had not yet been ascertained.

The question to consider is, "but if that were true, how would it show up in my data?" Simply put, the case you mention is a possible example. In other words, it is possible that your data contain two reads that map to the same portion of two SDs, even with these characteristics:

Allele quality of a SNP is over 400. The Allele Freq. 0.5, Depth 30, read strands even balanced, Genotype quality 99.

Just one last thing to mention. The reason why these results may be counterintuitive is because we are used to thinking of ourselves as diploid, and because we are not accustomed to thinking about what variant calling accuracy metrics mean in the context of a reference genome that is itself flawed. Metrics like quality score, depth etc. provide information concerning the level of accuracy of the nucleotides on a read and the number of reads generated. They aren't, however, useful for predicting whether the reference genome being used is itself accurate in such an area.

To use different words, what I mean is that having an accurate variant call in such a locus tells you nothing about whether that read maps to one SD, another SD, etc. Thus, you have to be really careful about what you conclude from reads in these regions, irrespective of their quality scores. Why? Because when such reads map to structurally heteromorphic loci (like SDs) their quality scores lose meaning insofar as they neglect higher-level tasks like genotype phasing, haplotype resolution, or read mapping, which ultimately influence most applications clinicians and scientists have for such information...

ADD COMMENT
0
Entering edit mode

Thank you for your honest assessment. I believe you have added much value to this discussion. As part of the lay, I was hoping for a good meaty answer.

To the laity, your discussion is quite disturbing. Consider a more direct question, with the information you provide, one could ask: can we create a useful WGS at all? This is disturbing because the medical industry is trying to use WGS data while its basic structures and patterns are still being developed. With that said, I would argue that we are learning a lot via this discussion. So let's explore a few more basic questions at play, beyond the Y issues raised above. This might teach us more about the basics quickly and give us a better understanding of WGS current state then the advertised sales speech of genetics industry.

Let's for these discussions, assume for the moment one can map a useful WGS. If one looks statistically across a WGS, what should the read distributions be? A reference in this case is meaningless. We are interested in the distribution of reads -- this really looks at the base quality of the reading process given that we can read a single chromosome pattern or a dual chromosome pattern. In this case, one is only really interested in homogeneous reads verse non homogeneous reads. I would argue, Fig. 4A and B would be a reasonable expectation for a diploid system. Moreover, Fig 4 C, one could argue, would be more expected for a haploid system with a high quality reading process.

Perhaps your comments are well taken for Figure 4E and one can't know what to expect yet because Y is different than diploid or haploid. Is that a fair over simplification of your discussion? I base this over simplification on an understanding that your general argument suggests there are far greater levels of mutation and other processes changing the Y chromosome overtime? So numerous variants can exist all at once for many reasons.

Nevertheless, using any sampling of chromosomes within WGS data, should there exist a typical patterns of the reads? Perhaps one can't discuss what to do with Y yet because it doesn't fit into this simple two bin system.

But within a diploid or haploid assumption, do Figs 4C vs 4D(4A/B might be better) form a reasonable pattern useful to differentiate a diploid from haploid systems, regardless of a reference. This argument is not based on a reference notion but merely a statistical notion. Mainly, your are either sampling one chromosome or two along the sequence. So the read should either be the same or split into two results at each location. Moreover, does it seem reasonable for a mosaic system to be some blend in between Fig. 4C and 4D, irrespective of the "true reference sequence." So perhaps the question of X verse XX can still be rightfully examined ill-respective of the reference sequence, simply considering raw differences of each chromosome read distributions. (I believe we also need to assume one common X within the two systems.)

Given these thoughts, what changes should be expected within Figs. 4 A,B,D? How would one expect the .5 hump to be transformed by a mosaic set of chromosomes made up of mixed diploid and haploid cells? Consider for argument 90% to 10% respectively. How should the .5 hump shift in probability and change its height based on the percentage of mosaicism? Moreover, what would one expect a mosaic system's read distribution pattern to look like? I think those are useful thought questions.

From a lay perspective, I think these are relative questions in helping one's deeper understanding of WGS datasets at their most basic levels with current technology and explained science. Also, there needs to be some type of test that flags questionable WGS results when basic assumptions are violated like additions of mosaicism are introduced within a WGS.

ADD REPLY
0
Entering edit mode

Charles R. Hi again - I am not sure i follow the logic in this post closely.

In my post above, I was using segmental duplications as an example. While NGS does not do well in these regions, Sanger Sequencing (1st generation sequencing) and NGS (2nd generation sequencing) have been incredibly useful, and they validly characterize most of the genome (~90%). That's not nothing.

You asked about Chromosome Y and problems associated with it, and it just happens that certain stretches of it are problematic for NGS. But I would not go from that to a statement like,

"This is disturbing because the medical industry is trying to use WGS data while its basic structures and patterns are still being developed"

NGS is at this point a well oiled machine. It's just that it cannot be used in certain areas without a lot of careful thinking and some potential for error. But, in reality that's no different than any tool - scientific or otherwise. Mitre saws cannot hammer nails, and staplers cannot pour water.

ADD REPLY

Login before adding your answer.

Traffic: 2703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6