Question: Is 1 deleted T in a T stretch always detected, assuming enough coverage of course?
gravatar for Marvin
9 months ago by
Marvin160 wrote:

Let's assume the following reference sequence:


The individual has a deletion of 1 T in that stretch of 20 T's. Could be the left-most, the middle, the right-most. We don't know.

Let's assume I have sequenced 100 reads which cover that region. So the mapper has to make a choice for every single one of the 100 reads: which T is deleted? That means that in an unfortunate distribution I might never see the deleted T?

Let's assume the mapper chooses the first T to be deleted for reads 1 to 5. For the other 95 reads it chooses some other position. That means later during the variant valling, the caller will assume "Nothing special here. I see 95 reads which say there's a T and I see 5 reads which say there's nothing. Probably just a T like the reference".

Next position, the second T: Let's assume for read 6 to 10, the mapper has decided that the deleted T is the second T. For the other 95 reads it chooses some other position. Again, the caller will say "Alright I got 95 times a T and 5 times nothing. Probably just a T"

In reads 11 to 15, the 3rd T is deleted. In reads 16 to 20 the 4th T is deleted ... [...] ... and in reads 95 to 100 the 20th T is deleted. So if the mapper chooses the deleted T for every read and the distribution is unfortunate, I will never call that deleted T. If the mapper had instead chosen the first T to be the deleted T for all of the 100 reads, I would have seen the deletion. How is this problem avoided?

I know that a mapper is just an algorithm, so it will probably never come to the situation described above (unless the programmer decided "multiple options for the position of the T deletion? hey, just pick one randomly" -> but for the reason described above that would be a mistake right?). Or is the opposite eventually true: is it 100% certain that a mapper will choose the same T to be the deleted T for ALL (or just some? 90%? 80%? what's the case?) of the 100 reads? That would mean, I would never miss the deletion and later I can left-align, right-align or do whatever with the variant.

mapper caller stretch • 231 views
ADD COMMENTlink modified 9 months ago by Kevin Blighe65k • written 9 months ago by Marvin160

left align indels

ADD REPLYlink written 9 months ago by rbagnall1.7k

Thank you but that is not the problem that I have described. I know the text is a little bit long but I'm talking about unfortunate mapping which might lead to not even seeing a variant. The link instead describes what to do after seeing the same variant in different callers.

ADD REPLYlink written 9 months ago by Marvin160

Homopolymer stretches are common sources of false indel and variant calls. I guess you simply do not have a variant and all you see are artifacts. Plase search around for advice on how to deal with longer homopolymer stretches such as this one. Polymerases seem to have trouble dealing with them so both short read NGS and Sanger sequencing is not reliable there.

ADD REPLYlink written 9 months ago by ATpoint38k

"I guess you simply do not have a variant and all you see are artifacts" -> No, in my made up example the biological truth is that there is 1 T deleted. And all reads are perfetly fine without artifacts. The question is, what does the mapper do with regards to asking the question "which T do I mark as deleted" 100 times. My assumption is that it would choose the same position 100 times but I'm not 100% certain about that.

ADD REPLYlink written 9 months ago by Marvin160

Sorry.., you are right, that was for vcf files, but the principal is the same for BAM files.

You will never which T was deleted. The convention is to left align indels at these positions in each read, and this can be done on BAM files using, for example, the GATK LeftAlignIndels tool.

For some genotyping tools, like GATK Haplotype caller, it is not necessary to post process the BAM file in this way as Haplotype Caller will do this on-th-fly

ADD REPLYlink modified 9 months ago • written 9 months ago by rbagnall1.7k

I totally forgot about the step "indel realignment" in GATK best practices. I think that is exactly the step which avoids the problem I've described above.

Even in a scenario where the programmer had opted for choosing a random T for each read, the probability for that worst-case distribution which I've described above would have been extremely low (1 over 20^100). However in this scenario there would be multiple distributions in which the caller says "the del's are noise, this is a T" at every position (so the overall chances of the swallowed-T-case occuring would be higher than 1 over 20^100).

So basically I think there are 2 factors which come into play: the mapper is an algorithm which always does the same thing for each read, therefore it leads to at least somewhat similar results for each read at that indel (unless of course the programmer chose random positions on purpose). Therefore there is no uniform distribution across the space of possibilities. And then after the "somewhat similar" results are produced, the indel realignment takes care of the rest (second factor).

ADD REPLYlink written 9 months ago by Marvin160
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

These cases are obviously going to be handled differently based on the combination of the aligner and the variant caller that is being used. One would have to delve into the 'inner workings' of the internal algorithms. I don't believe anybody has time to do that, though.

From my own experience, variant calls in 'short' homopolymers (e.g. <15bp) can be consistent IF there is good read coverage that spans the entire homopolymer. For much larger homopolymers and regions of repeat sequence, there really is no hope for short-read NGS technologies.

If you wanted to mask for —or isolate— these problematic regions, then, ideally, you would have a BED file of known homopolymers and utilise that somewhere in your variant calling pipeline. I'm beginning to think that no single-pass pipeline can call all variants with high sensitivity / specificity, and that separate runs of the pipeline will have to be run for special cases like homopolymers, regions of repeat sequence, regions that exhibit sequence homology, etc., with different parameter configurations for each. Take a look here for ideas how to identify homopolymers in your reference genome: Code golf: detecting homopolymers of length N in the (human) genome


ADD COMMENTlink written 9 months ago by Kevin Blighe65k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour