Question: Variant with AD 14,3 is heterozygote?
2
gravatar for Sharon
15 months ago by
Sharon440
Sharon440 wrote:

Hi everyone

I would like to take a second opinion regarding a variant with DP =17. The variant is in a very high GC content area so the low coverage is reasonable.

I do hard filtering because I have few samples for VQSR, I thinks it is okay according to GATK hard filtering, the values are as follows:

   AC=1;AF=0.500;AN=2;BaseQRankSum=0.892;ClippingRankSum=0.000;DP=17;ExcessHet=3.0103;FS=10.843;
    MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=2.81;ReadPosRankSum=-1.534;
SOR=2.795       GT:AD:DP:GQ:PL  0/1:14,3:17:76:76,0,506

I am worried because AD is 14,3 which means only 3 reads only support the ALT? is it safe to accept it as a het variant. The variant is very interesting based on our data, so I don't wanna discard without a strong reason.

Any comments, how do you think? with many many thanks

vcf exome • 495 views
ADD COMMENTlink modified 15 months ago by Gabriel R.2.6k • written 15 months ago by Sharon440
2
gravatar for Gabriel R.
15 months ago by
Gabriel R.2.6k
Center for Geogenetik K├Şbenhavns Universitet
Gabriel R.2.6k wrote:

I cannot provide a detailed answer given the information but a few comments.

There are 2 most likely models: 1) The site is homozygous reference

Therefore, the 3 reads are potentially a) mismapped, b) weird duplicates or c) sequencing errors.

a) Did you use a mappability filter? i recommend Heng Li's method lh3lh3.users.sourceforge.net/snpable.shtml b) Did you use rmdup? c) if both a) and b) are satisfied, then they are sequencing errors, assuming a base quality of 38, the probability of 3 seq. errors occurring independently is:

(10^((-1*38)/10))^3 = 3.981072e-12

so one chance in 250 billion.

2) The site is heterozygous reference

It is possible that this is a genuine het site and you haven't sampled the other base, this can happen with probability:

dbinom(3,17,prob=0.5) = 0.005187988

so 1 chance in ~200 (assuming no mismappings or errors).

This is reflected in your PL field where the homo ref model has a score of 76, the het model a score of 0 (most likely) and 506 for the homo alt. While the hetero is more likely that the homo ref, it is not astronomically more likely. However, the homo alt is so unlikely that it can be safely discarded.

ADD COMMENTlink written 15 months ago by Gabriel R.2.6k
1

Nice answer! I wonder if the low number of ALT reads could be because ALT reads have a mismatch with the reference. This mismatch could make some reads to map equally well to multiple locations hence their MAPQ is 0 hence they are filtered out somewhere in the analysis.

Maybe useful to inspect the unfiltered bam files to look for the reads mapped at the incriminated locus.

ADD REPLYlink modified 15 months ago • written 15 months ago by dariober10k
1

Thanks Gabriel for the nice answer. I did not use a nor b, I will give them a try. Just used GATK pipeline. Thanks dariober, that's another good point I did not think of, I will check other reads . Thanks

ADD REPLYlink modified 15 months ago • written 15 months ago by Sharon440
1

good comment. my personal experience with this is if I have to choose between filtering by mapping quality or mappability, pick mappability, it gives greater noise reduction: see figure on page 46 https://media.nature.com/original/nature-assets/nature/journal/v505/n7481/extref/nature12886-s1.pdf

ADD REPLYlink written 15 months ago by Gabriel R.2.6k

Interesting figure. Never thought about this. So I check if the reads are mappable, if yes, that's a sign of less likely for errors, which means trust the call and trust the variant? Do I understand right?

ADD REPLYlink written 15 months ago by Sharon440
1

The reads are not mappable, the genomic region is (or isn't). Mapping quality is defined on a per read basis but mappability is defined per genomic locus. Both aim at reducing the probability that there is a CNV or repeat or a read so badly damaged due to sequencing errors that it could end up in multiple places or a read so short that could have come from multiple genomic loci.

If the genomic region is highly mappable and you have good mapping quality, it likely means that you can discard the "read is mismapped" possibility. now, it is still possible to have sequencing errors or bizarre strand effects due to Illumina even if you are correctly mapped.

ADD REPLYlink written 15 months ago by Gabriel R.2.6k
1

Great, got it now. Thanks so much for the clear explanation. Much appreciated Gabriel :)

ADD REPLYlink written 15 months ago by Sharon440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1017 users visited in the last hour