Question: Calling Of Mutil-Allelic Snp Using Gatk
5.9 years ago
Bioscientist wrote:

I have four samples in a trio (and actually they are all patients) I tried using GATK-UnifiedGenotyper to call SNP/indel independently for each of them; as well as put them together and call SNP simultaneously.

When I check how the program deal with multi-allelic SNP, I found sth. interesting:


2    92306130    rs111843696    C    G    151.67    PASS    AC=2;AF=1.00;AN=2;DB;DP=28;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.8321;MQ=10.38;MQ0=18;QD=5.42;SB=-73.08    GT:AD:DP:GQ:PL    1/1:0,7:28:20.87:152,21,0


2    92306130    rs111843696    C    G    54.73    PASS    AC=2;AF=1.00;AN=2;DB;DP=18;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.9889;MQ=11.71;MQ0=8;QD=3.04;SB=-3.27    GT:AD:DP:GQ:PL    1/1:0,7:18:11.92:87,12,0


2    92306130    rs111843696    C    G    54.73    PASS    AC=2;AF=1.00;AN=2;DB;DP=25;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=2.8210;MQ=9.83;MQ0=15;QD=2.19    GT:AD:DP:GQ:PL    1/1:0,8:25:11.92:87,12,0


2    92306130    rs111843696    C    G    203.19    PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=-0.347;DB;DP=34;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=5.4511;MQ=11.99;MQ0=14;MQRankSum=1.042;QD=5.98;ReadPosRankSum=0.347;SB=-71.78    GT:AD:DP:GQ:PL    1/1:1,16:34:26.85:203,27,0

So we can see at this locus, all 4 samples share the same C-G mutation. However, in the vcf of combined calling:

2    92306130    rs111843696    C    A,G    529.37    PASS    AC=2,6;AF=0.25,0.75;AN=8;BaseQRankSum=-1.116;DB;DP=105;Dels=0.00;FS=0.000;HaplotypeScore=1.3628;MQ=11.04;MQ0=55;MQRankSum=1.193;QD=5.04;ReadPosRankSum=0.423;SB=-137.50    GT:AD:DP:GQ:PL    1/2:0,9,7:28:24.43:176,131,125,45,0,242/2:0,3,7:18:11.92:87,87,87,12,12,0    2/2:0,8,8:25:11.92:87,87,87,12,12,0    1/2:1,10,16:34:47.87:251,176,164,75,0,48

Now it's multi-allelic calling. I guess this is because when called independently, the read-depth of ALT allele "A" is quite low; and then when combined, the read-depth may surpass certain threshold so that A is called? thx

The easiest way to check would be adjusting the threshold and see whether that's the case, or, a simple pileup will tell you how that position looks like with all reads on top.

5.9 years ago
Jorge Amigo
Santiago de Compostela, Spain
Jorge Amigo wrote:

let me first say that you have to take into deep consideration that when dealing with non-biallelic variants the odds are critical for the calling. GATK tries to address this issue by allowing multi-sample calling rather than calling each sample individually, because GATK knows that the information of some samples will help taking decisions on others. anyway, I guess you aren't using the -maxAlleles option on the single-sample calling and you are probably forcing biallelic calls, because you would have to see those As on the 4 examples from above. what I see is that you are forcing GATK to call variants with no or very limited presence of the reference allele, only allowing a single alternate allele, hence forcing GATK to report these variants as homozygous for the single alternate allele allowed (although as the multi-sample calling states, there are reads with other alternate allele moving around).

I'm pretty sure that if you re-analyze those single samples with -maxAlleles 2 (or higher) you will get the same results as the multi-sample run. although I don't see anything on the documentation stating it, I'm almost certain that when performing a multi-sample analysis the maxAlleles default value is not set to 1, as this would highly limit any multi-sample calling capabilities.

