Incongruencies in GT assignment by Freebayes
0
0
Entering edit mode
25 days ago

I am using Freebayes (v1.3.6) for SNP calling analysis on eggplant population data. While scrutinizing the VCF file, it has come to my attention the heterozygous assignation in some cases. This was the command used to perform the SNP calling:

/usr/bin/freebayes -b all_parents_mapped_BWA_v3_20X_dedup.bam -f /data/users/vbarfon/Genomes/Solgenomics_2019_08_30/Eggplant_V3_Chromosomes.fa -v multithreadfreebayes/subfiles/all_parents_mapped_BWA_v3_20X_dedup_SMEL3Ch01_0_136534347.vcf -r SMEL3Ch01:0-136534347 --limit-coverage 800 --min-coverage 10 -m 20 -q 20

In this case, a homozygous genotype was assigned to sample5, supported by 11 reference reads and 1 alternate read, resulting in an alternate allele frequency of 1/12=0.083. On the other hand, sample8 exhibited a heterozygous genotype, supported by 9 reference reads and 1 alternate read, with an alternate allele frequency of 1/10=0.1. Despite the similarity in alternate allele frequencies between samples, Freebayes assigns different genotypes at the same loci.

SMEL3Ch01       96806   .       TTTGCC  CTGGCG  47.5168 .       AB=0.185185;ABP=49.4959;AC=3;AF=0.1875;AN=16;AO=12;CIGAR=1X1M1X2M1X;DP=131;DPB=131;DPRA=1.18636;EPP=3.73412;EPPR=3.90444;GTI=1;LEN=6;MEANALT=1;MQM=35.5833;MQMR=60;NS=8;NUMALT=1;ODDS=4.88556;PAIRED=1;PAIREDR=0.915966;PAO=0;PQA=0;PQR=0;PRO=0;QA=465;QR=4692;RO=119;RPL=10;RPP=14.5915;RPPR=4.48836;RPR=2;RUN=1;SAF=9;SAP=9.52472;SAR=3;SRF=58;SRP=3.17453;SRR=61;TYPE=complex  GT:GQ:DP:AD:RO:QR:AO:QA:GL  0/1:31.4607:17:13,4:13:520:4:159:-6.03904,0,-42.0287    0/0:101.192:22:22,0:22:872:0:0:0,-6.62266,-78.7923       0/0:71.0896:12:12,0:12:460:0:0:0,-3.61236,-41.7443      0/0:63.2764:21:20,1:20:792:1:38:0,-2.73405,-68.0036     0/0:51.1134:12:11,1:11:437:1:40:0,-1.51775,-37.5857     0/1:41.2006:27:22,5:22:851:5:187:-7.81267,0,-68.7726     0/0:65.069:10:10,0:10:401:0:0:0,-3.0103,-36.4467        0/1:0.0326615:10:9,1:9:359:1:41:-0.735817,0,-29.6608

Moreover, at other loci, sample6, supported by 11 reference reads and 1 alternate read, exhibits a heterozygous genotype instead of a homozygous one, as exhibited in the previous case by sample5.

SMEL3Ch01       82135   .       A       G       885.116 .       AB=0.0833333;ABP=21.1059;AC=5;AF=0.3125;AN=16;AO=30;CIGAR=1X;DP=107;DPB=107;DPRA=1.03535;EPP=10.2485;EPPR=11.1604;GTI=1;LEN=1;MEANALT=1;MQM=60;MQMR=59.6104;NS=8;NUMALT=1;ODDS=0.82119;PAIRED=1;PAIREDR=0.922078;PAO=0;PQA=0;PQR=0;PRO=0;QA=1174;QR=2974;RO=77;RPL=19;RPP=7.64277;RPPR=9.35551;RPR=11;RUN=1;SAF=11;SAP=7.64277;SAR=19;SRF=35;SRP=4.39215;SRR=42;TYPE=snp  GT:GQ:DP:AD:RO:QR:AO:QA:GL  1/1:30.4148:14:0,14:0:0:14:551:-49.9224,-4.21442,0      0/0:52.3018:15:15,0:15:570:0:0:0,-4.51545,-51.6258       1/1:33.4251:15:0,15:0:0:15:582:-52.7037,-4.51545,0      0/0:40.2606:11:11,0:11:418:0:0:0,-3.31133,-37.9553      0/0:40.2606:11:11,0:11:434:0:0:0,-3.31133,-39.4105      0/1:1.58028:12:11,1:11:398:1:41:-0.482207,0,-32.3809     0/0:58.3224:17:17,0:17:679:0:0:0,-5.11751,-61.4334      0/0:43.2709:12:12,0:12:475:0:0:0,-3.61236,-43.0969

This performance was also observed at higher coverages. For example, in this case, sample1, sample2, sample4, and sample7 carry a heterozygous genotype when their alternate allele frequencies are 4/45=0.088, 4/27=0.148, 6/45=0.133, 3/24=0.125, respectively.

SMEL3Ch01       96344   .       T       C       30.2059 .       AB=0.120567;ABP=179.331;AC=4;AF=0.25;AN=16;AO=24;CIGAR=1X;DP=242;DPB=242;DPRA=0;EPP=4.45795;EPPR=3.82085;GTI=2;LEN=1;MEANALT=1.125;MQM=45.4583;MQMR=54.682;NS=8;NUMALT=1;ODDS=5.42383;PAIRED=1;PAIREDR=0.949309;PAO=0;PQA=0;PQR=0;PRO=0;QA=966;QR=8706;RO=217;RPL=21;RPP=32.3252;RPPR=3.10036;RPR=3;RUN=1;SAF=15;SAP=6.26751;SAR=9;SRF=114;SRP=4.22112;SRR=103;TYPE=snp   GT:GQ:DP:AD:RO:QR:AO:QA:GL  0/1:0.000458462:45:41,4:41:1610:4:164:-1.33939,0,-127.693       0/1:22.7339:27:23,4:23:927:4:164:-6.37746,0,-71.7496     0/0:82.5237:17:16,1:16:643:1:32:0,-1.94407,-52.6745     0/1:29.6502:45:38,6:38:1525:6:246:-8.28092,0,-119.131   0/0:64.3246:27:25,2:25:1013:2:82:0,-0.614318,-82.214    0/0:66.6283:32:29,3:29:1168:3:114:0,-0.956644,-93.4502   0/1:0.0107044:24:21,3:21:853:3:123:-2.78843,0,-68.7308  0/0:96.5417:25:24,1:24:967:1:41:0,-3.63818,-80.9723

Summarizing, those results showed that Freebayes assigns genotypes randomly (two different sites supported by the same number of reads have different genotypes) and it assigns heterozygous genotypes even when the alternate allele frequency is too low, expecting homozygous genotypes at those sites. How can I deal with this issue?

It was also asked for help in Github: https://github.com/freebayes/freebayes/issues/793

depth freebayes heterozygous genotype • 344 views
ADD COMMENT
0
Entering edit mode

Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.

ADD REPLY
0
Entering edit mode

You are right. I have modified the entry by replacing the images with text, which makes it easier to view

ADD REPLY
0
Entering edit mode

(two different sites supported by the same number of reads have different genotypes)

but the number of reads is not enough. What about the genotype quality...

ADD REPLY
0
Entering edit mode

I have redone the SNP calling so that the GQ field is integrated into the resulting file. The quality of the assigned genotypes is indeed very low in the cases I have pointed out in the text, but my question is, why instead of assigning heterozygous genotypes with low quality does it not assign homozygous genotypes, which is what would be expected considering the coverage that supports the reference allele and the alternative allele? As it did in the first case with sample 5. Why has it not done this in the second case with sample 6?

ADD REPLY
0
Entering edit mode

I have visualized the alignments in IGV and have seen a possible reason why the genotype for sample 5 is 0/0 at position 96806 and the genotype for sample 6 is 0/1 at position 82135, both being supported by 11 reads for the reference allele and 1 for the alternative allele. As attached in the images, the mapping quality of the read from sample 5 is low (MAPQ 21), which is why this read is not considered when assigning the genotype.

enter image description here

On the other hand, the mapping quality of the read from sample 6 is high (MAPQ 60), so this read is taken into account, considering the heterozygous genotype.

enter image description here

Based on this, should I believe that indeed the genotype of sample 6 is heterozygous despite the difference in the number of reads supporting each allele being 10, favoring the reference allele? Shouldn't the number of reads also be considered when assigning genotypes? I know that the number of reads may not be enough, but there are studies where sequencing coverage is low (< 10X) and this methodology is used.

ADD REPLY

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6