How to know if a variant is sequencing or mapping artifact?
1
2
Entering edit mode
3.3 years ago
kspata ▴ 70

Hi All,

I am working with a viral sequence. Upon investigating the freebayes output vcf file I found that there are two positions where variants are reverse complement of the reference.

Reference   527 .   CCCGGGCGTCGGGCGAC   GTCGCCCGACGCCCGGG   0   .   AB=0;ABP=0;AC=0;AF=0;AN=1;AO=1837;CIGAR=2X2M2X2M1X2M2X2M2X;DP=8277;DPB=9767.53;DPRA=0;EPP=3415.19;EPPR=9687.22;GTI=0;LEN=17;MEANALT=29;MQM=40.3887;MQMR=39.4471;NS=1;NUMALT=1;ODDS=65195.1;PAIRED=0.997278;PAIREDR=0.968095;PAO=0;PQA=0;PQR=137846;PRO=4415;QA=0;QR=230712;RO=6394;RPL=63;RPP=3463.56;RPPR=11044.7;RPR=1774;RUN=1;SAF=1831;SAP=3940.06;SAR=6;SRF=6101;SRP=11459.1;SRR=293;TYPE=complex  GT:DP:DPR:RO:QR:AO:QA:GL    0:8277:8277,1837:6394:230712:1837:0:0,-30803.9


and

Reference   5586    .   GTCGCCCGACGCCCGGG   CCCGGGCGTCGGGCGAC   1.19962e-12 .   AB=0;ABP=0;AC=0;AF=0;AN=1;AO=1439;CIGAR=2X2M2X2M1X2M2X2M2X;DP=6715;DPB=8094.82;DPRA=0;EPP=2732.86;EPPR=7945.83;GTI=0;LEN=17;MEANALT=27;MQM=39.3489;MQMR=38.6015;NS=1;NUMALT=1;ODDS=54869.5;PAIRED=0.994441;PAIREDR=0.963359;PAO=0;PQA=0;PQR=125375;PRO=3860;QA=0;QR=188745;RO=5240;RPL=1401;RPP=2806.41;RPPR=9203.98;RPR=38;RUN=1;SAF=9;SAP=3050.08;SAR=1430;SRF=271;SRP=9149.39;SRR=4969;TYPE=complex  GT:DP:DPR:RO:QR:AO:QA:GL    0:6715:6715,1439:5240:188745:1439:0:0,-25889.3


When I visualized the alignment in IGV the variant was present at the specified location. This is visualization with IGV.

How can I know for sure that,

1. Is this variant a sequencing or mapping artifact or an inversion variant?
2. Is visualization in IGV a correct method of validating a variant?
Variant freebayes vcf • 1.6k views
1
Entering edit mode

Hello kspata,

please have a look at How to add images to a Biostars post to see how to add images to your post correctly. Also the code button (the one with 101 010) is more suitable to show file contents.

I've formatted your post this time. (Hopefully correct.)

fin swimmer

2
Entering edit mode
3.3 years ago

Hello,

Is this variant a sequencing or mapping artifact or an inversion variant?

this is very likely some kind of artifact. Have a look at the 6th column of your vcf. This is a quality value for the variant site. Your value is 0 or something very close to it. Normally this value should be at least something around 20. Depending on the read depth this value is much higher.

Also freebayes gives you the genotype 0 which means REF

Is visualization in IGV a correct method of validating a variant?

For a quick check this is absolutely fine. In this case you can also see that the bases which doesn't match the ref are faded. This normally means they have bad quality values.

For a real validation, each variant that has an impact on your goal, has to be confirmed by another method like sanger sequencing.

fin swimmer

0
Entering edit mode

Hi finswimmer,

Thank you for modifying my post and making it more legible and replying to the question.

If these variants are artifacts

1. Is there any way that i can check if they are mapping artifacts using BAM fike, before performing sanger sequencing?
2. What points should I take into account to verify and give reason for such variants? For example, is it possible that the reads mapped to multiple location hence the mapping quality is zero or if there is a inverse repeat present in the mate pairs?

Please suggest ways in which I can verify this.

Thanks again!!

0
Entering edit mode

Is there any way that i can check if they are mapping artifacts using BAM fike, before performing sanger sequencing?

It doesn't look like mapping artifacts. Have a look at the values of MQM and MQMR. The first is the mean mapping quality of reads support the alternate allele, the later for the reference. They are nearly the same and with a value around 40 ok.

What points should I take into account to verify and give reason for such variants? For example, is it possible that the reads mapped to multiple location hence the mapping quality is zero or if there is a inverse repeat present in the mate pairs?

To the topic mapping quality look above. You could also click on the reads in IGV and look what is the mapping quality.

What more is a hint that this is not a true variant: - the overall QUAL is low - As the QUAL depends als on the read depth one can calculate the Quality per depth (= QUAL/DP). In regions where you have a good coverage this value is > 2 - compare the QA und QR values. These are the Sum of quality of the alternate/reference observations - Click on a variant base in your IGV browser and check the base quality value. I guess these will be low.

There are more values in your vcf that are noticeable. Have a look in the header. There are description for each value.

fin swimmer