I'm calling mitochondria variants with mutect2 and one variant looks like an artifact but I don't understand what could be the cause. It looks like from IGV (picture below) that this variant is always at the same position on forward and backward reads. Also the artifact might be caused the repeat sequence (see the image below).
Here is the vcf line of the variant:
chrM 1620 . A C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=143,154|4,4;DP=307;ECNT=7;FS=0.000;MBQ=28,30;MFRL=0,0;MMQ=60,60;MPOS=43;OCM=0;POPAF=2.40;TLOD=17.31 GT:AD:AF:DP:F1R2:F2R1:FAD:PGT:PID:PS:SB 0|1:297,8:0.029:305:132,4:143,4:297,8:0|1:1620_A_C:1620:143,154,4,4
What do you think of this? Have you ever had this type of artifact ? How can I filter this out?
It might be related to soft-clipped bases (see image here here). Most of the reads with the variant also have soft-clipped bases at the end or start of the read. From what I understand there are three part in my reads supporting the variant:
- a region mapping to "position 1" in the genome
- a repeat region occurring in "position 1" and "position 2" of the genome (very similar except for the two variant bases)
- a region mapping to "position 2" in the genome
I don't know what to interpret now. Most of the reads bases map to "position 1" but I don't know why some also map well to "position 2"
Hi. Out of curiosity, why do you think this is an artifact and not a somatic variant?
I think this it is strange that the position of the variant on the read is always the same, and it is always at the end on forward reads and at the beginning of reverse strands. The probability of seeing this should not be extremely rare? Also the mutation at base 1620 seems to be always associated with an other mutation a base 1623, shouldn't they be distributed independently form each other?
Ah, good observation! (I missed before the right red box of your screenshot!). Have you checked the mapping quality of the reads supporting the mutation? And also maybe the base quality? Regarding the distribution of the two variants, if both variants are colocated (phased together) then it is expected to have a similar distribution, but not sure if this is the case here.
well that colocation implies those are due to mitochondrial heteroplasmy, which is just a fancy word for different lineages of mitochondria swimming around in an individual in different fractions. mitochondrial somatic variation can happen in cancer but heteroplasmy happens in everyone.
Jeremy Leipzig , "mitochondrial heteroplasmy", indeed sounds fancy, I did not know about this concept :)
I understand why colocation can implies heteroplasmy but not the constant distance from end/Start or read. It might be related to soft-clipping as suggested in this answer but I don't know much about this.
baptiste.lac : How were the libraries made? It is possible that we may have PCR duplicates since the reads look identical. If these were no-PCR libraries then there may indeed be a small fraction of genomes that seem to have
I have duplicates but I marked them with GATK
I have a MBQ=28,30 (mean base quality) and MMQ=60,60 (median mapping quality), which looks good. That is why I suspect some sort of artifact occurring at the start/end of the reads in this short repeats region (mostly A and C bases).
MBQ and MMQ consider all the reads in the position or only the ones supporting the mutation?
MMQ and MBQ represent the median mapping/base quality by allele so I think they consider all reference read and all alternate read. I understand that the second value in 60,60 is associated to the reads supporting the mutation. I also checked directly the read in IGV and they have good alignment and base qualities.
These variants are only in a small number of reads (out of many you seem to have)? There is at least one read where there is only one variant represented. Are you or are you now showing soft-clipped reads in IGV?
AS_SB_TABLE=143,154|4,4;DP=307hence 8 reads supporting the variant out of 307 aligned reads. Do you think it is too low? In the picture soft-clipped bases were not shown. What it looks like when soft clipped are shown but I don't know how to interpret:
It would probably be more helpful if you showed the same position as the parent post
Ooops yes I edited my reply to show the same base position
Could the soft clipped sequence be adapter? Otherwise, I'd think it's a misalignment; I'd say those reads don't belong there, and it's not a real variant.
I've blasted the soft-clipped part and they align to an other region in the genome where there is a secondary alignment of the read, but the secondary alignment is shorter and does not look better.
A quick thing you can check...grep say, 10 bases containing that variant sequence in your fastq; see if there are more reads with the variant that might be failing to map because they have even more variants. For instance, that hard stop you observed could be because there are more variants just upstream, but the aligner refused to align those reads because they were too different from your sequence.