What sequencing/alignment artifact is this?
0
1
Entering edit mode
18 months ago
lacb ▴ 120

I'm calling mitochondria variants with mutect2 and one variant looks like an artifact but I don't understand what could be the cause. It looks like from IGV (picture below) that this variant is always at the same position on forward and backward reads. Also the artifact might be caused the repeat sequence (see the image below).

Here is the vcf line of the variant:

chrM    1620    .   A   C   .   PASS    AS_FilterStatus=SITE;AS_SB_TABLE=143,154|4,4;DP=307;ECNT=7;FS=0.000;MBQ=28,30;MFRL=0,0;MMQ=60,60;MPOS=43;OCM=0;POPAF=2.40;TLOD=17.31    GT:AD:AF:DP:F1R2:F2R1:FAD:PGT:PID:PS:SB 0|1:297,8:0.029:305:132,4:143,4:297,8:0|1:1620_A_C:1620:143,154,4,4

What do you think of this? Have you ever had this type of artifact ? How can I filter this out?

EDIT :

It might be related to soft-clipped bases (see image here here). Most of the reads with the variant also have soft-clipped bases at the end or start of the read. From what I understand there are three part in my reads supporting the variant:

  • a region mapping to "position 1" in the genome
  • a repeat region occurring in "position 1" and "position 2" of the genome (very similar except for the two variant bases)
  • a region mapping to "position 2" in the genome

I don't know what to interpret now. Most of the reads bases map to "position 1" but I don't know why some also map well to "position 2"

IGV

variants mutect sequencing alignment mitochondria • 1.9k views
ADD COMMENT
0
Entering edit mode

Hi. Out of curiosity, why do you think this is an artifact and not a somatic variant?

ADD REPLY
1
Entering edit mode

I think this it is strange that the position of the variant on the read is always the same, and it is always at the end on forward reads and at the beginning of reverse strands. The probability of seeing this should not be extremely rare? Also the mutation at base 1620 seems to be always associated with an other mutation a base 1623, shouldn't they be distributed independently form each other?

ADD REPLY
0
Entering edit mode

Ah, good observation! (I missed before the right red box of your screenshot!). Have you checked the mapping quality of the reads supporting the mutation? And also maybe the base quality? Regarding the distribution of the two variants, if both variants are colocated (phased together) then it is expected to have a similar distribution, but not sure if this is the case here.

ADD REPLY
1
Entering edit mode

well that colocation implies those are due to mitochondrial heteroplasmy, which is just a fancy word for different lineages of mitochondria swimming around in an individual in different fractions. mitochondrial somatic variation can happen in cancer but heteroplasmy happens in everyone.

ADD REPLY
0
Entering edit mode

Jeremy Leipzig , "mitochondrial heteroplasmy", indeed sounds fancy, I did not know about this concept :)

ADD REPLY
0
Entering edit mode

I understand why colocation can implies heteroplasmy but not the constant distance from end/Start or read. It might be related to soft-clipping as suggested in this answer but I don't know much about this.

ADD REPLY
0
Entering edit mode

baptiste.lac : How were the libraries made? It is possible that we may have PCR duplicates since the reads look identical. If these were no-PCR libraries then there may indeed be a small fraction of genomes that seem to have

ADD REPLY
0
Entering edit mode

I have duplicates but I marked them with GATK

ADD REPLY
0
Entering edit mode

I have a MBQ=28,30 (mean base quality) and MMQ=60,60 (median mapping quality), which looks good. That is why I suspect some sort of artifact occurring at the start/end of the reads in this short repeats region (mostly A and C bases).

ADD REPLY
0
Entering edit mode

MBQ and MMQ consider all the reads in the position or only the ones supporting the mutation?

ADD REPLY
0
Entering edit mode

MMQ and MBQ represent the median mapping/base quality by allele so I think they consider all reference read and all alternate read. I understand that the second value in 60,60 is associated to the reads supporting the mutation. I also checked directly the read in IGV and they have good alignment and base qualities.

ADD REPLY
0
Entering edit mode

These variants are only in a small number of reads (out of many you seem to have)? There is at least one read where there is only one variant represented. Are you or are you now showing soft-clipped reads in IGV?

ADD REPLY
0
Entering edit mode

I have AS_SB_TABLE=143,154|4,4;DP=307 hence 8 reads supporting the variant out of 307 aligned reads. Do you think it is too low? In the picture soft-clipped bases were not shown. What it looks like when soft clipped are shown but I don't know how to interpret:

soft-clipped bases

ADD REPLY
0
Entering edit mode

It would probably be more helpful if you showed the same position as the parent post

ADD REPLY
0
Entering edit mode

Ooops yes I edited my reply to show the same base position

ADD REPLY
0
Entering edit mode

Could the soft clipped sequence be adapter? Otherwise, I'd think it's a misalignment; I'd say those reads don't belong there, and it's not a real variant.

ADD REPLY
0
Entering edit mode

I've blasted the soft-clipped part and they align to an other region in the genome where there is a secondary alignment of the read, but the secondary alignment is shorter and does not look better.

ADD REPLY
0
Entering edit mode

A quick thing you can check...grep say, 10 bases containing that variant sequence in your fastq; see if there are more reads with the variant that might be failing to map because they have even more variants. For instance, that hard stop you observed could be because there are more variants just upstream, but the aligner refused to align those reads because they were too different from your sequence.

ADD REPLY

Login before adding your answer.

Traffic: 2579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6