Somatic variant calling with/without matched normal sample (Mutect2)
1
0
Entering edit mode
15 months ago
newbio17 ▴ 340

Hello,

I'm currently testing GATK's tumor-only variant calling pipeline and came across an issue that I'm stuck on. I wanted to see how different the results would be if I ran Mutect2 with and without matched normal by comparing the two outputs (command used available below) using TCGA data. I'm following the current GATK workflow in order to filter the Mutect2 (i.e. FilterMutectCalls) calls and only consider variants with 'PASS' for comparison.

For a given sample, its results were summarized in venn diagram:

I assume Mutect2 is trying to do its best without paired normal sample to reduce errors in its calls, but the small overlap between the results really concern me as I prepare to perform tumor-only calling on my samples. Does anyone have a suggestion on how to impose more stringent filtering (i.e. retrieve variants that fall in both matched and unmatched workflow from the venn diagram)?

I seeked for assistance on GATK forums, but found out they only handle errors and issues with their tools.

# Mutect2 w/ matching normal

gatk Mutect2 -R hg38.fa \
-I input_tumor.bam -I input_normal.bam \
-tumor tumor_sample -normal normal_sample \
--af-of-alleles-not-in-resource 0.0000025 \
-L exome_autoXYM.intervals -O mt2_matched.vcf.gz

# Mutect2 w/ no matching normal

gatk Mutect2 -R hg38.fa \
-I input_tumor.bam \
-pon gatk4_mutect2_4136_pon.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--genotype-germline-sites \
-L exome_autoXYM.intervals \
-O mt2_unmatched.vcf.gz

mutect2 wes variant calling gatk • 1.1k views
0
Entering edit mode

0
Entering edit mode

Discussion thread can be found here. A developer responded asking how I compiled panel of normals, but not much to go on off of yet.

0
Entering edit mode

Are you seeing this with multiple tumor-normal pairs or just one?

Have you manually checked some of the variants in different categories in IGV (or a similar tool)? Do they make sense? For example, do the unmatched-only variants show up in the normal?

0
Entering edit mode

newbio17?

0
Entering edit mode

Sorry, I didn't get the notifications for the comments. I ran a couple of other matched data and the overlap is still low (10-15%) between the two workflows.

For igor's second comment, I did try running the germline variant calling on the normal samples and tried to see if the germline variants overlap with the unmatched-only variants from tumor-only somatic call. The results show that two sets do not overlap (0-2%).

0
Entering edit mode

I think you misunderstood my second comment. Did you manually inspect the variants? I mean visually inspect the BAM files. A lot of problems become very obvious when you actually see the reads.

0
Entering edit mode

Ah okay. I did briefly take a look at couple of files I tested, but nothing too out of ordinary. However, I only examined a couple cases such as intron variants in order to make sure the region was consistent with Agilent's target region. What do you recommend I look for specifically (aside from potential misalignments)?

Also, one of GATK developers replied to above-mentioned GATK forum post commenting on tumor-only variant calling and issues associated with it for those who are curious.

0
Entering edit mode

I only examined a couple cases such as intron variants in order to make sure the region was consistent with Agilent's target region

Are you targeting non-coding regions?

What do you recommend I look for specifically (aside from potential misalignments)?

Do the variants make sense in terms of where they are on your venn diagram? For example, do the unmatched-only variants show up in the normal?

one of GATK developers replied to above-mentioned GATK forum

Thanks for letting us know.

0
Entering edit mode

Are you targeting non-coding regions?

I'm not specifically interested in non-coding regions, but I examined them since they showed up to make sure they weren't due to some alignment errors.

Do the variants make sense in terms of where they are on your venn diagram? For example, do the unmatched-only variants show up in the normal?

Unmatched variants rarely shows up, if at all in normal. Therefore, they are new variant calls that unique to tumor-only protocol.

If I have any updates regarding this, I will follow up on this post. Thanks everyone.

0
Entering edit mode

they are new variant calls that unique to tumor-only protocol

Just to clarify, you see variants in your tumor sample, but not in your normal sample, but they are not being called when you are using that matched normal.

In that case, you can substantially simplify your original question to something like "I see clear variants in my tumor BAM file and not in the matched normal, but why are they not being called?".

0
Entering edit mode

Just to clarify, you see variants in your tumor sample, but not in your normal sample, but they are not being called when you are using that matched normal.

Let me clarify what I was doing with the tumor-only approach. I first call somatic variants using Mutect2 on the tumor BAM file. Then I call germline variants using HaplotypeCaller on the normal BAM file. Between the VCF files generated by both tools, there was almost no overlap. Therefore, I concluded that 3911 (from above venn diagram) are not germline variants that were failed to get filtered out in tumor-only approach.

To reiterate, I have the option to either provide normal & tumor BAM files or provide only tumor BAM file to get a single VCF file. I asked my original question in the hopes of receiving some feedback after providing evidence that two different workflows for calling somatic variants display significantly different results when they should be (ideally) similar. So I'm not sure if I fully agree with your simplified version of my question since that is not what I intended to ask in the first place (why they are not in normal seems self-explanatory if you consider that they are two different variant callers serving different purposes).

0
Entering edit mode

that is not what I intended to ask in the first place

I realize that is not what you intended to ask, which is why I suggested that should be the first question you should be asking. Your final question has a lot of variables. You should resolve each variable separately. The reason you have poor overlap is because your true somatic variants are not being called in the matched analysis (at least based on what you said in the comments). You need to figure out why they are not being called before you ask why they are not overlapping with variants from a different type of analysis.

why they are not in normal seems self-explanatory if you consider that they are two different variant callers serving different purposes

Now you are adding yet another variable to your question. Why is HaplotypeCaller relevant here? When you say that the variants are not in the normal, did you actually look at the BAM files yourself (as I suggested) or are you looking at HaplotypeCaller results?

0
Entering edit mode

did you actually look at the BAM files yourself

Yeah. I mentioned that I did this for handful of coding regions with no noticeable oddities.

The reason you have poor overlap is because your true somatic variants are not being called in the matched analysis

I agree, but what I'm concerned with aren't the missed calls (false negatives), but the false positives.

Why is HaplotypeCaller relevant here?

This became relevant when I tried to examine the false positives were actually germline variants.

Thanks again for your comments and I am more than interested in other suggestions/recommendations if you have any additional ones.

1
Entering edit mode
15 months ago

4300 somatic calls would be a lot even for WES. Is it possible there is a tumor/normal misalignment?

Otherwise, running it quickly with Mutect 1 would be an easy way to debug. Mutect 1 reports all variants, even if found in PoN or matched normal, so it's easier to figure out the impact of those.

0
Entering edit mode

I did see much fewer somatic variants when I ran the tumor-only pipeline for my samples before testing with the TCGA samples for troubleshooting. I'm not sure what you mean by misalignment. Do you mean to say that the .bam files might have been generated using different alignment strategies (e.g. hg19 vs hg38)? The data available on GDC goes through a harmonized pipeline once submitted, so I'm not sure if that could be the culprit.