Question: Somatic variant calling with/without matched normal sample (Mutect2)
0
gravatar for newbio17
7 months ago by
newbio17240
newbio17240 wrote:

Hello,

I'm currently testing GATK's tumor-only variant calling pipeline and came across an issue that I'm stuck on. I wanted to see how different the results would be if I ran Mutect2 with and without matched normal by comparing the two outputs (command used available below) using TCGA data. I'm following the current GATK workflow in order to filter the Mutect2 (i.e. FilterMutectCalls) calls and only consider variants with 'PASS' for comparison.

For a given sample, its results were summarized in venn diagram:

enter image description here

I assume Mutect2 is trying to do its best without paired normal sample to reduce errors in its calls, but the small overlap between the results really concern me as I prepare to perform tumor-only calling on my samples. Does anyone have a suggestion on how to impose more stringent filtering (i.e. retrieve variants that fall in both matched and unmatched workflow from the venn diagram)?

I seeked for assistance on GATK forums, but found out they only handle errors and issues with their tools.

Thanks in advance.

# Mutect2 w/ matching normal

gatk Mutect2 -R hg38.fa \
-I input_tumor.bam -I input_normal.bam \
-tumor tumor_sample -normal normal_sample \
-pon gatk4_mutect2_4136_pon.vcf.gz --germline-resource af-only-gnomad.hg38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
-L exome_autoXYM.intervals -O mt2_matched.vcf.gz

# Mutect2 w/ no matching normal

gatk Mutect2 -R hg38.fa \
-I input_tumor.bam \
-pon gatk4_mutect2_4136_pon.vcf.gz \
--germline-resource af-only-gnomad.hg38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--genotype-germline-sites \
-L exome_autoXYM.intervals \
-O mt2_unmatched.vcf.gz
ADD COMMENTlink modified 7 months ago by markus.riester500 • written 7 months ago by newbio17240

That is quite a worrying finding, but not unexpected. Can you please paste a link to the thread on GATK Forum, please?

ADD REPLYlink written 7 months ago by Kevin Blighe66k

Discussion thread can be found here. A developer responded asking how I compiled panel of normals, but not much to go on off of yet.

ADD REPLYlink written 7 months ago by newbio17240

Are you seeing this with multiple tumor-normal pairs or just one?

Have you manually checked some of the variants in different categories in IGV (or a similar tool)? Do they make sense? For example, do the unmatched-only variants show up in the normal?

ADD REPLYlink written 7 months ago by igor11k

newbio17?

ADD REPLYlink written 7 months ago by Kevin Blighe66k

Sorry, I didn't get the notifications for the comments. I ran a couple of other matched data and the overlap is still low (10-15%) between the two workflows.

For igor's second comment, I did try running the germline variant calling on the normal samples and tried to see if the germline variants overlap with the unmatched-only variants from tumor-only somatic call. The results show that two sets do not overlap (0-2%).

ADD REPLYlink written 7 months ago by newbio17240

I think you misunderstood my second comment. Did you manually inspect the variants? I mean visually inspect the BAM files. A lot of problems become very obvious when you actually see the reads.

ADD REPLYlink written 7 months ago by igor11k

Ah okay. I did briefly take a look at couple of files I tested, but nothing too out of ordinary. However, I only examined a couple cases such as intron variants in order to make sure the region was consistent with Agilent's target region. What do you recommend I look for specifically (aside from potential misalignments)?

Also, one of GATK developers replied to above-mentioned GATK forum post commenting on tumor-only variant calling and issues associated with it for those who are curious.

ADD REPLYlink written 7 months ago by newbio17240

I only examined a couple cases such as intron variants in order to make sure the region was consistent with Agilent's target region

Are you targeting non-coding regions?

What do you recommend I look for specifically (aside from potential misalignments)?

Do the variants make sense in terms of where they are on your venn diagram? For example, do the unmatched-only variants show up in the normal?

one of GATK developers replied to above-mentioned GATK forum

Thanks for letting us know.

ADD REPLYlink written 7 months ago by igor11k

Are you targeting non-coding regions?

I'm not specifically interested in non-coding regions, but I examined them since they showed up to make sure they weren't due to some alignment errors.

Do the variants make sense in terms of where they are on your venn diagram? For example, do the unmatched-only variants show up in the normal?

Unmatched variants rarely shows up, if at all in normal. Therefore, they are new variant calls that unique to tumor-only protocol.

If I have any updates regarding this, I will follow up on this post. Thanks everyone.

ADD REPLYlink written 7 months ago by newbio17240

they are new variant calls that unique to tumor-only protocol

Just to clarify, you see variants in your tumor sample, but not in your normal sample, but they are not being called when you are using that matched normal.

In that case, you can substantially simplify your original question to something like "I see clear variants in my tumor BAM file and not in the matched normal, but why are they not being called?".

ADD REPLYlink modified 7 months ago • written 7 months ago by igor11k

Just to clarify, you see variants in your tumor sample, but not in your normal sample, but they are not being called when you are using that matched normal.

Let me clarify what I was doing with the tumor-only approach. I first call somatic variants using Mutect2 on the tumor BAM file. Then I call germline variants using HaplotypeCaller on the normal BAM file. Between the VCF files generated by both tools, there was almost no overlap. Therefore, I concluded that 3911 (from above venn diagram) are not germline variants that were failed to get filtered out in tumor-only approach.

To reiterate, I have the option to either provide normal & tumor BAM files or provide only tumor BAM file to get a single VCF file. I asked my original question in the hopes of receiving some feedback after providing evidence that two different workflows for calling somatic variants display significantly different results when they should be (ideally) similar. So I'm not sure if I fully agree with your simplified version of my question since that is not what I intended to ask in the first place (why they are not in normal seems self-explanatory if you consider that they are two different variant callers serving different purposes).

ADD REPLYlink modified 7 months ago • written 7 months ago by newbio17240

that is not what I intended to ask in the first place

I realize that is not what you intended to ask, which is why I suggested that should be the first question you should be asking. Your final question has a lot of variables. You should resolve each variable separately. The reason you have poor overlap is because your true somatic variants are not being called in the matched analysis (at least based on what you said in the comments). You need to figure out why they are not being called before you ask why they are not overlapping with variants from a different type of analysis.

why they are not in normal seems self-explanatory if you consider that they are two different variant callers serving different purposes

Now you are adding yet another variable to your question. Why is HaplotypeCaller relevant here? When you say that the variants are not in the normal, did you actually look at the BAM files yourself (as I suggested) or are you looking at HaplotypeCaller results?

ADD REPLYlink written 7 months ago by igor11k

did you actually look at the BAM files yourself

Yeah. I mentioned that I did this for handful of coding regions with no noticeable oddities.

The reason you have poor overlap is because your true somatic variants are not being called in the matched analysis

I agree, but what I'm concerned with aren't the missed calls (false negatives), but the false positives.

Why is HaplotypeCaller relevant here?

This became relevant when I tried to examine the false positives were actually germline variants.

Thanks again for your comments and I am more than interested in other suggestions/recommendations if you have any additional ones.

ADD REPLYlink written 6 months ago by newbio17240
1
gravatar for markus.riester
7 months ago by
markus.riester500 wrote:

4300 somatic calls would be a lot even for WES. Is it possible there is a tumor/normal misalignment?

Otherwise, running it quickly with Mutect 1 would be an easy way to debug. Mutect 1 reports all variants, even if found in PoN or matched normal, so it's easier to figure out the impact of those.

ADD COMMENTlink written 7 months ago by markus.riester500

Thank you for your comment.

I did see much fewer somatic variants when I ran the tumor-only pipeline for my samples before testing with the TCGA samples for troubleshooting. I'm not sure what you mean by misalignment. Do you mean to say that the .bam files might have been generated using different alignment strategies (e.g. hg19 vs hg38)? The data available on GDC goes through a harmonized pipeline once submitted, so I'm not sure if that could be the culprit.

ADD REPLYlink written 7 months ago by newbio17240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1233 users visited in the last hour