Very few somatic mutations/variants using GATK's best practices (mouse exome sequencing)
0
2
Entering edit mode
6.3 years ago
clfougner ▴ 70

I'm working on somatic variant calling from mouse exome sequencing data, and my pipeline is based off of GATK's best practices. I have the entire pipeline set up, from FASTQ files to VCF files, including all preprocessing. However, when I run it, only 1-3 variants come out that pass MuTect's filters (the vast majority of variants not passing MuTect's filters fail because the variant is found in the normal/non-tumor sample). Given my mouse model and the fact that this occurs in all samples I've analyzed thus far, I find it extremely unlikely that it is actually the case that there are so few variants in the tumor samples.

I'm wondering if anyone with experience in mouse exome sequencing analysis could have a look through my code to see if there are any obvious errors?

I've documented all my work on my GitHub which can be found here: https://github.com/clfougner/MouseExomeSequencing. I'm confident that the error is in the preprocessing, not the variant calling itself. This can be found in lines 102 to 268 in the file EntirePipeline.sh. While this may seem like a lot, note that I've used new lines judiciously in an attempt to make it as readable as possible - there are only 11 functions in these lines.

I've written the README as a tutorial, because the vast majority of tutorials I've found online are not for mice and are outdated. I acknowledge that a review of my pipeline is a lot to ask for, but hopefully the fact that it should be a useful resource for future researchers working through the same issue acts as a small incentive.

Thanks!

Mutect GATK Mouse Exome Sequencing • 4.2k views
2
Entering edit mode

What is the depth of your data, and was there by any chance your tumor samples had normal cell contamination? Do you have any idea about that? This might interfere. The pipeline seems more or less fine to me but cannot judge unless run it but here are some inputs I can give:

1) Did you check with VarScan2 or just GATK variant calling and see the result rather than only Mutect2. Mutect is usually for low sensitive variants that might miss out with other callers.

2) Did you see what is the Ts/Tv ratio of your data ? How were the plots of analyze covariates? May be you have to lower down the sensitivities if your depth of sequencing is low or having contamination.

3) Try other variant callers as well.

4) Put the bams in IGV or other browser to scan for regions you know usually have variants for this strain to see how these qualities show up.

5) Are you also saying the Mutect2 is giving the final vcf file with only few variants or you are using the fpfilter as well?

0
Entering edit mode

Thanks for the quick response!

My data has a 100x coverage. As with any tumor I'm sure there's some normal cell contamination, but the tumors were harvested from mice and the contamination should be low.

1) I ran VarScan overnight, and I got a list of over 4000 variants. Definitely seems more odd now that I'm getting so few variants out of MuTect! I forgot to mention: my sequencing service provider also did some basic preprocessing (cutadapt, sickle, BWA-mem and dedupping, using mm9 instead of mm10), and when I run MuTect on their preprocessed files I get a list of approximately 2000 variants. Makes me think my issue may simply lie in a syntax error somewhere, but I'm not finding any...

2) Ts/Tv is 2.5-2.6 for all samples. Note that this is calculated based off of a comparison with the reference genome and not the matched normal; this should therefore likely be taken with a grain of salt as my samples are from a different mouse strain than the reference genome.

3) Tried VarScan, got results.

4) Have had a look at the bams in IGV, and I'm finding a number of variants (corresponding to those i found from the Varscan result).

5) The output from Mutect2 contains a long list of variants, however in all but 1-3 of them the variant is also found in the matched normal tissue. These variants are likely appearing because of the fact that the reference genome is from a different strain (C57/Bl6) to mine (FVB/N)

0
Entering edit mode

The depth seems adequate and definitely you can do some purity test as well to check that and then rerun the pipeline.

Definitely VarSca gives a long list as you mentioned. You can trim that down to more significant with the script fpfilter.sh which Dan Koboldt and others advice and finally annotate them to see which are the ones that directs to the phenotype. If you run Mutect2 and VarScan on mm9 bams from your service provider how comparable are the variants? I think that will help you understand whether there is a catch in your mm10 workflow for some specific file parser.

The Ts/Tv ratios seems fine. So definitely there is a problem with the mutect call since you can identify the variants in IGV for the ones identified by VarScan2.

To your last point, when you say long list of variants with mutect however you can see some of them in matched normal tissue. Dont you have a reference strain of FVB/N (sorry for my ignorance, not worked in it). Even if it is not , are you not running somatic calls? You need to run somatic calls which I was believing you are , separate the germline and somatic calls and then still if you see a lot of somatic variants you can trim them down with fpfilter.sh . Finally you can annotate them with any standard annotation tool.

Remember one thing the read cut off for both the tools are different in VarScan2 and Mutect2. So try to keep them same by tweaking the parameter and then run both for somatic calls, separate somatic hc output and then trim them down to more significant calls and annotate. Perform this step on both mm9 and mm10 and try to see the discrepancy. It might be due to technical error in workflow only.

1
Entering edit mode

Quick update: I've figured out the (now somewhat embarrassing) issue! I'd been using a placeholder for the read group names (which I figured was inconsequential because it's "just" a name), which was the same for tumor and and normal samples. Once I started using the proper read group names, I've started getting results more along the lines of what I'd expected.

Thanks for the help!