I'm working on somatic variant calling from mouse exome sequencing data, and my pipeline is based off of GATK's best practices. I have the entire pipeline set up, from FASTQ files to VCF files, including all preprocessing. However, when I run it, only 1-3 variants come out that pass MuTect's filters (the vast majority of variants not passing MuTect's filters fail because the variant is found in the normal/non-tumor sample). Given my mouse model and the fact that this occurs in all samples I've analyzed thus far, I find it extremely unlikely that it is actually the case that there are so few variants in the tumor samples.
I'm wondering if anyone with experience in mouse exome sequencing analysis could have a look through my code to see if there are any obvious errors?
I've documented all my work on my GitHub which can be found here: https://github.com/clfougner/MouseExomeSequencing. I'm confident that the error is in the preprocessing, not the variant calling itself. This can be found in lines 102 to 268 in the file
EntirePipeline.sh. While this may seem like a lot, note that I've used new lines judiciously in an attempt to make it as readable as possible - there are only 11 functions in these lines.
I've written the README as a tutorial, because the vast majority of tutorials I've found online are not for mice and are outdated. I acknowledge that a review of my pipeline is a lot to ask for, but hopefully the fact that it should be a useful resource for future researchers working through the same issue acts as a small incentive.