I have a whole genome sequence from a mouse tumor which presumably contains some interesting somatic mutations. Unfortunately, the model is in a Balb/cJ background rather than a C57BL/6 background from which the GRCm38 reference genome is derived. I would like to analyze the sequence in a strain-specific manner, and was able to find sequences for Balb/cJ mice from the Sanger Institute Mouse Genome Project. I was hoping someone with a little more experience than myself could give some feedback on my strategy before I get too far. I have read the prior questions posted here and found some good tips but not enough to finalize the plan.
- Align to GRCm38 reference genome using BWA. I am reluctant to use the Balb/cJ reference genome patches or Sanger Mouse Genome Balb/cJ sequence due to difficulty annotating variants later if the coordinates do not match the standard reference
- Use strain-specific SNVs and indel VCF files from Sanger found here for base quality score recalibration
- Use BAM sequence files from Mouse Genome Project specific to Balb/cJ as the "normal" sequence for Mutect2 (hosted here) -- I know these won't have been prepared the same way as the "tumor" sequence which might be a problem but the strain is highly inbred and should be the same as if I had sequenced the germline of my own mouse
- Use GRCm38 reference genome as above for the reference sequence in mutect2
- Annotate with standard tools that use GRCm38 (SNPeff or ANNOVAR... both can use rich annotation of GRCm38 sequence coordinates)
I realize I would have a more accurate alignment if I used the Balb/cJ sequence as the reference for BWA and Mutect2, but I am afraid I will get meaningless coordinates since the richest annotations all seem to be based on the standard GRCm38 reference. I am hoping that by using Balb/cJ SNPs and indels for BQSR and a "paired normal" Balb/cJ sequence to filter variant calls, I will account for strain-specific germline variants while preserving the reference coordinates for later annotation.
Somehow this mixing of strains feels sketchy though. Anyone have any experience with this that can comment on why this won't work and even better, suggest an alternative method? I might just be missing something obvious, but I have not found any clear examples of strain-specific variant calling/annotation on forums or in the literature.
HI, did you ever come to a conclusion on this?