Hello, I'm trying to run the GATK Base recalibration function to eventually map the mouse mm39 genome. I created reference and index files already based on my .fa genome and require these arguments to run the function:
--known-sites / NA One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis. This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites is given to the tool in order to skip over those sites. This tool accepts any number of Feature-containing files (VCF, BCF, BED, etc.) for use as this database. For users wishing to exclude an interval list of known variation simply use -XL my.interval.list to skip over processing those sites. Please note however that the statistics reported by the tool will not accurately be reflected those sites skipped by the -XL argument.
Where can I get these database files needed in the proper format for mm39 specifically (mouse genome), I found this site: https://www.mousegenomes.org/snps-indels/ which leads to https://ftp.ebi.ac.uk/pub/databases/mousegenomes/REL-1505-SNPs_Indels/
Although I am not sure if these are in the correct format. Ultimately I will go to a mapped BAM file and then to a VCF after this.
Sorry if this is a basic question, I am a software engineer working in a biology context so I am not familiar a lot and have to learn as I go, my position does not give me time to sit down and read a book and learn everything properly.
Thanks!