I am new to whole genome analyses. For guidance, I am referring to 'Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence'.
As of now, I need to call variants in Rhesus Macaque paired-end reads (fasta files).
I used the most recent reference genome available to map them using BWA. Then, duplicates were marked using Picard. The next step is supposed to be: Recalibrate Base Quality Scores. According to this link (https://software.broadinstitute.org/gatk/documentation/article?id=2801 ), it consists of four sub-steps. The command for the first sub-step is suggested to be the following:
java -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R reference.fa \ -I input_reads.bam \ -L 20 \ -knownSites dbsnp.vcf \ -knownSites gold_indels.vcf \ -o recal_data.table
My question is about the
-knownSites options here.
Is a vcf file listing the known sites available for all organisms? At the NCBI website, I do see that the information (several known SNPs) is there for Macaca mulatta but I am unable to figure out how to obtain it in a vcf format as such.
I would appreciate any sort of enlightening inputs.
Thanks in advance!