Hi everyone!
I am new to whole genome analyses. For guidance, I am referring to 'Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence'.
As of now, I need to call variants in Rhesus Macaque paired-end reads (fasta files).
I used the most recent reference genome available to map them using BWA. Then, duplicates were marked using Picard. The next step is supposed to be: Recalibrate Base Quality Scores. According to this link (https://software.broadinstitute.org/gatk/documentation/article?id=2801 ), it consists of four sub-steps. The command for the first sub-step is suggested to be the following:
java -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R reference.fa \ -I input_reads.bam \ -L 20 \ -knownSites dbsnp.vcf \ -knownSites gold_indels.vcf \ -o recal_data.table
My question is about the -knownSites
options here.
Is a vcf file listing the known sites available for all organisms? At the NCBI website, I do see that the information (several known SNPs) is there for Macaca mulatta but I am unable to figure out how to obtain it in a vcf format as such.
I would appreciate any sort of enlightening inputs.
Thanks in advance!
Take a look at this GATK thread for additional information.
Thanks! I will give a feedback once I try out the suggestions. I actually later also stumbled upon the ncbi repertoire of dbSNPs for macaques.
The BBMap package has a faster and easier and option for recalibration, which does not need known sites... Usage: