Question: pre-processing whole genome data
I am new to whole genome analyses. For guidance, I am referring to 'Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence'.

As of now, I need to call variants in Rhesus Macaque paired-end reads (fasta files).

I used the most recent reference genome available to map them using BWA. Then, duplicates were marked using Picard. The next step is supposed to be: Recalibrate Base Quality Scores. According to this link ( ), it consists of four sub-steps. The command for the first sub-step is suggested to be the following:

java -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R reference.fa \ -I input_reads.bam \ -L 20 \ -knownSites dbsnp.vcf \ -knownSites gold_indels.vcf \ -o recal_data.table

My question is about the -knownSites options here.

Is a vcf file listing the known sites available for all organisms? At the NCBI website, I do see that the information (several known SNPs) is there for Macaca mulatta but I am unable to figure out how to obtain it in a vcf format as such.

I would appreciate any sort of enlightening inputs.

Thanks in advance!

Take a look at this GATK thread for additional information.

ADD REPLYlink written 3.4 years ago by genomax89k

Thanks! I will give a feedback once I try out the suggestions. I actually later also stumbled upon the ncbi repertoire of dbSNPs for macaques.

ADD REPLYlink written 3.4 years ago by br.tania40

The BBMap package has a faster and easier and option for recalibration, which does not need known sites... Usage: in=mapped.bam ref=reference.fa ploidy=2 callvariants in=mapped.bam out=recalibrated.bam recalibrate
ADD REPLYlink written 3.4 years ago by Brian Bushnell17k
