I am working on identifying the SNPs in the genome of an insect which is sequenced in three samples. Following the GATK best practice pipeline i have completed the following task.
- Quality filtered reads.
- Read Alignment for each sample eg. S1, S2, S3.
- Sort and mark duplicates in each sample S1, S2, S3.
- Realign target and Indel realinger on each sample S1, S2, S3. Now i am stuck with next steps. As it require the base recalibration (BSQR) step. which require the known database of snps. however i dont have any known snp database so i am following the alternative steps mentioned the GATK for non-model organism. That is
- I call the raw SNPs in gvcf model from each sample and then make joint calling for a final combined vcf file.
- Next, I applied the hard filters on this file and extract good quality SNPs and indels as a reference database of BSQR step.
Now the question is i have three realigned files S1_realigned.bam, S2_realigned.bam, S3_realigned.bam from step 4 and a reference database of SNP and Indels (if i am right) from the step 6. So how i would proceed further. Did i use against each sample separately or make a combine re-calibration table by providing all three in the same command.
Thanks Dr. Deepak