I just want to know if my understanding is correct or not.
So for multi-sample,to use GATK for SNP/indel calling, what I should do is:
1.Independently run BWA for alignment and mark duplicates; 2.Independently realign bam file, and do the recalibration. Then I got, say, A.recal.bam, B.recal.bam, C.recal.bam.... 3.Then for Unified,Genotyper step (SNP-calling), I can input all those A.recal.bam, B.recal.bam, C.recal.bam and call SNP together, so that eventually I can get one VCF file integrating SNP calling across all samples.
Am I correct?
Also, GATK recommend:
Finally, if you really want to get the absolute best results, whatever the computational cost, then we recommend doing multiple sample realignment so that novel indels in one sample help to realign reads in other samples
Seems it's best to merge all bam files and do realignment together so that indels in one sample can help realignment in other sample. But in practise, esp. when we have many many exome samples, this becomes unrealistic due to extremely high computational cost, right? thx
I think for a while; and I would say there's no problem to first get independent
recal.bam files.But next we can do in different way:
1.Merge all bam together and call SNP.(This is impractical when total sample number is very large, say 200) So let's forget about this. 2.Merge all bam in a trio together and call SNP. 3.Call SNP independently for each bam file; then merge vcf of members in each trio together into a big vcf for each trio.
I'm just curious, for option 2 and option 3; the final result for each trio will be different or the same?