Question: Gatk Multi-Sample Calling
gravatar for Bioscientist
7.1 years ago by
Bioscientist1.7k wrote:

I just want to know if my understanding is correct or not.

So for multi-sample,to use GATK for SNP/indel calling, what I should do is:

1.Independently run BWA for alignment and mark duplicates;
2.Independently realign bam file, and do the recalibration.
Then I got, say, A.recal.bam, B.recal.bam, C.recal.bam....
3.Then for Unified,Genotyper step (SNP-calling), I can input all those A.recal.bam, B.recal.bam, C.recal.bam and call SNP together, so that eventually I can get one VCF file integrating SNP calling across all samples.

Am I correct?

Also, GATK recommend:

Finally, if you really want to get the absolute best results, whatever the computational cost, then we recommend doing multiple sample realignment so that novel indels in one sample help to realign reads in other samples

Seems it's best to merge all bam files and do realignment together so that indels in one sample can help realignment in other sample. But in practise, esp. when we have many many exome samples, this becomes unrealistic due to extremely high computational cost, right? thx

edit: I think for a while; and I would say there's no problem to first get independent recal.bam files.But next we can do in different way:

1.Merge all bam together and call SNP.(This is impractical when total sample number is very large, say 200) So let's forget about this.
2.Merge all bam in a trio together and call SNP. 
3.Call SNP independently for each bam file; then merge vcf of members in each trio together into a big vcf for each trio.

I'm just curious, for option 2 and option 3; the final result for each trio will be different or the same?


gatk • 9.9k views
ADD COMMENTlink modified 7.1 years ago by Alex Paciorkowski3.3k • written 7.1 years ago by Bioscientist1.7k
gravatar for Alex Paciorkowski
7.1 years ago by
Rochester, NY USA
Alex Paciorkowski3.3k wrote:

Generally people I know run BWA alignment separately on individual fastq files. When it comes to calling SNPs and indels on BAM files, we merge them into related familial trios, if that is the type of experiment being done.

You can put .bam files from a related trio together into a .bam.list file, and run GATK UnifiedGenotyper on that.

The contents of the .bam.list file are:


and so on...

If you need to, you can provide -L interval lists to GATK, and parallelize the process.

This will then produce a .vcf file for that trio, which you can annotate further using SeattleSeq. With downstream scripting you can then determine common and unique SNPs between family members, depending on what your hypothesized disease model is.

ADD COMMENTlink written 7.1 years ago by Alex Paciorkowski3.3k

thx, Alex; but plz see my edit. If I run bam files in trio independently, will results be different?

ADD REPLYlink written 7.1 years ago by Bioscientist1.7k

@bioscientist: Potentially, yes, your results may be different if analyzed separately. This is from GATK v3 best practices for variant detection:

"The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference."

ADD REPLYlink written 7.1 years ago by Alex Paciorkowski3.3k


ADD REPLYlink written 7.1 years ago by Alex Paciorkowski3.3k

In other words, there are crappy calls that are artifacts of sequencing that will make it into your VCF. Individually, a crap call in one .bam file may not get tossed statistically by GATK, and will make it into your VCF. If you exome sequenced a related trio on the same run, it is likely they will have the same crappy calls generated from that run, and if you group those .bam files together, you increase GATK's chances of tossing a whole lot more of them. Good luck!

ADD REPLYlink written 7.1 years ago by Alex Paciorkowski3.3k

thanks! that makes sense!

ADD REPLYlink written 7.1 years ago by Bioscientist1.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 807 users visited in the last hour