Gatk Multi-Sample Calling
1
3
Entering edit mode
12.1 years ago
Bioscientist ★ 1.7k

I just want to know if my understanding is correct or not.

So for multi-sample,to use GATK for SNP/indel calling, what I should do is:

1.Independently run BWA for alignment and mark duplicates;
2.Independently realign bam file, and do the recalibration.
Then I got, say, A.recal.bam, B.recal.bam, C.recal.bam....
3.Then for Unified,Genotyper step (SNP-calling), I can input all those A.recal.bam, B.recal.bam, C.recal.bam and call SNP together, so that eventually I can get one VCF file integrating SNP calling across all samples.

Am I correct?

Also, GATK recommend:

Finally, if you really want to get the absolute best results, whatever the computational cost, then we recommend doing multiple sample realignment so that novel indels in one sample help to realign reads in other samples

Seems it's best to merge all bam files and do realignment together so that indels in one sample can help realignment in other sample. But in practise, esp. when we have many many exome samples, this becomes unrealistic due to extremely high computational cost, right? thx

edit: I think for a while; and I would say there's no problem to first get independent recal.bam files.But next we can do in different way:

1.Merge all bam together and call SNP.(This is impractical when total sample number is very large, say 200) So let's forget about this.
2.Merge all bam in a trio together and call SNP. 
3.Call SNP independently for each bam file; then merge vcf of members in each trio together into a big vcf for each trio.

I'm just curious, for option 2 and option 3; the final result for each trio will be different or the same?

thx

gatk • 13k views
ADD COMMENT
3
Entering edit mode
12.1 years ago

Generally people I know run BWA alignment separately on individual fastq files. When it comes to calling SNPs and indels on BAM files, we merge them into related familial trios, if that is the type of experiment being done.

You can put .bam files from a related trio together into a .bam.list file, and run GATK UnifiedGenotyper on that.

The contents of the .bam.list file are:

/path/to/bamfile1.bam
/path/to/bamfile2.bam

and so on...

If you need to, you can provide -L interval lists to GATK, and parallelize the process.

This will then produce a .vcf file for that trio, which you can annotate further using SeattleSeq. With downstream scripting you can then determine common and unique SNPs between family members, depending on what your hypothesized disease model is.

ADD COMMENT
0
Entering edit mode

thx, Alex; but plz see my edit. If I run bam files in trio independently, will results be different?

ADD REPLY
0
Entering edit mode

@bioscientist: Potentially, yes, your results may be different if analyzed separately. This is from GATK v3 best practices for variant detection:

"The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference."

ADD REPLY
0
Entering edit mode

In other words, there are crappy calls that are artifacts of sequencing that will make it into your VCF. Individually, a crap call in one .bam file may not get tossed statistically by GATK, and will make it into your VCF. If you exome sequenced a related trio on the same run, it is likely they will have the same crappy calls generated from that run, and if you group those .bam files together, you increase GATK's chances of tossing a whole lot more of them. Good luck!

ADD REPLY
0
Entering edit mode

thanks! that makes sense!

ADD REPLY

Login before adding your answer.

Traffic: 2158 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6