Question: Order Of Gatk Commands
gravatar for Ashutosh Pandey
8.4 years ago by
Ashutosh Pandey12k wrote:

I have a mouse genome that was sequenced using 5 different mate-pair libraries and each library was run on 3 lanes on Illumina machine. I first aligned the reads at the lane level resulting into 15 bam files. Then I merged all the bam files (lanes) from the same library into a single BAM file resulting in 5 "single library BAM" files in total (each for one mate pair library). I want to use GATK to perform Indel realigner, Dedep and base score recalibration.

Assuming I have enough computational resources to run the GATK tool even on big bam files, what should be the correct order of performing these steps. I personally think, I should

1) Perform "IndelRealigner" at Library level OR for each "single library BAM" file separately. 2) Perform "Dedup" step at Library level to remove or mark redundant reads. 3) Using "TotalRecalibration" tool to perform quality score recalibration at single lane level or read group id level. GATK manual mentions that though a "single library BAM file" may contain reads from different read group or lanes, GATK will perform the recalibration at a lane level if RGID is provided in the BAM file for different lanes.

But I read a few recent papers, which have exactly the same situation as mine (1 sample -> multiple libraries -> each library run across more than one lane, No Barcoding) where IndelRealignment and was performed at lane level or single file, then Recalibration step was performed for each bam file separately and finally, lanes coming from the same library were merged together to form five "single library BAM file".

I just want to make sure if I am doing the things correct way?


gatk bam library • 3.9k views
ADD COMMENTlink modified 8.4 years ago by Jorge Amigo12k • written 8.4 years ago by Ashutosh Pandey12k
gravatar for Jorge Amigo
8.4 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

I guess that the best practice would be to follow GATK's advice for best practices, wouldn't it?

I particularly use the "better" suggestion, since the merging step of the "best" suggestion has always given me problems due to internal sample labeling on SOLiD platforms. we would use it only on small targetted resequencing projects, but we've found out that all the steps suggested as "better" lead to fairly believable results.

ADD COMMENTlink written 8.4 years ago by Jorge Amigo12k

Yeah, I tend to go with the GATK's best practices as well, it is pretty straightforward and seems to work. I would use the better option but I often only have 1-3 exome samples per project and I've never been sure whether doing VQSR with samples from different projects (different diseases and families) is a good idea or not.

ADD REPLYlink written 8.4 years ago by DG7.2k

that's exactly the point I was trying to make. if you have mixed things it doesn't seem reasonable to treat them as a mixture. sure that if you work constantly with the same kits, reagents, sample types,... using the merging step of the best practices would be wise, but it is very rare the case that this happens on our lab... to date ;)

ADD REPLYlink written 8.4 years ago by Jorge Amigo12k
gravatar for Zev.Kronenberg
8.4 years ago by
United States
Zev.Kronenberg11k wrote:

I don't know if I do it the "correct way", but here is my approach:

Align and de-dup separately.

sort and merge together with read groups.

Generate indel target intervals.

Run indel realignment.

ADD COMMENTlink written 8.4 years ago by Zev.Kronenberg11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1685 users visited in the last hour