Question: Order Of Gatk Commands
2
gravatar for Ashutosh Pandey
4.5 years ago by
Philadelphia
Ashutosh Pandey10k wrote:

I have a mouse genome that was sequenced using 5 different mate-pair libraries and each library was run on 3 lanes on Illumina machine. I first aligned the reads at the lane level resulting into 15 bam files. Then I merged all the bam files (lanes) from the same library into a single BAM file resulting in 5 "single library BAM" files in total (each for one mate pair library). I want to use GATK to perform Indel realigner, Dedep and base score recalibration.

Assuming I have enough computational resources to run the GATK tool even on big bam files, what should be the correct order of performing these steps. I personally think, I should

1) Perform "IndelRealigner" at Library level OR for each "single library BAM" file separately. 2) Perform "Dedup" step at Library level to remove or mark redundant reads. 3) Using "TotalRecalibration" tool to perform quality score recalibration at single lane level or read group id level. GATK manual mentions that though a "single library BAM file" may contain reads from different read group or lanes, GATK will perform the recalibration at a lane level if RGID is provided in the BAM file for different lanes.

But I read a few recent papers, which have exactly the same situation as mine (1 sample -> multiple libraries -> each library run across more than one lane, No Barcoding) where IndelRealignment and was performed at lane level or single file, then Recalibration step was performed for each bam file separately and finally, lanes coming from the same library were merged together to form five "single library BAM file".

I just want to make sure if I am doing the things correct way?

Thanks.

gatk bam library • 2.3k views
ADD COMMENTlink modified 4.5 years ago by Jorge Amigo9.5k • written 4.5 years ago by Ashutosh Pandey10k
5
gravatar for Jorge Amigo
4.5 years ago by
Jorge Amigo9.5k
Santiago de Compostela, Spain
Jorge Amigo9.5k wrote:

I guess that the best practice would be to follow GATK's advice for best practices, wouldn't it?

I particularly use the "better" suggestion, since the merging step of the "best" suggestion has always given me problems due to internal sample labeling on SOLiD platforms. we would use it only on small targetted resequencing projects, but we've found out that all the steps suggested as "better" lead to fairly believable results.

ADD COMMENTlink written 4.5 years ago by Jorge Amigo9.5k

Yeah, I tend to go with the GATK's best practices as well, it is pretty straightforward and seems to work. I would use the better option but I often only have 1-3 exome samples per project and I've never been sure whether doing VQSR with samples from different projects (different diseases and families) is a good idea or not.

ADD REPLYlink written 4.5 years ago by Dan Gaston6.6k

that's exactly the point I was trying to make. if you have mixed things it doesn't seem reasonable to treat them as a mixture. sure that if you work constantly with the same kits, reagents, sample types,... using the merging step of the best practices would be wise, but it is very rare the case that this happens on our lab... to date ;)

ADD REPLYlink written 4.5 years ago by Jorge Amigo9.5k
1
gravatar for Zev.Kronenberg
4.5 years ago by
United States
Zev.Kronenberg10k wrote:

I don't know if I do it the "correct way", but here is my approach:

Align and de-dup separately.

sort and merge together with read groups.

Generate indel target intervals.

Run indel realignment.

ADD COMMENTlink written 4.5 years ago by Zev.Kronenberg10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 492 users visited in the last hour