Question

Need Help With The Gatk Usage

0

Entering edit mode

10.5 years ago

ivivek_ngs ★ 5.2k

I have some more query for which I need some suggestions, I am new to GATK and want to use it for my exome sequencing data analysis. I have been a bit lost reading all the blogs , comments and the technical forums. So here is something I want to say and please correct and guide me through the procedure. I have downloaded the hg19 files from the UCSC browser and created the reference genome but do I need to again use the one which is there in GATK repository and then align my samples for downstream analysis? Also I want to run the GATK in my institute cluster. So if am not wrong I should create the directory of the latest GATK version and transfer all the necessary files via Filezilla in the cluster directory with the same name. Now this I have already done. So next thing is to download the bundle from the repository where I see 2 versions , so which one should I download? 2.5 or 2.3? Also once I download the bundle do I have to download anything else? So here it is which I should be downloading right in my cluster. The jar file and the resource folder with the .java files and then in the main directory of the GATK version folder in my cluster I should download the bundle version (2.5 or 2.3) and then unzip all the files that are there in the bundle directory. Right? Please let me know. Then I should be ready to use the GATK for the different downstream processes listed below:

Identify target regions for realignment (Genome Analysis Toolkit) ->Realign BAM to get better Indel calling (Genome Analysis Toolkit) ->Reindex the realigned BAM (SAM Tools) ->Call Indels (Genome Analysis Toolkit) ->Call SNPs (Genome Analysis Toolkit)->View aligned reads in BAM/BAI (Integrated Genome Viewer)

Also I would like to ask about the step of MarkPCR Duplicates, is it necessary to do it downstream as in some threads I see they do the variant calling without it as well. I am trying to set up a pipeline for myself with a single sample (paired -end) and test it end to end and then will use this pipeline once I receive my samples of the facility.

Please let me know if this looks correct or not. The VCF files from the 1kG and the DBSNP are already there in compressed form in the bundle repository of the GATK website which I am currently downloading and I can use them directly after unzipping them

exome-sequencing gatk • 4.8k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 10.5 years ago by ivivek_ngs ★ 5.2k

1

Entering edit mode

Cross-posted here and here.

ADD REPLY • link 10.5 years ago by Devon Ryan 104k

0

Entering edit mode

Yes I am looking for some suggestions, so I posted in both seqanwers and here, I would like to say people who reply in seqanswers need not reply here I will follow it up.

ADD REPLY • link 10.5 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

I have removed from the first post and kept it in the thread BAM to VCF in seqanswers if anyone like to put suggestions there or anywhere here its welcome, anyone of the thread can be answered to provide me guidance

ADD REPLY • link 10.5 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

You have asked many questions that I might need to deal with my data analysis too. I think you have to keep the version of GATK consistently from beginning to the end (Also BWA), and try to update latest version of software from the beginning because different versions might produce different result.

MarkPCR duplicate is not always necessary but it is no harm to just mark them without remove them. the downstream software can still do variant calling. So you can choose remove duplicate too but once you do it, you will lose them.

ADD REPLY • link 10.5 years ago by Tonyzeng ▴ 310

score 2 · Answer 1 · 2013-10-12

there are 2 main resources you have to check to improve your knowledge about exome sequencing:

having said that, and not wanting to extend this answer too much, just a few guidelines:

you have to use the same reference in the alignment step and in the GATK analysis. using any of the references that come inside the bundle would allow straight-forward usage of the rest of the bundle files
if you don't know which software and bundle version to use, just go for the latest one (currently GATK's 2.7.4 and bundle's 2.5). you don't have to care about where to store them as long as you use full paths on each file referencing. and yes, you have to unzip all the downloaded files before using them.
follow the best practice as possible. the typical pipeline would be the following: [duplicates removal]>[2-steps indel realignment]>[2-steps base quality recalibration]>[variant calling]>[variant recalibration]>[variant annotation]