Question: Getting A Large Vcf File On Variant Calling Through Gatk
gravatar for Saket Choudhary
7.6 years ago by
Saket Choudhary80 wrote:


I am using the following set of commands on GATK2.1.13 to generate a VCF file

               echo java -Xmx20g -jar  /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T RealignerTargetCreator -o my.intervals -et NO_ET -K /root/sandbox/ 
               echo "Realignment Done at date" echo "Starting IndelRealigner at date"

              echo java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T IndelRealigner -targetIntervals my.intervals -o myrealignedBam.bam -et NO_ET -K /root/sandbox/ 
             echo "Realignment done at date" 
             echo "Starting UnifiedGenotyper at date" echo java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -l INFO -R human_g1k_v37.fasta -T UnifiedGenotyper -I myrealignedBam.bam -o mygatk_vcf.vcf --output_mode EMIT_ALL_SITES -et NO_ET -K /root/sandbox/

When i do a 'mpileup' for B2_with_ReadGroup.ddup.sorted.bam , I get a devcent 10 MB VCF file. But on the last ste of the above pipeline, my " mygatk_vcf.vcf " is goinging into 81GBs !!

Do you know what is wrong ?

vcf gatk variant samtools • 2.6k views
ADD COMMENTlink modified 2.5 years ago by Biostar ♦♦ 20 • written 7.6 years ago by Saket Choudhary80
gravatar for Brad Chapman
7.6 years ago by
Brad Chapman9.5k
Boston, MA
Brad Chapman9.5k wrote:

Your UnifiedGenotyper command line includes the option --output_mode EMIT_ALL_SITES, which will write a call for every site in the genome. This is creating the large file size. The default is to only provide calls at variant sites that differ from the reference, which be a more reasonable size. There is lots of useful documentation to help in choosing the right options:

ADD COMMENTlink written 7.6 years ago by Brad Chapman9.5k

I guess that you are looking for both SNP and indels, so instead of using "--output_mode EMIT_ALL_SITES" you should use "-glm BOTH" (edited). if you are looking for something else then you'll definitely have to browse through Brad's suggestions (I would recommend it anyway, since GATK's docs are constantly changing, and it's good to have always a general plus a more straight vision of what GATK is really doing).

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Jorge Amigo11k

There is some confusion here. In case you need to call for both SNPs and Indels one should use genotypelikelihoodsmodel = BOTH. I dont think --output_mode has "BOTH" option.

ADD REPLYlink written 7.6 years ago by Ashutosh Pandey12k

yes, I probably wrote too fast my comment. it should be "-glm BOTH". the output mode option should be used when you want to get "all confident sites" or "all sites" instead of default "variants only". thanks for pointing it out, I'll edit my previous comment.

ADD REPLYlink written 7.6 years ago by Jorge Amigo11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 741 users visited in the last hour