Question

The Question About Dp Value In Mutiple Samples Calculated By Ug

1

Entering edit mode

13.1 years ago

Chris ▴ 40

Hi, Enthusiastic people I have my data successfully local realigned, BQSR, and then UG processing. But I find that, in VCF file, the DP value is very large, several hundred, which actually each of my data only is 10x-20x average. The data consist of 50 bams. Is the dp value calculated from 50*(10-20)?

And the UG walker tells me I need about 5 days to complete the process. 150GB size of 50 bams totally, is the time almost right?

Some warnings come out that : WARN 16:49:00,627 ExactAFCalculationModel - this tool is currently set to genotype at most 3 alternate alleles in a given context, but the context at chr1:38228257 has 16 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument. Is this matter?

The command I use for UG: java -jar -Djava.io.tmpdir=/data1/tmp /path/GenomeAnalysisTK-1.6-7-g2be5704/GenomeAnalysisTK.jar -R /path/ucsc.hg19.fasta -I /path/bam.list -T UnifiedGenotyper -D /data1/gatk/dbsnp_135.hg19.vcf -o SRR_50bam.raw.vcf -glm BOTH

a sample SNP in result: chrM 152 rs117135796 T C 3176.34 . AC=27;AF=0.307;AN=88;BaseQRankSum=-3.503;DB;DP=530;Dels=0.01;FS=3.082;HRun=1;HaplotypeScore=5.2820;InbreedingCoeff=0.6907;MQ=35.09;MQ0=13;MQRankSum=-8.934;QD=16.90;ReadPosRankSum=1.860;SB=-1671.05 GT:AD:DP:GQ:PL 1/1:0,14:14:42.04:395,42,0 ./. 0/0:5,0:6:12.02:0,12,121 0/0:9,0:9:18.02:0,18,176 0/0:14,0:14:42.06:0,42,428 1/1:0,7:7:21.03:202,21,0 0/0:8,0:9:21.04:0,21,209 0/1:1,3:4:22.54:44,0,23 0/0:6,0:6:18.04:0,18,182 0/0:14,1:15:23.98:0,24,209 0/0:14,1:15:23.98:0,24,209 0/0:13,0:13:36.06:0,36,358 1/1:0,20:20:48.06:454,48,0 0/0:8,0:8:12.03:0,12,132 0/0:21,0:21:57.11:0,57,582 1/1:0,10:10:24.02:223,24,0 1/1:0,20:20:57.06:542,57,0 0/0:26,2:30:56.99:0,57,523 0/0:15,1:17:33.06:0,33,337 0/0:19,0:19:45.11:0,45,479 0/0:8,0:8:24.06:0,24,253 0/0:7,0:7:21.01:0,21,194 0/0:21,0:21:51.07:0,51,490 1/1:0,13:13:32.96:284,33,0 1/1:0,9:9:17.99:153,18,0 0/0:19,0:19:51.07:0,51,505 0/1:4,13:17:75.62:181,0,76 0/0:20,0:20:48.05:0,48,457 0/0:12,0:12:18.01:0,18,166 1/1:0,13:15:26.99:240,27,0 0/0:12,0:12:30.01:0,30,280 1/1:0,14:14:39.05:384,39,0 1/1:0,20:20:9.01:85,9,0 0/0:3,0:3:9.02:0,9,94 0/0:16,0:16:45.06:0,45,441 0/0:26,0:26:69.06:0,69,663 0/1:6,7:13:99:134,0,118 1/1:0,3:3:3.01:31,3,0 0/0:3,0:4:3:0,3,29

I have search this forum for my question, but still confused, Sorry for my unprofessional question and appreciate for your help. Thanks

gatk • 2.7k views

ADD COMMENT • link updated 11.7 years ago by Biostar 20 • written 13.1 years ago by Chris ▴ 40

score 1 · Answer 1 · 2012-05-29

the fields on the INFO column of a vcf file are meant to be descriptive for the entire set analyzed, so if you are calling multiple samples the DP on the INFO column (defined as "Approximate read depth") would be the read depth of that site on all samples. if you look carefully on the samples' columns, the ones that use the GT:AD:DP:GQ:PL format, you will have there the DP value for each sample.

regarding time performance, it takes on our cluster ~2h to process ~100 bams of ~1G each, so your 50 bams of ~150G each would take ~6 days. considering that GATK constantly updates the time needed, I guess seeing a 5 days notice on yours when starting would be expected. have in mind though that UnifiedGenotyper can be used in parallel mode through the -nt option, which drastically reduces your timings. on the wiki there is a page about GATK parallelism, where it is stated that using 8 threads if possible would be the best scale. we simply use 2 and we get almost exactly half the original times.