Dear all,
I read this paper and do not well understand the "Variant calling was performed on all 2,073 BAM files using the GATK UnifedGenotyper". In this paper, 2073 mouse with ~0.6X were sequenced and UnifedGenotyper was employed to call variants. The method described as:
Variant calling was performed on all 2,073 BAM files using the GATK UnifedGenotyper with thresholds
-stand_call_conf 30
and-stand_emit_conf 30
, as well as the following options for building variant quality recalibration tables:-A QualByDepth -A HaplotypeScore -A BaseQualityRankSumTest -A ReadPosRankSumTest -A MappingQualityRankSumTest -A RMSMappingQuality -A DepthOfCoverage -A FisherStrand -A HardyWeinberg -A HomopolymerRun
. Raw VCF files from the variant calling step for all chromosomes except the Y chromosome were pooled together for VQSR using the GATK VariantRecalibrator under SNP mode. Training, known and true sets for building the positive model are the SNPs that segregate among the classical laboratory strains of the MGP (2011 release REL-1211) on all chromosomes except the Y chromosome.
In the UnifiedGenotyper step, I saw the sampling individuals up to 250 in multi-sample SNP calling setting in GATK. How to combine 2000 samples in this step? Or I misunderstand something?
Thank you!
Yuzhe
Why did they use UnifiedGenotyper and not HaplotypeCaller? Also, this paper is from 2016 so in all probability, they used GATK <=3.4, not 3.8. Given how HC is much better than UG, they may have been phasing features out of UG for a while.
In this case, they employed very low coverage sequencing (~0.6X) for multi-sample. I'm thinking that with such low coverage the sensitivity of the GVCF calls is not great. HC's regular mode has been proved bad performance in similar situation (see this paper).
So return to my question, How to combine 2000 samples in GATK variation calling step whatever the method (UG or HC)?
Are you getting the 250 number from the downsample setting? That is not a measure of the number of samples but a threshold on the depth of reads per sample. See this page on downsampling.
Thank you for this explanation!I have another question that whether the genotyping results are likely to be corrected in
CombineGVCFs
andGenotypeGVCFs
? The problem is still based on low depth sequencing (<1X). For example, using--min-pruning 1
and--min-dangling-branch-length 1
options will increase HaplotypeCaller's sensitivity but may output wrong genotyping result in gVCF file. In this case, is it possible to correct the genotype result in GenotypeGVCFs step using thousands of samples?I think this should be a new question on the forum.