Question: How to combine 2000 samples in multi-sample SNP calling
0
gravatar for yuzhe891
5 weeks ago by
yuzhe8910
yuzhe8910 wrote:

Dear all,

I read this paper and do not well understand the "Variant calling was performed on all 2,073 BAM files using the GATK UnifedGenotyper". In this paper, 2073 mouse with ~0.6X were sequenced and UnifedGenotyper was employed to call variants. The method described as:

Variant calling was performed on all 2,073 BAM files using the GATK UnifedGenotyper with thresholds -stand_call_conf 30 and -stand_emit_conf 30, as well as the following options for building variant quality recalibration tables: -A QualByDepth -A HaplotypeScore -A BaseQualityRankSumTest -A ReadPosRankSumTest -A MappingQualityRankSumTest -A RMSMappingQuality -A DepthOfCoverage -A FisherStrand -A HardyWeinberg -A HomopolymerRun. Raw VCF files from the variant calling step for all chromosomes except the Y chromosome were pooled together for VQSR using the GATK VariantRecalibrator under SNP mode. Training, known and true sets for building the positive model are the SNPs that segregate among the classical laboratory strains of the MGP (2011 release REL-1211) on all chromosomes except the Y chromosome.

In the UnifiedGenotyper step, I saw the sampling individuals up to 250 in multi-sample SNP calling setting in GATK. How to combine 2000 samples in this step? Or I misunderstand something?

Thank you!

Yuzhe

snp • 164 views
ADD COMMENTlink modified 5 weeks ago by RamRS21k • written 5 weeks ago by yuzhe8910

Why did they use UnifiedGenotyper and not HaplotypeCaller? Also, this paper is from 2016 so in all probability, they used GATK <=3.4, not 3.8. Given how HC is much better than UG, they may have been phasing features out of UG for a while.

ADD REPLYlink written 5 weeks ago by RamRS21k

In this case, they employed very low coverage sequencing (~0.6X) for multi-sample. I'm thinking that with such low coverage the sensitivity of the GVCF calls is not great. HC's regular mode has been proved bad performance in similar situation (see this paper).

So return to my question, How to combine 2000 samples in GATK variation calling step whatever the method (UG or HC)?

ADD REPLYlink modified 4 weeks ago by RamRS21k • written 4 weeks ago by yuzhe8910

Are you getting the 250 number from the downsample setting? That is not a measure of the number of samples but a threshold on the depth of reads per sample. See this page on downsampling.

ADD REPLYlink written 4 weeks ago by RamRS21k

Thank you for this explanation!I have another question that whether the genotyping results are likely to be corrected in CombineGVCFs and GenotypeGVCFs? The problem is still based on low depth sequencing (<1X). For example, using --min-pruning 1 and --min-dangling-branch-length 1 options will increase HaplotypeCaller's sensitivity but may output wrong genotyping result in gVCF file. In this case, is it possible to correct the genotype result in GenotypeGVCFs step using thousands of samples?

ADD REPLYlink modified 29 days ago by RamRS21k • written 4 weeks ago by yuzhe8910

I think this should be a new question on the forum.

ADD REPLYlink written 29 days ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 969 users visited in the last hour