How to combine 2000 samples in multi-sample SNP calling
0
0
Entering edit mode
5.1 years ago
yuzhe891 • 0

Dear all,

I read this paper and do not well understand the "Variant calling was performed on all 2,073 BAM files using the GATK UnifedGenotyper". In this paper, 2073 mouse with ~0.6X were sequenced and UnifedGenotyper was employed to call variants. The method described as:

Variant calling was performed on all 2,073 BAM files using the GATK UnifedGenotyper with thresholds -stand_call_conf 30 and -stand_emit_conf 30, as well as the following options for building variant quality recalibration tables: -A QualByDepth -A HaplotypeScore -A BaseQualityRankSumTest -A ReadPosRankSumTest -A MappingQualityRankSumTest -A RMSMappingQuality -A DepthOfCoverage -A FisherStrand -A HardyWeinberg -A HomopolymerRun. Raw VCF files from the variant calling step for all chromosomes except the Y chromosome were pooled together for VQSR using the GATK VariantRecalibrator under SNP mode. Training, known and true sets for building the positive model are the SNPs that segregate among the classical laboratory strains of the MGP (2011 release REL-1211) on all chromosomes except the Y chromosome.

In the UnifiedGenotyper step, I saw the sampling individuals up to 250 in multi-sample SNP calling setting in GATK. How to combine 2000 samples in this step? Or I misunderstand something?

Thank you!

Yuzhe

SNP • 1.7k views
ADD COMMENT
0
Entering edit mode

Why did they use UnifiedGenotyper and not HaplotypeCaller? Also, this paper is from 2016 so in all probability, they used GATK <=3.4, not 3.8. Given how HC is much better than UG, they may have been phasing features out of UG for a while.

ADD REPLY
0
Entering edit mode

In this case, they employed very low coverage sequencing (~0.6X) for multi-sample. I'm thinking that with such low coverage the sensitivity of the GVCF calls is not great. HC's regular mode has been proved bad performance in similar situation (see this paper).

So return to my question, How to combine 2000 samples in GATK variation calling step whatever the method (UG or HC)?

ADD REPLY
0
Entering edit mode

Are you getting the 250 number from the downsample setting? That is not a measure of the number of samples but a threshold on the depth of reads per sample. See this page on downsampling.

ADD REPLY
0
Entering edit mode

Thank you for this explanation!I have another question that whether the genotyping results are likely to be corrected in CombineGVCFs and GenotypeGVCFs? The problem is still based on low depth sequencing (<1X). For example, using --min-pruning 1 and --min-dangling-branch-length 1 options will increase HaplotypeCaller's sensitivity but may output wrong genotyping result in gVCF file. In this case, is it possible to correct the genotype result in GenotypeGVCFs step using thousands of samples?

ADD REPLY
0
Entering edit mode

I think this should be a new question on the forum.

ADD REPLY

Login before adding your answer.

Traffic: 2336 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6