GenotypeGVCF too many genotypes from pooled samples
Entering edit mode
6 months ago
Vic ▴ 40


I am trying to create a VCF file using GentypeGVCFs in GATK4. I have 60 samples and each sample is pooled data. The ploidy per sample is 60. This is due to the biological system I work in.

This data has been processed in Haplotypecaller, below is an example one pooled sample from bam to g.vcf:

./gatk HaplotypeCaller \    
-I /home/novaseq/bams/4_12.bam \
-R /home/novaseq/gatk/genomic_refseq.fna \
-O /home/novaseq/gatk/gvcf_by_sample/4_12_WG.g.vcf \
-ploidy 60 

Then data was taken through GenomicsDBImport to merge the multiple single sample g.vcfs into a database:

./gatk GenomicsDBImport \
--genomicsdb-workspace-path /home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \
-L /home/novaseq/gatk/gvcf_by_sample/intervals.list \
--sample-name-map /home/novaseq/gatk/gvcf_by_sample/gvcf.sample_map \
--tmp-dir /home/novaseq/gatk/gvcf_by_sample/tmp \

The resulting database was used to produce a vcf file via GenotypeGVCFs:

./gatk GenotypeGVCFs \
-R /home/novaseq/gatk/GCF_003254395.2_Amel_HAv3.1_genomic_refseq.fna \
-V gendb:///home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \
--sample-ploidy 60 \
-O /home/novaseq/gatk/pooled_colony.vcf.gz

I get an error of too many genotypes from GenotypeGVCFs when creating the vcf file:

Sample/Callset 4_9( TileDB row idx 59) at Chromosome NC_037638.1 position 60188 (TileDB column 60187) has too many genotypes in the combined VCF record : 1891 : current limit : 1024 (num_alleles, ploidy) = (3, 60). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.

Now I know there is a limit to how many genotypes can be in the vcf.

But I was wondering if someone could explain to me why there are so many genotypes for the 3 alleles at this site. Are there 1891 versions/types of ways of producing those 3 alleles? In one sample I have over 635000 genotypes for 5 alleles, wondering how is this possible with a ploidy of 60, is it due to the depth of sequencing?

Finally, can anyone offer a way to write the vcf file? Ultimately what I would like is allele frequencies for my downstream filtering and alnalysis. Should I have used additional filters earlier on?

many thanks.

GenotypeGVCFs GATK VCF • 193 views

Login before adding your answer.

Traffic: 2162 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6