I am trying to create a VCF file using GentypeGVCFs in GATK4. I have 60 samples and each sample is pooled data. The ploidy per sample is 60. This is due to the biological system I work in.
This data has been processed in Haplotypecaller, below is an example one pooled sample from bam to g.vcf:
./gatk HaplotypeCaller \ -I /home/novaseq/bams/4_12.bam \ -R /home/novaseq/gatk/genomic_refseq.fna \ -O /home/novaseq/gatk/gvcf_by_sample/4_12_WG.g.vcf \ -ERC GVCF \ -ploidy 60
Then data was taken through GenomicsDBImport to merge the multiple single sample g.vcfs into a database:
./gatk GenomicsDBImport \ --genomicsdb-workspace-path /home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \ -L /home/novaseq/gatk/gvcf_by_sample/intervals.list \ --sample-name-map /home/novaseq/gatk/gvcf_by_sample/gvcf.sample_map \ --tmp-dir /home/novaseq/gatk/gvcf_by_sample/tmp \
The resulting database was used to produce a vcf file via GenotypeGVCFs:
./gatk GenotypeGVCFs \ -R /home/novaseq/gatk/GCF_003254395.2_Amel_HAv3.1_genomic_refseq.fna \ -V gendb:///home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \ --sample-ploidy 60 \ -O /home/novaseq/gatk/pooled_colony.vcf.gz
I get an error of too many genotypes from GenotypeGVCFs when creating the vcf file:
Sample/Callset 4_9( TileDB row idx 59) at Chromosome NC_037638.1 position 60188 (TileDB column 60187) has too many genotypes in the combined VCF record : 1891 : current limit : 1024 (num_alleles, ploidy) = (3, 60). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Now I know there is a limit to how many genotypes can be in the vcf.
But I was wondering if someone could explain to me why there are so many genotypes for the 3 alleles at this site. Are there 1891 versions/types of ways of producing those 3 alleles? In one sample I have over 635000 genotypes for 5 alleles, wondering how is this possible with a ploidy of 60, is it due to the depth of sequencing?
Finally, can anyone offer a way to write the vcf file? Ultimately what I would like is allele frequencies for my downstream filtering and alnalysis. Should I have used additional filters earlier on?