GenotypeGVCF too many genotypes from pooled samples
0
0
Entering edit mode
6 months ago
Vic ▴ 40

Hello,

I am trying to create a VCF file using GentypeGVCFs in GATK4. I have 60 samples and each sample is pooled data. The ploidy per sample is 60. This is due to the biological system I work in.

This data has been processed in Haplotypecaller, below is an example one pooled sample from bam to g.vcf:

./gatk HaplotypeCaller \    
-I /home/novaseq/bams/4_12.bam \
-R /home/novaseq/gatk/genomic_refseq.fna \
-O /home/novaseq/gatk/gvcf_by_sample/4_12_WG.g.vcf \
-ERC GVCF \
-ploidy 60 

Then data was taken through GenomicsDBImport to merge the multiple single sample g.vcfs into a database:

./gatk GenomicsDBImport \
--genomicsdb-workspace-path /home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \
-L /home/novaseq/gatk/gvcf_by_sample/intervals.list \
--sample-name-map /home/novaseq/gatk/gvcf_by_sample/gvcf.sample_map \
--tmp-dir /home/novaseq/gatk/gvcf_by_sample/tmp \

The resulting database was used to produce a vcf file via GenotypeGVCFs:

./gatk GenotypeGVCFs \
-R /home/novaseq/gatk/GCF_003254395.2_Amel_HAv3.1_genomic_refseq.fna \
-V gendb:///home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \
--sample-ploidy 60 \
-O /home/novaseq/gatk/pooled_colony.vcf.gz

I get an error of too many genotypes from GenotypeGVCFs when creating the vcf file:

Sample/Callset 4_9( TileDB row idx 59) at Chromosome NC_037638.1 position 60188 (TileDB column 60187) has too many genotypes in the combined VCF record : 1891 : current limit : 1024 (num_alleles, ploidy) = (3, 60). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.

Now I know there is a limit to how many genotypes can be in the vcf.

But I was wondering if someone could explain to me why there are so many genotypes for the 3 alleles at this site. Are there 1891 versions/types of ways of producing those 3 alleles? In one sample I have over 635000 genotypes for 5 alleles, wondering how is this possible with a ploidy of 60, is it due to the depth of sequencing?

Finally, can anyone offer a way to write the vcf file? Ultimately what I would like is allele frequencies for my downstream filtering and alnalysis. Should I have used additional filters earlier on?

many thanks.

GenotypeGVCFs GATK VCF • 193 views
ADD COMMENT

Login before adding your answer.

Traffic: 2162 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6