GenotypeGVCF too many genotypes from pooled samples
1
0
Entering edit mode
2.9 years ago
Vic ▴ 100

Hello,

I am trying to create a VCF file using GentypeGVCFs in GATK4. I have 60 samples and each sample is pooled data. The ploidy per sample is 60. This is due to the biological system I work in.

This data has been processed in Haplotypecaller, below is an example one pooled sample from bam to g.vcf:

./gatk HaplotypeCaller \    
-I /home/novaseq/bams/4_12.bam \
-R /home/novaseq/gatk/genomic_refseq.fna \
-O /home/novaseq/gatk/gvcf_by_sample/4_12_WG.g.vcf \
-ERC GVCF \
-ploidy 60 

Then data was taken through GenomicsDBImport to merge the multiple single sample g.vcfs into a database:

./gatk GenomicsDBImport \
--genomicsdb-workspace-path /home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \
-L /home/novaseq/gatk/gvcf_by_sample/intervals.list \
--sample-name-map /home/novaseq/gatk/gvcf_by_sample/gvcf.sample_map \
--tmp-dir /home/novaseq/gatk/gvcf_by_sample/tmp \

The resulting database was used to produce a vcf file via GenotypeGVCFs:

./gatk GenotypeGVCFs \
-R /home/novaseq/gatk/GCF_003254395.2_Amel_HAv3.1_genomic_refseq.fna \
-V gendb:///home/novaseq/gatk/gvcf_by_sample/genomic_work_space/ \
--sample-ploidy 60 \
-O /home/novaseq/gatk/pooled_colony.vcf.gz

I get an error of too many genotypes from GenotypeGVCFs when creating the vcf file:

Sample/Callset 4_9( TileDB row idx 59) at Chromosome NC_037638.1 position 60188 (TileDB column 60187) has too many genotypes in the combined VCF record : 1891 : current limit : 1024 (num_alleles, ploidy) = (3, 60). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.

Now I know there is a limit to how many genotypes can be in the vcf.

But I was wondering if someone could explain to me why there are so many genotypes for the 3 alleles at this site. Are there 1891 versions/types of ways of producing those 3 alleles? In one sample I have over 635000 genotypes for 5 alleles, wondering how is this possible with a ploidy of 60, is it due to the depth of sequencing?

Finally, can anyone offer a way to write the vcf file? Ultimately what I would like is allele frequencies for my downstream filtering and alnalysis. Should I have used additional filters earlier on?

many thanks.

GenotypeGVCFs GATK VCF • 1.4k views
ADD COMMENT
0
Entering edit mode
22 months ago

I have the same issue with GenotypeGVCFs, with two different warning messages.

An error related to the number of genotypes:

> Sample/Callset 95( TileDB row idx 76) at Chromosome Chr2 position
> 2405066 (TileDB column 168320521) has too many genotypes in the
> combined VCF record : 1081 : current limit :  1024 (num_alleles,
> ploidy) = (46, 2). Fields, such as  PL, with length equal to the
> number of genotypes will NOT be added

An error related to the number of alleles:

> Chromosome Chr2 position 5276810 (TileDB column 171192265) has too
> many alleles in the combined VCF record : 58 : current limit : 50.
> Fields, such as  PL, with length equal to the number of genotypes will
> NOT be added for this location.    

I don't really understand either how can we get that many different alleles and genotypes at a single position, taking in account I have 80 samples which are diploid. I will ask the question directly on the tool webpage.

ADD COMMENT
1
Entering edit mode

I don't really understand either how can we get that many different alleles and genotypes at a single position,

microsattelites, many indels in the context, many different clipped sequences, etc...

ADD REPLY
0
Entering edit mode

Hello! I am having the same issue with merged samples of tetraploids... Any update for this? Would you share the link of your question in GATK?

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2646 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6