Filtering for individual allelic balance in multisample vcf using GATK4
1
0
Entering edit mode
2.1 years ago
mernster • 0

I have a multisample vcffile and would like to edit heterozygous individuals with a low or high allelic balance (AB). AB is defined as the ratio of reads showing the reference allele to all reads.

If the AB is below 0.25, I would like to change the genotype to 1/1 (homozygous for alternateallele ). If the AB is above 0.75, I would like to change the genotype to 0/0 (homozygous for reference allele).

Unfortunately, I dont have the AB information for each individual in my vcf. Therefore, I was planning to use RO and DP instead. RO is the count of full observations of the reference haplotype and DP is the total number of reads. These tags are specified in the genotype column, meaning that they are specific for every individual.

The way to proceed in these cases apparently is as described here: https://software.broadinstitute.org/gatk/documentation/article?id=12350

I have been trying to use something similar to this to tag the variants that I want to convert: gatk VariantFiltration -R /ref.fasta -V input.vcf -O output.vcf --genotype-filter-expression 'isHet == 1 && RO / DP < 0.25 ' --genotype-filter-name 'lowAB' But somehow this just adds an anotation to all heterozygous positions...

As far as I know, GATK only allows you to convert the calls with the tag to no call. Therefore I think I need a script to manually change the genotype fields if the tags are present.

gatk vcf filtering GATK4 • 1.2k views
0
Entering edit mode

Are you sure that's not a XY Problem?

0
Entering edit mode

I guess so. The original aim is to filter a multisample VCF by AB. The problem is, that if I do

vcffilter -f 'AB > 0.25 & AB < 0.75 | AB < 0.01' input.vcf > output.vcf


on a multisample VCF file, it uses the AB over ALL heterozygous genotypes in that site. However, what I want is to filter for the AB on a genotype level instead. So that, if an individual has 2 reads supporting allele A and 8 reads supporting allele B, that site is filtered since the 2 reads are then assumed to be sequencing errors.

However, I think that automatically removing the genotypes with low or high AB results in a great loss of information. There could be cases where the AB is low or high but when one discards the reads supporting A (because they are assumed to be sequencing errors), there are still sufficient reads showing that that site is homozygous for B.

Therefore, it would be good, if the program could discard the reads supporting A and then recall the genotype instead of converting the genotype under consideration to ./. automatically...

0
Entering edit mode

It probably depends for what you need allelic imbalance and how sensitive your detection of allelic imbalance needs to be. Is this RNA or DNA-seq? If the latter, what's the coverage?

0
Entering edit mode

I am working with very closely related species, thats why I wanted to keep the hets. My plan was to infer the phylogenetic tree using several methods (SNAPP, phylonet, Treemix and a concatenated tree with raxml) and then to compare the results.

0
Entering edit mode

Others are way more qualified to answer here since I'm not a germline person, but with average 10X, you have to accept that you don't have the power to detect heterozygosity for all sites. So you have to find a way to deal with the uncertainty. I'm sure there are standard ways of doing that for your goal. Allelic imbalance seems like the wrong term for a literature research here.

0
Entering edit mode
2.1 years ago

Calling AB=0.24 as homozygous instead of heterozygous is the wrong thing to do ~100% of the time. Until you understand why, GATK is doing you a favor by not making it easy to make this catastrophic mistake.

0
Entering edit mode

thanks very much for the answer! I won't do that. But what do you advise me to do instead?

Would you advise completely removing the genotypes of heterozygous individuals that are below/above the AB threshold? Doesn't that result in a huge loss of information?

I have done a different kind of workflow before, where I called variants for each individual separately and performed the filtering on this individual VCFs

There, I did use vcffilter to filter the genotypes that have low or high AB:

vcffilter -f 'AB > 0.25 & AB < 0.75 | AB < 0.01' input.vcf > output.vcf


I saw that often, vcffilter converts the genotypes to 1/1 or 0/0, instead of ./. and I thought that that is much better than automatically removing all of the genotypes. Only because there are a few reads that support an alternative allele, it doesn't mean that there is not sufficient evidence for that site being homozygous, right?

Do you know any way to do filtering of a multisample VCF the same way I did on individual VCFs using vcffilter? Id like the program to reassess if, when removing the reads with the minor allele, there is still sufficient evidence for coding that site as homozygous. Also, id like it to do that for each genotype not over all samples...