VCF filtering
0
0
Entering edit mode
6 weeks ago
drowl1 ▴ 30

Hi everyone,

I have a multisample VCF file (30 samples) with haploid calls and I want to filter off all sites where this condition holds true;

'sum of ref (GT=0) and missing/uncalled (GT=".") genotypes in all samples is 30'

I have tried this in bcftools but it doesn't seem to work ;

bcftools view  -e '(sum(GT[*] =".") + sum(GT[*]="0")) == 30'  samples.vcf  >  filtered_samples.vcf


Please advise on how to correctly do it with bcftools or with any other approach.

Thanks!

vcf SNPfiltering genotype • 346 views
1
Entering edit mode

I think slivar is pretty neat.

1
Entering edit mode

bcftools view --min-ac 1  in.vcf

0
Entering edit mode

Hi Pierre,

Thanks for your suggestion. I have initially excluded all the homozygous REF sites (where GT = 0 across the 30 samples) as well as the homozygous ALT ( where GT = 1 across all the 30 samples) in two steps using bcftools.

I basically want to remain with heterozygous sites only, so from the above, I also want to follow that up by further filtering off sites where the genotypes are REF and uncalled/missing across all samples ( i. e GT = "0" + GT = "." ==30) and where genotypes are ALT and uncalled/missing across all samples ( i. e GT = "1" + GT = "." ==30).

It looks like the regex and arithmetic functions in bcftools do not work across samples so I'm stuck. Would you know how to work around this?

0
Entering edit mode

why are you talking about homozygous sites and heterozygous sites if those are "haploid calls" ??

0
Entering edit mode

Apologies for that typo. I'd like to get rid of sites that are ALT & uncalled/missing genotype across all samples, as well as those that are REF & uncalled/missing genotype across all samples. Such that the remaining sites have all genotypes (ALT, REF & missing ".") across all samples