I have a multisample vcf that I would like to filter based on read depth. More precisely, I would like to modify the vcf so that a snp that is detected where the read depth is more than the average read depth at the sample will change to the reference allele. I know how to filter snps where read depth is higher than a certain value with
However, the value applied to the maxDP parameter would change depending on the sample.
Is there any tool or trick that could help me do this?
Many thanks for your help,
How do you plan to handle if the samples are discordant (i.e. the SNP has more reads than the avg read depth of one sample, but not another?) Are you just going to change the genotype of each sample individually? What's your ultimate goal here?
That is exactly my problem. The average read depth is different for each sample so I would need to find a way to modify the genotypes of each sample individually. My ultimate goal is to have a VCF file with all samples, where SNPs have been filtered for a minimum and a maximum read depth. I would use this VCF to build phylogenies, run population structure analyses, calculating heterozygosity...
Oof, that's an annoying one. You could write a script that uses bedtools
genomecovto get the read-depth at each variant position for each sample, but it's going to be rather annoying as you'll have to get that information from the bam file for each sample while iterating through the VCF.
I guess another (really not ideal) alternative would be to go back and call variants on each sample individually, filter with vcftools as you mention above, and then merge all the samples into a single VCF with bcftools
mergeor GATK's CombineVariants tools.
Is there a particular reason you feel filtering in this way will help your analyses? Most variant callers do a fair amount of filtering or offer separate utilities that can also be used for filtering.
Yes, I thought of re-doing the variant calling for each individual separately and merging the resulting vcf files. The problem with this is that samples that were homozygous for the reference alleles will end up with missing genotypes if other individuals were variant. A way around this would be to print all positions (variants & non variants) to get a gvcf but you end up with quite large files.
I feel I need to do this filtering because I am mapping reads to loci from a gene family with lots of paralogs. I fear that if paralogs are not in my reference, they will map to the same location/locus, falsely inflating the number of SNPs.
I will try Pierre's solution and will let you know if it works for me! Working on this now.
Ah okay. Yes, Pierre's solution should work for you assuming your FORMAT fields include DP. Good luck.
Could you post some example VCF records?
Not sure this is what you want but my multisample vcf looks like this at the moment: