Hello All,
I've got a bunch of SNPs from my ddRAD data, and am attempting to follow a set of filtering guidelines made by the good folks who make dDocent (https://github.com/jpuritz/dDocent/blob/master/tutorials/Filtering%20Tutorial.md). This guide (and many others) suggest filtering individuals that have a lots of missing data. Individual missingness can be easily calculated using VCFtools with the "--missing-indv" parameter. Unfortunately, VCFtools does not support polyploidy, and over half of my samples are polyploids, so I'm trying to find an alternative. I've done a fair amount of googling to see if this can be done using bcftools or vcflib or something, but haven't had any luck. Please let me know if you have any suggestions.
Thanks!
Evan
can you please please post one line of such VCF ?
Here's the first line from my vcf. SNPs were called using freebayes as part of the dDocent pipeline.
unless I'm wrong. I don't see any "missing data" here.
If I understand correctly (I'm new to all of this), the line I just posted is one site (out of >400,000) from one contig (out of ~22,000). I'm also not exactly sure how "individual missingness" is calculated using VCFtools, but I think it takes each individual and calculates the percentage of sites (out of the >400,000 raw SNP sites provided by freebayes) for which that individual has data. I know that I certainly have missing data, because when I follow the SNP-filtering guide posted above (only works on my diploids, but I tested it) I have a few individuals that have close to 50% missing data.
These
0/0
calls are homozygous ref; so, not exactly missing - this information is just a valuable as knowing that a variant / mutation is present. BCFtools should have some way of calculating these. If not, use my script here (the second one): A: calculate Per variant Heterozygosity from VCF file