Filtering VCFs and Phasing
1
0
Entering edit mode
2.6 years ago

I have been trying to phase 66 genomes that are all contained in chromosome specific VCF files using the software ShapeIt. I have a working pipeline (works if I use the --force command to override the error I will discuss).

I get the following error:

33mERROR:[0m 15611 SNPs with high rates of missing data (>10%). These sites should be removed.

First I tried to use Plink to remove these SNPs, but the resulting VCF had seemingly lost a lot of information. I've since deleted the script, but I could probably figure out what I did if necessary.

Second I found VCFtools could remove the SNPs too. I used the following code;

vcftools --vcf $file --max-missing 0.1 --recode --recode-INFO-all --out$OUTDIR/"\$newname"

This step only removes a few hundred SNPs, and the error message from ShapeIt indicates that 15461 of the missing data SNPs are still present. Have I misinterpreted the VCFtools manual, missed a parameter, or approached the problem incorrectly?

Thank you in advance for your help. I am still learning a lot as I go, and bioinformatics is certainly not my forte.

ShapeIt VCFtools filtering VCF SNP • 965 views
3
Entering edit mode
2.6 years ago

So I figured it out. Turns out it was simply a misunderstanding of the parameters. inthe VCFTools step, --max-missing needs to be higher than 0.9 (I used 0.95 in the end). I believe this means only variants with a maximum of 5% missing information were allowed to be kept. After testing a few different parameters, using +0.9 the number of SNPs in each chromosome matched the number of SNPs reported as data deficient for phasing.

0
Entering edit mode

Yes, that is correct. So, selecting 0.9 for --max-missing means that only variants that appear in 90% of your samples will be included. The name of this parameter does not do justice to its actual usage.