Question: Filtering VCFs and Phasing
gravatar for miles.thorburn
7 months ago by
miles.thorburn80 wrote:

I have been trying to phase 66 genomes that are all contained in chromosome specific VCF files using the software ShapeIt. I have a working pipeline (works if I use the --force command to override the error I will discuss).

I get the following error:

33mERROR: 15611 SNPs with high rates of missing data (>10%). These sites should be removed.

First I tried to use Plink to remove these SNPs, but the resulting VCF had seemingly lost a lot of information. I've since deleted the script, but I could probably figure out what I did if necessary.

Second I found VCFtools could remove the SNPs too. I used the following code;

vcftools --vcf $file --max-missing 0.1 --recode --recode-INFO-all --out $OUTDIR/"$newname"

This step only removes a few hundred SNPs, and the error message from ShapeIt indicates that 15461 of the missing data SNPs are still present. Have I misinterpreted the VCFtools manual, missed a parameter, or approached the problem incorrectly?

Thank you in advance for your help. I am still learning a lot as I go, and bioinformatics is certainly not my forte.

snp vcftools filtering vcf shapeit • 326 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by miles.thorburn80
gravatar for miles.thorburn
7 months ago by
miles.thorburn80 wrote:

So I figured it out. Turns out it was simply a misunderstanding of the parameters. inthe VCFTools step, --max-missing needs to be higher than 0.9 (I used 0.95 in the end). I believe this means only variants with a maximum of 5% missing information were allowed to be kept. After testing a few different parameters, using +0.9 the number of SNPs in each chromosome matched the number of SNPs reported as data deficient for phasing.

ADD COMMENTlink written 7 months ago by miles.thorburn80

Yes, that is correct. So, selecting 0.9 for --max-missing means that only variants that appear in 90% of your samples will be included. The name of this parameter does not do justice to its actual usage.

Please feel free top accept your own answer (I have already up-voted it).

ADD REPLYlink written 7 months ago by Kevin Blighe46k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1158 users visited in the last hour