Question: Inconsistency in VCF and Plink format after conversion
0
gravatar for olavur
3.5 years ago by
olavur100
T├│rshavn, Faroe Islands
olavur100 wrote:

When I convert a VCF file to Plink (PED and MAP files), using

$ vcftools --gzvcf myfile.vcf.gz --plink --out myfile_plink

there are too few variants in the PED file. The MAP file has the correct amount of variants (same as in original VCF file), but the PED file has much fewer. If I try to run some Plink command on these files, Plink reports this inconsistency, and I can verify it as well.

EDIT:

Working with a file with just chromosome 1 (human genome) seems to work just fine, but when combining all chromosomes (vcf-concat) I get this problem.

EDIT2:

When processing on a per-chromosome basis, this happens sometimes as well.

plink vcftools • 1.8k views
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by olavur100
2

Does this work if you use plink for the import instead (e.g. "plink --vcf myfile.vcf.gz --recode --out myfile_plink")?

ADD REPLYlink written 3.5 years ago by chrchang5237.3k

Perhaps, but stable Plink (v1.07) does not support the --vcf option. I tried moving over to v1.9, but that gives me additional problems, unfortunately.

ADD REPLYlink written 3.5 years ago by olavur100

Can you be more specific about an additional problem you're having with 1.9? I'm actively updating it to remove problems with replacing v1.07 with v1.9 (and this is now the only way in which v1.9 is updated; all new stuff is only going into the v2.0 codebase at this point).

ADD REPLYlink written 3.5 years ago by chrchang5237.3k

Using Plink v1.9 to convert from VCF to Plink ultimately solved the problem. Thanks. Feel free to post it as a solution, otherwise I will.

I think these "additional problems" I was speaking of were just due that I need to use a --const-fid when using Plink v1.9, which I didn't need to in Plink v1.07.

ADD REPLYlink written 3.5 years ago by olavur100
1

Maybe it's because PLINK cannot manage multi-allelic sites? Have you checked which variants are being filtered out?

ADD REPLYlink written 3.5 years ago by GabrielMontenegro560

I have used Plink with other datasets where multiple chromosomes wasn't a problem, so it's probably not that. I don't think it's possible to check which variants are missing either, because it's only the MAP file that has any information about the identity of the variants, and the MAP file is correct.

ADD REPLYlink written 3.5 years ago by olavur100

What do you mean multiple chromosomes? I was talking about multiple alleles at a site. How come you cannot check which variants are missing? How then do you know they were missing in the first place?

ADD REPLYlink written 3.5 years ago by GabrielMontenegro560

I only know how many variants there are supposed to be, and the number in the PED file differs from this.

Sorry, my background is math/comp.sci., new to bioinformatics. I wasn't sure what you meant about not supporting multi-allelic sites, I just assumed you were referring to my note about multiple chromosomes.

ADD REPLYlink written 3.5 years ago by olavur100
1

Check your VCF file and try to see the TYPE of your variants. If there are sites that are not marked as SNPs for example INDELS, PLINK would just omit those and your PLINK bim file will have less sites than expected. Also by multiallelic I meant that you can have multiple alleles at a SNP and PLINK will also have "problems" with it - although you might have to check PLINKs documentation to see how they handle those. Try to do what @chrchang523 suggested and read your VCF file with PLINK. You might get a useful log info text file.

ADD REPLYlink written 3.5 years ago by GabrielMontenegro560

My VCF files do not have a TYPE field.

ADD REPLYlink written 3.5 years ago by olavur100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1888 users visited in the last hour