Inconsistency in VCF and Plink format after conversion
0
0
Entering edit mode
7.1 years ago
olavur ▴ 150

When I convert a VCF file to Plink (PED and MAP files), using

$ vcftools --gzvcf myfile.vcf.gz --plink --out myfile_plink

there are too few variants in the PED file. The MAP file has the correct amount of variants (same as in original VCF file), but the PED file has much fewer. If I try to run some Plink command on these files, Plink reports this inconsistency, and I can verify it as well.

EDIT:

Working with a file with just chromosome 1 (human genome) seems to work just fine, but when combining all chromosomes (vcf-concat) I get this problem.

EDIT2:

When processing on a per-chromosome basis, this happens sometimes as well.

vcftools plink • 3.6k views
ADD COMMENT
2
Entering edit mode

Does this work if you use plink for the import instead (e.g. "plink --vcf myfile.vcf.gz --recode --out myfile_plink")?

ADD REPLY
0
Entering edit mode

Perhaps, but stable Plink (v1.07) does not support the --vcf option. I tried moving over to v1.9, but that gives me additional problems, unfortunately.

ADD REPLY
0
Entering edit mode

Can you be more specific about an additional problem you're having with 1.9? I'm actively updating it to remove problems with replacing v1.07 with v1.9 (and this is now the only way in which v1.9 is updated; all new stuff is only going into the v2.0 codebase at this point).

ADD REPLY
0
Entering edit mode

Using Plink v1.9 to convert from VCF to Plink ultimately solved the problem. Thanks. Feel free to post it as a solution, otherwise I will.

I think these "additional problems" I was speaking of were just due that I need to use a --const-fid when using Plink v1.9, which I didn't need to in Plink v1.07.

ADD REPLY
1
Entering edit mode

Maybe it's because PLINK cannot manage multi-allelic sites? Have you checked which variants are being filtered out?

ADD REPLY
0
Entering edit mode

I have used Plink with other datasets where multiple chromosomes wasn't a problem, so it's probably not that. I don't think it's possible to check which variants are missing either, because it's only the MAP file that has any information about the identity of the variants, and the MAP file is correct.

ADD REPLY
0
Entering edit mode

What do you mean multiple chromosomes? I was talking about multiple alleles at a site. How come you cannot check which variants are missing? How then do you know they were missing in the first place?

ADD REPLY
0
Entering edit mode

I only know how many variants there are supposed to be, and the number in the PED file differs from this.

Sorry, my background is math/comp.sci., new to bioinformatics. I wasn't sure what you meant about not supporting multi-allelic sites, I just assumed you were referring to my note about multiple chromosomes.

ADD REPLY
1
Entering edit mode

Check your VCF file and try to see the TYPE of your variants. If there are sites that are not marked as SNPs for example INDELS, PLINK would just omit those and your PLINK bim file will have less sites than expected. Also by multiallelic I meant that you can have multiple alleles at a SNP and PLINK will also have "problems" with it - although you might have to check PLINKs documentation to see how they handle those. Try to do what @chrchang523 suggested and read your VCF file with PLINK. You might get a useful log info text file.

ADD REPLY
0
Entering edit mode

My VCF files do not have a TYPE field.

ADD REPLY

Login before adding your answer.

Traffic: 1457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6