Question

Plink1.9 gives error when converting VCF.gz to PED? "Error read failure"

1

Entering edit mode

7.1 years ago

DanielC ▴ 210

Hi Friends,

I am receiving an error when using Plink to convert "XXX.VCF.gz" file to PED format. The error is "Error read failure". I need this conversion to do the SNP association study (-assoc) using plink and to get the P-values of the SNPs. Can you please let me know what could be the issue?

Thanks, DK

VCF Plink • 11k views

ADD COMMENT • link 7.1 years ago by DanielC ▴ 210

0

Entering edit mode

Could you post your plink command?

ADD REPLY • link 7.1 years ago by zx8754 12k

0

Entering edit mode

Thanks for response! The command is:

plink --vcf XXX.vcf.gz --make-bed --out XXX.out

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

1

Entering edit mode

How did you zip it? I believe that plink expects gzipped. Yours may be bgzipped.

If it helps, just uncompress it and try again. If there's still an error after you do that, then check that your VCF header looks fine.

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin! When you say that the header looks fine, do you mean to look for missing fields, such as genotype info etc?

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

0

Entering edit mode

Yes. You can paste the header here, if you wish?

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks much Kevin! I uncompressed the vcf file and ran plink again. There was compression error and now it's solved, so I was able to run plink using the command:

plink --vcf XX.vcf --make-bed --out XX

The program ran well, however, gave an error : "line 242 of vcf file has fewer token than expected"

Interestingly, when I ran vcftools:

vcftools --vcf xx.vcf --plink --out xx

this ran without any issues and made ".ped" and ".map" files. However, when I moved ahead to perform the association study to get the P-values using:

plink --file xx --assoc

it gives same error: "line 2043 of .ped file has fewer token than expected"

I did some research and the solution was to include "-- no sex --no pheno" to avoid this error, but it did not solve the error for me. Could you please tell me what could be the issue and what could be the reasonable approach to solve it?

Thanks much!

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

0

Entering edit mode

Oh, what is on line 242 of the VCF?

sed '242!d' VCF.vcf

What is on line 2043 of the PED file?

sed '2043!d' PED.ped

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin! I did that but don't know what the issue is as the 242nd line in VCF looks same like 243 line, one notable thing is that the line 242 is the very next line to the header "CHROM POS ID REF ALT ......". I cant paste the line here due to confidentiality etc.. It looks like this:

In VCF:
line 242:
1  13245 . G A  95    .             CSQ=A|downstream_gene_variant|..............buch of info..... GT:AD:DP:GQ:PL 

line 243:
1  13256 . G C  759 LowQual CSQ=C|downstream_gene_variant| ......bunch of info...rs123456....GD:AD:DP:GQ:PL:SDP:RD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR

so in line 242 "rsXXXXX" is missing - that's what I noted.

In PED file the 2043 line is like:

TCGA-barcode..... TCGA-barcode.... 0 0 0 0 0 0 0 0 0 0 0 0... 0s till the end of row.

I also found an explanation here but related to PED and MAP file: https://stackoverflow.com/questions/31249227/plink-error-while-converting-to-binary-line-1-of-ped-file-has-fewer-tokens-tha

So, can you please give your suggestion on the solution for the error of "fewer token than expected"? Thanks

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

0

Entering edit mode

With your VCF, I first recommend normalising it with the following command:

bcftools norm -Ob -m-any MyVariants.vcf > MyVariants.norm.bcf

Then, index it:

bcftools index MyVariants.norm.bcf

Then, read it into plink:

plink --bcf MyVariants.onrm.bcf

Don't use the VCFtools function, as VCFtools is now very old.

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin! Could you tell me what the normalization step will do here? Is it related to distribution of the mutations?

And, by reading the normalized bcf to plink, do you mean this?:

plink --bcf MyVariants.norm.bcf --make-bed --out XX

Hence, instead of reading vcf files plink reads normalized bcf files, followed by which I can do the association study? Thanks much!

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

0

Entering edit mode

Hey, the BCFtools command will left-align indels and throw more 'intellectual' errors that may help you, if errors do indeed exist in the VCF formatting.

Regarding plink, you can just as easily do plink --vcf or plink --bcf. It is important to not use the VCFtools function that converts VCF to plink, as that function is out of date.

Obviously I am limited in how I can help from a distance. There should not be any issues with VCF/BCF conversion to plink, provided that the data is managed properly and updated functions are used.

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

plink --bcf MyVariants.norm.bcf --make-bed --out XX

Yes, that is what I meant, to give it the full command.

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

1

Entering edit mode

Thanks Kevin! I will work on this and let you know.

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

0

Entering edit mode

Thanks Kevin! I used bcftools and performed normalization, but got this error below:

[E:vcf_parse_format] Format column with no sample columns starting at 1:13244

In VCF:
line 242:
1  13244 . G A  95    .             CSQ=A|downstream_gene_variant|..............buch of info..... GT:AD:DP:GQ:PL

One thing to note is that this is the same line 242 in the VCF file that was giving error with vcftools and plink. As the error says "Format column with no sample columns starting at 1:13244", could you please let me know what the solution could be and what is the problem in this? Thanks!

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

0

Entering edit mode

Oh, where are the genotypes for that line? Is there nothing after the GT:AD:DP:GQ:PL?

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin! No, there is nothing after "GT:AD:DP:GQ:PL" for this line; however, for the next line 243 there is "GD:AD:DP:GQ:PL:SDP:RD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR". So the genotype of this 242 line is missing I think? And, that is what causing the issue or please let me know if the matter could be something else and suggestions on how this can be solved? Thanks much!

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

0

Entering edit mode

Well, yes, there should be more data, if even just ./. per sample, after the GT:AD:DP:GQ:PL. I am not sure how the VCF was created like that, but it would definitely cause an error.

Any idea why it would be like that?

To quickly delete a line number in a file, you can use:

sed '242d' MyVariants.vcf

That will delete line 242

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin! So, yes this line is the problem sine it has missing info. Regarding the generation of the merged VCF; which includes about 10000 samples, I know that the merged VCF files were created using GATK as the header of the merged VCF file state that. I have access only to the merged files so difficult to say why such error is present in the VCF merged file.

Does deleting such lines cause any issue in downstream analysis? As this is just one allele mutation info in one merged VCF file of 10000 samples for chromosome 1, what you suggest should be kept in mind when processing other such merged VCF files for remaining 21 chromosomes?

Also, wanted to clarify that, in order to plot manhattan plots to predict the association of the SNPs with the traits, can't I use the P-value (from Fisher's exact test) present in the VCF file for each mutation? Is it significant to get the P-values by performing plink association study and then do manhattan plot? Thanks much!

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

1

Entering edit mode

It depends on the things that are being compared in the Fisher's Test. Usually, Chi-square p-values are used for association studies.

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin! I have an update! To my surprise, when I deleted that line 242 and ran bcftools normalization, it still gives the same error, now for the line 243:

[E:vcf_parse_format] Format column with no sample columns starting at 1:13273

line 243:
1  13273 . G C  759 LowQual CSQ=C|downstream_gene_variant| ......bunch of info...rs123456....GD:AD:DP:GQ:PL:SDP:RD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR

Do you think this error is related to some other issue? Thanks much!

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

0

Entering edit mode

That looks empty, too. How many lines in your VCF look empty?

Are you sure that you generated the file correctly?

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin! I will figure this issue out and get back.

ADD REPLY • link 7.1 years ago by DanielC ▴ 210

0

Entering edit mode

Hi Kevin! Just wanted to update that I have figured out the issue; the file has no genotype info andhence cant be worked upon. I have got vcf file with genotype info and it works now. Thanks!