0
1
Entering edit mode
4.7 years ago
DanielC ▴ 160

Hi Friends,

I am receiving an error when using Plink to convert "XXX.VCF.gz" file to PED format. The error is "Error read failure". I need this conversion to do the SNP association study (-assoc) using plink and to get the P-values of the SNPs. Can you please let me know what could be the issue?

Thanks, DK

0
Entering edit mode

Could you post your plink command?

0
Entering edit mode

Thanks for response! The command is:

plink --vcf XXX.vcf.gz --make-bed --out XXX.out

1
Entering edit mode

How did you zip it? I believe that plink expects gzipped. Yours may be bgzipped.

If it helps, just uncompress it and try again. If there's still an error after you do that, then check that your VCF header looks fine.

0
Entering edit mode

Thanks Kevin! When you say that the header looks fine, do you mean to look for missing fields, such as genotype info etc?

0
Entering edit mode

Yes. You can paste the header here, if you wish?

0
Entering edit mode

Thanks much Kevin! I uncompressed the vcf file and ran plink again. There was compression error and now it's solved, so I was able to run plink using the command:

plink --vcf XX.vcf --make-bed --out XX

The program ran well, however, gave an error : "line 242 of vcf file has fewer token than expected"

Interestingly, when I ran vcftools:

vcftools --vcf xx.vcf --plink --out xx

this ran without any issues and made ".ped" and ".map" files. However, when I moved ahead to perform the association study to get the P-values using:

it gives same error: "line 2043 of .ped file has fewer token than expected"

I did some research and the solution was to include "-- no sex --no pheno" to avoid this error, but it did not solve the error for me. Could you please tell me what could be the issue and what could be the reasonable approach to solve it?

Thanks much!

0
Entering edit mode

Oh, what is on line 242 of the VCF?

sed '242!d' VCF.vcf


What is on line 2043 of the PED file?

sed '2043!d' PED.ped

0
Entering edit mode

Thanks Kevin! I did that but don't know what the issue is as the 242nd line in VCF looks same like 243 line, one notable thing is that the line 242 is the very next line to the header "CHROM POS ID REF ALT ......". I cant paste the line here due to confidentiality etc.. It looks like this:

In VCF:
line 242:
1  13245 . G A  95    .             CSQ=A|downstream_gene_variant|..............buch of info..... GT:AD:DP:GQ:PL

line 243:


so in line 242 "rsXXXXX" is missing - that's what I noted.

In PED file the 2043 line is like:

TCGA-barcode..... TCGA-barcode.... 0 0 0 0 0 0 0 0 0 0 0 0... 0s till the end of row.

I also found an explanation here but related to PED and MAP file: https://stackoverflow.com/questions/31249227/plink-error-while-converting-to-binary-line-1-of-ped-file-has-fewer-tokens-tha

So, can you please give your suggestion on the solution for the error of "fewer token than expected"? Thanks

0
Entering edit mode

With your VCF, I first recommend normalising it with the following command:

bcftools norm -Ob -m-any MyVariants.vcf > MyVariants.norm.bcf


Then, index it:

bcftools index MyVariants.norm.bcf


plink --bcf MyVariants.onrm.bcf


Don't use the VCFtools function, as VCFtools is now very old.

0
Entering edit mode

Thanks Kevin! Could you tell me what the normalization step will do here? Is it related to distribution of the mutations?

And, by reading the normalized bcf to plink, do you mean this?:

plink --bcf MyVariants.norm.bcf --make-bed --out XX

0
Entering edit mode

Hey, the BCFtools command will left-align indels and throw more 'intellectual' errors that may help you, if errors do indeed exist in the VCF formatting.

Regarding plink, you can just as easily do plink --vcf or plink --bcf. It is important to not use the VCFtools function that converts VCF to plink, as that function is out of date.

Obviously I am limited in how I can help from a distance. There should not be any issues with VCF/BCF conversion to plink, provided that the data is managed properly and updated functions are used.

0
Entering edit mode
plink --bcf MyVariants.norm.bcf --make-bed --out XX


Yes, that is what I meant, to give it the full command.

1
Entering edit mode

Thanks Kevin! I will work on this and let you know.

0
Entering edit mode

Thanks Kevin! I used bcftools and performed normalization, but got this error below:

[E:vcf_parse_format] Format column with no sample columns starting at 1:13244

In VCF:
line 242:
1  13244 . G A  95    .             CSQ=A|downstream_gene_variant|..............buch of info..... GT:AD:DP:GQ:PL


One thing to note is that this is the same line 242 in the VCF file that was giving error with vcftools and plink. As the error says "Format column with no sample columns starting at 1:13244", could you please let me know what the solution could be and what is the problem in this? Thanks!

0
Entering edit mode

Oh, where are the genotypes for that line? Is there nothing after the GT:AD:DP:GQ:PL?

0
Entering edit mode

Thanks Kevin! No, there is nothing after "GT:AD:DP:GQ:PL" for this line; however, for the next line 243 there is "GD:AD:DP:GQ:PL:SDP:RD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR". So the genotype of this 242 line is missing I think? And, that is what causing the issue or please let me know if the matter could be something else and suggestions on how this can be solved? Thanks much!

0
Entering edit mode

Well, yes, there should be more data, if even just ./. per sample, after the GT:AD:DP:GQ:PL. I am not sure how the VCF was created like that, but it would definitely cause an error.

Any idea why it would be like that?

To quickly delete a line number in a file, you can use:

sed '242d' MyVariants.vcf


That will delete line 242

0
Entering edit mode

Thanks Kevin! So, yes this line is the problem sine it has missing info. Regarding the generation of the merged VCF; which includes about 10000 samples, I know that the merged VCF files were created using GATK as the header of the merged VCF file state that. I have access only to the merged files so difficult to say why such error is present in the VCF merged file.

Does deleting such lines cause any issue in downstream analysis? As this is just one allele mutation info in one merged VCF file of 10000 samples for chromosome 1, what you suggest should be kept in mind when processing other such merged VCF files for remaining 21 chromosomes?

Also, wanted to clarify that, in order to plot manhattan plots to predict the association of the SNPs with the traits, can't I use the P-value (from Fisher's exact test) present in the VCF file for each mutation? Is it significant to get the P-values by performing plink association study and then do manhattan plot? Thanks much!

1
Entering edit mode

It depends on the things that are being compared in the Fisher's Test. Usually, Chi-square p-values are used for association studies.

0
Entering edit mode

Thanks Kevin! I have an update! To my surprise, when I deleted that line 242 and ran bcftools normalization, it still gives the same error, now for the line 243:

[E:vcf_parse_format] Format column with no sample columns starting at 1:13273

line 243:


Do you think this error is related to some other issue? Thanks much!

0
Entering edit mode

That looks empty, too. How many lines in your VCF look empty?

Are you sure that you generated the file correctly?

0
Entering edit mode

Thanks Kevin! I will figure this issue out and get back.

0
Entering edit mode

Hi Kevin! Just wanted to update that I have figured out the issue; the file has no genotype info andhence cant be worked upon. I have got vcf file with genotype info and it works now. Thanks!

1
Entering edit mode

Great to hear, DK! Thanks for coming back to update.