Entering edit mode
6.7 years ago
DanielC
▴
170
Hi Friends,
I am receiving an error when using Plink to convert "XXX.VCF.gz" file to PED format. The error is "Error read failure". I need this conversion to do the SNP association study (-assoc) using plink and to get the P-values of the SNPs. Can you please let me know what could be the issue?
Thanks, DK
Could you post your
plink
command?Thanks for response! The command is:
plink --vcf XXX.vcf.gz --make-bed --out XXX.out
How did you zip it? I believe that plink expects gzipped. Yours may be bgzipped.
If it helps, just uncompress it and try again. If there's still an error after you do that, then check that your VCF header looks fine.
Thanks Kevin! When you say that the header looks fine, do you mean to look for missing fields, such as genotype info etc?
Yes. You can paste the header here, if you wish?
Thanks much Kevin! I uncompressed the vcf file and ran plink again. There was compression error and now it's solved, so I was able to run plink using the command:
plink --vcf XX.vcf --make-bed --out XX
The program ran well, however, gave an error : "line 242 of vcf file has fewer token than expected"
Interestingly, when I ran vcftools:
vcftools --vcf xx.vcf --plink --out xx
this ran without any issues and made ".ped" and ".map" files. However, when I moved ahead to perform the association study to get the P-values using:
plink --file xx --assoc
it gives same error: "line 2043 of .ped file has fewer token than expected"
I did some research and the solution was to include "-- no sex --no pheno" to avoid this error, but it did not solve the error for me. Could you please tell me what could be the issue and what could be the reasonable approach to solve it?
Thanks much!
Oh, what is on line 242 of the VCF?
What is on line 2043 of the PED file?
Thanks Kevin! I did that but don't know what the issue is as the 242nd line in VCF looks same like 243 line, one notable thing is that the line 242 is the very next line to the header "CHROM POS ID REF ALT ......". I cant paste the line here due to confidentiality etc.. It looks like this:
so in line 242 "rsXXXXX" is missing - that's what I noted.
In PED file the 2043 line is like:
TCGA-barcode..... TCGA-barcode.... 0 0 0 0 0 0 0 0 0 0 0 0... 0s till the end of row.
I also found an explanation here but related to PED and MAP file: https://stackoverflow.com/questions/31249227/plink-error-while-converting-to-binary-line-1-of-ped-file-has-fewer-tokens-tha
So, can you please give your suggestion on the solution for the error of "fewer token than expected"? Thanks
With your VCF, I first recommend normalising it with the following command:
Then, index it:
Then, read it into plink:
Don't use the VCFtools function, as VCFtools is now very old.
Thanks Kevin! Could you tell me what the normalization step will do here? Is it related to distribution of the mutations?
And, by reading the normalized bcf to plink, do you mean this?:
plink --bcf MyVariants.norm.bcf --make-bed --out XX
Hence, instead of reading vcf files plink reads normalized bcf files, followed by which I can do the association study? Thanks much!
Hey, the BCFtools command will left-align indels and throw more 'intellectual' errors that may help you, if errors do indeed exist in the VCF formatting.
Regarding plink, you can just as easily do
plink --vcf
orplink --bcf
. It is important to not use the VCFtools function that converts VCF to plink, as that function is out of date.Obviously I am limited in how I can help from a distance. There should not be any issues with VCF/BCF conversion to plink, provided that the data is managed properly and updated functions are used.
Yes, that is what I meant, to give it the full command.
Thanks Kevin! I will work on this and let you know.
Thanks Kevin! I used bcftools and performed normalization, but got this error below:
[E:vcf_parse_format] Format column with no sample columns starting at 1:13244
One thing to note is that this is the same line 242 in the VCF file that was giving error with vcftools and plink. As the error says "Format column with no sample columns starting at 1:13244", could you please let me know what the solution could be and what is the problem in this? Thanks!
Oh, where are the genotypes for that line? Is there nothing after the
GT:AD:DP:GQ:PL
?Thanks Kevin! No, there is nothing after "GT:AD:DP:GQ:PL" for this line; however, for the next line 243 there is "GD:AD:DP:GQ:PL:SDP:RD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR". So the genotype of this 242 line is missing I think? And, that is what causing the issue or please let me know if the matter could be something else and suggestions on how this can be solved? Thanks much!
Well, yes, there should be more data, if even just ./. per sample, after the
GT:AD:DP:GQ:PL
. I am not sure how the VCF was created like that, but it would definitely cause an error.Any idea why it would be like that?
To quickly delete a line number in a file, you can use:
That will delete line 242
Thanks Kevin! So, yes this line is the problem sine it has missing info. Regarding the generation of the merged VCF; which includes about 10000 samples, I know that the merged VCF files were created using GATK as the header of the merged VCF file state that. I have access only to the merged files so difficult to say why such error is present in the VCF merged file.
Does deleting such lines cause any issue in downstream analysis? As this is just one allele mutation info in one merged VCF file of 10000 samples for chromosome 1, what you suggest should be kept in mind when processing other such merged VCF files for remaining 21 chromosomes?
Also, wanted to clarify that, in order to plot manhattan plots to predict the association of the SNPs with the traits, can't I use the P-value (from Fisher's exact test) present in the VCF file for each mutation? Is it significant to get the P-values by performing plink association study and then do manhattan plot? Thanks much!
It depends on the things that are being compared in the Fisher's Test. Usually, Chi-square p-values are used for association studies.
Thanks Kevin! I have an update! To my surprise, when I deleted that line 242 and ran bcftools normalization, it still gives the same error, now for the line 243:
[E:vcf_parse_format] Format column with no sample columns starting at 1:13273
Do you think this error is related to some other issue? Thanks much!
That looks empty, too. How many lines in your VCF look empty?
Are you sure that you generated the file correctly?
Thanks Kevin! I will figure this issue out and get back.
Hi Kevin! Just wanted to update that I have figured out the issue; the file has no genotype info andhence cant be worked upon. I have got vcf file with genotype info and it works now. Thanks!
Great to hear, DK! Thanks for coming back to update.