Error in converting vcf file to plink format
1
0
Entering edit mode
8 weeks ago

Hi,

I am trying to convert vcf format to plink format. I use this dataset (ALL.2of4intersection.20100804.genotypes.vcf.gz). Following is the command I used in my cmd:

plink2 --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz --make-bed --out ALL.2of4intersection.20100804.genotypes

However, I have an error as follows: Error: --vcf file decompression faliure: Malformed BGZF block

Then, I tried to unzip the gz file as follows to check whether I can unzip it: gzip -d ALL.2of4intersection.20100804.genotypes.vcf.gz

Again, there is an error as follows: gzip: ALL.2of4intersection.20100804.genotypes.vcf.gz: invalid compressed data--crc error gzip: ALL.2of4intersection.20100804.genotypes.vcf.gz: invalid compressed data--length error

It would be very helpful if you could give me your suggestion for this error messages. Thank you!

Plink cmd linux • 729 views
0
Entering edit mode

Hello,

I am trying to use following command to downlaod the file with population information of 1000 genomes.

But I have an error as follows:

--2022-09-29 11:06:10-- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/20100804.ALL.panel => ‘20100804.ALL.panel’ Resolving ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)... failed: Temporary failure in name resolution. wget: unable to resolve host address ‘ftp.1000genomes.ebi.ac.uk’

Could you pls help me to check whether I used corret code to download this file? Thank you!

0
Entering edit mode

Try

wget -O 20100804/20100804.ALL.panel http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/20100804.ALL.panel


If you continue getting the following error

Temporary failure in name resolution

then ot is a local problem on your end. DNS is not able to resolve the name correctly. Wait and see if the problem resolves.

0
Entering edit mode

Thank you for your information. I just tried with your command and got the error as Temporary failure in name resolution. Did you mean like "wait and see if the problem resolves", means, do I need to re -reun after waiting some time please? kindly let me know. Thank you!

1
Entering edit mode

Correct. This looks like a problem with your local network. I was able to access the link without problems.

0
Entering edit mode

Thank you so much! I just check my workinf directory with "ls" and I was able to find the file name "20100804.All.panel". Does this means it was downloaded successfully? Pls let me know how can I make sure it? Thank you!

1
Entering edit mode

As long as the file is not empty it should be good.

0
Entering edit mode

Thank you so much! Will try to run the analysis with the file and check it! Thanks again!!

1
Entering edit mode
8 weeks ago

 wget -O ALL.2of4intersection.20100804.genotypes.vcf.gz "https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz"

0
Entering edit mode

Thank you so much for your kind information. I actually did it three times taking several hours but still there was that error. I will use the command you gave me to redownloaad the file. Thank you!

0
Entering edit mode

Thank you so much for your information again and I was able to download the file successfully! However, when I try to convert that vcf file to plink format, it only gave me FAM file.

I used this command to convert: plink2 --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz --make-bed --out ALL.2of4intersection.20100804.genotypes

I have this error related to BMI file as follows:

--vcf: 25488488 variants scanned. --vcf: ALL.2of4intersection.20100804.genotypes-temporary.pgen + ALL.2of4intersection.20100804.genotypes-temporary.pvar.zst + ALL.2of4intersection.20100804.genotypes-temporary.psam written. 629 samples (0 females, 0 males, 629 ambiguous; 629 founders) loaded from ALL.2of4intersection.20100804.genotypes-temporary.psam. 25488488 variants loaded from ALL.2of4intersection.20100804.genotypes-temporary.pvar.zst. Note: No phenotype data present. Writing ALL.2of4intersection.20100804.genotypes.fam ... done. Writing ALL.2of4intersection.20100804.genotypes.bim ... Error: ALL.2of4intersection.20100804.genotypes.bim cannot contain multiallelic variants.

Could you kindly let me know any possibility to convert this vcf to BED, BIM and FAM files? Thank you once again!

1
Entering edit mode

remove multi allelic with bcftools view -m2 -M2 or normalise with bcftools norm

0
Entering edit mode

Thank you for the very quick answer and I will try it as you mentioned! Thank you!

0
Entering edit mode

I tried to search about bcftools somehow I could not figure out it since it is bit not familiar to me. However, while searching, I was able to find that "--max-alleles 2" could be used to filter out the multiallelic variants and I just tried it and it pretty worked for me and I was able to get BED, BIM and FAM files well! I hope this way is okay and I really appreciate your kind help once again! Thank you!

0
Entering edit mode

Hi,

I have another issue when I work with this data set. After I obtain the BED, BIM and FAM files, I tried to do some QC steps for this dataset in my computer. However, there is a memory error as follows:

• FATAL ERROR Exhausted system memory *
• *
• You need a smaller dataset or a bigger computer...*
• *
• Forced exit now... *

I am wondering that is there any possibility to take a smaller dataset from this big data set for the practices? If it is, could you kindly let me know about it? Thank you!

1
Entering edit mode

This error is a consequence of using plink 1.07, which has to load the entire dataset into memory, instead of plink 1.9, which is capable of processing the data in a streaming manner.

0
Entering edit mode

0
Entering edit mode

I just tried with plink 1.9 and it worked pretty well!!!! Thank you once again!