Question: How To Uncompress The 1000 Genome Vcf.Gz File
1
gravatar for GPR
7.0 years ago by
GPR320
Mexico
GPR320 wrote:

Hello, Can somebody tell me how to uncompress 1000 Genome vcf.gz files? I am performing an RNA-editing analysis and would like to substract annotated SNPs/INDELs. I have already done so using dbSNP data with bedtools instersect, but am still stuck with the 1000 Genome Project *.vcf.gz files. I downloaded these for each chromosome and then concatenated them. These files are in a format that gunzip/gzip -d wont recognize. I tried using this file unzipped in bedtools intersect but it wasn't reconized. Many thanks,

• 20k views
ADD COMMENTlink modified 7.0 years ago by bbio80 • written 7.0 years ago by GPR320
1

"I downloaded these for each chromosome and then concatenated them" - what files are you referring to here? And can you link to an example compressed file? There is no good reason why gunzip will not work on a .gz file, unless the file is corrupt or not actually a .gz file. What error message does gunzip give?

ADD REPLYlink written 7.0 years ago by Neilfws48k

I downloaded the *.vcf.gz files per chromosome from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ I tried unzipping each file separately but failed. I therefore concatenated them all, and tried to unzip, unsuccessfully. I did download them twice to discard the files being corrupt. The error messages I get are "not in gunzip or gzip" format.

ADD REPLYlink written 7.0 years ago by GPR320

Don't concatenate zipped files, that will never work. EDIT: I'm wrong, see Pierre's comment below. The *.vcf.gz files should definitely be uncompressible by gzip.

Most of my money would be on some type of user error. How exactly are you trying to unzip them? Can you give us an example file that doesn't work for you?

I suppose there's a tiny chance you're on a strange computer system with an unhappy version of gzip. What OS/platform are you on?

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by matted7.2k
2

"Don't concatenate zipped files, that will never work." : In fact, apart from the current problem, that could work: http://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files

ADD REPLYlink written 7.0 years ago by Pierre Lindenbaum124k

Nice! I didn't know that. I take back my overly broad claim.

It looks like it will concatenate the uncompressed contents into a single file, which is only sometimes what you want... but useful to know nonetheless.

ADD REPLYlink written 7.0 years ago by matted7.2k

I believe that this the property that that makes block compression BGZF (used in BAM compression) possible see BGZF - Blocked, Bigger & Better GZIP!

ADD REPLYlink written 7.0 years ago by Istvan Albert ♦♦ 81k
3
gravatar for Pierre Lindenbaum
7.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

try to use

bgzip -d

from the tabix package. But it's strange, gunzip should work.

ADD COMMENTlink written 7.0 years ago by Pierre Lindenbaum124k

Thanks for the suggestion Pierre. I just installed tabix, which I didn't have, but once more, I couldn't unzip the 1000 genome files. This time I get the error: invalid block header. The same for the individual files and for the concatenated one.

ADD REPLYlink written 7.0 years ago by GPR320
1
gravatar for bbio
7.0 years ago by
bbio80
United Kingdom
bbio80 wrote:

I'm not sure if I am understanding your description correctly, but if you concatenated the .gz files before trying to unzip them, that would probably be the problem. So if this is what you did, try unzipping them first individually and then concatenating them.

ADD COMMENTlink written 7.0 years ago by bbio80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1193 users visited in the last hour