How To Uncompress The 1000 Genome Vcf.Gz File
2
2
Entering edit mode
9.6 years ago
GPR ▴ 380

Hello, Can somebody tell me how to uncompress 1000 Genome vcf.gz files? I am performing an RNA-editing analysis and would like to substract annotated SNPs/INDELs. I have already done so using dbSNP data with bedtools instersect, but am still stuck with the 1000 Genome Project *.vcf.gz files. I downloaded these for each chromosome and then concatenated them. These files are in a format that gunzip/gzip -d wont recognize. I tried using this file unzipped in bedtools intersect but it wasn't reconized. Many thanks,

• 31k views
1
Entering edit mode

"I downloaded these for each chromosome and then concatenated them" - what files are you referring to here? And can you link to an example compressed file? There is no good reason why gunzip will not work on a .gz file, unless the file is corrupt or not actually a .gz file. What error message does gunzip give?

0
Entering edit mode

I downloaded the *.vcf.gz files per chromosome from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ I tried unzipping each file separately but failed. I therefore concatenated them all, and tried to unzip, unsuccessfully. I did download them twice to discard the files being corrupt. The error messages I get are "not in gunzip or gzip" format.

0
Entering edit mode

Don't concatenate zipped files, that will never work. EDIT: I'm wrong, see Pierre's comment below. The *.vcf.gz files should definitely be uncompressible by gzip.

Most of my money would be on some type of user error. How exactly are you trying to unzip them? Can you give us an example file that doesn't work for you?

I suppose there's a tiny chance you're on a strange computer system with an unhappy version of gzip. What OS/platform are you on?

2
Entering edit mode

"Don't concatenate zipped files, that will never work." : In fact, apart from the current problem, that could work: http://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files

0
Entering edit mode

Nice! I didn't know that. I take back my overly broad claim.

It looks like it will concatenate the uncompressed contents into a single file, which is only sometimes what you want... but useful to know nonetheless.

0
Entering edit mode

I believe that this the property that that makes block compression BGZF (used in BAM compression) possible see BGZF - Blocked, Bigger & Better GZIP!

7
Entering edit mode
9.6 years ago

try to use

bgzip -d


from the tabix package. But it's strange, gunzip should work.

0
Entering edit mode

Thanks for the suggestion Pierre. I just installed tabix, which I didn't have, but once more, I couldn't unzip the 1000 genome files. This time I get the error: invalid block header. The same for the individual files and for the concatenated one.

0
Entering edit mode

Hey thanks I was having an issue unzipping the files and bgzip -d worked for me.

1
Entering edit mode
9.6 years ago
bbio ▴ 90

I'm not sure if I am understanding your description correctly, but if you concatenated the .gz files before trying to unzip them, that would probably be the problem. So if this is what you did, try unzipping them first individually and then concatenating them.