As mentioned, if you have a .tbi file, then the prefix of that file with .gz probably means it was made with bgzip.
You can use htslib to check if the file is block-compressed, or use hexdump to check the first two bytes (for the file type), the fourth byte (to see if there is an extra header set) and the 13th and 14th bytes (to check for the extra header):
$ hexdump -s 0 -n 2 -e '8/1 "%02x""\n"' some_file.gz
1f8b
$ hexdump -s 3 -n 1 -e '8/1 "%d""\n"' some_file.gz | awk 'and($0,0x04){ print "extra header"; }'
extra header
$ hexdump -s 12 -n 2 -e '8/1 "%c""\n"' some_file.gz
BC
If the first result equals 1f8b, the second result returns extra header, and the third equals BC, then some_file.gz was probably made with bgzip.
If the first result equals 1f8b and the second does not return extra header, then it is likely just a gzip file.
The htslib tool probably does some similar check of the first bytes in the input file, to return a true or false identification.
(Bytes via: https://tools.ietf.org/html/rfc1952#page-5 and http://www.htslib.org/doc/bgzip.html)
If you have a directory of files that are all gzip-formatted and you want to make block-compressed versions, and you are using a bash shell, then you could use a for loop, using the .bgz convention:
$ for in_fn in `ls *.vcf.gz`; do out_fn=${in_fn%.*}.bgz; echo ${out_fn}; gunzip -c ${in_fn} | bgzip > ${out_fn}; tabix -p vcf ${out_fn}; done
If you want to follow the .gz convention, you have to do some extra work:
$ for in_fn in `ls *.vcf.gz`; do tmp_fn=${in_fn%.*}.tmp.gz; echo ${tmp_fn}; gunzip -c ${in_fn} | bgzip > ${tmp_fn}; mv ${tmp_fn} ${in_fn}; tabix -p vcf ${in_fn}; rm ${tmp_fn}; done
Note: This second loop is dangerous, as it will overwrite the original gzip file. I would recommend having backups, writing to a separate directory, or just using the .bgz convention.
If you already have the tbi files, your VCFs are already compressed with BGZip, just check with
file *vcf.gz. If that is the case, just rename the file.To add to JC's point, the difference between the
fileoutput for agzipfile versus abgzipfile will be that for the latter, it will mention the presence of an extra field.Beyond that, it is actually unusual to use
bgzsuffix. Many tools requiring bgzip-compressed data (to my knowledge) actually expect the normalgzsuffix. If tabix indexing works then it is bgzip, otherwise it throws an error.I'm not sure if it is unusual. I have used
bgzto hint that the file is likely indexed with tabix or similar. I have seen others use this convention.gnomAD uses it (Hover over the VCF file links here: https://gnomad.broadinstitute.org/downloads - all
.vcf.bgz). It's not unusual, but not necessary either. Anything that can work with a gzipped file can work with a bgzipped file, and tools that need a bgzipped file should be equipped to error out if the file is not bgzipped. If it's just the extension that's causing OP's problem, they can rename or create soft-links.Good to know, have not seen it myself so far.
@JC @RamRS - Thanks a lot for your suggestions. However, I have an issue. When I execute the command
file *vcf.gz, I get the output likeTest_t5.chr12.dose.vcf.gz: gzip compressed data, extra field. So, I think all my files arebgzipfile. Am I right? So, I renamed the file extension from.gzto.bgzand usedhailto import thevcf.bgzfile. However, I got an error message which states thatfile does not conform to block zip format. May I know why does this happen despite it being in a.bgzformat. Can I kindly request your help pleaseyour files are not BGzip, are GZip, you will need to recompress them
@JC, May I know how do you say that my files aren't
BGzip. The command output has mention ofextra headfield. Am I right?. The command produces output which is like as belowCan you also let us know how the command output should look like if its a BGzip file (because it already looks like what @RamRS and @Alex Reynolds (based on hexdump) mentioned).
Here is how should look like: