value too large for defined data type using bgzip
1
0
Entering edit mode
4.8 years ago
ricfoz ▴ 80

Hello all

I am trying to retrieve a .fasta file from a large .vcf file

in order to run bcftools' consensus tool i need to bgzip compress the .vcf file and then index it with tabix, nevertheless, when attempting to compress the file largefile.vcf, i get the next error:

"[bgzip] Value too large for defined data type: largefile.vcf"

i tried to compress it using gzip, and it compressed it successfuly, anyways, when i tried to index the resulting largefile.vcf.gz from the gzip compression, i got an error because it is a GZIP file, and not a BGZIP file which is the one needed.

anyone knows why bgzip tool finds the value to large, and not the gzip tool?, i need to get the file bgzip compressed in order to follow my workflow.

Any help would be greately apreciated.

Cheers.

bgzip value too large bgzip error compressing • 3.2k views
1
Entering edit mode

https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Value-too-large-for-defined-data-type

It is basically saying that bgzip (binary or compiled from source) on your machine is not compiled to handle large data. Please read the link above for better clarification of the issue.

copy/pasted from GNU website:

"It means that your version of the utilities were not compiled with large file support enabled. The GNU utilities do support large files if they are compiled to do so. You may want to compile them again and make sure that large file support is enabled. This support is automatically configured by autoconf on most systems. But it is possible that on your particular system it could not determine how to do that and therefore autoconf concluded that your system did not support large files."

0
Entering edit mode

I have been going through the explanation of GNU website for a while now, i understand the problem, and i have been reading about a "lseek" function, which is able to read the length of the file, and changes the offset of the file, something about an "off_t data type", and i'm supposed to write some code to change that on my GNU utilities.

I don't know how to do that, i have been trying, and have updated and upgraded gcc libraries in my computer, but it hasn't worked.

have you got any idea how to define the offset of my file to 64 bits?, or maybe you know how to compile my GNU utilities in order to support large files ?

0
Entering edit mode

what is your OS architecture? 64bit? try to recompile bgzip from htslib sources. I could not find source (.tar.gz) file for bgzip.

in the mean time try to do this:

$cat test.vcf | parallel --pipe --recend '' -k bgzip > test.vcf.gz  Then try to to index it. ADD REPLY 0 Entering edit mode Hello there, Thanks!, that script appeared at the begining to run neatly, it lasted some hours to deliver a .vcf in a binary configuration. But without the .vcf.gz suffix. just .vcf, and it delivered the following error when trying to index with tabix: z-VirtualBox:$ tabix Originalfile.hg18.chr4.vcf [E: :get_intv] failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "c" [E: :hts_idx_push] unsorted positions tbx_index_build failed: Originalfile.hg18.chr4.vcf

In order to follow my workflow, i must get a .gz, tabix indexable kind of file.

0
Entering edit mode

rename the file to vcf.gz and run tabix on it. Let us know if it still fails. Btw, did you not send the output to a file with vcf.gz extension?

0
Entering edit mode

Hello, i did give the .vcf.gz extension, i'll re-run the task to see how it goes, i'll let you know the result, thank you for the interaction. The task lasted about 6 hours, so i'll be posting in a while.

Cheers

0
Entering edit mode

If it is taking that long, please try some thing else. I do not want to experiment :) or try with partial vcf. Check if consensus creators (software) takes multiple vcf options. In that case, you can break down your VCF per chromosome and then pass them onto consensus creating software. Btw, i tried the command on dbsnp chr20 (63mb) and it worked fine.. If it is not much work, you can do following as well:

1) Break down your VCF per chromosome (there are several tools which do this)
2) Break down your reference fasta per chromosome (there are several tools which do this)
3) Index all VCFs and if required, index your fasta files
4) In a loop, create consensus fasta for each chromosome and at the end of the loop, merge output.(or)
5) Get consensus fasta for each chromosome and cat all fasta files.


You can write loops for 1,2 and 3: Break down per chromosome and then index the files. Make sure that newly created files are stored in a place other than reference files (fasta and vcf)

0
Entering edit mode

That's a neat workflow suggestion, i'll try that on some complete genomes i need to work with, but i must say that the 37GB vcf file that brought the original problem in this post to me is already a one chromosome vcf file ... i think it's huge for it to be just one chr, but it actually is ... i'm really interested on seeing what variants it's got.

thanks again

0
Entering edit mode

Hello there,

Well, i re-tried the compression, and it all seemed to go well. i have my compressed vcf.gz file, and i can view it correctly using less, so i think it's all good to go, still, i get the same error when trying to tabix index it.

I already posted my trouble with that, since it is another problem than the one on this threath, the one with bgzip i had was solved with your advice.

Cheers !

0
Entering edit mode

Good luck :)

0
Entering edit mode
4.8 years ago
pfs ▴ 280

I would not anticipate a .fasta file to be inside a .vcf file. I am guessing that your file format is wrong. Please provide a sample of this vcf file.

0
Entering edit mode

I think OP is trying to reconstruct fasta file from VCF (as I understand).

0
Entering edit mode

pfs, I know .vcf is different to .fasta format, and that fasta don't reside within .vcf files, but calling the variants inside the .vcf it's possible to generate a consensus between .vcf and the reference genome. My vcf file is a variant call file downloaded from the Max Planck Institute server.