Question: Trouble indexing a .vcf.gz file
0
gravatar for ricfoz
7 months ago by
ricfoz20
National School of Antropology and History, Mexico city, Mexico
ricfoz20 wrote:

Hello everyone,

I am trying to index a .vcf.gz file in order to get a fasta consensus with bcftools

this is the simple command i give:

tabix myFile.vcf.gz

and I get the next error:

[E: :get_intv] failed to parse TBX_VCF, was wrong -p [type] used? 
The offending line was: "c" 
[E: :hts_idx_push] unsorted positions tbx_index_build failed: myFile.vcf.gz

I must say the vcf.gz file is sorted by position, and the original one had the chromosome field defined just by number, in this file, i added the preffix "chr", to match the fasta reference file i need to apply on a next step

Greetings

compressing vcf.gz tabix vcf • 880 views
ADD COMMENTlink modified 7 months ago by Ram16k • written 7 months ago by ricfoz20

If the file is sorted by position, that's a strange error to get, and I wouldn't think that the chromosome field has much to do with it necessarily. Like the error says, does explicitly adding the -p option help? Like so:

tabix -p vcf myFile.vcf.gz

Beyond that, was it compressed using bgzip?

ADD REPLYlink written 7 months ago by aays140

i tried with the -p flag, but i didn't know what input to use with the "string"option ... as i see your suggestion, the flag -p needs "vcf" as string option, i will try that syntax.

Yes , i compressed it with bgzip, well, with that tool, but using GNU-parallel tool, since my compiled version of bgzip was unable to compress the 37gb file, but running bgzip with parallel seemed to work alright

ADD REPLYlink written 7 months ago by ricfoz20
1

There is no way to check if parallel bgzip is worked on such a huge file. Probably output from parallel is not correctly bgzipped (i.e you have output.bgzip, but it may not be correctly bgzipped as per tabix requirements). Try with a smaller vcf (as I did before posting it) and if it works, it should scale up.

Unzip the bgzipped vcf in to a separate file and compare the md5sum of old vcf file and newly uncompressed vcf file (from bgzip - which is cumbersome as the vcf file is big). Both should be same. On the other hand, if entire bgzip compression failed, file type (use file command in ubuntu- file test.vcf.gz -- tells you if it is gzip compressed data) may be able to tell you.

ADD REPLYlink modified 7 months ago by Ram16k • written 7 months ago by cpad01127.7k

I second this - I do not trust parallel combinations with gzip or bgzip. I hope you retain the original file (the "old vcf file" that cpad mentions), OP. If bgzip did something funky, you're gonna need that file to restart your processing.

ADD REPLYlink written 7 months ago by Ram16k

md5sum on a 37G VCF file shouldn't be cumbersome, just takes a while.

ADD REPLYlink written 7 months ago by Ram16k

It is not md5summing that is cumbersome. It is with bgzipping a big vcf file (~27 gb), followed by unzipping it. For bgzipping with parallel, it takes ~6h such a big file (as per OP) and without parallel, it fails.

ADD REPLYlink written 7 months ago by cpad01127.7k

I hope OP has access to good computing resources. In the meantime, I'd recommend using bgzip -c instead of plain bgzip.

ADD REPLYlink written 7 months ago by Ram16k
1

Hello there, as expected, i have been following your suggestions, here is a description of my actual problem, since i think i know the answer, but still, as you say, it is hard with no "good" computational resources:

I got online a 3.5gb.vcf.gz file which was actually already indexed, say: Ready to go!, but as i've realized that is the state of the art on bioinformatics, the chromosome anotation was written just with number, and the UCSC .fa reference file has got the chr annotation.

Having realized this, i thought well ... decompres, awk it, and recompress ...(i have been working with bgzip in smaller files and it runs neatly) ... that may have done it .. but when decompressing, the .vcf file was actually 37gb, and when trying to bgzip recompress it i couldn't get bgzip working., tried parallel and seemed ok, but then it was unindexable, as described on the original post.

As i have been browsing, and thanks to your advice fellas, i realize that i need to get to compile my bgzip tool for larger offset (that might last a while for me), since parallel, didn't get the job done as needed. Now, i can as well "cut" the file in various files and merge later, or simply sed/awk supressthe reference .fa file chr notation... i mean, to get the job done, but i realize that that is niether an elegant nor perfectly correct way of working.

still, now i think that i just might get hands on a faster better computer at my university when possible.

I sincerely apreciate the advice provided by you people, since for an interested starter on these matters, only this interaction is what get us understanding and going.

ADD REPLYlink written 7 months ago by ricfoz20
1

and a learning experience for all of us here:)

ADD REPLYlink written 7 months ago by cpad01127.7k

Thank you - I had no idea you could recompile bgzip with larger offsets (? pointer sizes maybe?). Good going!

ADD REPLYlink written 7 months ago by Ram16k

and -@ also. To my understanding, bgzip uses 1 thread by default. May be OP should utilize the threads if he has access to good computing resources. FFR (vcf is 87M and Processor i5-6200U (4cores with 8 threads- 6 used below), 8GB DDR3 RAM ),

$ time (bgzip -@6 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf)

real    0m1.108s
user    0m4.010s
sys 0m0.177s

$ gzip -d Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz 

$ time (bgzip Mills_and_1000G_gold_standard.indels.hg19.sites.vcf)

real    0m2.779s
user    0m2.719s
sys 0m0.060s
ADD REPLYlink modified 7 months ago • written 7 months ago by cpad01127.7k

alright, i tried to run the tool with the -p vcf flag, and i got exactly the same error. Maybe should i find a way to sort the vcf file before compressing? .. thanks for the advice still.

ADD REPLYlink written 7 months ago by ricfoz20

Try zcat file.vcf.gz | less -S and eyeball it to check if everything looks OK. Just for starters.

ADD REPLYlink written 7 months ago by Ram16k

it does look ok, because of the errors, i thought about sorting the vcf file, but as i look at it, all positions are in order, and i haven't change anything but the chr preffix to the chromosome number column... i think it's just about the offset of files permitted by the compiling of my bgzip tool ... i'll dive into how to get that sorted.

ADD REPLYlink written 7 months ago by ricfoz20

Also, you say you added chromosome prefixes. Are you sure you added them everywhere, including the ##contig lines and in all VCF entries?

ADD REPLYlink written 7 months ago by Ram16k

i wrote code with awk to add the 'chr' prefix to all the lines with information about the nucleotides, headers, say lines starting with ## at the beggining of the file weren't annotated with chr

ADD REPLYlink written 7 months ago by ricfoz20

I guess that should be OK. I would change the headers too, but this thread says it might not be necessary: VCF files: Change Chromosome Notation

I'd still change it though, just because I like to play it consistent.

ADD REPLYlink written 7 months ago by Ram16k

Please take some time and use the formatting bar (especially the code option) to present your post better. Formatting bar

ADD REPLYlink written 7 months ago by Ram16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 627 users visited in the last hour