I am trying to index a
.vcf.gz file in order to get a fasta consensus with
this is the simple command i give:
and I get the next error:
[E: :get_intv] failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "c" [E: :hts_idx_push] unsorted positions tbx_index_build failed: myFile.vcf.gz
I must say the vcf.gz file is sorted by position, and the original one had the chromosome field defined just by number, in this file, i added the preffix "chr", to match the fasta reference file i need to apply on a next step
If the file is sorted by position, that's a strange error to get, and I wouldn't think that the chromosome field has much to do with it necessarily. Like the error says, does explicitly adding the -p option help? Like so:
Beyond that, was it compressed using bgzip?
i tried with the -p flag, but i didn't know what input to use with the "string"option ... as i see your suggestion, the flag -p needs "vcf" as string option, i will try that syntax.
Yes , i compressed it with bgzip, well, with that tool, but using GNU-parallel tool, since my compiled version of bgzip was unable to compress the 37gb file, but running bgzip with parallel seemed to work alright
There is no way to check if parallel bgzip is worked on such a huge file. Probably output from parallel is not correctly bgzipped (i.e you have output.bgzip, but it may not be correctly bgzipped as per tabix requirements). Try with a smaller vcf (as I did before posting it) and if it works, it should scale up.
Unzip the bgzipped vcf in to a separate file and compare the
md5sumof old vcf file and newly uncompressed vcf file (from bgzip - which is cumbersome as the vcf file is big). Both should be same. On the other hand, if entire bgzip compression failed, file type (use
filecommand in ubuntu-
file test.vcf.gz-- tells you if it is gzip compressed data) may be able to tell you.
I second this - I do not trust parallel combinations with gzip or bgzip. I hope you retain the original file (the "old vcf file" that cpad mentions), OP. If bgzip did something funky, you're gonna need that file to restart your processing.
md5sum on a 37G VCF file shouldn't be cumbersome, just takes a while.
It is not md5summing that is cumbersome. It is with bgzipping a big vcf file (~27 gb), followed by unzipping it. For bgzipping with parallel, it takes ~6h such a big file (as per OP) and without parallel, it fails.
I hope OP has access to good computing resources. In the meantime, I'd recommend using
bgzip -cinstead of plain
Hello there, as expected, i have been following your suggestions, here is a description of my actual problem, since i think i know the answer, but still, as you say, it is hard with no "good" computational resources:
I got online a 3.5gb.vcf.gz file which was actually already indexed, say: Ready to go!, but as i've realized that is the state of the art on bioinformatics, the chromosome anotation was written just with number, and the UCSC .fa reference file has got the
Having realized this, i thought well ... decompres,
awkit, and recompress ...(i have been working with bgzip in smaller files and it runs neatly) ... that may have done it .. but when decompressing, the .vcf file was actually 37gb, and when trying to bgzip recompress it i couldn't get bgzip working., tried parallel and seemed ok, but then it was unindexable, as described on the original post.
As i have been browsing, and thanks to your advice fellas, i realize that i need to get to compile my bgzip tool for larger offset (that might last a while for me), since parallel, didn't get the job done as needed. Now, i can as well "cut" the file in various files and merge later, or simply sed/awk supressthe reference
chrnotation... i mean, to get the job done, but i realize that that is niether an elegant nor perfectly correct way of working.
still, now i think that i just might get hands on a faster better computer at my university when possible.
I sincerely apreciate the advice provided by you people, since for an interested starter on these matters, only this interaction is what get us understanding and going.
and a learning experience for all of us here:)
Thank you - I had no idea you could recompile bgzip with larger offsets (? pointer sizes maybe?). Good going!
and -@ also. To my understanding, bgzip uses 1 thread by default. May be OP should utilize the threads if he has access to good computing resources. FFR (vcf is 87M and Processor i5-6200U (4cores with 8 threads- 6 used below), 8GB DDR3 RAM ),
alright, i tried to run the tool with the
-p vcfflag, and i got exactly the same error. Maybe should i find a way to sort the vcf file before compressing? .. thanks for the advice still.
zcat file.vcf.gz | less -Sand eyeball it to check if everything looks OK. Just for starters.
it does look ok, because of the errors, i thought about sorting the vcf file, but as i look at it, all positions are in order, and i haven't change anything but the chr preffix to the chromosome number column... i think it's just about the offset of files permitted by the compiling of my bgzip tool ... i'll dive into how to get that sorted.
Also, you say you added chromosome prefixes. Are you sure you added them everywhere, including the
##contiglines and in all VCF entries?
i wrote code with awk to add the 'chr' prefix to all the lines with information about the nucleotides, headers, say lines starting with ## at the beggining of the file weren't annotated with chr
I guess that should be OK. I would change the headers too, but this thread says it might not be necessary: VCF files: Change Chromosome Notation
I'd still change it though, just because I like to play it consistent.
Please take some time and use the formatting bar (especially the
codeoption) to present your post better.