Trouble indexing a .vcf.gz file
1
1
Entering edit mode
4.8 years ago
ricfoz ▴ 80

Hello everyone,

I am trying to index a .vcf.gz file in order to get a fasta consensus with bcftools

this is the simple command i give:

tabix myFile.vcf.gz


and I get the next error:

[E: :get_intv] failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "c"
[E: :hts_idx_push] unsorted positions tbx_index_build failed: myFile.vcf.gz


I must say the vcf.gz file is sorted by position, and the original one had the chromosome field defined just by number, in this file, i added the preffix "chr", to match the fasta reference file i need to apply on a next step

Greetings

tabix vcf compressing vcf.gz • 11k views
0
Entering edit mode

If the file is sorted by position, that's a strange error to get, and I wouldn't think that the chromosome field has much to do with it necessarily. Like the error says, does explicitly adding the -p option help? Like so:

tabix -p vcf myFile.vcf.gz


Beyond that, was it compressed using bgzip?

0
Entering edit mode

i tried with the -p flag, but i didn't know what input to use with the "string"option ... as i see your suggestion, the flag -p needs "vcf" as string option, i will try that syntax.

Yes , i compressed it with bgzip, well, with that tool, but using GNU-parallel tool, since my compiled version of bgzip was unable to compress the 37gb file, but running bgzip with parallel seemed to work alright

1
Entering edit mode

There is no way to check if parallel bgzip is worked on such a huge file. Probably output from parallel is not correctly bgzipped (i.e you have output.bgzip, but it may not be correctly bgzipped as per tabix requirements). Try with a smaller vcf (as I did before posting it) and if it works, it should scale up.

Unzip the bgzipped vcf in to a separate file and compare the md5sum of old vcf file and newly uncompressed vcf file (from bgzip - which is cumbersome as the vcf file is big). Both should be same. On the other hand, if entire bgzip compression failed, file type (use file command in ubuntu- file test.vcf.gz -- tells you if it is gzip compressed data) may be able to tell you.

0
Entering edit mode

I second this - I do not trust parallel combinations with gzip or bgzip. I hope you retain the original file (the "old vcf file" that cpad mentions), OP. If bgzip did something funky, you're gonna need that file to restart your processing.

0
Entering edit mode

md5sum on a 37G VCF file shouldn't be cumbersome, just takes a while.

0
Entering edit mode

It is not md5summing that is cumbersome. It is with bgzipping a big vcf file (~27 gb), followed by unzipping it. For bgzipping with parallel, it takes ~6h such a big file (as per OP) and without parallel, it fails.

0
Entering edit mode

I hope OP has access to good computing resources. In the meantime, I'd recommend using bgzip -c instead of plain bgzip.

1
Entering edit mode

Hello there, as expected, i have been following your suggestions, here is a description of my actual problem, since i think i know the answer, but still, as you say, it is hard with no "good" computational resources:

I got online a 3.5gb.vcf.gz file which was actually already indexed, say: Ready to go!, but as i've realized that is the state of the art on bioinformatics, the chromosome anotation was written just with number, and the UCSC .fa reference file has got the chr annotation.

Having realized this, i thought well ... decompres, awk it, and recompress ...(i have been working with bgzip in smaller files and it runs neatly) ... that may have done it .. but when decompressing, the .vcf file was actually 37gb, and when trying to bgzip recompress it i couldn't get bgzip working., tried parallel and seemed ok, but then it was unindexable, as described on the original post.

As i have been browsing, and thanks to your advice fellas, i realize that i need to get to compile my bgzip tool for larger offset (that might last a while for me), since parallel, didn't get the job done as needed. Now, i can as well "cut" the file in various files and merge later, or simply sed/awk supressthe reference .fa file chr notation... i mean, to get the job done, but i realize that that is niether an elegant nor perfectly correct way of working.

still, now i think that i just might get hands on a faster better computer at my university when possible.

I sincerely apreciate the advice provided by you people, since for an interested starter on these matters, only this interaction is what get us understanding and going.

1
Entering edit mode

and a learning experience for all of us here:)

0
Entering edit mode

Thank you - I had no idea you could recompile bgzip with larger offsets (? pointer sizes maybe?). Good going!

0
Entering edit mode

and -@ also. To my understanding, bgzip uses 1 thread by default. May be OP should utilize the threads if he has access to good computing resources. FFR (vcf is 87M and Processor i5-6200U (4cores with 8 threads- 6 used below), 8GB DDR3 RAM ),

$time (bgzip -@6 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf) real 0m1.108s user 0m4.010s sys 0m0.177s$ gzip -d Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz

\$ time (bgzip Mills_and_1000G_gold_standard.indels.hg19.sites.vcf)

real    0m2.779s
user    0m2.719s
sys 0m0.060s

0
Entering edit mode

alright, i tried to run the tool with the -p vcf flag, and i got exactly the same error. Maybe should i find a way to sort the vcf file before compressing? .. thanks for the advice still.

0
Entering edit mode

Try zcat file.vcf.gz | less -S and eyeball it to check if everything looks OK. Just for starters.

0
Entering edit mode

it does look ok, because of the errors, i thought about sorting the vcf file, but as i look at it, all positions are in order, and i haven't change anything but the chr preffix to the chromosome number column... i think it's just about the offset of files permitted by the compiling of my bgzip tool ... i'll dive into how to get that sorted.

0
Entering edit mode

Also, you say you added chromosome prefixes. Are you sure you added them everywhere, including the ##contig lines and in all VCF entries?

0
Entering edit mode

i wrote code with awk to add the 'chr' prefix to all the lines with information about the nucleotides, headers, say lines starting with ## at the beggining of the file weren't annotated with chr

0
Entering edit mode

I guess that should be OK. I would change the headers too, but this thread says it might not be necessary: VCF files: Change Chromosome Notation

I'd still change it though, just because I like to play it consistent.

0
Entering edit mode

Please take some time and use the formatting bar (especially the code option) to present your post better.

0
Entering edit mode
10 months ago
san • 0

I encountered same problem as you. The answer is DO NOT use awk to rename chromosomes, use tools in bcftools. Prepare a blank separated txt file"rename.txt", the first column is the original name and the second column is your new name. Command is bcftools annotate --rename.txt test.gz | bgzip -c > test_add_chr.gz

0
Entering edit mode

Your approach is accurate but the command is incorrect. See the linked thread from my comment for the correct command.