Editing headers from a vcf.gz file
1
0
Entering edit mode
4.8 years ago
ricfoz ▴ 80

Hello everyone

Do anyone know any tool or script to eddit the header and ad a "chr" prefix to the chromosome name header directly in a .vcf.gz file?? ... i don't want even to uncompress it, since i have the experience that a huge .vcf can't be compressed by bgzip, at least in my computer.

I am working with some genomic files in the .bam and .vcf format, i tried to retrieve some genic regions, and i already sorted that out, beggining with a large one human chromosome .bam file.

When working with that whole chromosome file, i realized that the header had the name of the chromosome without the preffix "chr", only the number, and that gave me a hard time when trying to run mpileup just in the middle of the workflow.

Now, following my working path, i got a big one chromosome .vcf.gz file, which i indexed with tabix in order to retrieve the desired region with ease, but i get the same problem as before, the name of the chromosome is lacking the "chr" prefix, which happens to be not compatible with the .fasta reference file it needs to run the command, just now beggining with a .vcf.gz file.

I thought about going back from .vcf.gz to bam, on which i already know the syntax to eddit the headers, but that means doing around four file transformations. That will spend lots of time.

Thanks in advance for any orientation.

0
Entering edit mode

i don't want even to uncompress it,

without uncompressing it ? no, you cannot.

0
Entering edit mode

prefix to the chromosome name header directly in a .vcf.gz file?? ... i

otherwise, you can use sed

gunzip -c input.vcf.gz | sed 's/^##contig=<ID=/##contig=<ID=chr/' | bgzip > output.vcf.gz && tabix -f -p vcf output.vcf.gz

1
Entering edit mode

I don't think the OP realises that reheadering a VCF will do nothing. Like SAM, the chromosome name in plain text string format in every row. It seems they're looking for a VCF equivalent of samtools reheader -i. That trick would only work for a BCF, and even then, it would break variants in breakend notation.

0
Entering edit mode

Thanks for the observation, yes, as you said, headers had nothing to do with the problem, i had to re-name the chromosome column, which is the first one on the VCF file, i solved the problem handling the text in the file using awk.

0
Entering edit mode

alright, i guessed i couldn't do it without decompressing, but i was just describing the trouble. I am trying your script right now, as i read it, it should give an output in the .vcf.gz format, going through an intermediate decompressing... during which preffix chr should be added, right?

... i used that sed command before, in a bam file, just chatting a bit in order to get the grip on what is being done to the files.

greetins

0
Entering edit mode

There was another post on Biostars on this topic too: VCF files: Change Chromosome Notation

You could also avoid decompressing the vcf.gz by just using bcftools:

bcftools view My.vcf.gz | '*script to modify header*' | bcftools view -oz


That will input vcf.gz and output vcf.gz

1
Entering edit mode

Still doesn't avoid decompressing. There is literally no way to read a gzip file without decompressing from the first byte.

EDIT: Unless I'm mistaken on the compression algorithm used by bgzip and it's not LZ77-based.

0
Entering edit mode

You're right but I meant explicitly decompressing with gunzip. gunzip -c is essentially the same as bcftools view though

0
Entering edit mode

In this context, yeah. I don't know why OP wants to avoid decompression. Maybe avoid a decompressed intermediate file by streaming it?

1
Entering edit mode

Decompressing and recompressing with gzip is slow as it is typically limited to one core (although see pigz and pbgzip). If the VCF were block compressed it would technically be possible to decompress just the block(s) containing the VCF header, write it out with the edited header, and then directly concatenate the subsequent blocks without any decompression/recompression of those blocks (this is the trick that samtools reheader uses), but AFAIK no tool currently implements this for VCF.

Edit: bcftools looks to have a reheader command - I don't know if it uses the speed trick though

0
Entering edit mode

That is useful information! Thank you!

0
Entering edit mode

3
Entering edit mode
4.8 years ago
d-cameron ★ 2.7k

Do anyone know any tool or script to eddit the header and ad a "chr" prefix to the chromosome name header directly in a .vcf.gz file?? ... i don't want even to uncompress it, since i have the experience that a huge .vcf can't be compressed by bgzip, at least in my computer.

BAM is to SAM as BCF is to VCF. That is, like SAM, VCF is a text file format. Like SAM, it stores chromosome names in every line thus changing the header does not change the chromosome name in each line of the file. To do a global replacement, your variant calls would need to be in BCF file format (.vcf.gz is just like having a .sam.gz file) but even then, that would not change variants in breakend notation as chromosome names are referenced in the string ALT field.

Do anyone know any tool or script to eddit the header and ad a "chr" prefix to the chromosome name header directly in a .vcf.gz file?? ... i don't want even to uncompress it, since i have the experience that a huge .vcf can't be compressed by bgzip, at least in my computer.

Is there a reason you can't decompress, rename, then recompress directly within a unix pipe? The compression is block based and should be streamable.

0
Entering edit mode

hello there, well, as i have been searching, the reason i can't compress big files (as a one chromosome 9GB one), is because i am currently working on a 32 bit computer. I can have acces to a 64bit one, but still i'm reading around to get to make a script which adds the "chr" prefix to all the lines of the vcf file i'm using, since the number of chromosome is written as "6" all along.