Question

Editing headers from a vcf.gz file

0

Entering edit mode

6.4 years ago

ricfoz ▴ 100

Hello everyone

Do anyone know any tool or script to edit the header and add a "chr" prefix to the chromosome name header directly in a .vcf.gz file? I don't want even to uncompress it, since I have the experience that a huge .vcf can't be compressed by bgzip, at least in my computer.

I am working with some genomic files in the .bam and .vcf format, I tried to retrieve some genic regions, and I already sorted that out, beginning with a large one human chromosome .bam file.

When working with that whole chromosome file, I realized that the header had the name of the chromosome without the prefix "chr", only the number, and that gave me a hard time when trying to run mpileup just in the middle of the workflow.

Now, following my working path, I got a big one chromosome .vcf.gz file, which I indexed with tabix in order to retrieve the desired region with ease, but I get the same problem as before, the name of the chromosome is lacking the "chr" prefix, which happens to be not compatible with the .fasta reference file it needs to run the command, just now beginning with a .vcf.gz file.

I thought about going back from .vcf.gz to bam, on which I already know the syntax to edit the headers, but that means doing around four file transformations. That will spend lots of time.

Thanks in advance for any orientation.

vcf • 7.9k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 6.4 years ago by ricfoz ▴ 100

0

Entering edit mode

I don't want even to uncompress it,

without uncompressing it ? no, you cannot.

ADD REPLY • link updated 13 months ago by Ram 43k • written 6.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

prefix to the chromosome name header directly in a .vcf.gz file

otherwise, you can use sed

gunzip -c input.vcf.gz | sed 's/^##contig=<ID=/##contig=<ID=chr/' | bgzip > output.vcf.gz && tabix -f -p vcf output.vcf.gz

ADD REPLY • link updated 13 months ago by Ram 43k • written 6.4 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

I don't think the OP realises that reheadering a VCF will do nothing. Like SAM, the chromosome name in plain text string format in every row. It seems they're looking for a VCF equivalent of samtools reheader -i. That trick would only work for a BCF, and even then, it would break variants in breakend notation.

ADD REPLY • link 6.4 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Thanks for the observation, yes, as you said, headers had nothing to do with the problem, i had to re-name the chromosome column, which is the first one on the VCF file, i solved the problem handling the text in the file using awk.

ADD REPLY • link 6.4 years ago by ricfoz ▴ 100

0

Entering edit mode

alright, I guessed I couldn't do it without decompressing, but I was just describing the trouble. I am trying your script right now, as I read it, it should give an output in the .vcf.gz format, going through an intermediate decompressing... during which prefix chr should be added, right?

... I used that sed command before, in a bam file, just chatting a bit in order to get the grip on what is being done to the files.

greetings

ADD REPLY • link updated 13 months ago by Ram 43k • written 6.4 years ago by ricfoz ▴ 100

0

Entering edit mode

There was another post on Biostars on this topic too: VCF files: Change Chromosome Notation

You could also avoid decompressing the vcf.gz by just using bcftools:

bcftools view My.vcf.gz | '*script to modify header*' | bcftools view -oz

That will input vcf.gz and output vcf.gz

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

1

Entering edit mode

Still doesn't avoid decompressing. There is literally no way to read a gzip file without decompressing from the first byte.

EDIT: Unless I'm mistaken on the compression algorithm used by bgzip and it's not LZ77-based.

ADD REPLY • link 6.4 years ago by Ram 43k

0

Entering edit mode

You're right but I meant explicitly decompressing with gunzip. gunzip -c is essentially the same as bcftools view though

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

In this context, yeah. I don't know why OP wants to avoid decompression. Maybe avoid a decompressed intermediate file by streaming it?

ADD REPLY • link 6.4 years ago by Ram 43k

1

Entering edit mode

Decompressing and recompressing with gzip is slow as it is typically limited to one core (although see pigz and pbgzip). If the VCF were block compressed it would technically be possible to decompress just the block(s) containing the VCF header, write it out with the edited header, and then directly concatenate the subsequent blocks without any decompression/recompression of those blocks (this is the trick that samtools reheader uses), but AFAIK no tool currently implements this for VCF.

Edit: bcftools looks to have a reheader command - I don't know if it uses the speed trick though

ADD REPLY • link 6.4 years ago by Len Trigg ★ 1.6k

0

Entering edit mode

That is useful information! Thank you!

ADD REPLY • link 6.4 years ago by Ram 43k

0

Entering edit mode

( posted as answer )

ADD REPLY • link 6.4 years ago by d-cameron ★ 2.9k

Ram · Answer 1 · 2017-11-26

Do anyone know any tool or script to edit the header and add a "chr" prefix to the chromosome name header directly in a .vcf.gz file? I don't want even to uncompress it, since I have the experience that a huge .vcf can't be compressed by bgzip, at least in my computer.

BAM is to SAM as BCF is to VCF. That is, like SAM, VCF is a text file format. Like SAM, it stores chromosome names in every line thus changing the header does not change the chromosome name in each line of the file. To do a global replacement, your variant calls would need to be in BCF file format (.vcf.gz is just like having a .sam.gz file) but even then, that would not change variants in breakend notation as chromosome names are referenced in the string ALT field.

Do anyone know any tool or script to edit the header and add a "chr" prefix to the chromosome name header directly in a .vcf.gz file? I don't want even to uncompress it, since I have the experience that a huge .vcf can't be compressed by bgzip, at least in my computer.

Is there a reason you can't decompress, rename, then recompress directly within a unix pipe? The compression is block based and should be streamable.