Question: Editing headers from a vcf.gz file
0
gravatar for ricfoz
23 months ago by
ricfoz30
National School of Antropology and History, Mexico city, Mexico
ricfoz30 wrote:

Hello everyone

Do anyone know any tool or script to eddit the header and ad a "chr" prefix to the chromosome name header directly in a .vcf.gz file?? ... i don't want even to uncompress it, since i have the experience that a huge .vcf can't be compressed by bgzip, at least in my computer.

I am working with some genomic files in the .bam and .vcf format, i tried to retrieve some genic regions, and i already sorted that out, beggining with a large one human chromosome .bam file.

When working with that whole chromosome file, i realized that the header had the name of the chromosome without the preffix "chr", only the number, and that gave me a hard time when trying to run mpileup just in the middle of the workflow.

Now, following my working path, i got a big one chromosome .vcf.gz file, which i indexed with tabix in order to retrieve the desired region with ease, but i get the same problem as before, the name of the chromosome is lacking the "chr" prefix, which happens to be not compatible with the .fasta reference file it needs to run the command, just now beggining with a .vcf.gz file.

I thought about going back from .vcf.gz to bam, on which i already know the syntax to eddit the headers, but that means doing around four file transformations. That will spend lots of time.

Thanks in advance for any orientation.

vcf.gz headers • 2.1k views
ADD COMMENTlink modified 23 months ago by d-cameron2.1k • written 23 months ago by ricfoz30

i don't want even to uncompress it,

without uncompressing it ? no, you cannot.

ADD REPLYlink written 23 months ago by Pierre Lindenbaum123k

prefix to the chromosome name header directly in a .vcf.gz file?? ... i

otherwise, you can use sed

gunzip -c input.vcf.gz | sed 's/^##contig=<ID=/##contig=<ID=chr/' | bgzip > output.vcf.gz && tabix -f -p vcf output.vcf.gz
ADD REPLYlink written 23 months ago by Pierre Lindenbaum123k
1

I don't think the OP realises that reheadering a VCF will do nothing. Like SAM, the chromosome name in plain text string format in every row. It seems they're looking for a VCF equivalent of samtools reheader -i. That trick would only work for a BCF, and even then, it would break variants in breakend notation.

ADD REPLYlink written 23 months ago by d-cameron2.1k

Thanks for the observation, yes, as you said, headers had nothing to do with the problem, i had to re-name the chromosome column, which is the first one on the VCF file, i solved the problem handling the text in the file using awk.

ADD REPLYlink written 22 months ago by ricfoz30

alright, i guessed i couldn't do it without decompressing, but i was just describing the trouble. I am trying your script right now, as i read it, it should give an output in the .vcf.gz format, going through an intermediate decompressing... during which preffix chr should be added, right?

... i used that sed command before, in a bam file, just chatting a bit in order to get the grip on what is being done to the files.

greetins

ADD REPLYlink written 23 months ago by ricfoz30

There was another post on Biostars on this topic too: VCF files: Change Chromosome Notation

You could also avoid decompressing the vcf.gz by just using bcftools:

bcftools view My.vcf.gz | '*script to modify header*' | bcftools view -oz

That will input vcf.gz and output vcf.gz

ADD REPLYlink written 23 months ago by Kevin Blighe49k
1

Still doesn't avoid decompressing. There is literally no way to read a gzip file without decompressing from the first byte.

EDIT: Unless I'm mistaken on the compression algorithm used by bgzip and it's not LZ77-based.

ADD REPLYlink modified 23 months ago • written 23 months ago by RamRS24k

You're right but I meant explicitly decompressing with gunzip. gunzip -c is essentially the same as bcftools view though

ADD REPLYlink written 23 months ago by Kevin Blighe49k

In this context, yeah. I don't know why OP wants to avoid decompression. Maybe avoid a decompressed intermediate file by streaming it?

ADD REPLYlink written 23 months ago by RamRS24k
1

Decompressing and recompressing with gzip is slow as it is typically limited to one core (although see pigz and pbgzip). If the VCF were block compressed it would technically be possible to decompress just the block(s) containing the VCF header, write it out with the edited header, and then directly concatenate the subsequent blocks without any decompression/recompression of those blocks (this is the trick that samtools reheader uses), but AFAIK no tool currently implements this for VCF.

Edit: bcftools looks to have a reheader command - I don't know if it uses the speed trick though

ADD REPLYlink modified 23 months ago • written 23 months ago by Len Trigg1.3k

That is useful information! Thank you!

ADD REPLYlink written 23 months ago by RamRS24k

( posted as answer )

ADD REPLYlink modified 23 months ago • written 23 months ago by d-cameron2.1k
3
gravatar for d-cameron
23 months ago by
d-cameron2.1k
Australia
d-cameron2.1k wrote:

Do anyone know any tool or script to eddit the header and ad a "chr" prefix to the chromosome name header directly in a .vcf.gz file?? ... i don't want even to uncompress it, since i have the experience that a huge .vcf can't be compressed by bgzip, at least in my computer.

BAM is to SAM as BCF is to VCF. That is, like SAM, VCF is a text file format. Like SAM, it stores chromosome names in every line thus changing the header does not change the chromosome name in each line of the file. To do a global replacement, your variant calls would need to be in BCF file format (.vcf.gz is just like having a .sam.gz file) but even then, that would not change variants in breakend notation as chromosome names are referenced in the string ALT field.

Do anyone know any tool or script to eddit the header and ad a "chr" prefix to the chromosome name header directly in a .vcf.gz file?? ... i don't want even to uncompress it, since i have the experience that a huge .vcf can't be compressed by bgzip, at least in my computer.

Is there a reason you can't decompress, rename, then recompress directly within a unix pipe? The compression is block based and should be streamable.

ADD COMMENTlink written 23 months ago by d-cameron2.1k

hello there, well, as i have been searching, the reason i can't compress big files (as a one chromosome 9GB one), is because i am currently working on a 32 bit computer. I can have acces to a 64bit one, but still i'm reading around to get to make a script which adds the "chr" prefix to all the lines of the vcf file i'm using, since the number of chromosome is written as "6" all along.

ADD REPLYlink written 23 months ago by ricfoz30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 997 users visited in the last hour