Detecting chromosone notation in vcf files
2
0
Entering edit mode
6 weeks ago

Hi,

I recently ran into an issue where a pipeline I wrote did not work on a new vcf file. As it turns out the problem was simply that the vcf file used "chr7" instead of just 7 for chromosome notation which confused tabix.

Is there any header in vcf files that indicates what annotation is used? I could convert them to a common format but that would take a lot longer than changing the command string for tabix.

Thanks

vcf • 254 views
1
Entering edit mode

after indexing bgzipped vcf with tabix, tabix -l <input.vcf.gz> should print the chromosomes.

2
Entering edit mode
6 weeks ago

is there any header in vcf files that indicates what annotation is used?

bcftools view --header-only in.vcf | grep "##contig"


or

bcftools index -s indexed.vcf.gz | cut -f1

1
Entering edit mode
6 weeks ago

Keep in mind that having different chromosome names doesn't mean that some variants are annotated in some way, but that they are detected in a different genome of reference. In particular, if we're talking about human genome, note that hg19 does use the "chr" prefix, GRCh37 does not, but both hg38 and GRCh38 do.

As hg19 and GRCh37 have been extensively used through the years, usually one tends to assume that they can be sometimes used indistinctively simply by adding/removing the "chr" prefix, at least when focusing in autosomal and sexual chromosomes only, but this assumption has always to be taken into account.

I won't deny that I've never used this assumption (in fact I have used it a lot), but if you go for it you should use it with caution. The best advice will always be to make sure which reference was used, not only if the reference contains or not a particular prefix. If that's what you're really interested in, you may find useful any vcf header describing commands being used to generate that particular vcf file, as those commands usually contain the reference file used.