Issues with Chromosome Encoding and VCF Annotation in dbSNP Alpha Release
3 months ago
Fernando • 0

Hello, Biostars Community,

I am working on creating a custom database of variants using the VCF from the latest dbSNP alpha release available at I have encountered a couple of issues that I'm hoping someone might help me resolve.

Firstly, the chromosome encoding uses RefSeq IDs (e.g., NC_000007.12) instead of the typical chromosome notation (e.g., chr1, chr2, etc.). I've managed to map each RefSeq code to its corresponding chromosome. As a first step for simplicity, I've eliminated the unplaced scaffolds (e.g., NT_113901.1 unplaced-scaffold) using the following command:

zcat freq.vcf.gz | grep -E '^#|^NC_' | gzip > freq_only_NC.vcf.gz

Next, I attempted to use bcftools annotate --rename-chrs to change the encoding to the standard chromosome notation:

bcftools annotate --rename-chrs refseq_to_main_chr_mod.txt -o chr_freq_only_NC.vcf.gz -Oz freq_only_NC.vcf.gz

However, I received the following error:

[W::vcf_parse] Contig 'NC_000007.12' is not defined in the header. (Quick workaround: index the file with tabix.)

Upon trying to create an index for this new VCF, I encountered another error:

tabix -p vcf freq_only_NC.vcf.gz
[E::hts_idx_push] Invalid record on sequence #2: end 1 < begin 2040518
tbx_index_build failed: freq_only_NC.vcf.gz

I am puzzled by this error since the VCFs only list one position and not a range with a beginning and end. Could someone please assist me in understanding and resolving these issues?

Any insights or advice would be greatly appreciated.

3 months ago

NC_000007.12 is not defined in the header

means that you should find a header with the following syntax:


if not, you should declare the correct chromosomes, using bcftools rehader --fai /path/to/reference.fa.fai in.vcf.gz

Thanks! at the end is what a problem with the tab in the file!


