Hello, Biostars Community,
I am working on creating a custom database of variants using the VCF from the latest dbSNP alpha release available at ftp.ncbi.nih.gov/snp/population_frequency/latest_release/. I have encountered a couple of issues that I'm hoping someone might help me resolve.
Firstly, the chromosome encoding uses RefSeq IDs (e.g., NC_000007.12) instead of the typical chromosome notation (e.g., chr1, chr2, etc.). I've managed to map each RefSeq code to its corresponding chromosome. As a first step for simplicity, I've eliminated the unplaced scaffolds (e.g., NT_113901.1 unplaced-scaffold) using the following command:
zcat freq.vcf.gz | grep -E '^#|^NC_' | gzip > freq_only_NC.vcf.gz
Next, I attempted to use bcftools annotate --rename-chrs
to change the encoding to the standard chromosome notation:
bcftools annotate --rename-chrs refseq_to_main_chr_mod.txt -o chr_freq_only_NC.vcf.gz -Oz freq_only_NC.vcf.gz
However, I received the following error:
[W::vcf_parse] Contig 'NC_000007.12' is not defined in the header. (Quick workaround: index the file with tabix.)
Upon trying to create an index for this new VCF, I encountered another error:
tabix -p vcf freq_only_NC.vcf.gz
[E::hts_idx_push] Invalid record on sequence #2: end 1 < begin 2040518
tbx_index_build failed: freq_only_NC.vcf.gz
I am puzzled by this error since the VCFs only list one position and not a range with a beginning and end. Could someone please assist me in understanding and resolving these issues?
Any insights or advice would be greatly appreciated.
Thanks! at the end is what a problem with the tab in the file!