Question

VCF: Replacing RefSeq ID to chr in #CHROM

3

Entering edit mode

4.4 years ago

Hadi M ▴ 50

Hi everyone,

I recently downloaded the latest dbSNP VCF and when opening the file, I noticed the #CHROM column is filled with RefSeq ID instead of chr1, chr2 or so on. Here is how the VCF looks like:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
NC_000001.11    10019   rs775809821     TA      T       .       .       RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
NC_000001.11    10039   rs978760828     A       C       .       .       RS=978760828;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV
NC_000001.11    10043   rs1008829651    T       A       .       .       RS=1008829651;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV

I would like to convert the RefSeq ID to its corresponding chromosome number:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr1    10019   rs775809821     TA      T       .       .       RS=775809821;dbSNPBuildID=144;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL
chr1   10039   rs978760828     A       C       .       .       RS=978760828;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV
chr1    10043   rs1008829651    T       A       .       .       RS=1008829651;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV

Is a tool or script available that can convert the RefSeq ID? Thank you in advance.

VCF • 7.0k views

ADD COMMENT • link updated 23 months ago by Lhl ▴ 760 • written 4.4 years ago by Hadi M ▴ 50

score 14 · Answer 1 · 2020-02-20

Yeah, that was a real kick in the pants, as they didn't seem to post anything in the release notes, which still list them by chr1, chr2, etc.

Best. documentation. EVAR.

The easy solution is to have a renaming file for all the chromosomes of the format old_name new_name\n per bcftools and use the --rename-chrs option:

bcftools annotate --rename-chrs rename_file.txt --threads num_threads -o new.vcf.gz -Oz original.vcf.gz

Update: NCBI Assembly to the rescue. Look up your assembly (I did both GRCh37.p13 and GRCh38.p13) and click "Download the full sequence report". Or if you copy the links:

# Assembly reports for renaming chromosomes
report_dir='ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405'
wget -N "${report_dir}/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt"
wget -N "${report_dir}/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_assembly_report.txt"

# Grab the useful columns
for k in *assembly_report.txt
  do
    out=$(echo $k | sed 's/.txt/.chrnames/')
    grep -e '^[^#]' $k | awk '{ print $7, $1 }' > $out
done

# Annotate
bcftools annotate \
  --rename-chrs GCF_000001405.25_GRCh37.p13_assembly_report.chrnames \
  --threads 10 -Oz \
  -o GRCh37.dbSNP153.vcf.gz \
  GCF_000001405.25.gz
bcftools annotate \
  --rename-chrs GCF_000001405.39_GRCh38.p13_assembly_report.chrnames \
  --threads 10 -Oz \
  -o GRCh38.dbSNP153.vcf.gz \
  GCF_000001405.38.gz

GenoMax · Answer 2 · 2020-02-20

1

Entering edit mode

4.2 years ago

tothepoint ▴ 800

Look for the specific position of ref sequence here : Reference sequence with chromosome position

Change each reference position with chr1, chr2 and so on :

awk '{gsub(/NC_000001.11/, "chr1"); gsub(/NC_000002.12/, "chr2") gsub(/NC_000003.13/, "chr3") .......); print;}' your.vcf > your_final.vcf

Here you go.

ADD COMMENT • link updated 4.2 years ago by GenoMax 141k • written 4.2 years ago by tothepoint ▴ 800