Question

Can multiple RSids be on the same position?

0

Entering edit mode

24 months ago

Julian ▴ 10

Hi All,

I convert a list of rs ids to chrom/pos numbers to extract from vcf files.

I don't know however if this logic works, when I'm extracting based on chrom/pos do I need to take care that the bases are correct to call it a RS id?

For example if rs id 123 refers to chromosome 1, pos 10, A > T. And in my VCF file I extract chromosome 1, pos 10 but the substitution is different, let's say A > G, is it still rsid 123? or would in that case a different rs id be asigned?

Kind regards, Julian

dbSNP RSid • 958 views

ADD COMMENT • link updated 24 months ago by Ram 43k • written 24 months ago by Julian ▴ 10

score 1 · Answer 1 · 2022-05-10

I'm interested in the accurate answer to this too, as it is vital in VCF annotation and probably has a definite answer.

From my experience, a few fields matter in dbSNP comparison: CHROM, POS, REF and VC (Variation Class) - I am not a 100% sure about this, though. I see entries happening at the same CHR:POS location, but they have different REF or at least different VC. For example,

rs978760828 and rs1639542564 are both at chr1:10039 and have REF allele A, but have different ALT alleles (C,G vs AC) and different VC (SNV vs INDEL).

  NC_000001.11    10039   rs978760828 A   C,G .   .   RS=978760828;dbSNPBuildID=150;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=Siberian:0.5,0.5,.|dbGaP_PopFreq:1,0,0
  NC_000001.11    10039   rs1639542564    A   AC  .   .   RS=1639542564;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL;R5;GNO;FREQ=dbGaP_PopFreq:1,0

I tried to understand the VC field, especially the INDEL vs DEL vs INS calls but it seems to depend more on when the entries were added than on if the change is a deletion or an insertion. For example, if 2 deletions happen at a location, the earlier is annotated as INDEL and the latter as DEL (similarly for insertions).

An insertion is called an INDEL in the two examples below:

NC_000001.11    10054   rs1639543798    C   CT  .   .   RS=1639543798;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL;R5;GNO;FREQ=dbGaP_PopFreq:1,0
NC_000001.11    10054   rs1639543820    CT  C   .   .   RS=1639543820;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=DEL;R5;GNO;FREQ=dbGaP_PopFreq:1,0

NC_000001.11    10057   rs1570391741    A   C,G .   .   RS=1570391741;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9935,0.006507,.|SGDP_PRJ:0.5,.,0.5|dbGaP_PopFreq:1,0,0
NC_000001.11    10057   rs1639544026    A   AC  .   .   RS=1639544026;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL;R5;GNO;FREQ=dbGaP_PopFreq:1,0

A deletion is called an INDEL here:

NC_000001.11    10106   rs1639545966    C   CT  .   .   RS=1639545966;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INS;R5;GNO;FREQ=dbGaP_PopFreq:1,0
NC_000001.11    10106   rs1639545986    CCCAA   C   .   .   RS=1639545986;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL;R5;GNO;FREQ=dbGaP_PopFreq:1,0

The most informative example I found:

NC_000001.11    10108   rs62651026  C   T   .   .   RS=62651026;dbSNPBuildID=129;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=dbGaP_PopFreq:1,0
NC_000001.11    10108   rs1322538365    C   CT  .   .   RS=1322538365;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INS;R5;GNO;FREQ=dbGaP_PopFreq:1,0
NC_000001.11    10108   rs1377973775    CAACCCT C   .   .   RS=1377973775;dbSNPBuildID=151;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL;R5;GNO;FREQ=TOMMO:0.9999,6.861e-05
NC_000001.11    10108   rs1639546068    C   CA  .   .   RS=1639546068;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=INDEL;R5;GNO;FREQ=dbGaP_PopFreq:1,0

In the above, C>CT is called an INS, C>CA is called an INDEL and CAACCCT>C is also called an INDEL. What this tells me is that even if REF allele matches, VC has to differ, and CHR:POS:VC is not unique either. So you need either CHR:POS:REF:ALT or CHR:POS:REF:VC to match rsID accurately.

All examples above were taken from the file https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz