Why VCF POS different in dbSNP and ClinVar?
0
0
Entering edit mode
2.3 years ago
magnolia ▴ 20

Hi,

I simply subset dbSNP with ClinVar using VCF positions via bcftools. I noticed that some of the variants are not present in subset file. I realized that 1 base difference in POS causing that. Also, dbSNP REF and ALT value start 1 base earlier than ClinVar. For example rs1555813914:

(based on GRCh37)

dbSNP

POS: 25304044

REF: ATC

ALT: AAAA

ClinVar

POS: 25304045

REF: TC

ALT: AAA

Why is there such difference?

How annotation tools such as Ensembl VEP can annotate these positions correctly?

Thank you!

vep clinvar dbsnp vcf • 1.8k views
ADD COMMENT
0
Entering edit mode

Hi magnolia, i recommend you take a step back and read about 3 things:

1) 0-based indexing and 1-based indexing

2) differences between genomic builds

3) correct naming of indels

https://varnomen.hgvs.org/recommendations/DNA/variant/delins/

ADD REPLY
0
Entering edit mode

Thank you for the answer Vincent!

1) I thought VCFs are 1-based. Both files are VCFs so they should have the same POS? Also, if one file is 0-based, then shouldn't almost all positions be unmatched?

2) I downloaded latest releases for both and for GRCh37. Will there be any difference between them? There is no other option. Only GRCh37 and GRCh38.

3) I couldn't figure out what naming indels have to do anything with positions.

ADD REPLY
1
Entering edit mode

I think the two variants are the same, but may not be "normalized" the same (e.g. the dbSNP record in the VCF has an extra A at the start). I am not sure why that would be

ADD REPLY
0
Entering edit mode

Probably one of them is not normalized. Gotta check the sequence I guess. The variants are the same as ClinVar entry is mapped to same rsid.

ADD REPLY
1
Entering edit mode

I thought VCFs are 1-based.

However, this does not imply that any reference material you might find (i.e. databases of annotations etc.) is 1-based. A lot of resources are not.

Will there be any difference between GRCh37 and GRCh38?

Yes, there are thousands and thousands of differences.

There is no other option. Only GRCh37 and GRCh38.

No, this is not correct. However, these are certainly the most commonly used.

I couldn't figure out what naming indels have to do anything with positions.

Please do not take any further action until you understand this. Please see, for example, https://pubmed.ncbi.nlm.nih.gov/25701572/

ADD REPLY
0
Entering edit mode

Thanks a lot for the normalization article! That cleared a lot.

I actually asked differences between dbSNP and ClinVar for GRCh37, not GRCh37 and GRCh38 but apparently, there ARE differences between dbSNP and ClinVar even if they're both for the same reference genome.

Now my question is then do I have to normalize BOTH dbSNP VCF and individual's VCF to annotate correctly? (or in this case both dbSNP and ClinVar)

ADD REPLY
1
Entering edit mode

I guess the way I'd respond is, while you don't "have to" it would be considered best practice.... I have not actually looked for tools that do this recently. If you decide to and need help, let me know.

ADD REPLY
0
Entering edit mode

Thank you so much Vincent. For sure, it's best practice and seems like very important. ClinVar has the normalized alleles. I normalized dbSNP with vt, and they match with ClinVar (though haven't checked ALL positions yet). Seems like it's very crucial for annotation by position and allele.

ADD REPLY

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6