Where can I lookup old merged rsIDs?
1
1
Entering edit mode
7.2 years ago
Mike Dacre ▴ 130

I have a huge dataset of SNPs that I am trying to get hg19 locations for based only on the rsID. Right now I am doing that by just downloading the latest version of dbSNP, turning it into an sqlite database, and doing a huge long running query. This works, although it takes forever, but it has the problem that it fails for all SNPs that have been merged, e.g. rs111199278.

I need some way in batch to be able to get the position of these old rsIDs, possibly by just converting them to the latest rsID in dbSNP and then looking them up with my slow database query.

Is there a good way to do this efficiently? I think I will end up needing to lookup about 100K-500K rsIDs (I don't know yet because my big location lookup hasn't finished).

SNP • 3.3k views
ADD COMMENT
1
Entering edit mode

Maybe a

curl -s 2>&1 "https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs111199278" | head -n 1

that gives you "merged into rs#" might be of help?

ADD REPLY
0
Entering edit mode

Yeah, that would work in principle, but I would need to put a delay of at least half a second between requests to avoid making NCBI angry, which would put the run time at around half a day. Still, it is a pretty good idea if there isn't a more efficient way to do it somewhere. I was kind of hoping for some kind of lookup table or queryable database though.

ADD REPLY
0
Entering edit mode

This seems to me to be an important topic, and I wonder if anyone can comment on the process by which NCBI keeps this resource up to date.

I found the change stamp for this merge id resource at

https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/database/organism_data/ 

specifically

https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/database/organism_data/RsMergeArch.bcp.gz

The timestamp is

RsMergeArch.bcp.gz 2018-02-07 12:09 146M

So that table is 3 years old, but notice it is associated with (according to the directory path) b151 (dbSNP) and GRCh38 (hg38) ...so perhaps it is "up to date" even though 3 years old.

Has the merging process ended? If not, is there a more recent resource to use?

ADD REPLY
0
Entering edit mode

I think you should ask NCBI support about this by putting a ticket in via their help desk. Please post their response here once you hear back from them.

ADD REPLY
1
Entering edit mode

Thank you for this tip. Here is the answer from NLM:

"Those files are from the legacy snp build process, which was decommissioned in 2018.

The SNP build has been using SPDI-based asserted location workflow since build 152. That system no longer produces relevant file on this. The new build system only produces json and vcf: https://ftp.ncbi.nlm.nih.gov/snp/latest_release/"

ADD REPLY
4
Entering edit mode
7.2 years ago

download and insert the content of ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/organism_data/RsMergeArch.bcp.gz into a sqlite database ?

$ curl -s "ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/organism_data/RsMergeArch.bcp.gz" | gunzip -c | grep -m1 -w 111199278
111199278   3095314 144 1   2015-04-01 13:45:00.837 2015-07-14 23:14:01.543 3095314 1   rsm

PS: I don't think sqlite3 is a database of choice for such large dataset.

ADD COMMENT
0
Entering edit mode

Perfect, thanks so much Pierre.

I agree that the sqlite db isn't the best bet at this point, I think I will migrate it over to postgresql, is it your experience that postresql/mysql are faster/more resource efficient for very large queries?

ADD REPLY
1
Entering edit mode

I've only got an experience with mysql, or nosql/berkeleydb, but most of the time, just like with your problem, I use a simple linux sort/join.

ADD REPLY
0
Entering edit mode

OK, I will try that then, thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 2562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6