Faster way of matching rsnumbers to positions
1
0
Entering edit mode
2.9 years ago

Hi all,

I work a lot with GWAS Summary Stats and a lot of times the RS number is missing or incomplete or the CHR:BP is missing. I have created a python script that matches rs to chr:bp and chr:bp to rs (accounting for alleles and synonyms) using a whole dbSNP reference file (but parsed to only contain relevant data). The fastest way I could come up with is loading the GWAS in memory as a dictionary and then iterating through the dbSNP file (29GB uncompressed, 8.5GB gzipped, 1000million SNPs) and it takes around 40 minutes to finish.

Using the dbSNP file to look up into would be significantly faster but it takes a long time to load it in the memory and uses more than 90GB RAM to the point it crashes. I tried using it as a mysql db but it was very slow. I've tried splitting it up per chromosome and use multithreading for lookups but it only got slower.

Does anyone know any obvious tricks I've missed?

match GWAS rsnumber chr SNP • 731 views
ADD COMMENT
2
Entering edit mode
2.9 years ago

just using linux ? (not tested)

prepare your file id->pos:

bcftools query -f '%ID\t%CHROM:%POS\n' ncbi.dbsnp.vcf.gz | LC_ALL=C sort -T . -t $'\t' -k1,1  > id2pos.tsv 

sort and join your query:

  LC_ALL=C sort -T . -t $'\t' -k1,1  ids.txt | join -t $'\t' -1 1 -2 1  id2pos.tsv -

otherwise, you could put your IDs in a database like sqlite3 and then query

awk '{printf("select ID,CHROM,POS from ID2POS where ID=\"%s\";\n",$1);}' ids.txt | sqlite3 mydb.sqlite
ADD COMMENT

Login before adding your answer.

Traffic: 3524 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6