Question

Faster way of matching rsnumbers to positions

0

Entering edit mode

2.9 years ago

TreasureChest • 0

Hi all,

I work a lot with GWAS Summary Stats and a lot of times the RS number is missing or incomplete or the CHR:BP is missing. I have created a python script that matches rs to chr:bp and chr:bp to rs (accounting for alleles and synonyms) using a whole dbSNP reference file (but parsed to only contain relevant data). The fastest way I could come up with is loading the GWAS in memory as a dictionary and then iterating through the dbSNP file (29GB uncompressed, 8.5GB gzipped, 1000million SNPs) and it takes around 40 minutes to finish.

Using the dbSNP file to look up into would be significantly faster but it takes a long time to load it in the memory and uses more than 90GB RAM to the point it crashes. I tried using it as a mysql db but it was very slow. I've tried splitting it up per chromosome and use multithreading for lookups but it only got slower.

Does anyone know any obvious tricks I've missed?

match GWAS rsnumber chr SNP • 731 views

ADD COMMENT • link updated 2.9 years ago by Pierre Lindenbaum 166k • written 2.9 years ago by TreasureChest • 0

score 2 · Answer 1 · 2022-11-10

just using linux ? (not tested)

prepare your file id->pos:

bcftools query -f '%ID\t%CHROM:%POS\n' ncbi.dbsnp.vcf.gz | LC_ALL=C sort -T . -t $'\t' -k1,1  > id2pos.tsv

sort and join your query:

  LC_ALL=C sort -T . -t $'\t' -k1,1  ids.txt | join -t $'\t' -1 1 -2 1  id2pos.tsv -

otherwise, you could put your IDs in a database like sqlite3 and then query

awk '{printf("select ID,CHROM,POS from ID2POS where ID=\"%s\";\n",$1);}' ids.txt | sqlite3 mydb.sqlite