Hi all,
I work a lot with GWAS Summary Stats and a lot of times the RS number is missing or incomplete or the CHR:BP is missing. I have created a python script that matches rs to chr:bp and chr:bp to rs (accounting for alleles and synonyms) using a whole dbSNP reference file (but parsed to only contain relevant data). The fastest way I could come up with is loading the GWAS in memory as a dictionary and then iterating through the dbSNP file (29GB uncompressed, 8.5GB gzipped, 1000million SNPs) and it takes around 40 minutes to finish.
Using the dbSNP file to look up into would be significantly faster but it takes a long time to load it in the memory and uses more than 90GB RAM to the point it crashes. I tried using it as a mysql db but it was very slow. I've tried splitting it up per chromosome and use multithreading for lookups but it only got slower.
Does anyone know any obvious tricks I've missed?