I have an computational problem.
I'm using python to iterate over 2 csv files.
csv file1= contains (6-7) columns .. and the important column is an "rs ID" column from dbSNP.
csv file2= 3 columns, 2 of them are important, also the rs ID and a GENE symbol column
now i want to search: is an rs ID from csv file 1 IN csv 2 ? if yes, take the gene symbol from csv file 2 and put it into csv file 1.
csv file 1= 1,3 gb, csv file 2 = 8.8 mb
i'm generation a dictionary in python from the csv file 2 and I use it to search in csv file 1.
Problem: for every row(rs ID) in the csv file 1, he iterate through the whole dictionary ( 8.8mb file)
That takes way to much time.... do you know an another approach to get this search faster? I thought a dictionary/hashtable would be good... but it is way to slow.
Maybe creating a Suffix Array from csv file 2 instead of using a dictionary?
Or are there some packages, other data structures in python (vectorization methods)?
I would be very grateful for your help!