Question: "big" Data csv search
1
gravatar for hrbrt.sch
4.9 years ago by
hrbrt.sch10
Germany
hrbrt.sch10 wrote:

Hey,

I have an computational problem.

I'm using python to iterate over 2 csv files.

csv file1=  contains (6-7) columns .. and the important column is an "rs ID" column from dbSNP.

csv file2= 3 columns, 2 of them are important, also the rs ID and a GENE symbol column

my problem:

now i want to search: is an rs ID from csv file 1 IN csv 2 ? if yes, take the gene symbol from csv file 2 and put it into csv file 1.

csv file 1= 1,3 gb, csv file 2 = 8.8 mb

i'm generation a dictionary in python  from the csv file 2 and I use it to search in csv file 1.

Problem: for every row(rs ID) in the csv file 1, he iterate through the whole dictionary ( 8.8mb file)

That takes way to much time.... do you know an another approach to get this search faster? I thought a dictionary/hashtable would be good... but it is way to slow.

Maybe creating a Suffix Array from csv file 2 instead of using a dictionary?

Or are there some packages, other data structures in python (vectorization methods)?

I would be very grateful for your help!

 

 

 

 

 

 

 

 

 

csv python • 4.3k views
ADD COMMENTlink modified 4.9 years ago by pld4.8k • written 4.9 years ago by hrbrt.sch10
5

don't use python: sort both files using  unix 'sort' and then use unix 'join'

ADD REPLYlink written 4.9 years ago by Pierre Lindenbaum125k

but i dont want to join them directly. i want to do a pattern search.

csv file 2 has maybe 300000 entries, csv file 1 =  millions

And  and i have to search every "rs ID" from csv file 2 in csv file 1, if it is occuring/matching, take the gene symbole from csv file 2 to csv file 1 on position "x"

ADD REPLYlink written 4.9 years ago by hrbrt.sch10

If you sort at least the smaller file, then searching through it becomes rapidly faster. If you sort both of them then you don't even need to store either of them in memory.

Also, if you really need it to be fast then python is probably not the best route. You'll end up getting vastly better performance in C/C++/etc.

ADD REPLYlink written 4.9 years ago by Devon Ryan93k
2

That takes way to much time.... do you know an another approach to get this search faster? I thought a dictionary/hashtable would be good... but it is way to slow.

A search of a dictionary or hash table completes in O(1) or constant time. A search of a suffix tree completes in O(n) or linear time, and will generally be slower to construct than a dictionary, as well. 

A key-value pairing on the two columns in your second CSV file should give you a fast lookup time for each key match with the first CSV file.

You might post your code and we can take a look. Using a third-party library should be unnecessary for this task.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Alex Reynolds29k

I'm not sure if 'fast enough' is actually a problem here, if you only need to generate this dataset once, just let it cook overnight or while you're busy with something else. No sense in spending a few days writing a faster version of a script when you already have the data.

If you have to do this sort of thing frequently, I would use a database to solve this issue.

ADD REPLYlink written 4.9 years ago by pld4.8k
1

I think @Pierre Lindenbaum is right. Another option would be do it with "awk" in unix. If give an example of your table I could try to suggest you something.
 

ADD REPLYlink written 4.9 years ago by dago2.6k

example:

csv file 1:

RS ID
rs4846029
rs12135483

 

csv file 2:

rs id     gene symbol
rs123      CASP3

 

and now i want to search the rs id   "rs123" in the csv file 1, furthermore, if there is a match, write on position "x"(match) the gene symbol

 

 

ADD REPLYlink modified 4.9 years ago by Pierre Lindenbaum125k • written 4.9 years ago by hrbrt.sch10

sort both files on column rs/RS and use linux join.

ADD REPLYlink written 4.9 years ago by Pierre Lindenbaum125k

but i can't use join.

csv file 2 has e.g. 10 million rows, so there are a lot of rs ID's which are occurring multiple times(the other columns has unique Id's on csv file 1, but not in csv file 2)!

if i just join them, there will be no right match, due to  rs ID "rsxyz" occur for example 10 times

 

ADD REPLYlink written 4.9 years ago by hrbrt.sch10

see option -v of 'join'

ADD REPLYlink written 4.9 years ago by Pierre Lindenbaum125k

So in your csv file 2 there are only partial ID compared to csv file 1?

ADD REPLYlink written 4.9 years ago by dago2.6k
1

Shouldn't a Python dictionary (=a hash table) have O(1) lookup time? It should not have to iterate through 8.8 Mb as you write. Are you sure you implemented it correctly, or did I misunderstand something? I think your approach would work, by initially constructing a dictionary with all the rsids from file 2 as keys (with some arbitrary value like 1) and then just going through file 1 and checking for each line whether the dictionary has the rsid in question as a key.

ADD REPLYlink written 4.9 years ago by Mikael Huss4.7k

I exactly did what you wrote. my dict has rs IDs as a key and the values are the gene symbols.

Here is an example code:

        # row = row from csv file 1, genes= dict from csv file 2

        for row in reader:
            for key,value in genes.iteritems():    
                if row['rs_id']==key:
                    row['geneSymbol']=value

ADD REPLYlink written 4.9 years ago by hrbrt.sch10
1

Try this and see if it helps:

for row in reader:

      if genes.has_key(row['rs_id']):

             (... do stuff)

 

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Mikael Huss4.7k

I removed the "big data" tag, since a 1 gig file is too small to qualify.

ADD REPLYlink written 4.9 years ago by Devon Ryan93k

use pandas to read csv files and query them.

ADD REPLYlink written 4.9 years ago by Giovanni M Dall'Olio26k
1
gravatar for Giovanni M Dall'Olio
4.9 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

With python/pandas:

import pandas
file1 = pandas.read_csv('file1')
file2 = pandas.read_csv('file1')
print(file1)
rs_list = file1['rs_id'].tolist()

# Get rows of file2 corresponding to rs ids in file1:
file2[file2['rs_id'].isin(rs_list)]

Alternatively:

file2.query('rs_id in rs_list')

In general, avoid parsing tabular files manually and access them through dictionaries. It is much more versatile to use pandas dataframes in these cases :-).

ADD COMMENTlink written 4.9 years ago by Giovanni M Dall'Olio26k
1
gravatar for pld
4.9 years ago by
pld4.8k
United States
pld4.8k wrote:

Here's my take on it. It seems what you're doing now is identifying all of the annotations and storing them in memory and then writing to the disk, if all you're doing is a simple join you can skip this and write the symbol as you make your output file:

import csv
import sys

def __main__():
    example_csv_1 = sys.argv[1]
    example_csv_2 = sys.argv[2]
    csv_1_id_col  = int(sys.argv[3])
    csv_2_id_col  = int(sys.argv[4])
    csv_2_sym_col = int(sys.argv[5])
    outfile       = sys.argv[6]

    with open(example_csv_2, 'rb') as fi:
        data = list(csv.reader(fi, delimiter=','))

    symbols = dict()

    for val in data:
        try:
            symbols[val[csv_2_id_col]] == None
            print "Error: Duplicate symbols found for ", val[csv_2_id_col]
            sys.exit(1)

        except KeyError:
            symbols[val[csv_2_id_col]] = val[csv_2_sym_col]

    with open(example_csv_1, 'rb') as fi:
        data = list(csv.reader(fi, delimiter=','))

    with open(outfile, 'w') as fi:
        for d in data:
            id = d[csv_1_id_col]
            try:
                sym = symbols[id]
            except KeyError:
                print "Error no gene symbol found for ", id
                sys.exit(1)

            fi.write(','.join(d + [sym]))
            fi.write('\n')

    return 0

__main__()

csv1:

a,stuff
b,stuff
c,stuff
d,stuff

csv2:

a,symA
b,symB
c,symC
d,symD

output:

a,stuff,symA
b,stuff,symB
c,stuff,symC
d,stuff,symD

 

Again, if all you want in the end is a csv file with annotated with gene symbols, there's no need to make it any more complex than it has to. Just grab the symbols as you write your output file. This will save time since you don't have to iterate through your large CSV file twice. It will also save memory since you don't have any overhead associated with storing the gene symbol for each row (big CSV file) in memory.

This script assumes that each id is associated with a single symbol. If there are multiple symbols per ID it can be easily updated. If you know there aren't any genes with missing symbols (big CSV) or genes with multiple symbols (small CSV), you can omit the checks I have in there and gain a bit of performance.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by pld4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1755 users visited in the last hour