Question

Parsing Custom Snp Info File In Python

2

Entering edit mode

15.1 years ago

Eric Normandeau 11k

Hi,

I have a file containing information about SNPs (contig number, position, variant counts per individual) and would like to parse that file using Python. Here is an example of the file format (tab separated text file):

Contig_nb    Pos    tag_name    A    C    G    T    N    *    -
Contig_3    365    LD_Tag04    1    3    0    0    0    0    8
Contig_3    365    LD_Tag05    0    0    0    5    0    0    8
Contig_3    365    LD_Tag07    0    0    0    1    0    0    9
Contig_3    365    LD_Tag08    1    0    0    4    0    0    10
Contig_128    1432    DR_Tag09    0    2    1    0    0    0    11
Contig_128    1432    DR_Tag11    0    4    1    0    0    0    16
Contig_128    1432    DR_Tag15    0    0    3    0    0    0    9
Contig_128    1432    DR_Tag16    0    2    0    0    0    0    10
Contig_128    1432    LD_Tag01    0    4    8    0    0    0    18

The crux of the problem is to regroup lines that share a similar contig_nb AND position and then treat them as a block. I thus want a generator that permits iteration over all the groups (identical contig_np AND pos) present in the file and return them as a list of lists (each line in a group being a list with in 'group' list).

I am trying to avoid doing a custom while loop where I include the lines as long as they share the same contig_np and pos and then starting a new group when they don't

How would you do it? Could 'groupby' be used in such an instance?

Many thanks!

python • 6.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 15.1 years ago by Eric Normandeau 11k

Ram · Answer 1 · 2010-06-10

7

Entering edit mode

15.1 years ago

Will 4.6k

As long as everything is properly sorted something like this should work:

import csv
from itertools import groupby
from operator import itemgetter

#only needed for example purposes
import StringIO

examplestr = """Contig_nb   Pos tag_name    A   C   G   T   N   *   -
Contig_3    365 LD_Tag04    1   3   0   0   0   0   8
Contig_3    365 LD_Tag05    0   0   0   5   0   0   8
Contig_3    365 LD_Tag07    0   0   0   1   0   0   9
Contig_3    365 LD_Tag08    1   0   0   4   0   0   10
Contig_128  1432    DR_Tag09    0   2   1   0   0   0   11
Contig_128  1432    DR_Tag11    0   4   1   0   0   0   16
Contig_128  1432    DR_Tag15    0   0   3   0   0   0   9
Contig_128  1432    DR_Tag16    0   2   0   0   0   0   10
Contig_128  1432    LD_Tag01    0   4   8   0   0   0   18
"""

if __name__ == '__main__':

    #replace this with your file-handle
    fhandle = StringIO.StringIO(examplestr)
    lgen = csv.DictReader(fhandle, delimiter='\t')

    for key, group in groupby(lgen, key = itemgetter('Contig_nb')):
        print key
        for row in group:
            print row['tag_name']

I prefer using the csv module for my parsing ... especially if my file has headers like yours does ... I can't tell from the example what you delimiters are but you can easily modify that in the csv.DictReader line.

In this case key will be Contig_8 and group will be an iterator for each line.

Hope that helps

Will

PS ... If its not sorted by Contig_nb then this solution will have doubles. If its not you'll either need to do a sort operation first or use the defaultdict answer that was mentioned in another post.

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 15.1 years ago by Will 4.6k

3

Entering edit mode

Looks good. you're just missing the first argument to groupby (should be your line_gen) and then your key function can be

key=operator.itemgetter('Contig_nb', 'Pos')

ADD REPLY • link updated 6.8 years ago by Ram 45k • written 15.1 years ago by brentp 24k

0

Entering edit mode

that's what you get for not using Unit-Tests ;) ... I actually prefer lambda expressions over most operator expressions. To me they're just a little easier to read .. and the speed hit isn't that bad unless your doing terabytes worth of info.

ADD REPLY • link 15.1 years ago by Will 4.6k

0

Entering edit mode

@Will, I would be glad to give you the answer if you add the missing parts so that it can be readily used. Thanks!

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k

0

Entering edit mode

@Eric ... Edited the code to make more sense :)

ADD REPLY • link 15.1 years ago by Will 4.6k

0

Entering edit mode

@Will, Many thanks :)

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k

score 2 · Answer 2 · 2010-06-10

2

Entering edit mode

15.1 years ago

Paulo Nuin ★ 3.7k

You can use a defaultdict initialized as a list in order to group your lines, as you read the file. Should be straightforward and instead of a list of list you have a dict-like object with a list of items stored.

Check the docs here

ADD COMMENT • link 15.1 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

Hi Paulo. Nice approach, however, this will have me create one big object containing my whole file. Given that I may want to treat many thousands of SNPs and that there is a memory overhead to using native Python objects like dicts and lists, I would really prefer the generator approach. Any suggestion? Cheers

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k

Ram · Answer 3 · 2010-06-11

0

Entering edit mode

15.1 years ago

Eric Normandeau 11k

Hi Guys,

I was trying to avoid implementing a custom made generator to do the job I had in mind, but I still really want to have the generator behaviour and avoid building a big object (dictionary of list) containing all the data, sometimes stored in an oddly sorted way.

One of the main reasons why I insist on a generator is that this is going to be helpful in many situations and I want to be able to go through huge files in the appropriate manner.

So, here is my working attempt:

def custom_generator(in_file, header=True):
    with open(in_file) as f:
        if header == True: # Get rid of header if there is one
            junk = f.next()
            del(junk)
        last_id = ""
        for line in f:
            line_list = line.strip().split("\t")
            _id = ":".join(line_list[0:2]) # Contig_nb + position
            if _id == last_id:
                contig_data.append(line_list)
                last_id = _id
            elif last_id == "": # Change initial value for 'last_id'
                last_id = _id
                contig_data = [line_list]
            else:
                try:
                    yield(contig_data)
                except:
                    print "Could not return contig data"
                contig_data = [line_list]
                last_id = _id
        if line.strip() != "": # If last line misses a line return character
            try:
                yield(contig_data)
            except:
                print "Could not return last contig data"

contigs = custom_generator(in_file, header)
    while 1:
        try:
            print contigs.next()
        except:
            break

If you see a better approach, or changes that could ameliorate the generator, I a very keen on knowing!

Many thanks

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 15.1 years ago by Eric Normandeau 11k

0

Entering edit mode

@Eric, what's wrong with @Will's approach below?

ADD REPLY • link 15.1 years ago by brentp 24k

0

Entering edit mode

@brentp. I am not sure to understand his approach completely, plus there is a restriction that the contig names have to be sorted. It is also partial, as you noted yourself. I figured the custom approach I am doing is better if I can understand it and it has no sort limitation. If I get some spare time, I'll try to get into groupby for this problem (and others), but I didn't understand the given example enough to actually use it. If the approach is stated completely and does what I asked in the question, I will be more than pleased to accept it as the answer to my question. Cheers

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k

0

Entering edit mode

@Eric, if i read your implementation above correctly, it makes the same assumption as yours. both can only group rows that are similar, and adjacent. that's 'sorted' in this context.

ADD REPLY • link 15.1 years ago by brentp 24k

0

Entering edit mode

@brentp, You're right. That is not how I had understood his 'sorted' requirement. However, I will still need the missing parts to be able to use it, as I am not familiar with the suggestions you made about what needed to be added. Cheers

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k