Question: Group The Data In Python
1
gravatar for User000
4.6 years ago by
User000260
User000260 wrote:

I have a huge input file like this,

con1    Traes4    99.26    135    1    0   139   543     1     135   5.0   279  1506   135    
con13    Traes7    97.61    544    13    0   482   2113   200  544   0.0   101  2392  544     
con13    Traes8    100.00    467    0    0   713   2113   1     467   0.0   901  2392   467    
con15    EMT18    95.27    148    7    0   73     516    35    48    6.0   256   2560   148    
con15    EMT29    89.86    148    15    0   73    516    100   148   3.0   243   2560   48

I am trying to extract contigs that have full lengths and % identity > 30, but I also allow +-10 aa differences,

start_to_end=(hsp[9]-hsp[8])+1
if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30

However, I want the data be grouped by line[0]. For example, con13 has Traes8 that fits my request, and Traes7 that does not, but since I want all conX be grouped, it is enough that one of con13 meets my request, my output should print both of con13 in result1 and non of con13 should ever appear in result2 file. Note: I am also interested to print single con1 since it meets my requirement. Note 2: con13 might have >20 different Traes. Here is my script, It prints results but separately without grouping by con, how can this be solved?

from itertools import groupby

with open('file.txt') as f1:
    with open('result1.txt', 'w') as f2:
        with open('result2.txt','w') as f3:
            for k, g in groupby(f1, key=lambda x:x.split()[0]): 
                hits = []
                for line in g:
                    hsp = line.split()
                    hsp[9], hsp[13], hsp[8], hsp[2]= int(hsp[9]), int(hsp[13]),int(hsp[8]),float(hsp[2])
                    hits.append(hsp)
                    print line.rstrip()
                    #for i in range(1,len(hits)):
                    start_to_end=(hsp[9]-hsp[8])+1
                    if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30:
                       f2.write(line.rstrip() + '\n') 
                    else:
                       f3.write(line.rstrip() + '\n')

This is the result I get,Result1.txt

 con1    Traes4    99.26    135    1    0   139   543     1     135   5.0   279  1506   135
 con13    Traes8    100.00    467    0    0   713   2113   1     467   0.0   901  2392   467

Result2.txt

con13    Traes7    97.61    544    13    0   482   2113   200  544   0.0   101  2392  544     
con15    EMT18    95.27    148    7    0   73     516    35    48    6.0   256   2560   148    
con15    EMT29    89.86    148    15    0   73    516    100   148   3.0   243   2560   48

But my desired outputs should be like this, Result1.txt

con1    Traes4    99.26    135   1      0   139   543    1     135   5.0   279  1506  135    
con13    Traes7    97.61    544   13      0   482   2113    200  544   0.0  1010  2392  544     
con13    Traes8    100.00   467  0      0   713   2113    1     467   0.0   901   2392  467

Result2.txt

con15    EMT18    95.27  148      7    0   73   516    35  48    6.0    256     2560  148    
con15    EMT29    89.86  148     15    0   73   516   100 148   3.0   243     2560  148
python bioinformatics • 3.7k views
ADD COMMENTlink modified 4.6 years ago by Devon Ryan86k • written 4.6 years ago by User000260
1

As currently presented, this is just a programming question. Can you edit it a bit to make it clear how this relates to bioinformatics? Also, are you committed to using python or are you also open to using R (R has a number of functions that make splitting data like this and looking at the groups simple)?

ADD REPLYlink written 4.6 years ago by Devon Ryan86k

I am trying to extract contigs that have full lengths, but I also allow +-10 aa differences, I just did not go into details, since calculations seem to be OK, and the file prints me results, but I am facing problem of grouping contigs and print them by group, as I explained above..so may be I am missing some important line in the code? I am not good in R, as I program only in Python, so I prefer using python, many thanks

ADD REPLYlink written 4.6 years ago by User000260

OK, btw what is the purpose of "Result2" and why are both of the results from con15 in there? From the python code it looks like it should contain the Traes7 and EMT18 lines, since those 2 don't meet the filtering metrics that you set.

ADD REPLYlink written 4.6 years ago by Devon Ryan86k
1

this is the point, the script I provided above does print Traes7, EMT18 and EMT29 in result2.txt, since all of them does not meet my requirement, but that is not my purpose, as I tried to explain above, if one of my contigs meet the requirement all other hits under that particular contig should be in result1.txt (even if others do not meet my requirement) so I need to group on line[0]. Later on, I extract only those single reads that meet my requirement from result1.txt and work with them separately. I need result2.txt, the rest of contigs as I perform other overlapping non overlapping operations on them.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by User000260
1

Ah, your update has clarified things!

ADD REPLYlink written 4.6 years ago by Devon Ryan86k

I am glad it is clearer now, it is not so easy to explain in few words..Anyway, my gut feeling tells me that the problem is because the program is reading group by line, so printing lines that meet my condition obviously, so may be I need to add a dictionary or set before grouping, not sure how to proceed, hope you guys can help me

ADD REPLYlink written 4.6 years ago by User000260

You can look into Python Pandas dataframe groupby functionalities:

1) documentation: pandas.DataFrame.groupby

2) grouping and aggregating in python

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Gjain5.2k
2
gravatar for Devon Ryan
4.6 years ago by
Devon Ryan86k
Freiburg, Germany
Devon Ryan86k wrote:

The following seems to be more what you're going for:

#!/usr/bin/env python
from itertools import groupby, tee
import csv

f = csv.reader(open("foo.txt", "r"), dialect="excel-tab")
of1 = csv.writer(open("results1.txt", "w"), dialect="excel-tab")
of2 = csv.writer(open("results2.txt", "w"), dialect="excel-tab")

for k,g in groupby(f, lambda x: x[0]) :
    passed=0
    g, g2 = tee(g)
    for line in g :
        if(float(line[2]) >= 30 and abs(int(line[13])-int(line[9])+int(line[8])+1) < 10) :
            passed += 1
            break

    if(passed>0) :
        for line in g2 :
            of1.writerow(line)
    else :
        for line in g2 :
            of2.writerow(line)

This will write an entire group to f1 if any of its members pass the filtering step. Otherwise, all of the members of the group will be written to f2. Note that in your original example, each of the 3 groups had at least one member passing the filter (EMT29 passes, so all of contig 15 would pass). Of importance is the usage of tee(). Once you start using an iterator (e.g., for line in g) you can't then rewind or reuse it. Ways around that would be to (1) store everything in an array yourself or (2) just use tee to make a copy of the iterator for you.

ADD COMMENTlink written 4.6 years ago by Devon Ryan86k
1

nice use of itertools! You can collapse the last bit depending on your aesthetic preference:

(of1 if passed else of2).writerows(g2)
ADD REPLYlink written 4.6 years ago by brentp22k

Hadn't thought about that, good tip!

ADD REPLYlink written 4.6 years ago by Devon Ryan86k

THANK YOU SO MUCH, it works perfectly! I just had a moment of flash in my head and modified my script like this, so it works as well ))

from itertools import groupby

with open('file.txt') as f1:
    with open('result1.txt', 'w') as f2:
        with open('result2.txt','w') as f3:
            contig = set([])
            for k, g in groupby(f1, key=lambda x:x.split()[0]): 
                hits = []
                for line in g:
                    hsp = line.rsplit()
                    hsp[9], hsp[13], hsp[8], hsp[2]= int(hsp[9]), int(hsp[13]),int(hsp[8]),float(hsp[2])
                    hits.append(hsp)
                    start_to_end=(hsp[9]-hsp[8])+1
                    if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30:
                        contig.add(hsp[0])
                if k in contig:
                    outfile = f2
                else:
                    outfile=f3
                for i in hits:
                    outfile.write('\t'.join(map(str,i)))
                    outfile.write('\n')
ADD REPLYlink written 4.6 years ago by User000260

what do you think of the script from the memory usage point of view? I have > 15 mln reads, I guess using tee is more efficient in this case? p.s. I have never used tee

ADD REPLYlink written 4.6 years ago by User000260
1

I imagine that the memory usage would be similar, since I expect tee() is making a similar sort of duplication in the background. One thing to change would be to not bother with a contig set, since you'll always just be interested in the last value anyway (the longer that gets, the slower things will get). It would be better to use a method like my passed=0 prior to for line in g and then just set passed=1 if it passes the filter (my passed +=1 is a relic of having an earlier iteration print out a status message so I could debug a silly mistake).

ADD REPLYlink written 4.6 years ago by Devon Ryan86k

I see, thank you for the detailed explanation and for the comments!

ADD REPLYlink written 4.6 years ago by User000260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1759 users visited in the last hour