Question: Group The Data In Python
1
6.5 years ago by
User000390
User000390 wrote:

I have a huge input file like this,

``````con1    Traes4    99.26    135    1    0   139   543     1     135   5.0   279  1506   135
con13    Traes7    97.61    544    13    0   482   2113   200  544   0.0   101  2392  544
con13    Traes8    100.00    467    0    0   713   2113   1     467   0.0   901  2392   467
con15    EMT18    95.27    148    7    0   73     516    35    48    6.0   256   2560   148
con15    EMT29    89.86    148    15    0   73    516    100   148   3.0   243   2560   48
``````

I am trying to extract contigs that have full lengths and % identity > 30, but I also allow +-10 aa differences,

``````start_to_end=(hsp[9]-hsp[8])+1
if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30
``````

However, I want the data be grouped by line[0]. For example, con13 has Traes8 that fits my request, and Traes7 that does not, but since I want all conX be grouped, it is enough that one of con13 meets my request, my output should print both of con13 in result1 and non of con13 should ever appear in result2 file. Note: I am also interested to print single con1 since it meets my requirement. Note 2: con13 might have >20 different Traes. Here is my script, It prints results but separately without grouping by con, how can this be solved?

``````from itertools import groupby

with open('file.txt') as f1:
with open('result1.txt', 'w') as f2:
with open('result2.txt','w') as f3:
for k, g in groupby(f1, key=lambda x:x.split()[0]):
hits = []
for line in g:
hsp = line.split()
hsp[9], hsp[13], hsp[8], hsp[2]= int(hsp[9]), int(hsp[13]),int(hsp[8]),float(hsp[2])
hits.append(hsp)
print line.rstrip()
#for i in range(1,len(hits)):
start_to_end=(hsp[9]-hsp[8])+1
if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30:
f2.write(line.rstrip() + '\n')
else:
f3.write(line.rstrip() + '\n')
``````

This is the result I get,Result1.txt

`````` con1    Traes4    99.26    135    1    0   139   543     1     135   5.0   279  1506   135
con13    Traes8    100.00    467    0    0   713   2113   1     467   0.0   901  2392   467
``````

Result2.txt

``````con13    Traes7    97.61    544    13    0   482   2113   200  544   0.0   101  2392  544
con15    EMT18    95.27    148    7    0   73     516    35    48    6.0   256   2560   148
con15    EMT29    89.86    148    15    0   73    516    100   148   3.0   243   2560   48
``````

But my desired outputs should be like this, Result1.txt

``````con1    Traes4    99.26    135   1      0   139   543    1     135   5.0   279  1506  135
con13    Traes7    97.61    544   13      0   482   2113    200  544   0.0  1010  2392  544
con13    Traes8    100.00   467  0      0   713   2113    1     467   0.0   901   2392  467
``````

Result2.txt

``````con15    EMT18    95.27  148      7    0   73   516    35  48    6.0    256     2560  148
con15    EMT29    89.86  148     15    0   73   516   100 148   3.0   243     2560  148
``````
python bioinformatics • 5.1k views
modified 6.5 years ago by Devon Ryan97k • written 6.5 years ago by User000390
1

As currently presented, this is just a programming question. Can you edit it a bit to make it clear how this relates to bioinformatics? Also, are you committed to using python or are you also open to using R (R has a number of functions that make splitting data like this and looking at the groups simple)?

I am trying to extract contigs that have full lengths, but I also allow +-10 aa differences, I just did not go into details, since calculations seem to be OK, and the file prints me results, but I am facing problem of grouping contigs and print them by group, as I explained above..so may be I am missing some important line in the code? I am not good in R, as I program only in Python, so I prefer using python, many thanks

OK, btw what is the purpose of "Result2" and why are both of the results from con15 in there? From the python code it looks like it should contain the Traes7 and EMT18 lines, since those 2 don't meet the filtering metrics that you set.

1

this is the point, the script I provided above does print Traes7, EMT18 and EMT29 in result2.txt, since all of them does not meet my requirement, but that is not my purpose, as I tried to explain above, if one of my contigs meet the requirement all other hits under that particular contig should be in result1.txt (even if others do not meet my requirement) so I need to group on line[0]. Later on, I extract only those single reads that meet my requirement from result1.txt and work with them separately. I need result2.txt, the rest of contigs as I perform other overlapping non overlapping operations on them.

1

Ah, your update has clarified things!

I am glad it is clearer now, it is not so easy to explain in few words..Anyway, my gut feeling tells me that the problem is because the program is reading group by line, so printing lines that meet my condition obviously, so may be I need to add a dictionary or set before grouping, not sure how to proceed, hope you guys can help me

You can look into Python Pandas dataframe groupby functionalities:

1) documentation: pandas.DataFrame.groupby

2
6.5 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

The following seems to be more what you're going for:

``````#!/usr/bin/env python
from itertools import groupby, tee
import csv

of1 = csv.writer(open("results1.txt", "w"), dialect="excel-tab")
of2 = csv.writer(open("results2.txt", "w"), dialect="excel-tab")

for k,g in groupby(f, lambda x: x[0]) :
passed=0
g, g2 = tee(g)
for line in g :
if(float(line[2]) >= 30 and abs(int(line[13])-int(line[9])+int(line[8])+1) < 10) :
passed += 1
break

if(passed>0) :
for line in g2 :
of1.writerow(line)
else :
for line in g2 :
of2.writerow(line)
``````

This will write an entire group to `f1` if any of its members pass the filtering step. Otherwise, all of the members of the group will be written to `f2`. Note that in your original example, each of the 3 groups had at least one member passing the filter (EMT29 passes, so all of contig 15 would pass). Of importance is the usage of `tee()`. Once you start using an iterator (e.g., `for line in g`) you can't then rewind or reuse it. Ways around that would be to (1) store everything in an array yourself or (2) just use `tee` to make a copy of the iterator for you.

1

nice use of itertools! You can collapse the last bit depending on your aesthetic preference:

``````(of1 if passed else of2).writerows(g2)
``````

THANK YOU SO MUCH, it works perfectly! I just had a moment of flash in my head and modified my script like this, so it works as well ))

``````from itertools import groupby

with open('file.txt') as f1:
with open('result1.txt', 'w') as f2:
with open('result2.txt','w') as f3:
contig = set([])
for k, g in groupby(f1, key=lambda x:x.split()[0]):
hits = []
for line in g:
hsp = line.rsplit()
hsp[9], hsp[13], hsp[8], hsp[2]= int(hsp[9]), int(hsp[13]),int(hsp[8]),float(hsp[2])
hits.append(hsp)
start_to_end=(hsp[9]-hsp[8])+1
if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30:
if k in contig:
outfile = f2
else:
outfile=f3
for i in hits:
outfile.write('\t'.join(map(str,i)))
outfile.write('\n')
``````

what do you think of the script from the memory usage point of view? I have > 15 mln reads, I guess using tee is more efficient in this case? p.s. I have never used tee

1

I imagine that the memory usage would be similar, since I expect `tee()` is making a similar sort of duplication in the background. One thing to change would be to not bother with a `contig` set, since you'll always just be interested in the last value anyway (the longer that gets, the slower things will get). It would be better to use a method like my `passed=0` prior to `for line in g` and then just set `passed=1` if it passes the filter (my `passed +=1` is a relic of having an earlier iteration print out a status message so I could debug a silly mistake).