I have a huge input file like this,
con1 Traes4 99.26 135 1 0 139 543 1 135 5.0 279 1506 135
con13 Traes7 97.61 544 13 0 482 2113 200 544 0.0 101 2392 544
con13 Traes8 100.00 467 0 0 713 2113 1 467 0.0 901 2392 467
con15 EMT18 95.27 148 7 0 73 516 35 48 6.0 256 2560 148
con15 EMT29 89.86 148 15 0 73 516 100 148 3.0 243 2560 48
I am trying to extract contigs that have full lengths and % identity > 30, but I also allow +-10 aa differences,
start_to_end=(hsp[9]-hsp[8])+1
if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30
However, I want the data be grouped by line[0]. For example, con13 has Traes8 that fits my request, and Traes7 that does not, but since I want all conX be grouped, it is enough that one of con13 meets my request, my output should print both of con13 in result1 and non of con13 should ever appear in result2 file. Note: I am also interested to print single con1 since it meets my requirement. Note 2: con13 might have >20 different Traes.
Here is my script, It prints results but separately without grouping by con, how can this be solved?
from itertools import groupby
with open('file.txt') as f1:
with open('result1.txt', 'w') as f2:
with open('result2.txt','w') as f3:
for k, g in groupby(f1, key=lambda x:x.split()[0]):
hits = []
for line in g:
hsp = line.split()
hsp[9], hsp[13], hsp[8], hsp[2]= int(hsp[9]), int(hsp[13]),int(hsp[8]),float(hsp[2])
hits.append(hsp)
print line.rstrip()
#for I in range(1,len(hits)):
start_to_end=(hsp[9]-hsp[8])+1
if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30:
f2.write(line.rstrip() + '\n')
else:
f3.write(line.rstrip() + '\n')
This is the result I get,Result1.txt
con1 Traes4 99.26 135 1 0 139 543 1 135 5.0 279 1506 135
con13 Traes8 100.00 467 0 0 713 2113 1 467 0.0 901 2392 467
Result2.txt
con13 Traes7 97.61 544 13 0 482 2113 200 544 0.0 101 2392 544
con15 EMT18 95.27 148 7 0 73 516 35 48 6.0 256 2560 148
con15 EMT29 89.86 148 15 0 73 516 100 148 3.0 243 2560 48
But my desired outputs should be like this, Result1.txt
con1 Traes4 99.26 135 1 0 139 543 1 135 5.0 279 1506 135
con13 Traes7 97.61 544 13 0 482 2113 200 544 0.0 1010 2392 544
con13 Traes8 100.00 467 0 0 713 2113 1 467 0.0 901 2392 467
Result2.txt
con15 EMT18 95.27 148 7 0 73 516 35 48 6.0 256 2560 148
con15 EMT29 89.86 148 15 0 73 516 100 148 3.0 243 2560 148
As currently presented, this is just a programming question. Can you edit it a bit to make it clear how this relates to bioinformatics? Also, are you committed to using python or are you also open to using R (R has a number of functions that make splitting data like this and looking at the groups simple)?
I am trying to extract contigs that have full lengths, but I also allow +-10 aa differences, I just did not go into details, since calculations seem to be OK, and the file prints me results, but I am facing problem of grouping contigs and print them by group, as I explained above..so may be I am missing some important line in the code? I am not good in R, as I program only in Python, so I prefer using python, many thanks
OK, btw what is the purpose of "Result2" and why are both of the results from con15 in there? From the python code it looks like it should contain the Traes7 and EMT18 lines, since those 2 don't meet the filtering metrics that you set.
this is the point, the script I provided above does print Traes7, EMT18 and EMT29 in result2.txt, since all of them does not meet my requirement, but that is not my purpose, as I tried to explain above, if one of my contigs meet the requirement all other hits under that particular contig should be in result1.txt (even if others do not meet my requirement) so I need to group on line[0]. Later on, I extract only those single reads that meet my requirement from result1.txt and work with them separately. I need result2.txt, the rest of contigs as I perform other overlapping non overlapping operations on them.
Ah, your update has clarified things!
I am glad it is clearer now, it is not so easy to explain in few words..Anyway, my gut feeling tells me that the problem is because the program is reading group by line, so printing lines that meet my condition obviously, so may be I need to add a dictionary or set before grouping, not sure how to proceed, hope you guys can help me
You can look into Python Pandas dataframe groupby functionalities:
1) documentation: pandas.DataFrame.groupby
2) grouping and aggregating in python