I have a BLAST tabular output with millions of hits.Query is my sequence and subject is a protein hit. I am interested in finding the subjects corresponding to the same query that do not overlap. If I know the subject start and end sites it becomes possible to do; if S1 < E2 < S2 and E1 < S2 < E2 OR S2 - E1 > 0 Basically, since there are many hits and number of subjects vary, I may understand the algorithm, but find it difficult to implement in code. For example,my input file
query subject start end cont20 EMT34567 2 115 cont20 EMT28057 238 345 cont31 EMT45002 112 980 cont31 EMT45002 333 567
Desired output (I want the program to print only the query and subject names that do not overlap)
cont20 EMT28057 cont20 EMT34567
I have started the script using regex, but I am not sure how to continue or if this is a right way
import re output=open('result.txt','w') f=open('file.txt','r') lines=f.readlines() for line in lines: new_list=re.split(r'\t+',line.strip()) query=new_list subject=new_list s_start=new_list s_end=new_list
So what you want is: for every query (cont...) get non overlapping subjects (EMT...)?