Blast Result: Finding Non Overlapping Hits
2
2
Entering edit mode
10.1 years ago
User000 ▴ 690

I have a BLAST tabular output with millions of hits.Query is my sequence and subject is a protein hit. I am interested in finding the subjects corresponding to the same query that do not overlap. If I know the subject start and end sites it becomes possible to do; if S1 < E2 < S2 and E1 < S2 < E2 OR S2 - E1 > 0 Basically, since there are many hits and number of subjects vary, I may understand the algorithm, but find it difficult to implement in code. For example,my input file

query  subject  start end
cont20 EMT34567 2   115 
cont20 EMT28057 238 345
cont31 EMT45002 112 980
cont31 EMT45002 333 567

Desired output (I want the program to print only the query and subject names that do not overlap)

cont20 EMT28057 
cont20 EMT34567

I have started the script using regex, but I am not sure how to continue or if this is a right way

import re
output=open('result.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
    new_list=re.split(r'\t+',line.strip())
    query=new_list[0]
    subject=new_list[1]
    s_start=new_list[8]
    s_end=new_list[9]
python • 6.0k views
ADD COMMENT
0
Entering edit mode

So what you want is: for every query (cont...) get non overlapping subjects (EMT...)?

ADD REPLY
0
Entering edit mode

yes,exactly....

ADD REPLY
2
Entering edit mode
10.1 years ago

UPDATED:

Input: file.txt

query    subject    start    end    
contig1    EMT16196    481    931
contig15    EMT15298    1    148
contig18    EMT04099    1    290
contig18    EMT20601    1    290
contig18    EMT23062    1    290
contig20    EMT14935    298    524
contig20    EMT19916    415    434
contig20    EMT19915    422    441
contig20    EMT19914    298    317
contig29    EMT30092    1    20
contig30    EMT31940    61    795
contig35    EMT03428    181    785
contig37    EMT02979    364    1184
contig42    EMT19888    449    657
contig42    EMT19888    339    472
contig43    EMT19888    339    657
contig45    EMT27750    329    363
contig45    EMT17965    889    908
contig51    EMT32871    324    390
contig51    EMT32871    203    241
contig52    EMT15568    107    127
contig56    EMT28040    811    939
contig67    EMT32527    132    489
contig69    EMT12559    38    226
contig79    EMT05411    85    919
contig95    EMT26862    138    327
contig95    EMT10613    20    164
contig107    EMT33347    1    243
contig107    EMT33347    255    387
contig107    EMT14531    135    385
contig108    EMT33347    1    423
contig108    EMT14531    135    565
contig109    EMT07436    60    88
contig149    EMT17561    119    219
contig159    EMT28057    39    307
contig176    EMT23021    359    379

Python:

from itertools import groupby

def nonoverlapping(hits):
    """Returns a list of non-overlapping hits."""
    nonover = list(hits)
    overst =  False
    for i in range(1,len(hits)):
        (p, c) = hits[i-1], hits[i]
        # Check whether hits overlap.
        if c[2]<=p[3]:
            if not overst: nonover.remove(p)
            nonover.remove(c)
            overst = True
        else:
            overst = False     
    return nonover

fh = open('file.txt')
oh = open('results.txt', 'w')
fh.next() # Ignore header line in BLAST output.

# Loop over BLAST hits (grp) for each query (qid).
for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    # I need to convert start and end positions
    # from strings into integers.
    for line in grp:
        hsp = line.split()
        hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
        hits.append(hsp)
    # Sort hits by start position.
    hits.sort(key=lambda x: x[2])
    for hit in nonoverlapping(hits):
         oh.write('\t'.join([str(f) for f in hit])+'\n')

Results: results.txt

contig1    EMT16196    481    931
contig15    EMT15298    1    148
contig29    EMT30092    1    20
contig30    EMT31940    61    795
contig35    EMT03428    181    785
contig37    EMT02979    364    1184
contig43    EMT19888    339    657
contig45    EMT27750    329    363
contig45    EMT17965    889    908
contig51    EMT32871    203    241
contig51    EMT32871    324    390
contig52    EMT15568    107    127
contig56    EMT28040    811    939
contig67    EMT32527    132    489
contig69    EMT12559    38    226
contig79    EMT05411    85    919
contig109    EMT07436    60    88
contig149    EMT17561    119    219
contig159    EMT28057    39    307
contig176    EMT23021    359    379
ADD COMMENT
0
Entering edit mode

thank you a lot, it gives me an error, would you mind to briefly comment on the script so that I can understand it and fix the error..

ADD REPLY
0
Entering edit mode

What kind of error do you get?

ADD REPLY
0
Entering edit mode

ValueError: list.remove(x): x not in list

it works on the small example file I provided, though, doesnt work on my huge file..

ADD REPLY
1
Entering edit mode
10.1 years ago
Michael 54k

For the purpose of interval overlapping large numbers of intervals you should use an efficient algorithm based on interval trees. In R these would be in IRanges, in python these are in the bx-python package. See this blog post and this Fast Interval Intersection Methodologies, and What is the quickest algorithm for range overlap?. Just group your data by the queryid (first column, then by second column hit accesion), those hit accessions that have only a single occurrence you can output immediately.

For the rest you can build an interval tree of all ranges of the same query/hit. Use possibly a hashtable like data structure (dict?) to group by column 2 and add the tree. Then overlap the tree with itself and filter for intervals with more than 1 overlap. If the number of hits per hit accession is very small, it might be more efficient to use the naive approach (sort an compare) because of the upstart time to build the overlap tree, but the practical implication of this is possibly negligible.

ADD COMMENT

Login before adding your answer.

Traffic: 2539 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6