collapsing coordinate information from a blast result
2
1
Entering edit mode
2.9 years ago

File this under the "have you ever tried something like this?" question bucket... I'm not looking for a particular solution - though feel free :) - rather, just looking for insights from users who maybe have come across this sort of problem before and hoping to hear what tools or strategies they adopted to tackle the problem.

I've recently run a blastn search where, in some instances, query sequences span only portions of the expected reference region. Here's a toy example, spanning 60 bases of an arbitrary reference sequence. The asterisk * values indicate regions in which the query have aligned:

           1            15             30             45             60
           |             |              |              |              |
reference: ------------------------------------------------------------
query1:    ******************************
query2:      ***********************
query3:                                             *******************
query4:                                       **********

This feels like a custom job, but the goal would be to parse the blast file in such a fashion that:

  • queries that are contained within other queries are discarded for (for example, query2 would be dropped because it is entirely contained within query1
  • queries with partial overlaps are collapsed into a single contiguous stretch (for example, query3 and query4 would collapse into a single stretch

Visually, this would result in something like this:

           1            15             30             45             60
           |             |              |              |              |
reference: ------------------------------------------------------------
result1:   ******************************
result2:                                      *************************

I'd greatly appreciate inputs on strategies to programmatically evaluate if a given query sequence is contained within another, or if it's overlapping partially, or entirely unique from another query sequence. Seems like a trivial/classic programming problem, so any pointers on where to read up on that particular kind of challenge (and solution!) is much appreciated.

Thank you for your response

blast • 1.2k views
ADD COMMENT
0
Entering edit mode

Ff there is a way to collect the coordinates for each match in bed format from the blastn result

  1. We can collapse the bed to non-overlapping coordinates (using appropriate tool)
  2. We can copy the same bed twice and intersect them for non-overlapping segments (with appropriate tools).
ADD REPLY
1
Entering edit mode
2.9 years ago
Mensur Dlakic ★ 27k

I think this program does almost exactly what you want.

https://sourceforge.net/projects/weinberg-overcluster2/

ADD COMMENT
0
Entering edit mode
2.9 years ago
Mensur Dlakic ★ 27k

I don't think it will be a trivial problem because many queries will have more than one high-scoring pairs (HSPs), unlike the scenario you presented. In such a case one of HSPs may be in contradiction with other HSPs from the same query.

I would look into how protein domains are delineated from BLAST hits. This paper may be of some help:

https://academic.oup.com/bioinformatics/article/16/5/451/192400

ADD COMMENT
0
Entering edit mode

Thank you for the first response Mensur - I agree, this is far from a trivial task in the context of blast specifically, but the challenge of parsing distinct, partially overlapping, or contained coordinates is likely less of a challenge in the abstract.

I plan on leveraging the bit scores to rank order the preferred queries, but did not mention it initially because my main questions were to find out how others may have approached the blast problem previously, or coordinate-based sorting more generally.

Kindly note that my question concerns nucleotides, but perhaps there is some wisdom to be discovered in the link you shared investigating protein blast.

Thanks - and looking forward to any other feedback

ADD REPLY
0
Entering edit mode

Wouldn't it be possible to select one HSP for each query (on some arbitrary basis)? The rest of what OP is asking for might be doable thereafter.

Might be good if OP could provide more details about what they're trying to do.

ADD REPLY

Login before adding your answer.

Traffic: 2571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6