File this under the "have you ever tried something like this?" question bucket... I'm not looking for a particular solution - though feel free :) - rather, just looking for insights from users who maybe have come across this sort of problem before and hoping to hear what tools or strategies they adopted to tackle the problem.
I've recently run a blastn
search where, in some instances, query sequences span only portions of the expected reference region. Here's a toy example, spanning 60 bases of an arbitrary reference sequence. The asterisk *
values indicate regions in which the query have aligned:
1 15 30 45 60
| | | | |
reference: ------------------------------------------------------------
query1: ******************************
query2: ***********************
query3: *******************
query4: **********
This feels like a custom job, but the goal would be to parse the blast file in such a fashion that:
- queries that are contained within other queries are discarded for (for example,
query2
would be dropped because it is entirely contained withinquery1
- queries with partial overlaps are collapsed into a single contiguous stretch (for example,
query3
andquery4
would collapse into a single stretch
Visually, this would result in something like this:
1 15 30 45 60
| | | | |
reference: ------------------------------------------------------------
result1: ******************************
result2: *************************
I'd greatly appreciate inputs on strategies to programmatically evaluate if a given query sequence is contained within another, or if it's overlapping partially, or entirely unique from another query sequence. Seems like a trivial/classic programming problem, so any pointers on where to read up on that particular kind of challenge (and solution!) is much appreciated.
Thank you for your response
Ff there is a way to collect the coordinates for each match in bed format from the blastn result