I have a set of transposon element(TE) sequences and their genomic coordinates.
For each TE, I have concatenated a mosaic of their flanking 5' and 3' sequences, up to 50nt on each end, so up to 100nt total mosaic length. Let's call this the "pre-insertion" sequence.
Now, for each mosaic as query, I need to report coordinates and copy numbers for regions in the genome with matches at >= x% identity, over at least y% of query length (excluding the one self-match).
My questions are:
A. how best to run the BLAST search, and (outfmt 6 or 11 or what?)
B. Importantly, how best to parse the BLAST search results to report copy number and genomic coordinates for each mosaic query.
C. To reiterate, I do not want to report the gapped self-match (where the gap is the TE sequence). But I realize there could be other gapped matches, where the intervening sequence is not the original TE sequence, but some other sequence. In principle, by retaining coords of query in some manner and comparing them to BLAST results, I could filter out self-match cases, right?
Is mine too specific a parsing requirement, or will any off-the-shelf tools like blast_formatter or bp_search2gff.pl work out for me? I've never had to parse BLASTn output before, so this is new to me. Looking forward to your advice and suggestions. Thanks all!
In the example visualization, you can see on chr 1 and 4, there are instances of pre-insertion sequences - I want to report each of them with their start-stop coords.
On chr 1, there is another match with a black box in between, these matches will be reported by blastN. But those matches will be separated by a non-matching, non-TE sequence. Therefore, I will not consider this a pre-insertion sequence, and want to ignore such cases.
On Chr 2, on the minus strand, and on Chr 3, on the plus strand, are matches to the flanking sequences, separated by original TE sequence. If you imagine the entire region duplicated from Chr 2 to Chr 3, this can be possible. Here again, these do not represent pre-insertion sequence, and I want to ignore such cases.