Off topic:BLAST run parameters and parsing advice
1
2
Entering edit mode
2.3 years ago
Anand Rao ▴ 360

I have a set of transposon element(TE) sequences and their genomic coordinates.

For each TE, I have concatenated a mosaic of their flanking 5' and 3' sequences, up to 50nt on each end, so up to 100nt total mosaic length. Let's call this the "pre-insertion" sequence.

Now, for each mosaic as query, I need to report coordinates and copy numbers for regions in the genome with matches at >= x% identity, over at least y% of query length (excluding the one self-match).

My questions are:

A. how best to run the BLAST search, and (outfmt 6 or 11 or what?)

B. Importantly, how best to parse the BLAST search results to report copy number and genomic coordinates for each mosaic query.

C. To reiterate, I do not want to report the gapped self-match (where the gap is the TE sequence). But I realize there could be other gapped matches, where the intervening sequence is not the original TE sequence, but some other sequence. In principle, by retaining coords of query in some manner and comparing them to BLAST results, I could filter out self-match cases, right?

Is mine too specific a parsing requirement, or will any off-the-shelf tools like blast_formatter or bp_search2gff.pl work out for me? I've never had to parse BLASTn output before, so this is new to me. Looking forward to your advice and suggestions. Thanks all!

In the example visualization, you can see on chr 1 and 4, there are instances of pre-insertion sequences - I want to report each of them with their start-stop coords.

On chr 1, there is another match with a black box in between, these matches will be reported by blastN. But those matches will be separated by a non-matching, non-TE sequence. Therefore, I will not consider this a pre-insertion sequence, and want to ignore such cases.

On Chr 2, on the minus strand, and on Chr 3, on the plus strand, are matches to the flanking sequences, separated by original TE sequence. If you imagine the entire region duplicated from Chr 2 to Chr 3, this can be possible. Here again, these do not represent pre-insertion sequence, and I want to ignore such cases.

Insertional-Mosaic-BLASTn-Vs-Genome

BLAST BLASTn BLASTparser • 850 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1249 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6