I have a problem using blastn or megablast. The thing is as follows:
I have several sequences, and I want to align them. I want to obtain those sequences with an aligment of more than 99% of identity and more than 100 bp of length.
However, imagine we have the following two sequences:
>seq_a AGCTGACTGACCAGTGACTGCATGACTGCATGGGCCCGAGCGCGCGCGTATTATGCTGCTAGATGCTGTAATGCTCTACTATTAGAGAGAGACTGTGATGATTTGACGTACGTCGTAGCGATCGATAGCATCGATCGAGCTATGCATCGATCGATCGATCGACTAGCATGCATGCTAGTACTGACGTACATGCGTACGTCGTCATGAGTGACGACACACTGATGCAGTCATGTGTTGTGACTGACTCTTTATACTCAAGCTACACATCTCATTTTACGACGTAGCTCAAGACTCTCAGACTGGACTGACGACATC
>seq_b AGCTGACTGACCAGTGACTGCATGACTGCATGGGCCCGCGCGCGCGCGTATTATGCTGCTAGATGCTGTATTGGTGTGTGATTAGAGAGAGTAGTGATGATTTAGACGTACGTCGTAGCGATCGATAGCATCGATCGAGCTATGCATCGATCGATCGATCGACTAGCATGCATGCTAGTACTGACGTACATGCGTACGTCGTCATGAGTGACGACACACTGATGCAGTCATGTGTTGTGACTGACTCTTTATACTCAAGCTACACATCTCATATTACGACGTAGCTCAAGATAGTA
They can be aligned and the results will show an aligment of 96% covering 282 nucleotides. HOWEVER, there is a CORE (an that is what I want to know, the important part) of 186 nucleotides with an identity of 99%.
If I use blastn or megablast with default parameters (and many others...) I will never know if my sequences possess a region within them with a high identity.
You know which paremeters I should use? Or which programs can do what I am suggesting?
You want to find conserved cores between any two pairs of sequences on your dataset, or find a core where all sequences are conserved? Do your sequences have a good alignment over all their extension, or conserved regions alternated with divergent regions?
I want to find any pair of sequences that have, in general, a good alignment (over all their extension or over a relatively large area) with a very conserved core (more than 100 bp and 99% identity).
The problem is... maybe we find two sequences very very similar, with an aligned region of 1000 bp with 90% identity. However, there is not a core of more than 100bp of 99% identity. That's the point.
I am thinking in using blast with a word size (the length of the seed that initiates an alignment) of 100 bp. That way, I can obtain those pairs of sequences very similar with a core of 100 bp totally conserved. It is not exactly the same... but its something.
Thank you for your interest! If you have any idea, comment it please