I am newbie in bioinformatics, and now trying to learn about alignments, so sorry about the lack of vocabulary. After blasting a sequence, this is my output from a BLASTN
... lots of other aligns ...
AACGTATACGGATCGACTGC
AACGTATACGGATCGAC
AACGTATACGGATCGACTGC
AACGTATACGGATCGAC
AACGTATATGGATCGACTGC
AACGTATACGGATCGACTGC
AACGTATACGGATCGACTGCA
AACGTATACGGATCGACTG
AACGTATACGGATCGACTGC
AACGTATATGGATCGACTGC
AACGTATACGGATCGCTGC
AACGTATACGGATCGACTGC
AACGTATACGGATCGCTGCAA
...
That list are the extracted alignment strings for the database, from the blastn results. As you've seen, the most common pattern has length of 20 nucleotides, but some sequences in the results have insertions and deletions, and some has both of them. I've grouped the alignments for a population analysis, and if I understood correctly, the arlequin software requires to format the aligns for having all sequences with equal size, so I want to "fix" the alignments.
Now I'm thinking in two options:
- I missed a parameter in my NCBI blastn setup to limit to only alignments with fixed size (of 20 in my case)
- This is very typical and there is a software to fix this situation, where you preserve the accuracy by adding gaps up to the longest sequence (I imagine there is a case for which you cannot add gaps without having two possible different sequences)
Any suggestions?
I have the impression you are doing something possibly wrong here Please explain, the problem you are trying to solve, then we can maybe tell if that question makes sense and if you are using the right tools at all. Your 'blast output' looks strange, and arlequin as I understand is for population genetics, so where is the connection?
Ok, that's not the raw blast output, each sequence in that list is the alignment string for the database (the HSP_HSEQ node in the XML output) from the blastn, which then I've extracted for grouping (according to information in other databses, linked by accession number) so they can be entered in Arlequin. Let me know if you want more details.
Can you tell us which analysis in arlequin you want to run exactly? I don't use arlequin, so I cannot offer help, but I figure that it also will be difficult for others, without that information. Also, with respect to gapped alignments, do you rather want to avoid them completely? Also it seems to me, that the sequences you found are very similar, can you maybe do a multiple-sequence alignment and take the consensus sequence instead, or do you need all exact hits?
I think Michael is right. Although I don't know Arlequin, I would guess that it needs a multiple alignment as input (with gap symbols representing in/dels). So I would try running a multiple alignment programm such as mafft (http://www.ebi.ac.uk/Tools/msa/mafft/) with your sequences as input and then use the resulting alignment as input for Arlequin.