Entering edit mode
4.3 years ago
charlieyu.bt99
▴
10
I want to align two protein sequences. For example,
sequence 1: AYGEC
sequence 2: A(GG)C
Note that sequence 2 is always shorter than sequence 1. So gaps inevitably appear in the sequence 2 of final alignment.
My goal is to align them 1) without end-gaps and 2) no gap is allowed in (GG) of sequence 2. So the possible final alignment would be
AYGEC
A-(GG)C
or
AYGEC
A(GG)-C
Can anyone suggest me any available command-line tools to do it? I know many tools are allowed to set a high end-gap open penalty to prevent end gaps. But I have not found any tools to prevent gaps between assigned residues.
You can set the gap open and gap extend penalties high.
Your question doesn't fully match your example though as:
This isn't the challenge. You aren't trying to prevent gaps (other than between known patterns) but instead coerce the alignment to match some a priori idea of which bits should and shouldn't match. Indeed, your example differs only in where a gap is inserted.
This doesn't really sound like a good alignment approach to me, and would probably lead to you manually editing the alignments one way or another anyway.
I personally am not aware of such a tool. Depending on the actual objective, a regex approach to find matches to known subsequences might be more appropriate.
Sorry I gave a bad example. I think my problem is a constraint global alignment. If I have a sequence 1 to be aligned to a reference protein sequence 2, I already know some residue segments in the reference sequence must have no gaps to be inserted into them. Thus, the rest part of residues are free, and the final alignment depends on the dynamic programming and trace back matrices. So I think I can not just do a simple regular expression to align myself.
I think you need something like a glocal alignment. It's easy enough to remove gaps from the ends of alignments after the fact, but if you particularly care about preserving small motifs, you still need a local alignment based approach to some degree.
If you are aligning to a reference, there shouldn't be gaps appearing in the reference really (don't do multiple sequence alignment in this case). You are assuming the reference is already correct, so you really just want to align the query sequences no?
Yes, I actually did multiple sequence alignment. I wanted to check if any loops are aligned correctly. How I did this is that I see the sequences having the corresponding secondary structures as referecne sequences. I can correct misaligned sequence by manual correction. However, I had too many sequences to do such corrections. I wanted to write a script to do such corrections automatically. So yeah, as you said, I just want to "correct" query sequences.
I think you need a semi-local or local multiple pairwise alignment to your reference. If you do multiple sequence alignment, no sequence is 'privileged' as the reference sequence, so they will all be subject to the addition of gaps etc.
I think I bypass this problem by other tricks. But thank you for helping me though.