I'm quite new in bioinformatics area, but will try to describe my problem accurately. I have protein and corresponding nucleotide sequences. I use multiple alignment to align my original sequence with two other similar proteins (slight difference in solubility) to find conserved domains. But when I align protein sequences and then nucleotide sequences I obtain different results, which is a natural consequence, since the alphabets of the sequences differ (in protein case we have 20, and in nucleotide case we have 4). So my question would be: what should I use for alignment? Protein sequences or nucleotide sequences?
As a rule of thumb:
- if the species are phylogenetically far, it is better to use the protein sequence, which is more conserved (think of the third codon, which can usually be mutated without any consequence for the protein sequence)
- if the species are close, or if you are comparing individuals of the same species, it is better to use the DNA sequence. This is because the protein sequences will be too similar, and you will get too few results.
You should ALWAYS use protein sequences for a multiple sequence alignment when you have both, and you are aligning a coding sequence. A DNA multiple alignment may be more useful for building evolutionary trees over shorter distances ( < 100 million years), but the actual DNA alignment should be driven by the protein alignment. If the proteins are closely related, there will not be many (any) gaps, and your alignment will be very accurate and robust. But a DNA alignment does not know about codons, so it may put in some gaps at inappropriate places. If the DNA and protein alignments differ, the protein alignment will almost certainly be more accurate, so use proteins.
Once you have a multiple protein sequence alignment, you can use that alignment to build the corresponding DNA sequence alignment, using the protein alignment as a template. This will ensure that all protein gaps become 3-residue (codon-sized) DNA gaps.