Muscle Multiple Sequence Alignment: How To Allow Alignment With A Sequence That Is Just Gaps
3
1
Entering edit mode
12.0 years ago
Klugman ▴ 20

I am currently using Clustal to align about orthologous proteins for about 50 species, but would like to use MUSCLE instead.
Since I am examining thousands of proteins, I use the linux binary of MUSCLE.

the Problem

MUSCLE appears to not accept "empty" input sequences. That is, Protein X is not present in, say, the bear, and this is shown as lines with dashes/gaps:

Input

>ProteinX_human 
ABC
>ProteinX_cat 
A-C
>ProteinX_bear
---

Output

the MUSCLE output alignment file will not include >ProteinX_bear.

Question

How do I go about to ensure that MUSCLE will output the alignment with >ProteinX_bear just showing dashes/gaps/- throughout its alignment? I cannot find any information about how to achieve this in the MUSCLE manual, although I am new to bioinformatic and could be bixblind, so to speak. It is very important for my downstream analysis that species lacking AAs are included in the alignment output.

thankyou for your help, and I hope my question is clear.

multiple sequence alignment • 6.3k views
ADD COMMENT
2
Entering edit mode
12.0 years ago
Andreas ★ 2.5k

It's hard to imagine why you would need this feature. Anyway, have you tried to replace gaps in the gap-only sequences with Xs (X=any amino acid)? That fake sequence would also contain no information, but Muscle will at least report the sequence containing only Xs in the output.

Andreas

ADD COMMENT
1
Entering edit mode
12.0 years ago
Neilfws 49k

I have never seen anyone try to do this before. If Protein X is not present in bear, why would you even want to align it?

The residues in a multiple sequence alignment contribute information. No residues = no information. In other words, even if you did include a sequence containing only "-", it would not contribute anything meaningful to later analyses.

That said: I'd really like to know what happens were one to edit the alignment manually (probably the only way to do it), insert an "all gaps" sequence and try a subsequent analysis.

ADD COMMENT
1
Entering edit mode
12.0 years ago
Klugman ▴ 20

thankyou Andreas and Neilfws - I appreciate your input and will give the dash to non-AA letter replacement vs MUSCLE a go.

I am (obviously ;) ) very new to bioinformatics, and wrote a downstream Perl script that requires (clunky and would probably make many bioinformaticians cry, but it does its job) that all the species are in the same order for each protein's alignment, hence the need to include information-less data in each alignment.

Update: Just realised I forgot to mention that I am combining UCSC multiway CDS alignments with unpublished protein sequences from our lab. Sorry about not being clear.

I just wrote a Perl script that replaces the gaps with X's prior to running MUSCLE. The subsequent MUSCLE alignments look good.

Thanks again.

ADD COMMENT
2
Entering edit mode

Why don't you post-process the output of MUSCLE instead and add back the missing sequence? It sounds safer and you can easily use it with any other aligner, should you want to replace MUSCLE in the future.

ADD REPLY

Login before adding your answer.

Traffic: 2141 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6