Automagically Remove "Badly" Aligning Sequence From Multiple-Sequence Alignment
4
7
Entering edit mode
10.8 years ago
Yannick Wurm ★ 2.3k

Lets say you do a Multiple Sequence Alignment (MSA). One incomplete, or erroneous sequence can complicate things downstream. Is there any tool to automatically eliminate the bad one?

(Gblocks keeps things that align... which is not the opposite of removing a sequence, so its not a solution here)

Cheers, yannick

multiple • 6.4k views
ADD COMMENT
0
Entering edit mode

Gblocks selects blocks (almost the same as columns) - what you are asking for is selecting rows/sequences.

ADD REPLY
6
Entering edit mode
10.8 years ago

NorMD

NorMD can remove badly aligning sequences from MSAs. The paper is called Towards a reliable objective function for multiple sequence alignments (Hubmed link). The method was developed by Julie Thompson - the same author that created ClustalW.

The source code for NorMD is available here

See the section Removing badly aligned or unrelated sequences on page 13 of the PDF.

GUIDANCE

GUIDANCE (paper, web server) can also remove bad sequences from MSAs.

Quoting from the abstract:

The server points to columns and sequences that are unreliably aligned and enables their automatic removal from the MSA, in preparation for downstream analyses.

ADD COMMENT
0
Entering edit mode

thanks aleks, I'll check those out.

ADD REPLY
0
Entering edit mode

has anyone had difficulty getting the code for NorMD. It seem to have experienced some link-death.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

NORMD is available for download here

ADD REPLY
2
Entering edit mode
10.8 years ago

It might be worth to ask yourself the following question: why is the badly aligning sequence in your set to start with?

Since you are asking for an "automagic" way to eliminate such cases, I assume that you are trying to make multiple sequence alignments for many sets of sequences (otherwise you would just look at your alignment and remove the bad one). Which implies that you probably also have some automatic pipeline that fishes out the initial set of sequences (likely based on pairwise sequence similarity). In this case, you should try to figure out why this pipeline includes a sequence that does not align well with the other sequence in the first place, and try to fix the problem at its source.

Alternatively, you'll have to fix it post hoc. One way to do this would be that you for each sequence make an alignment of all other sequences. You could then either use the NorMD program to assess if the alignment became better by excluding this sequence, or use hmmbuild and hmmsearch to score how well the left-out sequence matches an HMM build from all the others.

ADD COMMENT
1
Entering edit mode
10.8 years ago
Rvosa ▴ 580

Perhaps you can tell us more about your use case? Obviously, you can filter out sequences on any number of conditions, e.g. too many missing (N/X/?), too short, etc. Do you want to remove them after doing the initial alignment, e.g. remove sequences that diverge too much from the mean (suggesting dodgy alignment), then re-align? All this can presumably be done with any of the Bio* toolkits (I'd use BioPerl, but that's just me).

ADD COMMENT
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2055 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6