Question

Automagically Remove "Badly" Aligning Sequence From Multiple-Sequence Alignment

7

Entering edit mode

13.3 years ago

Yannick Wurm ★ 2.5k

Lets say you do a Multiple Sequence Alignment (MSA). One incomplete, or erroneous sequence can complicate things downstream. Is there any tool to automatically eliminate the bad one?

(Gblocks keeps things that align... which is not the opposite of removing a sequence, so its not a solution here)

Cheers, yannick

multiple • 8.1k views

ADD COMMENT • link updated 5.1 years ago by dukecomeback ▴ 40 • written 13.3 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

Gblocks selects blocks (almost the same as columns) - what you are asking for is selecting rows/sequences.

ADD REPLY • link 13.3 years ago by Aleksandr Levchuk 3.2k

Ram · Answer 1 · 2011-01-14

6

Entering edit mode

13.3 years ago

Aleksandr Levchuk 3.2k

NorMD

NorMD can remove badly aligning sequences from MSAs. The paper is called Towards a reliable objective function for multiple sequence alignments (Hubmed link). The method was developed by Julie Thompson - the same author that created ClustalW.

The source code for NorMD is available here

See the section Removing badly aligned or unrelated sequences on page 13 of the PDF.

GUIDANCE

GUIDANCE (paper, web server) can also remove bad sequences from MSAs.

Quoting from the abstract:

The server points to columns and sequences that are unreliably aligned and enables their automatic removal from the MSA, in preparation for downstream analyses.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.3 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

thanks aleks, I'll check those out.

ADD REPLY • link 13.3 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

has anyone had difficulty getting the code for NorMD. It seem to have experienced some link-death.

ADD REPLY • link 13.2 years ago by Will 4.5k

0

Entering edit mode

@Will, ftp://ftp-igbmc.u-strasbg.fr/pub/NORMD/ is working

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.2 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

NORMD is available for download here

ADD REPLY • link 4.7 years ago by catchakhiljobby ▴ 10

score 2 · Answer 2 · 2011-01-13

It might be worth to ask yourself the following question: why is the badly aligning sequence in your set to start with?

Since you are asking for an "automagic" way to eliminate such cases, I assume that you are trying to make multiple sequence alignments for many sets of sequences (otherwise you would just look at your alignment and remove the bad one). Which implies that you probably also have some automatic pipeline that fishes out the initial set of sequences (likely based on pairwise sequence similarity). In this case, you should try to figure out why this pipeline includes a sequence that does not align well with the other sequence in the first place, and try to fix the problem at its source.

Alternatively, you'll have to fix it post hoc. One way to do this would be that you for each sequence make an alignment of all other sequences. You could then either use the NorMD program to assess if the alignment became better by excluding this sequence, or use hmmbuild and hmmsearch to score how well the left-out sequence matches an HMM build from all the others.

score 1 · Answer 3 · 2011-01-13

Perhaps you can tell us more about your use case? Obviously, you can filter out sequences on any number of conditions, e.g. too many missing (N/X/?), too short, etc. Do you want to remove them after doing the initial alignment, e.g. remove sequences that diverge too much from the mean (suggesting dodgy alignment), then re-align? All this can presumably be done with any of the Bio* toolkits (I'd use BioPerl, but that's just me).

score 0 · Answer 4 · 2019-03-21

0

Entering edit mode

5.1 years ago

dukecomeback ▴ 40

https://github.com/dukecomeback/bad-sequence-remover

ADD COMMENT • link 5.1 years ago by dukecomeback ▴ 40