I'm working with softclipped reads in python, mainly using biopython. I have gathered some small clusters of interesting sequences, which i want to do some motif analysis on, maybe blast them, etc.
My problem is: i have between 3 and 20 sequences in each cluster, and i want to reduce that to a consensus sequence. The sequences are highly similar, but sometimes a few corrupted sequences are in the cluster. That means i cannot simply calculate the consensus, since a single unmatching sequences might introduce gaps or otherwise affect the consensus to much.
Is there a way to automatically (without human interference) discard any badly matching sequences from a multiple alignment?
My current implementation first does the clustalw multiple alignment, gets the consensus, and then does pairwise alignment, using emboss needle, to the consensus and discards poorly matching sequences. Then the consensus is rebuilt. This seems rather clumsy, and is terribly slow.
Any advice is greatly appreciated!