Filtering MSA by similiarity to a consensus sequence/motif
1
0
Entering edit mode
3.1 years ago

Dear all,

anyone knows a good way of filtering or sorting a large multiple sequence alignment (~8000 sequences) by similarity to a given consensus sequence? A solution using python/biopython would be optimal.

Any help is appreciated!

Best Jonathan

3
Entering edit mode
3.1 years ago
Joe 21k

Here's the basics of some code that will do what you need. If you already have the consensus sequence in another file, you don't need to derive the consensus as I have here. I am also using an arbitrarily low identity and high hamming distance for illustration purposes:

from Bio import AlignIO
from Bio.Align import AlignInfo
import sys

def hamming_distance(s1, s2):
"""Return the Hamming distance between equal-length sequences"""
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

threshold = 0.10

summary = AlignInfo.SummaryInfo(alignment)
consensus_seq = summary.dumb_consensus(threshold)

filtered = []
for record in alignment:
if hamming_distance(record.seq, consensus_seq) < 100:
filtered.append(record)

[print(record.id) for record in filtered]


In short:

1. Read your alignment in with Bio.AlignIO.
2. Read in your consensus sequence (e.g. with SeqIO or derive it in the code).
3. Iterate each record in the alignment, and compare its sequence distance (in this case using the Hamming Distance) to the consensus.
4. If it passes whatever similarity threshold, add it to a list for recall later.
1
Entering edit mode

Note that this is a very naive approach (as evidenced by the use of the dumb_consensus). This will not take in to consideration any amino acid similarity or any other factors other than absolute residue-for-residue identity.