Filtering MSA by similiarity to a consensus sequence/motif
1
0
Entering edit mode
9 weeks ago

Dear all,

anyone knows a good way of filtering or sorting a large multiple sequence alignment (~8000 sequences) by similarity to a given consensus sequence? A solution using python/biopython would be optimal.

Any help is appreciated!

Best Jonathan

biopython motif multiple alignment MSA python sequence search • 214 views
ADD COMMENT
2
Entering edit mode
9 weeks ago
Joe 19k

Here's the basics of some code that will do what you need. If you already have the consensus sequence in another file, you don't need to derive the consensus as I have here. I am also using an arbitrarily low identity and high hamming distance for illustration purposes:

from Bio import AlignIO
from Bio.Align import AlignInfo
import sys


def hamming_distance(s1, s2):
    """Return the Hamming distance between equal-length sequences"""
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))


alignment = AlignIO.read(sys.argv[1], "fasta")
threshold = 0.10

summary = AlignInfo.SummaryInfo(alignment)
consensus_seq = summary.dumb_consensus(threshold)

filtered = []
for record in alignment:
    if hamming_distance(record.seq, consensus_seq) < 100:
        filtered.append(record)

[print(record.id) for record in filtered]

In short:

  1. Read your alignment in with Bio.AlignIO.
  2. Read in your consensus sequence (e.g. with SeqIO or derive it in the code).
  3. Iterate each record in the alignment, and compare its sequence distance (in this case using the Hamming Distance) to the consensus.
  4. If it passes whatever similarity threshold, add it to a list for recall later.
ADD COMMENT
1
Entering edit mode

Note that this is a very naive approach (as evidenced by the use of the dumb_consensus). This will not take in to consideration any amino acid similarity or any other factors other than absolute residue-for-residue identity.

ADD REPLY

Login before adding your answer.

Traffic: 1004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6