How to find positions in the alignments at which the target sequence is not identical to the consensus.
0
0
Entering edit mode
4.2 years ago
iCosmoso • 0

I have the MSA consensus sequence at 60 % threshold, but I need to know the positions in the alignments at which the target sequence is not identical to the consensus. Is there an easier way to do that?

I have tried using this for two of my sequences (s1 and s2):

[i for i,(a1,a2)  in enumerate(izip(s1,s2)) if a1!=a2]

And it gives me a lot of numbers. Maybe there is a better way to find the positions?

biopython python • 1.3k views
ADD COMMENT
0
Entering edit mode

Those numbers should correspond to the indexes you are interested in, exactly what is the problem?

If you want to generalise this for the full alignment you can do:

from Bio import AlignIO
aln = AlignIO.parse(..., ...)

for i in range(aln.get_alignment_length()):
        if len(set(aln[:, i])) > 1:
            print(i)

This will show you all of the positions which are not 100% identical in the MSA. Since you're interested in how they differ from the consensus, you will need to iterate the consensus sequence and do a similar comparison to what you did in your post.

You haven't said exactly what kind of output you're interested in though? Do you want the index where any sequence differs from the consensus? Do you want to know the specific sequence that was different? Etc. All of these things will change the code you need and the approach you can take.

ADD REPLY
0
Entering edit mode

I just want the positions or basically the index, yes. I want to print out/or report the positions (thall be the index numbers) where the target protein sequence is not identical to the consensus sequence for every alignment.

ADD REPLY
0
Entering edit mode

What do you want to do if the consensus has an ambiguous character?

Otherwise you just need to loop over the sequences in the alignment:

from Bio import AlignIO
from Bio.Align import AlignInfo # To generate a trivial 'dumb' consensus

aln = AlignIO.read(file, format)
summary = AlignIO.SummaryInfo(aln)
consensus = summary.dumb_consensus(threshold, "ambiguous_char")

for seq in aln:
    results = [i for i, (ch1, ch2) in enumerate(zip(consensus, seq.seq)) if ch1 != ch2]
    printseq.id, results)

I've tested the code in as much as it runs, but I haven't properly checked that it is indeed reporting all the correct positions. Looks legit though as a couple of columns I know are 100% identical are not in the output I get.

There's probably a smarter/more efficient way to do this with set, any and all to do list comparisons a bit easier, but since you want to know the differences on a sequence by sequence basis, this is the most trivial code.

NB Be aware there is a parsing error on the forum's markdown renderer and there should be a ( between print and seq.id.

ADD REPLY
0
Entering edit mode

Oh yea I forgot to mention, I want to ignore non-consensus positions and the gaps in either sequences.

ADD REPLY
0
Entering edit mode

That should be easy enough. I'd suggest unrolling the list comprehension in to a proper loop structure, and then you just add a few if statements to ensure the character in question is not X or -.

You could edit the list comp, but it starts to get unwieldy after a while. Off the top of my head it would look something like [ i for i, (ch1, ch2) in enumerate(zip(consensus, seq.seq)) if ch1 != ch2 and ch1 != "X" and ch2 not in ["X", "-"] ]

(Or whatever ambiguous character you choose to use.

ADD REPLY
0
Entering edit mode

But like how could I make it look nicer, cuz right now it gives me number by number, but like how could I reported like it starts from lets say 1 to 200?

ADD REPLY
0
Entering edit mode

You are moving the goalposts now, getting the indexes of all the different positions is the output you requested, so that's the code I wrote.

If you need a specific form of output, you need to be explicit about this in your question. You also haven't shown what you've tried on this front.

You can post-process this data however you like: another script, import it in to an excel sheet, or whatever you prefer, but I can't guess what "nicer" means for your needs. If you only care about the first 200 values, you can just index results[0:199], for example.

ADD REPLY

Login before adding your answer.

Traffic: 2115 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6