Entering edit mode
4.2 years ago
iCosmoso
•
0
I have the MSA consensus sequence at 60 % threshold, but I need to know the positions in the alignments at which the target sequence is not identical to the consensus. Is there an easier way to do that?
I have tried using this for two of my sequences (s1 and s2):
[i for i,(a1,a2) in enumerate(izip(s1,s2)) if a1!=a2]
And it gives me a lot of numbers. Maybe there is a better way to find the positions?
Those numbers should correspond to the indexes you are interested in, exactly what is the problem?
If you want to generalise this for the full alignment you can do:
This will show you all of the positions which are not 100% identical in the MSA. Since you're interested in how they differ from the consensus, you will need to iterate the consensus sequence and do a similar comparison to what you did in your post.
You haven't said exactly what kind of output you're interested in though? Do you want the index where any sequence differs from the consensus? Do you want to know the specific sequence that was different? Etc. All of these things will change the code you need and the approach you can take.
I just want the positions or basically the index, yes. I want to print out/or report the positions (thall be the index numbers) where the target protein sequence is not identical to the consensus sequence for every alignment.
What do you want to do if the consensus has an ambiguous character?
Otherwise you just need to loop over the sequences in the alignment:
I've tested the code in as much as it runs, but I haven't properly checked that it is indeed reporting all the correct positions. Looks legit though as a couple of columns I know are 100% identical are not in the output I get.
There's probably a smarter/more efficient way to do this with
set
,any
andall
to do list comparisons a bit easier, but since you want to know the differences on a sequence by sequence basis, this is the most trivial code.NB Be aware there is a parsing error on the forum's markdown renderer and there should be a
(
betweenprint
andseq.id
.Oh yea I forgot to mention, I want to ignore non-consensus positions and the gaps in either sequences.
That should be easy enough. I'd suggest unrolling the list comprehension in to a proper loop structure, and then you just add a few
if
statements to ensure the character in question is notX
or-
.You could edit the list comp, but it starts to get unwieldy after a while. Off the top of my head it would look something like
[ i for i, (ch1, ch2) in enumerate(zip(consensus, seq.seq)) if ch1 != ch2 and ch1 != "X" and ch2 not in ["X", "-"] ]
(Or whatever ambiguous character you choose to use.
But like how could I make it look nicer, cuz right now it gives me number by number, but like how could I reported like it starts from lets say 1 to 200?
You are moving the goalposts now, getting the indexes of all the different positions is the output you requested, so that's the code I wrote.
If you need a specific form of output, you need to be explicit about this in your question. You also haven't shown what you've tried on this front.
You can post-process this data however you like: another script, import it in to an excel sheet, or whatever you prefer, but I can't guess what "nicer" means for your needs. If you only care about the first 200 values, you can just index
results[0:199]
, for example.