Printing for each specific sequence and its residue in msa , every msa position
1
0
Entering edit mode
3.6 years ago

Hi everyone so I have a question how can you print out for each specific residue of a sequence in msa its msa position I tried something with biopython :

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import AlignIO

alignment = AlignIO.read(open("ADORA1.pfam"), "PFAM")

def msa_position(aln, id, res_no):
    rec = next((r for r in aln if r.id==id), None)
    j = 0
    for i, res in enumerate(rec.seq):
        if res!='-':
            if j==res_no:
 return i
 j+=1

print (msa_position(alignment, "ENST00000618295|ADORA1/1-210", 3))

However, i can only do it for one position at a time and i want output like

id                                                              Sequence_position             msa_positon

ENST00000618295|ADORA1                     3                                        119

ENST00000618295|ADORA1                        4                                    120

and so on

Sorry for the code writing its my first time on biostars.

alignment sequence biopython R • 1.2k views
ADD COMMENT
0
Entering edit mode

I have tried to format your code for you, but the indentation was not clear, so please correct me if I have it wrong.

I'm not quite sure I understand the question - you want to know, for a given sequence within an MSA, where its sequence starts (i.e. count the gap characters up til the first base/amino)?

ADD REPLY
0
Entering edit mode

So what i want is a table which will map every sequence position of each sequence in given alignment (there are three of them in the msa file) to its msa position of course counting the gaps until the first amino so basically for that sequence in the example the third position in the sequence is actually 119 position cause of the gaps

ADD REPLY
0
Entering edit mode

Can you provide some example input data?

ADD REPLY
0
Entering edit mode

Here you go something shorter than my sequences but principle should be the same so i want each position in each sequence mapped to its msa position example in Beta the 4th position in the sequence is 5th msa position and for the other two they are the same 4th position to 4th position.

alignment = MultipleSeqAlignment([ SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"), SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"), SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),

ADD REPLY
0
Entering edit mode

That gap is internal to the second sequence, so should the desired output be 0 or 3?

ADD REPLY
0
Entering edit mode

so in sequence itself gaps you dont count so you go 1,2,3, skip gap 4 and for each of the counted positions i need the msa position so basically it will list in a column postions in sequence from 1 to length of that sequence and to each counted residued it will write a msa position as gaps arent counted the two numbers will not be the same

ADD REPLY
1
Entering edit mode
3.6 years ago

Maybe this will help.

from Bio.Align import MultipleSeqAlignment
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

def map_positions(aln, id):
    rec = next((r for r in aln if r.id==id), None)
    seq_pos = 0
    for aln_pos, res in enumerate(rec.seq, start=1):
        if res!='-':
            seq_pos += 1
            yield seq_pos, aln_pos

alignment = MultipleSeqAlignment([
    SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"), 
    SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"), 
    SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma")])

print('id\tseq_pos\taln_pos')
for seq_pos, aln_pos in map_positions(alignment, "Beta"):
    print(f'Beta\t{seq_pos}\t{aln_pos}')

Output:

id  seq_pos aln_pos
Beta    1   1
Beta    2   2
Beta    3   3
Beta    4   5
Beta    5   6
Beta    6   7
Beta    7   8
Beta    8   9
Beta    9   10
Beta    10  11
Beta    11  12

If you want to show the output for all the sequences at once, you can use this code:

from Bio.Align import MultipleSeqAlignment
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

def map_positions(seq):
    seq_pos = 0
    for aln_pos, res in enumerate(seq, start=1):
        if res!='-':
            seq_pos += 1
            yield seq_pos, aln_pos

alignment = MultipleSeqAlignment([
    SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"), 
    SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"), 
    SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma")])

print('id\tseq_pos\taln_pos')
for aln_seq in alignment:
    for seq_pos, aln_pos in map_positions(aln_seq):
        print(f'{aln_seq.id}\t{seq_pos}\t{aln_pos}')
ADD COMMENT
0
Entering edit mode

Yes thats it thank you so much you are a lifesaver. I have just one question how can i print for all ids at once so i have alpha beta and gamma in one output.

ADD REPLY
0
Entering edit mode

Glad I could help. I just updated my answer to show all ids at once.

ADD REPLY

Login before adding your answer.

Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6