Question

How to automat the calculation of alignment scores (BLOSUMxx matrix based) between one short peptide sequence and 10000 other peptide sequences (using biopython)

0

Entering edit mode

2.5 years ago

pierre.g.coulon • 0

Hi everyone and thanks in advance for any of your help.

I am running into an issue recently when trying to calculate the scores of a high number of short peptide alignments (10000).

I have to calculate all the alignement scores (calculation based on the BLOSUM62 matrix, but I could also want to use other BLOSUM matrix) of a set of 10000 alignments.

Those alignments are between one "reference" sequence (that does not vary) and 10000 other (almost random) sequences. Those sequences are all very short (9 to 20aa — my example bellow is with 9aa).

First, I already generated the list of 10000 (almost random) peptide sequences in an .xls (or .cvs) table. No issue for that.

Then, I found this biopython script to calculate the score of one alignement (here i put one example with 2 of my peptides): (the script come from this forum)

from Bio.SubsMat.MatrixInfo import blosum62 as blosum

blosum.update(((b,a),val) for (a,b),val in list(blosum.items()))

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True):
for A,B in zip(seq1, seq2):
    diag = ('-'==A) or ('-'==B)
    yield (gap_e if gap else gap_s) if diag else matrix[(A,B)]
    gap = diag

seq1 = 'ALNSSGTLT'
seq2 = 'AVNDSG-LT'

print(sum(score_pairwise(seq1, seq2, blosum, -3, -1)))

—> this give me the correct score of 27. All perfect.

The issue I am facing now is that although this script is perfect for me to calculate the score of the 2 sequences aligned (seq1 and seq2), I am struggling to find a neat way to apply this Python script on 10000 alignements.

As I said, seq1 does not vary so nothing to change here. Now, i wanted python to retrieve the 10000 other seq2 sequences from my xls/cvs sheet into a list so that the variable seq2 could take the various string values from that list. For that I did:

import pandas as pd
all_seq2_list=pd.read_csv('10000seq2.csv')
all_seq2_list.values.T[0].tolist()

Which return me the list ['randompeptide1', 'randompeptide2',…, 'randompeptide10000']

So far so good.

I have a vague idea that I should find a way to make a loop where the seq2 variable changes each time, assigning it the new string value (peptide sequence) corresponding to the next sequence in my all_seq2_list but I don't have enough technical knowledge in python to understand how to do that.

Bottom line is: I am struggling with the construction of the loop and how to (for each iteration) give the next value from my all_seq2_list to my seq2 variable; and then calculate the 10000 resulting scores I think it is probably not very difficult for someone who use python every day but as a newbie I am stuck.

Can anyone help me ? Thank you!

biopython blosum score alignment script • 2.7k views

ADD COMMENT • link updated 2.5 years ago by Mensur Dlakic ★ 27k • written 2.5 years ago by pierre.g.coulon • 0

0

Entering edit mode

2.5 years ago

pierre.g.coulon • 0

Thank you very much ! I feel so dumb: few minutes after posting I found this solution myself :

import pandas as pd
all_seq2_list=pd.read_csv('10000seq2.csv')
seq2=all_seq2_list.values.T[0].tolist()
for string in seq2:
   from Bio.SubsMat.MatrixInfo import blosum62 as blosum
   blosum.update(((b,a),val) for (a,b),val in list(blosum.items()))
   def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True):
      for A,B in zip(seq1, seq2):
         diag = ('-'==A) or ('-'==B)
         yield (gap_e if gap else gap_s) if diag else matrix[(A,B)]
         gap = diag

   seq1 = 'ALNSSGTLT'
   seq2 = string
   print(sum(score_pairwise(seq1, seq2, blosum, -3, -1)))

I guess I am not confortable with coding so I thought I could not find a solution.

Anyway, thank you a lot for your answer, it allows me to learn and also helped me to put my scores in a .cvs file.

ADD COMMENT • link 2.5 years ago by pierre.g.coulon • 0

1

Entering edit mode

The part that import the BLOSUM matrix and defines pairwise scoring should be outside of the loop, as it needs to be defined only once at the top of the script. What you are doing here is repeating these imports and definitions for each of your sequences. While it won't hurt the final result, it is a waste of time. seq1 can also be defined outside of the loop since it never changes.

ADD REPLY • link 2.5 years ago by Mensur Dlakic ★ 27k

score 2 · Accepted Answer · 2021-10-28

There is really nothing to it - you need a simple for loop to iterate through your sequences, which is in the last two lines.

from Bio.SubsMat.MatrixInfo import blosum62 as blosum

blosum.update(((b,a),val) for (a,b),val in list(blosum.items()))

def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True):
    for A,B in zip(seq1, seq2):
        diag = ('-'==A) or ('-'==B)
        yield (gap_e if gap else gap_s) if diag else matrix[(A,B)]
        gap = diag

seq1 = 'ALNSSGTLT'

import pandas as pd
all_seq2_list=pd.read_csv('10000seq2.csv')

for seq2 in all_seq2_list.values.T[0].tolist():
    print(sum(score_pairwise(seq1, seq2, blosum, -3, -1)))

If you want to add the scores to your sequences and save them into a .csv file for later inspection, replace the last two lines from the above code with this snipet:

all_seq2_list['scores'] = 0
for z in range(len(all_seq2_list)):
    score = sum(score_pairwise(seq1, all_seq2_list['sequence'][z], blosum, -3, -1))
    print(score)
    all_seq2_list['scores'][z] = score

all_seq2_list.to_csv('10000seq2_with_scores.csv', index=False)

This assumes that your .csv file is in this general format, specifically that the header is named sequence:

sequence
LTTDAFRFCCVETVNDFTWT
NRASPFTIDTRVVFTHKKHE
QRSPPALAYELDHMTKVTNF
EGHRYMVPWMEPHRIQPWMV
TNSCVWDSCYRYRHMFWAGH
VINPYCESCWIGSHYNDESI
QVCDPILSHGLEAHTMAISE
HAETSYNSTYDWIYNKGEPV
QKLCPMPIPARVHTTYPNLT
LYNTYQSFNATCYDMRGHDP
NDLTHRHLRRGWRWMLNHSW