Question: How to extract polymorphic positions from several alignments?
0
s_herrera10 wrote:

Hi everyone,

I have several fasta protein alignments (~5000) and I would want to identify polymorphic positions plus the aminoacid residues that change between sequences. I have tried to write a code myself, but it has been very difficult (I'm new in programming), and I looked on BioPython but I hasn't found anything yet. I want something like:

Protein alignment:

``````> sp1 MQGAAYMQAAAYYMQA
> sp2 MQGAARMQGAAYYMQA
> sp3 MQGAARMQGAAYYMQM
> sp4 MQGAARMQGAAYYMQA
> sp5 MQGAARMQAAAYYMQA
^  ^      ^
``````

In the example above, the alignment has 3 polymorphic positions (marked with ^). The first one is located on the 6th position, the second one has the 9th position and the third one the 16th position. A common notation of a polymorphic site coult be as follows: R6Y, which means that a change occured in the 6th position from an R to a Y. The direction of the change (R->Y or Y->R) is based on the most frequent letter at that position. So, in this case R has the highest frequency and one can infer that the direction is R->Y.

As you can see, the 6th and 16th positions have single changes (the different letter is at frequency of 1). However, the 9th position has two sequences (sp1 and sp5) with the change. I would like two distinguish between this two types of polymorphisms. Thus, in this case I woul like an output something like this:

Output :

``````# Alignment #1
#   Single polymorphisms:
#     R6Y: sp1
#     A16M: sp3

#   Non-single polymorphisms:
#     G9A: sp1, sp5
``````

Any suggestion is very appreciated, thanks!!

snp alignment python • 1.6k views
modified 3.1 years ago by Rodrigo150 • written 3.2 years ago by s_herrera10
4
Rodrigo150 wrote:

Supposing your sequences are the same length and have no empty spaces (no '--') as in your example, you can use the following (explained) script.

``````# Start of script
from collections import Counter
from glob import glob
import numpy as np
from Bio import AlignIO
``````

Along with module AlignIO of Bio (biopython) you'll use:

numpy: a widely used python module for numerical computing.

glob and collections: two standard modules (no need to install)

``````sequences = glob('/path/to/sequences/*')
``````

Put all your sequences (fasta files) in a directory here called 'sequences' and replace `/path/to/sequences` with the real directory.

``````for sequence in sequences:
print(sequence)
align_array = np.array([list(rec) for rec in alignment], np.character)
single = {}
poly = {}
``````

For each fasta file, the script prints the name of the file and creates an alignment (using numpy array).

``````    for column in range(alignment.get_alignment_length()):
counter = Counter(align_array[:,column])
main_letter = counter.most_common(1)
``````

For each column `main_letter` is the most frequent letter, if several letter have the same frequency it will be one of them.

``````        if len(counter) == 2:
# single polymorphism
letter = counter.most_common(2)
for pos in np.where(align_array[:,column] == letter):
change = main_letter + str(column + 1) + letter
single.setdefault(change, [])
single[change].append('sp' + str(pos + 1))
``````

If there is only two different letters in that position means that there is a single polyorphism in the position.

``````        elif len(counter) > 2:
# Non-single polymorphism
for letter in counter:
if letter != main_letter:
for pos in np.where(align_array[:,column] == letter):
change = main_letter + str(column + 1) + letter
poly.setdefault(change, [])
poly[change].append('sp' + str(pos + 1))
``````

If more than two differents letters in the column, then there is a non-single polymorphism.

``````    print("   " + "Single polymorphisms:")
for change in single:
print("     " + change + ": " + str(single[change])[1:-1])
print("   " + "Non-single polymorphisms:")
for change in poly:
print("     " + change + ": " + str(poly[change])[1:-1])
# End of script
``````

Run this script as `python script.py >> output.txt` to have the output printed in a file called `output.txt`.

Hi @Rodrigo,

Thank you very much! This is incredibly useful. I just had to added `.decode('UTF-8')` when concatenating `main_letter` and `letter` as follows: `change = main_letter.decode('UTF-8') + str(column + 1) + letter.decode('UTF-8')`, because otherwise the script returns this error message: TypeError: can't concat numpy.bytes_ to str.

best