Question: How to extract polymorphic positions from several alignments?
gravatar for s_herrera
3.2 years ago by
s_herrera10 wrote:

Hi everyone,

I have several fasta protein alignments (~5000) and I would want to identify polymorphic positions plus the aminoacid residues that change between sequences. I have tried to write a code myself, but it has been very difficult (I'm new in programming), and I looked on BioPython but I hasn't found anything yet. I want something like:

Protein alignment:

           ^  ^      ^

In the example above, the alignment has 3 polymorphic positions (marked with ^). The first one is located on the 6th position, the second one has the 9th position and the third one the 16th position. A common notation of a polymorphic site coult be as follows: R6Y, which means that a change occured in the 6th position from an R to a Y. The direction of the change (R->Y or Y->R) is based on the most frequent letter at that position. So, in this case R has the highest frequency and one can infer that the direction is R->Y.

As you can see, the 6th and 16th positions have single changes (the different letter is at frequency of 1). However, the 9th position has two sequences (sp1 and sp5) with the change. I would like two distinguish between this two types of polymorphisms. Thus, in this case I woul like an output something like this:

Output :

# Alignment #1
#   Single polymorphisms:
#     R6Y: sp1
#     A16M: sp3

#   Non-single polymorphisms:
#     G9A: sp1, sp5

Any suggestion is very appreciated, thanks!!

snp alignment python • 1.6k views
ADD COMMENTlink modified 3.1 years ago by Rodrigo150 • written 3.2 years ago by s_herrera10
gravatar for Rodrigo
3.1 years ago by
Rodrigo150 wrote:

Supposing your sequences are the same length and have no empty spaces (no '--') as in your example, you can use the following (explained) script.

# Start of script
from collections import Counter
from glob import glob
import numpy as np
from Bio import AlignIO

Along with module AlignIO of Bio (biopython) you'll use:

numpy: a widely used python module for numerical computing.

glob and collections: two standard modules (no need to install)

sequences = glob('/path/to/sequences/*')

Put all your sequences (fasta files) in a directory here called 'sequences' and replace /path/to/sequences with the real directory.

for sequence in sequences:
    alignment =, 'fasta')
    align_array = np.array([list(rec) for rec in alignment], np.character)
    single = {}
    poly = {}

For each fasta file, the script prints the name of the file and creates an alignment (using numpy array).

    for column in range(alignment.get_alignment_length()):
        counter = Counter(align_array[:,column])
        main_letter = counter.most_common(1)[0][0]

For each column main_letter is the most frequent letter, if several letter have the same frequency it will be one of them.

        if len(counter) == 2:
            # single polymorphism
            letter = counter.most_common(2)[1][0]
            for pos in np.where(align_array[:,column] == letter)[0]:
                change = main_letter + str(column + 1) + letter
                single.setdefault(change, [])
                single[change].append('sp' + str(pos + 1))

If there is only two different letters in that position means that there is a single polyorphism in the position.

        elif len(counter) > 2:
            # Non-single polymorphism
            for letter in counter:
                if letter != main_letter:
                    for pos in np.where(align_array[:,column] == letter)[0]:
                        change = main_letter + str(column + 1) + letter
                        poly.setdefault(change, [])
                        poly[change].append('sp' + str(pos + 1))

If more than two differents letters in the column, then there is a non-single polymorphism.

    print("   " + "Single polymorphisms:")
    for change in single:
        print("     " + change + ": " + str(single[change])[1:-1])
    print("   " + "Non-single polymorphisms:")
    for change in poly:
        print("     " + change + ": " + str(poly[change])[1:-1])
# End of script

Run this script as python >> output.txt to have the output printed in a file called output.txt.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Rodrigo150

Hi @Rodrigo,

Thank you very much! This is incredibly useful. I just had to added .decode('UTF-8') when concatenating main_letter and letter as follows: change = main_letter.decode('UTF-8') + str(column + 1) + letter.decode('UTF-8'), because otherwise the script returns this error message: TypeError: can't concat numpy.bytes_ to str.


ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by s_herrera10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1579 users visited in the last hour