How to extract polymorphic positions from several alignments?
1
0
Entering edit mode
6.9 years ago
s_herrera ▴ 10

Hi everyone,

I have several fasta protein alignments (~5000) and I would want to identify polymorphic positions plus the aminoacid residues that change between sequences. I have tried to write a code myself, but it has been very difficult (I'm new in programming), and I looked on BioPython but I hasn't found anything yet. I want something like:

Protein alignment:

> sp1 MQGAAYMQAAAYYMQA
> sp2 MQGAARMQGAAYYMQA
> sp3 MQGAARMQGAAYYMQM
> sp4 MQGAARMQGAAYYMQA
> sp5 MQGAARMQAAAYYMQA
           ^  ^      ^

In the example above, the alignment has 3 polymorphic positions (marked with ^). The first one is located on the 6th position, the second one has the 9th position and the third one the 16th position. A common notation of a polymorphic site coult be as follows: R6Y, which means that a change occured in the 6th position from an R to a Y. The direction of the change (R->Y or Y->R) is based on the most frequent letter at that position. So, in this case R has the highest frequency and one can infer that the direction is R->Y.

As you can see, the 6th and 16th positions have single changes (the different letter is at frequency of 1). However, the 9th position has two sequences (sp1 and sp5) with the change. I would like two distinguish between this two types of polymorphisms. Thus, in this case I woul like an output something like this:

Output :

# Alignment #1
#   Single polymorphisms:
#     R6Y: sp1
#     A16M: sp3

#   Non-single polymorphisms:
#     G9A: sp1, sp5

Any suggestion is very appreciated, thanks!!

alignment python SNP • 2.8k views
ADD COMMENT
5
Entering edit mode
6.8 years ago
Rodrigo ▴ 190

Supposing your sequences are the same length and have no empty spaces (no '--') as in your example, you can use the following (explained) script.

# Start of script
from collections import Counter
from glob import glob
import numpy as np
from Bio import AlignIO

Along with module AlignIO of Bio (biopython) you'll use:

numpy: a widely used python module for numerical computing.

glob and collections: two standard modules (no need to install)

sequences = glob('/path/to/sequences/*')

Put all your sequences (fasta files) in a directory here called 'sequences' and replace /path/to/sequences with the real directory.

for sequence in sequences:
    print(sequence)
    alignment = AlignIO.read(sequence, 'fasta')
    align_array = np.array([list(rec) for rec in alignment], np.character)
    single = {}
    poly = {}

For each fasta file, the script prints the name of the file and creates an alignment (using numpy array).

    for column in range(alignment.get_alignment_length()):
        counter = Counter(align_array[:,column])
        main_letter = counter.most_common(1)[0][0]

For each column main_letter is the most frequent letter, if several letter have the same frequency it will be one of them.

        if len(counter) == 2:
            # single polymorphism
            letter = counter.most_common(2)[1][0]
            for pos in np.where(align_array[:,column] == letter)[0]:
                change = main_letter + str(column + 1) + letter
                single.setdefault(change, [])
                single[change].append('sp' + str(pos + 1))

If there is only two different letters in that position means that there is a single polyorphism in the position.

        elif len(counter) > 2:
            # Non-single polymorphism
            for letter in counter:
                if letter != main_letter:
                    for pos in np.where(align_array[:,column] == letter)[0]:
                        change = main_letter + str(column + 1) + letter
                        poly.setdefault(change, [])
                        poly[change].append('sp' + str(pos + 1))

If more than two differents letters in the column, then there is a non-single polymorphism.

    print("   " + "Single polymorphisms:")
    for change in single:
        print("     " + change + ": " + str(single[change])[1:-1])
    print("   " + "Non-single polymorphisms:")
    for change in poly:
        print("     " + change + ": " + str(poly[change])[1:-1])
# End of script

Run this script as python script.py >> output.txt to have the output printed in a file called output.txt.

ADD COMMENT
0
Entering edit mode

Hi @Rodrigo,

Thank you very much! This is incredibly useful. I just had to added .decode('UTF-8') when concatenating main_letter and letter as follows: change = main_letter.decode('UTF-8') + str(column + 1) + letter.decode('UTF-8'), because otherwise the script returns this error message: TypeError: can't concat numpy.bytes_ to str.

best

ADD REPLY

Login before adding your answer.

Traffic: 2962 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6