Extract Identical hits from pairwise global alignments
1
0
Entering edit mode
4.6 years ago
mnsp088 ▴ 100

Hi!

I am trying to perform 1:1 pairwise global alignments for a bunch of proteins in a fasta file. My end goal is to get a total number of identical hits (I.e. proteins which have a perfect match from start to end). So far I am using this code from Biopython to perform those alignments:

fasta = sys.argv[1]
with open(fasta, 'r') as f:
    seqs = []
    for line in f:
        if not line.startswith('>'):
            seqs.append(line.strip())

combos = itertools.combinations(seqs, 2)

    for k,v in combos:
        aln = pairwise2.align.globalxx(k,v)
        print(format_alignment(*aln[0]))

and it outputs several lines that look like this with a score:

KLARQYMTRARINVLIWPSCSPDLNLIENVWSVLKH----*
                          ||      |     |
--------------------------IE------K-IEQQ*
  Score=4

Is there a way to get the number of identical hits from this output? Or is there an entirely easier way to go about this problem? I am new to python so any tips are greatly appreciated. Thank you.

alignment protein biopython • 982 views
ADD COMMENT
0
Entering edit mode

If you're only interested in perfect matches, with the xx algorithm, this should be equivalent to finding out which alignments have a score == len(seq) I think. There's no penalty for gaps (IIRC), even if it does insert them.

You will just need to check that a gapped sequence cannot score identically to a full perfect match, then simply throw in a if aln[0][3] == len(k) check should do it I think.

I havent tested this at all though so I might be missing something.

In short, you need to know what the maximal score possible for a given sequence is (no gaps, perfect matches), given your scoring system, and then just check for alignments with that score.

ADD REPLY
3
Entering edit mode
4.6 years ago
Joe 21k

To illustrate my comment, here's some example input, code, and output:

Input testseqs.fa:

>1
ATACTAGCTACGATCGGGCTATATGCGCGCT
>2
ATCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTAGG
>3
ATCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTAGG
>4
TTCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTA

Note that 2 and 3 are identical.

With the following identicalseqs.py script invoked as python identicalseqs.py testseqs.fa:

import sys
from Bio import SeqIO
from Bio import pairwise2
import itertools

for k,v in itertools.combinations(SeqIO.parse(sys.argv[1], 'fasta'), 2):
    aln = pairwise2.align.globalxx(k.seq, v.seq)
    print(">{0} (Len = {1}) -> {2} (Len = {3}):".formatk.id, str(len(k.seq)), v.id, str(len(v.seq))))
    if aln[0][2] == len(k.seq):
        print(aln[0])
    else:
        print("Non-identical")

Yields the following output (change as required, this prints explicity which comparison is being made and the aln object tuple for the first alignment:

>1 (Len = 31) -> 2 (Len = 40):
Non-identical
>1 (Len = 31) -> 3 (Len = 40):
Non-identical
>1 (Len = 31) -> 4 (Len = 38):
Non-identical
>2 (Len = 40) -> 3 (Len = 40):
('ATCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTAGG', 'ATCGCGATTGAGCGCGCTAGTGATGCCGCGCAATGCTAGG', 40.0, 0, 40)
>2 (Len = 40) -> 4 (Len = 38):
Non-identical
>3 (Len = 40) -> 4 (Len = 38):
Non-identical

If you wanted to be really robust about it, you could calculate Hamming distances on each pair of strings in aln[0][0] and aln[0][1] and they should be 0 if they're identical post-alignment.


Pro tip:

itertools is designed to do exactly what it says on the tin, work with iterators, so you can put this directly in to the itertools.combinations() call above (as I have), which means you dont have to have everything stored in memory at once - could be useful if your fasta is very large as n choose k combinations can get really large really quickly.

ADD COMMENT
0
Entering edit mode

This method worked perfectly. thank you for your thorough explanation and example code!

ADD REPLY
0
Entering edit mode

Joe Just a side question out of curiosity - can one use blastp to do this task? I know it is used for local alignments but can we modify the command line arguments to make it perform a global alignment?

ADD REPLY
1
Entering edit mode

You could, in theory, but probably shouldn't, make BLAST behave 'pseudo'-globally by reducing the penalties for gaps etc., but it will likely never perform the same as a true global aligner because of the seed strategy etc.

Your best bet would be to use a global aligner like needle - I think BLAT might also work but I forget whether it does global or not, but can be used as essentially a drop in replacement for BLAST. The approach above only works for the specific xx alignment above though, because the scoring matrix is simply 1 for a match, 0 for a mismatch, and gaps are unpenalised. If you switched to using a different scoring matrix, you'd have to know what those scores are and adjust accordingly, to whatever the theoretical maximum score for a sequence is.

ADD REPLY
0
Entering edit mode

That makes sense. thank you again!

ADD REPLY

Login before adding your answer.

Traffic: 2603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6