(Closed)Exract variable protein sequence from multiple seqece alinment?
2
0
Entering edit mode
4.1 years ago
MSRS ▴ 580

I have multiple sequence alignment file in fasta format.

enter image description here

I want to extract the result like this with respect to reference sequence A,

B I4E
B V7F
C V7F
C F10W

How can I be able to extract? Thanks in Advance.

alignment • 1.2k views
ADD COMMENT
0
0
Entering edit mode

Thank you very much. I need the information about the sequence headers name in addition to sequence position.

How can I modify that code?

ADD REPLY
3
Entering edit mode
4.1 years ago
Mensur Dlakic ★ 27k

Let's save the three sequences below in file seqs.fas:

>A
MDIIFWVSTFVLF
>B
MDIEFWFSTFVLF
>C
MDIIFWFSTWVLF

Next, save the code below as pairwise_custom.py and run:

python pairwise_custom.py seqs.fas

The output will be:

 B I4E
 B V7F
 C V7F
 C F10W

The code is a modified version from here - please upvote that post.

#!/usr/bin/env python
# Count how many gaps and mismatches using pairwise2 alignment
# python pairwise_custom.py your_fasta_file.fasta

import itertools, sys
from Bio import SeqIO, pairwise2
from Bio.SubsMat import MatrixInfo as matlist

def my_format_alignment(w, align1, align2, score, begin, end):
    z = 1
    for a, b in zip(align1[begin:end], align2[begin:end]):
        if a == b:
            pass
        elif a == "-" or b == "-":
            print(' ' + IDs[w + 1] + ' ' + a + str(z) + b)
        else:
            print(' ' + IDs[w + 1] + ' ' + a + str(z) + b)
        z += 1

FastaFile = open(sys.argv[1], 'rU')

SEQs = []
IDs = []
for rec in SeqIO.parse(FastaFile, 'fasta'):
    IDs.append rec.id)
    SEQs.append(rec.seq)

combos = itertools.combinations(SEQs, 2)

z = 0
for k, v in combos:
    if z < (len(SEQs) - 1):
        aln = pairwise2.align.globalds(k, v, matlist.blosum62, -10, -1)
        my_format_alignment(z, *aln[0])
    z += 1
ADD COMMENT
0
Entering edit mode

There is a bug of some sort in code-rendering procedure. This line:

IDs.append followed by ( followed by rec.id)

is rendered as:

IDs.appendrec.id)

No matter what I do it can't be fixed, so you will have to make that change manually after copying the code.

ADD REPLY
0
Entering edit mode

what a weird error indeed. We'll look into ASAP. Probably some from a markdown processing step.

ADD REPLY
0
Entering edit mode

I have identified the problem, it comes from the bleach html santizing package by Mozilla, it is an older version that we use. It looks like it happens when the code starts with a parenthesis and contains .id in the second part of the parenthesis). As always the hardest bugs are those that we did not create ourselves. Now I could update the library, and that would fix this behavior, but then in the meantime the TLS versions have also been updated, and I cannot connect to PyPI to get the new package until a rebuild OpenSSL. This latter would affect all software on the server, hence it is a major undertaking. I have to make sure nothing else breaks when I update OpenSSL. We are also readying the next version of Biostar - we'll see which one makes it in first.

ADD REPLY
0
Entering edit mode

Thank you very much.

ADD REPLY
2
Entering edit mode
4.1 years ago
Mensur Dlakic ★ 27k

A generic suggestion, assuming your sequences are the same length and highly similar: make a multiple alignment - with any suitable program - which will most likely be completely gapless. Then make a script in your programming language of choice that will compare A with other sequences one column at a time.

If you want a concrete suggestion, use pairwise alignment functionality in Biopython: compare A to all other sequences in pairwise fashion and print columns where the content is different between the two sequences.

ADD COMMENT

Login before adding your answer.

Traffic: 2308 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6