Question

(Closed)Exract variable protein sequence from multiple seqece alinment?

0

Entering edit mode

4.5 years ago

MSRS ▴ 590

I have multiple sequence alignment file in fasta format.

enter image description here

I want to extract the result like this with respect to reference sequence A,

B I4E
B V7F
C V7F
C F10W

How can I be able to extract? Thanks in Advance.

alignment • 1.3k views

ADD COMMENT • link updated 15 months ago by Ram 44k • written 4.5 years ago by MSRS ▴ 590

0

Entering edit mode

A: Tools to enumerate amino acid mutations per site?

ADD REPLY • link 4.5 years ago by GenoMax 146k

0

Entering edit mode

Thank you very much. I need the information about the sequence headers name in addition to sequence position.

How can I modify that code?

ADD REPLY • link 4.5 years ago by MSRS ▴ 590

Istvan Albert · Accepted Answer · 2020-04-05

3

Entering edit mode

4.5 years ago

Mensur Dlakic ★ 28k

Let's save the three sequences below in file seqs.fas:

>A
MDIIFWVSTFVLF
>B
MDIEFWFSTFVLF
>C
MDIIFWFSTWVLF

Next, save the code below as pairwise_custom.py and run:

python pairwise_custom.py seqs.fas

The output will be:

 B I4E
 B V7F
 C V7F
 C F10W

The code is a modified version from here - please upvote that post.

#!/usr/bin/env python
# Count how many gaps and mismatches using pairwise2 alignment
# python pairwise_custom.py your_fasta_file.fasta

import itertools, sys
from Bio import SeqIO, pairwise2
from Bio.SubsMat import MatrixInfo as matlist

def my_format_alignment(w, align1, align2, score, begin, end):
    z = 1
    for a, b in zip(align1[begin:end], align2[begin:end]):
        if a == b:
            pass
        elif a == "-" or b == "-":
            print(' ' + IDs[w + 1] + ' ' + a + str(z) + b)
        else:
            print(' ' + IDs[w + 1] + ' ' + a + str(z) + b)
        z += 1

FastaFile = open(sys.argv[1], 'rU')

SEQs = []
IDs = []
for rec in SeqIO.parse(FastaFile, 'fasta'):
    IDs.append rec.id)
    SEQs.append(rec.seq)

combos = itertools.combinations(SEQs, 2)

z = 0
for k, v in combos:
    if z < (len(SEQs) - 1):
        aln = pairwise2.align.globalds(k, v, matlist.blosum62, -10, -1)
        my_format_alignment(z, *aln[0])
    z += 1

ADD COMMENT • link updated 4.5 years ago by Istvan Albert 101k • written 4.5 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

There is a bug of some sort in code-rendering procedure. This line:

IDs.append followed by ( followed by rec.id)

is rendered as:

IDs.appendrec.id)

No matter what I do it can't be fixed, so you will have to make that change manually after copying the code.

ADD REPLY • link 4.5 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

what a weird error indeed. We'll look into ASAP. Probably some from a markdown processing step.

ADD REPLY • link 4.5 years ago by Istvan Albert 101k

0

Entering edit mode

I have identified the problem, it comes from the bleach html santizing package by Mozilla, it is an older version that we use. It looks like it happens when the code starts with a parenthesis and contains .id in the second part of the parenthesis). As always the hardest bugs are those that we did not create ourselves. Now I could update the library, and that would fix this behavior, but then in the meantime the TLS versions have also been updated, and I cannot connect to PyPI to get the new package until a rebuild OpenSSL. This latter would affect all software on the server, hence it is a major undertaking. I have to make sure nothing else breaks when I update OpenSSL. We are also readying the next version of Biostar - we'll see which one makes it in first.

ADD REPLY • link 4.5 years ago by Istvan Albert 101k

0

Entering edit mode

Thank you very much.

ADD REPLY • link 4.5 years ago by MSRS ▴ 590

score 2 · Accepted Answer · 2020-04-04

A generic suggestion, assuming your sequences are the same length and highly similar: make a multiple alignment - with any suitable program - which will most likely be completely gapless. Then make a script in your programming language of choice that will compare A with other sequences one column at a time.

If you want a concrete suggestion, use pairwise alignment functionality in Biopython: compare A to all other sequences in pairwise fashion and print columns where the content is different between the two sequences.