Question: Simple amino acid polymorphism count
0
gravatar for stutterdust
9 days ago by
UK
stutterdust0 wrote:

Hi, I am very new to shell scripting and am hoping someone could help me with the below problem I encountered in my bachelor's project.

I have a bunch of .fasta files, each containing three amino acid sequences, A, B and C.

I need a script that compares every amino acid position in A or B against those in C, then prints on new lines which positions in A/B 'mutate' relative to C. It should also preferably ignore any comparison where the identity of an amino acid residue is 'X'.

e.g.

Input .fasta

A: MEXRKVSDWNRI

B: MEARKVSFWNHI

C: MEARKVCDWNHI

Desired output

Sequence A:

C6S

H10R

Sequence B:

C6S

D7F

shell unix alignment • 101 views
ADD COMMENTlink modified 9 days ago by st.ph.n650 • written 9 days ago by stutterdust0
3
gravatar for st.ph.n
9 days ago by
st.ph.n650
Philadelphia, PA
st.ph.n650 wrote:

You mention needing/wanting to use bash, but if you tried something, you should show it along with any error messages.

Linearize your sequences:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < infile.fa > infile_single.fasta

Python:

#!/usr/bin/env python
import sys

def comp_seqs(a, b, i):
        mismatch = []
        mismatch.append(i)
        one = list(a)
        two = list(b)
        for n in range(len(one)):
                if one[n] != 'X' and two[n] != 'X':
                        if one[n] != '*' and two[n] != '*':
                                if one[n]  != '-' and two[n] != '-':
                                        if one[n] != two[n]:
                                                mismatch.append(two[n] + str(n+1) + one[n])
        return mismatch

with open(sys.argv[1], 'r') as f:
        myseqs = []
        for line in f:
                if line.startswith(">"):
                        myseqs.append((line.strip().split('>')[1].split(' ')[0], next(f).strip()))

avsc = comp_seqs(myseqs[0][1], myseqs[2][1], myseqs[0][0])
bvsc = comp_seqs(myseqs[1][1], myseqs[2][1], myseqs[1][0])

with open(sys.argv[1].split('.fasta')[0] + '_mm.txt', 'w') as out:
        print >> out, avsc[0]
        for a in avsc[1:]:
                 print >> out, '\t', a
        print >> out, '\n'
        print >> out, bvsc[0]
        for b in bvsc[1:]:
                print >> out, '\t', b

Save as comp_seqs.py, run as python comp_seqs.py in_file.fasta. If you have lots of FASTA to compare, use a for loop in your terminal:

for file in *.fasta; do python comp_seqs.py $file; done
ADD COMMENTlink modified 8 days ago • written 9 days ago by st.ph.n650
1

Thanks for the python solution. I'm glad we share warm feelings for that language. But I have some feedback/comments on your code, feel free to let me know if you disagree.

I'm pretty sure the code below can be improved:

if one[n] != 'X' and two[n] != 'X':
        if one[n] != '*' and two[n] != '*':
            if one[n]  != '-' and two[n] != '-':
                if one[n] != two[n]:

What about something like:

if not one[n] in ['X', '*', '-'] and not two[n] in ['X', '*', '-'] and not one[n] == two[n]:

I would expect this is quicker, too.

And this part would benefit from a) a list comprehension and b) biopython (although additional dependency)

with open(sys.argv[1], 'r') as f:
        myseqs = []
        for line in f:
                if line.startswith(">"):
                        myseqs.append((line.strip().split('>')[1].split(' ')[0], next(f).strip()))

You could do something like:

from Bio import SeqIO
myseqs = [record.seq for record in SeqIO.parse(sys.argv[1], "fasta")]

This synthax for writing to an output file is discouraged:

with open(sys.argv[1].split('.fasta')[0] + '_mm.txt', 'w') as out:
        print >> out, avsc[0]

Instead, you should use:

with open(sys.argv[1].split('.fasta')[0] + '_mm.txt', 'w') as out:
        out.write(avsc[0])
ADD REPLYlink modified 6 days ago • written 6 days ago by WouterDeCoster15k
1

@WouterDeCoster, thanks for your feedback. I broke the if statements for checking the characters in the sequences down so hopefully the OP might have an easier learning curve in understanding the code. Your not and and statements are surely more pythonic. For this reason, I also didn't use biopython, so the OP can see how one can open a file, and read line by line. As far as writing to a file, I am using 2.7, is the syntax I used only discouraged in later versions, or across all of them? I understand to use out.write for versions 3+.

ADD REPLYlink written 6 days ago by st.ph.n650
1

+1 for writing educational code, but you could consider supplying both "versions" to show "easy" and "good" code.

As far as writing to a file, I am using 2.7, is the syntax I used only discouraged in later versions, or across all of them? I understand to use out.write for versions 3+.

Hmm. Possible. I can't find a reference for this.
But in my not so humble opinion, >> looks damn ugly and unpythonic.
But that's personal :-p

I would recommend writing python3 whenever possible, in the future python2 will get discontinued completely and all code containing old syntax will break. So since your code will be here on biostars forever, better write future-proof ;-)

ADD REPLYlink written 6 days ago by WouterDeCoster15k

Thank you so much for this, this seems like it would do what I need. However rather than using the A, B, C example names I should probably have clarified that each FASTA contains unique sequence names, e.g. the first two are always in the format

ZM[10-9999]M

ZM[10-9999]F

while the third is always Consensus_MF

Here is a link to a sample file.

How would I need to change the script to account for this?

ADD REPLYlink modified 8 days ago • written 8 days ago by stutterdust0

I updated it to handle different ids, based on the header in the fasta file. This solution works if there is always only 3 sequences per file, and you are always comparing 1 vs 3, 2 vs 3. Your example file also in nucleotides, and contains '-'.

ADD REPLYlink written 8 days ago by st.ph.n650

Apologies, seems I attached the wrong file. Here is the protein alignment. It still contains gaps (-) and stops (*) so I suppose these could be excluded along with X.

When running the script I got it to output TXT files but I encountered two issues:

  1. the amino acid position numbers are all off by -1

  2. the comparison seems to terminate after comparing only the first ~50 or so amino acids

e.g. for the attached protein alignment FASTA the result is

ZM1006M
R8K
R19K
M29Q
V34I
G48S


ZM1006F
R19K
M29Q
G48S
ADD REPLYlink written 8 days ago by stutterdust0
1

Updated to exclude '-', '*'. Python starts at 0, updated to show mismatch position starting from count 1. Your example file is multi-line fasta, use the above awk command to linearize each sequence.

ADD REPLYlink written 8 days ago by st.ph.n650

Thanks a bunch, works like a charm!

ADD REPLYlink written 8 days ago by stutterdust0

the amino acid position numbers are all off by -1

Welcome to bioinformatics.

ADD REPLYlink written 6 days ago by WouterDeCoster15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 994 users visited in the last hour