Simple amino acid polymorphism count
2
0
Entering edit mode
5.2 years ago

Hi, I am very new to shell scripting and am hoping someone could help me with the below problem I encountered in my bachelor's project.

I have a bunch of .fasta files, each containing three amino acid sequences, A, B and C.

I need a script that compares every amino acid position in A or B against those in C, then prints on new lines which positions in A/B 'mutate' relative to C. It should also preferably ignore any comparison where the identity of an amino acid residue is 'X'.

e.g.

Input .fasta

A: MEXRKVSDWNRI

B: MEARKVSFWNHI

C: MEARKVCDWNHI

Desired output

Sequence A:

C6S

H10R

Sequence B:

C6S

D7F

unix shell alignment • 2.3k views
4
Entering edit mode
5.2 years ago
st.ph.n ★ 2.7k

You mention needing/wanting to use bash, but if you tried something, you should show it along with any error messages.

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < infile.fa > infile_single.fasta


Python:

  #!/usr/bin/env python
import sys, itertools

def comp_seqs(a, b, i):
mismatch = []
mismatch.append(i)
one = list(a)
two = list(b)
for n in range(len(one)):
if one[n] != 'X' and two[n] != 'X':
if one[n] != '*' and two[n] != '*':
if one[n]  != '-' and two[n] != '-':
if one[n] != two[n]:
mismatch.append(two[n] + str(n+1) + one[n])
return mismatch

with open(sys.argv[1], 'r') as f:
myseqs = []
for line in f:
if line.startswith(">"):
myseqs.append((line.strip().split('>')[1].split(' ')[0], next(f).strip()))

target = myseqs[-1]
myseqs.remove(target)

with open(sys.argv[1].split('.fasta')[0] + '_mm.txt', 'w') as out:
results = []
for i in myseqs:
vs = comp_seqs(i[1], target[1], i[0])
results.append(vs)
for x in itertools.izip_longest(*sorted(results), fillvalue=''):
print >> out, '\t'.join(str(i) for i in x)


Save as comp_seqs.py, run as python comp_seqs.py in_file.fasta. If you have lots of FASTA to compare, use a for loop in your terminal:

for file in *.fasta; do python comp_seqs.py \$file; done

2
Entering edit mode

Thanks for the python solution. I'm glad we share warm feelings for that language. But I have some feedback/comments on your code, feel free to let me know if you disagree.

I'm pretty sure the code below can be improved:

if one[n] != 'X' and two[n] != 'X':
if one[n] != '*' and two[n] != '*':
if one[n]  != '-' and two[n] != '-':
if one[n] != two[n]:


if not one[n] in ['X', '*', '-'] and not two[n] in ['X', '*', '-'] and not one[n] == two[n]:


I would expect this is quicker, too.

And this part would benefit from a) a list comprehension and b) biopython (although additional dependency)

with open(sys.argv[1], 'r') as f:
myseqs = []
for line in f:
if line.startswith(">"):
myseqs.append((line.strip().split('>')[1].split(' ')[0], next(f).strip()))


You could do something like:

from Bio import SeqIO
myseqs = [record.seq for record in SeqIO.parse(sys.argv[1], "fasta")]


This synthax for writing to an output file is discouraged:

with open(sys.argv[1].split('.fasta')[0] + '_mm.txt', 'w') as out:
print >> out, avsc[0]


with open(sys.argv[1].split('.fasta')[0] + '_mm.txt', 'w') as out:
out.write(avsc[0])

1
Entering edit mode

@WouterDeCoster, thanks for your feedback. I broke the if statements for checking the characters in the sequences down so hopefully the OP might have an easier learning curve in understanding the code. Your not and and statements are surely more pythonic. For this reason, I also didn't use biopython, so the OP can see how one can open a file, and read line by line. As far as writing to a file, I am using 2.7, is the syntax I used only discouraged in later versions, or across all of them? I understand to use out.write for versions 3+.

1
Entering edit mode

+1 for writing educational code, but you could consider supplying both "versions" to show "easy" and "good" code.

As far as writing to a file, I am using 2.7, is the syntax I used only discouraged in later versions, or across all of them? I understand to use out.write for versions 3+.

Hmm. Possible. I can't find a reference for this.
But in my not so humble opinion, >> looks damn ugly and unpythonic.
But that's personal :-p

I would recommend writing python3 whenever possible, in the future python2 will get discontinued completely and all code containing old syntax will break. So since your code will be here on biostars forever, better write future-proof ;-)

0
Entering edit mode

Thank you so much for this, this seems like it would do what I need. However rather than using the A, B, C example names I should probably have clarified that each FASTA contains unique sequence names, e.g. the first two are always in the format

ZM[10-9999]M

ZM[10-9999]F

while the third is always Consensus_MF

Here is a link to a sample file.

How would I need to change the script to account for this?

0
Entering edit mode

I updated it to handle different ids, based on the header in the fasta file. This solution works if there is always only 3 sequences per file, and you are always comparing 1 vs 3, 2 vs 3. Your example file also in nucleotides, and contains '-'.

0
Entering edit mode

Apologies, seems I attached the wrong file. Here is the protein alignment. It still contains gaps (-) and stops (*) so I suppose these could be excluded along with X.

When running the script I got it to output TXT files but I encountered two issues:

1. the amino acid position numbers are all off by -1

2. the comparison seems to terminate after comparing only the first ~50 or so amino acids

e.g. for the attached protein alignment FASTA the result is

ZM1006M
R8K
R19K
M29Q
V34I
G48S

ZM1006F
R19K
M29Q
G48S

1
Entering edit mode

Updated to exclude '-', '*'. Python starts at 0, updated to show mismatch position starting from count 1. Your example file is multi-line fasta, use the above awk command to linearize each sequence.

0
Entering edit mode

Thanks a bunch, works like a charm!

0
Entering edit mode

the amino acid position numbers are all off by -1

Welcome to bioinformatics.

0
Entering edit mode

I was wondering if you could help me with a modification to the script, if it's not too much trouble.

Initially I was using separate input files in the below format:

>A1
Seq
>A2
Seq
>XX
Seq


However rather than making individual files I would rather use one super-file, e.g.:

>A1
Seq
>A2
Seq
>B1
Seq
>B2
Seq
[...]
>XX
Seq


Example file here. This will hold many more sequences but the last sequence (XX) will always be the comparator.

Ideally it would then successively print the results of every comparison into one output file, e.g. with the results of:

A1 vs XX
A2 vs XX
B1 vs XX
B2 vs XX
[...]


Alternatively, is it possible to print this directly into an Excel spreadsheet where the result of each comparison occupies one column?

0
Entering edit mode

so you want to compare all the items in a fasta to the last item in the fasta? Do you want each comparison, one per row, as in your result example? Or a single row, with each comparison in a column?

0
Entering edit mode

First question, yes. Second question, all the results are ultimately to go into Excel as shown below so I'm looking for whatever format will most easily facilitate that.

0
Entering edit mode

Ok. I updated the above answer for this. Although, a moderator may tell you you should've added a new question, and referenced this post. I tested on your example fasta.

0
Entering edit mode

Thank you, I'll keep that in mind. I tested the script and it seems to output first all the values for the -M headers in a series of columns, followed by those for all the -F headers. Would it be possible for the matching headers to be paired next to each other as shown in the image? e.g. M/F, M/F

0
Entering edit mode

Ok, updated. The order the script ran was dependent on the order of the sequences in the input file. I added a sort, so the headers will be F/M..F/M for each one. F/M b/c alphabetical.

0
Entering edit mode

Thanks, runs as expected. Would it also be possible to make a modification where, if it detects X, - or * in a position then it will not print that position in the corresponding site for the other cognate header?

e.g. an input of

>ZM0001M
ME*RK
>ZM0001F
MEVRK
>ZM0002M
MEFRK
>ZM0002F
MEFRK
>XX
MECRK


should print only

ZM0002M
C3F

ZM0002F
C3F


ZM0001F[C3V] being ignored because ZM0001M[3] = *

1
Entering edit mode

Since you stated this is for a bachelor's project, I think you should try to understand what each line of code is doing. Try writing pseudocode to help. At least you will learn something and know how it works, if you intend to use this analysis in your project.

For now, comment out out the 'if' statements (placing a octothorp (#) at the beginning of each line) in the comp_seqs function, so that all mismatches are outputted. You can then manually filter in Excel.

0
Entering edit mode

I removed the if statements and as expected all mismatches are printed, but I am not sure how the output can be filtered in Excel in order to accomplish the aforementioned. Or at least, it would require some inventive syntax. I had a go at a pseudo-function below:

=Display value IF (within range of column i+1 on the previous sheet, there is a cell containing a numeric string matching that of the reference cell AND matching cell does NOT contain X OR * OR -) OR (within range of column i+1 on the previous sheet, there is no cell containing a numeric string matching that of the reference cell AND reference cell does NOT contain X OR * OR -)

0
Entering edit mode

From the output of the program, you know the position of each mismatch, and each mismatch is printed in order that it was seen. Knowing the position and comparing two columns shouldn't be too difficult.