How To Count Amino Acids In Fasta Formated File?
4
3
Entering edit mode
11.6 years ago
8mazix ▴ 30

I need to count how many A, T, G and so on is in each sequence, for example:

>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIILXKKK


in this sequence theres:

M - 2
R - 2
G - 1
RR - 2
LL - 2
P - 1
II - 1
L - 1
X - 1
KKK - 1


for now on, I have a code that counts only single letters (output from my app: 1 sequences found {'G': 1, 'I': 2, 'M': 2, 'L': 3, 'P': 1, 'R': 4}):

def FASTA(filename):
try:
f = file(filename)
except IOError:
print "The file, %s, does not exist" % filename
return

order = []
sequences = {}
counts = {}

for line in f:
if line.startswith('>'):
name = line[1:].rstrip('\n')
name = name.replace('_', ' ')
order.append(name)
sequences[name] = ''
else:
sequences[name] += line.rstrip('\n').rstrip('*')
for aa in sequences[name]:
if aa in counts:
counts[aa] = counts[aa] + 1
else:
counts[aa] = 1

print "%d sequences found" % len(order)
print counts
return order, sequences

x, y = FASTA("file.fasta")


How to change it to be able to get the output above? I don't want to use any other libs, (I mean - I dont want to do BioPython for this, because I want to do it without BioPython;)

python biopython • 13k views
2
Entering edit mode

Just to clarify the poster's question. He actually wants counts of single amino acids AND consecutive amino acids. Not just single amino acids. It's actually a bit more tricky than just using the native python/biopython count functions.

1
Entering edit mode

It seems you are not using the real power of biopython. read fasta file first, create Seq object my_seq and assign sequence from fasta file and use my_seq.count("G"). I my opinion it is good to use Biopython.

0
Entering edit mode

Moved to "question" since this is clearly not a "tip".

4
Entering edit mode
11.6 years ago
Ketil 4.1k

Another solution is to use the standard Unix tools:

grep -v '^>' test.fasta | tr -d '\n' | sed -e 's/$$.$$/\1\n/g' | sort | uniq -c | sort -rn


:-)

0
Entering edit mode

Awesome :) thnks

0
Entering edit mode

That's genius. Even 5 years later!

0
Entering edit mode

I know this was post a while ago, but would you mind explaining this in terms of the commands in steps? I'm very new to unix so I'm not familiar with the language/ terms

4
Entering edit mode
11.6 years ago

Here's one solution using standard Python library: itertools. The code also works with multiple Fasta format.

from itertools import groupby

fh = open('file.fasta')

groups = (x[1] for x in groupby(fh, lambda l: l[0] == ">"))
for g in groups:
seq = ''.join(s.strip() for s in next(groups))
counts = {}
for s in [''.join(g) for k, g in groupby(seq)]:
if s not in counts: counts[s] = 0
counts[s]+=1
print counts


It gives you:

>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
{'G': 1, 'RR': 1, 'LL': 1, 'M': 2, 'L': 1, 'II': 1, 'P': 1, 'R': 2, 'KKK': 1, 'X': 1}

0
Entering edit mode

Nice. I haven't really looked into itertools since a lot of the useful functions are 2.6+. The groupby function looks extremely useful. I think I will definitely look into it now.

0
Entering edit mode
11.6 years ago

Try this. The getCount function will get you the single and consecutive amino acid counts

inFile = open('filePath','r')

def getCounts(seq):
current = seq[0]
counts = {}
for i in range(1,len(seq), 1):
if seq[i] != current[-1]:
if not counts.has_key(current):
counts[current] = 0
counts[current] += 1
current = seq[i]
else:
current += seq[i]
return counts

seq = ''
for line in inFile:
if line[0] == ">":
print header + "\t" + str(getCounts(seq))
seq = ''
else:
seq += line.strip()
print header + "\t" + str(getCounts(seq))

0
Entering edit mode

Hey Damian Kao this code count 'all' separated aac once(ie. you get all 'L' in the seq?) and then counts the doubles or triples consecutive LL for ex.??

Thanks

Traffic: 1494 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.