Question

How To Count Amino Acids In Fasta Formated File?

3

Entering edit mode

11.2 years ago

8mazix ▴ 30

I need to count how many A, T, G and so on is in each sequence, for example:

>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIILXKKK

in this sequence theres:

M - 2
R - 2
G - 1
RR - 2
LL - 2
P - 1
II - 1
L - 1
X - 1
KKK - 1

for now on, I have a code that counts only single letters (output from my app: 1 sequences found {'G': 1, 'I': 2, 'M': 2, 'L': 3, 'P': 1, 'R': 4}):

def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}
  counts = {}

  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')
      for aa in sequences[name]:
        if aa in counts:
            counts[aa] = counts[aa] + 1
        else:
            counts[aa] = 1  

  print "%d sequences found" % len(order)
  print counts
  return order, sequences

x, y = FASTA("file.fasta")

How to change it to be able to get the output above? I don't want to use any other libs, (I mean - I dont want to do BioPython for this, because I want to do it without BioPython;)

python biopython • 13k views

ADD COMMENT • link updated 11.2 years ago by Andrzej Zielezinski 11k • written 11.2 years ago by 8mazix ▴ 30

2

Entering edit mode

Just to clarify the poster's question. He actually wants counts of single amino acids AND consecutive amino acids. Not just single amino acids. It's actually a bit more tricky than just using the native python/biopython count functions.

ADD REPLY • link 11.2 years ago by Damian Kao 16k

1

Entering edit mode

It seems you are not using the real power of biopython. read fasta file first, create Seq object my_seq and assign sequence from fasta file and use my_seq.count("G"). I my opinion it is good to use Biopython.

ADD REPLY • link 11.2 years ago by samsara ▴ 630

0

Entering edit mode

Moved to "question" since this is clearly not a "tip".

ADD REPLY • link 11.2 years ago by Neilfws 49k

score 4 · Answer 1 · 2013-01-23

4

Entering edit mode

11.2 years ago

Ketil 4.1k

Another solution is to use the standard Unix tools:

grep -v '^>' test.fasta | tr -d '\n' | sed -e 's/\(.\)/\1\n/g' | sort | uniq -c | sort -rn

:-)

ADD COMMENT • link 11.2 years ago by Ketil 4.1k

0

Entering edit mode

Awesome :) thnks

ADD REPLY • link 11.2 years ago by Rahul Sharma ▴ 660

0

Entering edit mode

That's genius. Even 5 years later!

ADD REPLY • link 5.8 years ago by JDK_92 • 0

0

Entering edit mode

I know this was post a while ago, but would you mind explaining this in terms of the commands in steps? I'm very new to unix so I'm not familiar with the language/ terms

ADD REPLY • link 5.1 years ago by a.f.mcdonnell • 0

score 4 · Answer 2 · 2013-01-23

Here's one solution using standard Python library: itertools. The code also works with multiple Fasta format.

from itertools import groupby

fh = open('file.fasta')

groups = (x[1] for x in groupby(fh, lambda l: l[0] == ">"))
for g in groups:
    header = next(g).strip()
    seq = ''.join(s.strip() for s in next(groups))
    counts = {}
    for s in [''.join(g) for k, g in groupby(seq)]:
        if s not in counts: counts[s] = 0
        counts[s]+=1
    print header
    print counts

It gives you:

>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
{'G': 1, 'RR': 1, 'LL': 1, 'M': 2, 'L': 1, 'II': 1, 'P': 1, 'R': 2, 'KKK': 1, 'X': 1}

score 0 · Answer 3 · 2013-01-23

Try this. The getCount function will get you the single and consecutive amino acid counts

inFile = open('filePath','r')

def getCounts(seq):
    current = seq[0]
    counts = {}
    for i in range(1,len(seq), 1):
        if seq[i] != current[-1]:
            if not counts.has_key(current):
                counts[current] = 0
            counts[current] += 1
            current = seq[i]
        else:
            current += seq[i]
    return counts

header = ''
seq = ''
for line in inFile:
    if line[0] == ">":
        if header != '':
            print header + "\t" + str(getCounts(seq))
        header = line.strip()[1:]
        seq = ''
    else:
        seq += line.strip()
print header + "\t" + str(getCounts(seq))