Percentage of each aminoacid in fasta file
3
0
Entering edit mode
9.9 years ago
j.cardeira • 0

Hello everyone,

I'm completely new to Python, but I have to do something that is pretty basic.

I have a fasta file from which I need to count the percentage of each aminoacid (in all sequences in the file).

How can I select only the sequences (not the names or descriptions) and turn it into a string or a list?

And how can I, for instance, select only one aminoacid (lets say the 10th) from each sequence?

Thanks everyone

fasta python aminoacid • 6.7k views
ADD COMMENT
1
Entering edit mode
9.9 years ago
xbello ▴ 20

I will use Biopython.

from Bio import SeqIO
seq_list = []
for record in SeqIO.parse("sample.fasta", "fasta")
    # The sequence is in the `seq` attribute of the record
    print record.seq
    # Print only the 10th character
    print record.seq[9]
    # Add the sequences to a list
    seq_list.append(record.seq)
# Turn the list into a string
single_seq = "".join(seq_list)
ADD COMMENT
0
Entering edit mode

Adding to this reply (which is the only one actually answering the OP's question about Python code), you can calculate the percentages directly without converting the sequences to a list and then operate on these.

  1. Create a dictionary that holds all twenty amino acids (and gaps if you need them) associated to 0s:

    aa_count = { 'A': 0, 'C': 0, ... etc etc...}
    
  2. Iterate over the sequences using SeqIO (by the way, here is a link to the tutorial page, which is full of examples), just like @xbello mentioned.
  3. For each sequence, iterate over it (it's a string) and simply increment the counts:

    sequence = record.seq
    for aa in sequence:
      aa_count[aa] += 1
    

At the end, you should have the total amount of each residue. The sum of the entire dictionary values gives you the total of amino acids in the entire file. Then calculate your averages. If you want to do this per sequence, just place the dictionary inside the for loop so that it is created at each iteration (i.e. each new sequence, all counters reset).

Hope it helps a bit. Also, if you are starting, have a look at this tutorial. Might help a bit.

ADD REPLY
1
Entering edit mode

Or if you feel lazy, use the letters already defined in Biopython:

from Bio.Alphabet.IUPAC import extended_protein

d = dict.fromkeys(list(extended_protein.letters), 0)

ADD REPLY
0
Entering edit mode
9.9 years ago

This can be done without python too. Awk is particularly fast for these stuffs.

With python you can easily implement like this but again, there are faster ways to do it especially if you use the re module or use stream read (io module)

I can give you an awk one liner (I am sorry for my obsession with awk):

awk 'BEGIN{FS=""} {#to concatenate the sequences\
if($0~/</){h=$0} else{seq[h]=seq[h]$10}} END{for(i in a){x=substr(a[i],10,1);y[x]++;z++} for(j in y){print j,y[j]/z}} yourfile.fa

or in the linux terminal (if you don't want to retain the sequence headers)

sed '/>/s/^.*/>/' yourfile.fa | tr -d "\n" | tr ">" "\n"| awk -F "" 'NR>1{a[$10]++;b++} END{for(i in a){print i,a[i]/b}}'
ADD COMMENT
0
Entering edit mode
9.9 years ago
5heikki 11k

Select only the sequences (not the names or descriptions) and turn it into a string:

grep -v '^>' test.fasta | tr -d "\n"

Select only one aminoacid (lets say the 10th) from each sequence:

grep -v '^>' test.fasta | cut -c10
ADD COMMENT
0
Entering edit mode

This won't work because all the lines will be concatenated to a single line; all you will get is a single amino acid. In the second case you will pick up 10th residue from every line!!

ADD REPLY
0
Entering edit mode

Which is literally what OP asked for.

ADD REPLY
0
Entering edit mode

What OP asked is to find out the percentage likelihood of an amino acid at a certain position. If you concatenate all sequences then there is only one sequence left; you will end up reporting the 10th residue of the first sequence in the fasta file!!

In the second case you'll end up picking up 10th residue from every line, not every sequence (a sequence may be folded in a fixed window which generates multiple lines per sequence).

ADD REPLY
0
Entering edit mode

First thing OP asked:

How can I select only the sequences (not the names or descriptions) and turn it into a string or a list?

grep -v '^>' test.fasta | tr -d "\n"

We get a concatenated string from which we can

count the percentage of each aminoacid

You're correct about the second part. I assumed no line breaks in sequences.

ADD REPLY

Login before adding your answer.

Traffic: 2727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6