Percentage of each aminoacid in fasta file
3
0
Entering edit mode
10.0 years ago
j.cardeira • 0

Hello everyone,

I'm completely new to Python, but I have to do something that is pretty basic.

I have a fasta file from which I need to count the percentage of each aminoacid (in all sequences in the file).

How can I select only the sequences (not the names or descriptions) and turn it into a string or a list?

And how can I, for instance, select only one aminoacid (lets say the 10th) from each sequence?

Thanks everyone

fasta python aminoacid • 6.9k views
1
Entering edit mode
10.0 years ago
xbello ▴ 20

I will use Biopython.

from Bio import SeqIO
seq_list = []
for record in SeqIO.parse("sample.fasta", "fasta")
# The sequence is in the seq attribute of the record
print record.seq
# Print only the 10th character
print record.seq[9]
# Add the sequences to a list
seq_list.append(record.seq)
# Turn the list into a string
single_seq = "".join(seq_list)

0
Entering edit mode

Adding to this reply (which is the only one actually answering the OP's question about Python code), you can calculate the percentages directly without converting the sequences to a list and then operate on these.

1. Create a dictionary that holds all twenty amino acids (and gaps if you need them) associated to 0s:

aa_count = { 'A': 0, 'C': 0, ... etc etc...}

2. Iterate over the sequences using SeqIO (by the way, here is a link to the tutorial page, which is full of examples), just like @xbello mentioned.
3. For each sequence, iterate over it (it's a string) and simply increment the counts:

sequence = record.seq
for aa in sequence:
aa_count[aa] += 1


At the end, you should have the total amount of each residue. The sum of the entire dictionary values gives you the total of amino acids in the entire file. Then calculate your averages. If you want to do this per sequence, just place the dictionary inside the for loop so that it is created at each iteration (i.e. each new sequence, all counters reset).

Hope it helps a bit. Also, if you are starting, have a look at this tutorial. Might help a bit.

1
Entering edit mode

Or if you feel lazy, use the letters already defined in Biopython:

from Bio.Alphabet.IUPAC import extended_protein

d = dict.fromkeys(list(extended_protein.letters), 0)

0
Entering edit mode
10.0 years ago

This can be done without python too. Awk is particularly fast for these stuffs.

With python you can easily implement like this but again, there are faster ways to do it especially if you use the re module or use stream read (io module)

I can give you an awk one liner (I am sorry for my obsession with awk):

awk 'BEGIN{FS=""} {#to concatenate the sequences\
if($0~/</){h=$0} else{seq[h]=seq[h]$10}} END{for(i in a){x=substr(a[i],10,1);y[x]++;z++} for(j in y){print j,y[j]/z}} yourfile.fa  or in the linux terminal (if you don't want to retain the sequence headers) sed '/>/s/^.*/>/' yourfile.fa | tr -d "\n" | tr ">" "\n"| awk -F "" 'NR>1{a[$10]++;b++} END{for(i in a){print i,a[i]/b}}'

0
Entering edit mode
10.0 years ago
5heikki 11k

Select only the sequences (not the names or descriptions) and turn it into a string:

grep -v '^>' test.fasta | tr -d "\n"


Select only one aminoacid (lets say the 10th) from each sequence:

grep -v '^>' test.fasta | cut -c10

0
Entering edit mode

This won't work because all the lines will be concatenated to a single line; all you will get is a single amino acid. In the second case you will pick up 10th residue from every line!!

0
Entering edit mode

Which is literally what OP asked for.

0
Entering edit mode

What OP asked is to find out the percentage likelihood of an amino acid at a certain position. If you concatenate all sequences then there is only one sequence left; you will end up reporting the 10th residue of the first sequence in the fasta file!!

In the second case you'll end up picking up 10th residue from every line, not every sequence (a sequence may be folded in a fixed window which generates multiple lines per sequence).

0
Entering edit mode

First thing OP asked:

How can I select only the sequences (not the names or descriptions) and turn it into a string or a list?

grep -v '^>' test.fasta | tr -d "\n"


We get a concatenated string from which we can

count the percentage of each aminoacid

You're correct about the second part. I assumed no line breaks in sequences.