Question: Percentage of each aminoacid in fasta file
0
gravatar for j.cardeira
5.1 years ago by
j.cardeira0
Portugal
j.cardeira0 wrote:

Hello everyone,

I'm completely new to Python, but I have to do something that is pretty basic.

I have a fasta file from which I need to count the percentage of each aminoacid (in all sequences in the file).

How can I select only the sequences (not the names or descriptions) and turn it into a string or a list?

And how can I, for instance, select only one aminoacid (lets say the 10th) from each sequence?

Thanks everyone

aminoacid python fasta • 3.5k views
ADD COMMENTlink modified 5.1 years ago by xbello20 • written 5.1 years ago by j.cardeira0
1
gravatar for xbello
5.1 years ago by
xbello20
Spain
xbello20 wrote:

I will use Biopython.

    from Bio import SeqIO

    seq_list = []

    for record in SeqIO.parse("sample.fasta", "fasta")

        # The sequence is in the `seq` attribute of the record

        print record.seq

        # Print only the 10th character

        print record.seq[9]

        # Add the sequences to a list

        seq_list.append(record.seq)

    # Turn the list into a string

    single_seq = "".join(seq_list)  

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by xbello20

Adding to this reply (which is the only one actually answering the OP's question about Python code), you can calculate the percentages directly without converting the sequences to a list and then operate on these.

1. Create a dictionary that holds all twenty amino acids (and gaps if you need them) associated to 0s:

aa_count = { 'A': 0, 'C': 0, ... etc etc...}

2. Iterate over the sequences using SeqIO (by the way, here's a link to the tutorial page, which is full of examples), just like @xbello mentioned.

3. For each sequence, iterate over it (it's a string) and simply increment the counts:

sequence = record.seq

for aa in sequence:

    aa_count[aa] += 1

At the end, you should have the total amount of each residue. The sum of the entire dictionary values gives you the total of amino acids in the entire file. Then calculate your averages. If you want to do this per sequence, just place the dictionary inside the for loop so that it is created at each iteration (i.e. each new sequence, all counters reset). 

Hope it helps a bit. Also, if you are starting, have a look at this tutorial. Might help a bit.

ADD REPLYlink written 5.1 years ago by João Rodrigues2.5k
1

Or if you feel lazy, use the letters already defined in Biopython:

from Bio.Alphabet.IUPAC import extended_protein

d = dict.fromkeys(list(extended_protein.letters), 0)

ADD REPLYlink written 5.1 years ago by xbello20
0
gravatar for Bharat Iyengar
5.1 years ago by
Bombay, India
Bharat Iyengar260 wrote:

This can be done without python too. Awk is particularly fast for these stuffs.

With python you can easily implement like this but again, there are faster ways to do it especially if you use the re module or use stream read (io module)

I can give you an awk one liner (I am sorry for my obsession with awk):

awk 'BEGIN{FS=""} {#to concatenate the  sequences\

if($0~/</){h=$0} else{seq[h]=seq[h]$10}} END{for(i in a){x=substr(a[i],10,1);y[x]++;z++} for(j in y){print j,y[j]/z}} yourfile.fa

 

or in the linux terminal (if you don't want to retain the sequence headers)

sed '/>/s/^.*/>/' yourfile.fa | tr -d "\n" | tr ">" "\n"| awk -F "" 'NR>1{a[$10]++;b++} END{for(i in a){print i,a[i]/b}}'

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Bharat Iyengar260
0
gravatar for 5heikki
5.1 years ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

Select only the sequences (not the names or descriptions) and turn it into a string:

grep -v '^>' test.fasta | tr -d "\n"

 

Select only one aminoacid (lets say the 10th) from each sequence:

grep -v '^>' test.fasta | cut -c10

ADD COMMENTlink written 5.1 years ago by 5heikki8.4k

This won't work because all the lines will be concatenated to a single line; all you will get is a single amino acid. In the second case you will pick up 10th residue from every line!!
 

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Bharat Iyengar260

Which is literally what OP asked for.

ADD REPLYlink written 5.1 years ago by 5heikki8.4k

What OP asked is to find out the percentage likelihood of an amino acid at a certain position. If you concatenate all sequences then there is only one sequence left; you will end up reporting the 10th residue of the first sequence in the fasta file!!

In the second case you'll end up picking up 10th residue from every line, not every sequence (a sequence may be folded in a fixed window which generates multiple lines per sequence).

ADD REPLYlink written 5.1 years ago by Bharat Iyengar260

First thing OP asked:

How can I select only the sequences (not the names or descriptions) and turn it into a string or a list?

grep -v '^>' test.fasta | tr -d "\n"

We get a concatenated string from which we can

count the percentage of each aminoacid

You're correct about the second part. I assumed no line breaks in sequences.

 

ADD REPLYlink written 5.1 years ago by 5heikki8.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1459 users visited in the last hour