Question: How can I calculate the C:N ratio (or just number of carbons and nitrogens) of each amino acid sequence in a multifasta file?
gravatar for kieft1bp
17 months ago by
United States
kieft1bp0 wrote:

I have a multifasta file of amino acid sequences, around 1000 seqs total, like so:

  • > seq_id_1
  • MAWT........
  • > seq_id_2
  • MTRA.......
  • ....
  • > seq_id_1000
  • MIVE.......

I want to calculate the molar C:N ratio (number of total carbon atoms in each sequence divided by the number of total nitrogen atoms in each sequence) for all seq IDs and print a tsv file, like so:

  • seq_id_1 \t 1.5
  • seq_id_2 \t 0.9
  • ...
  • seq_id_1000 \t 1.1

This C:N ratio is derived from the number of carbon and nitrogen atoms in each amino acid residue (e.g., there are 5 Cs and 1 N in Methionine) and the number of each amino acid in the protein sequence. Is there a tool available that can do this, or do I have to write my own? I am fine with using a web server, a pre-written suite that runs on unix (mac, linux), or custom scripts from someone (python, perl, ruby). Thanks!

ADD COMMENTlink modified 17 months ago • written 17 months ago by kieft1bp0

using awk:

awk '/^>/ {if(S>0) {print N==0?"NA":C/N;} C=0;N=0;S++;printf("%s\t",$0); ;next;} {t=$0; gsub(/[^Cc]/,"",t);C+=length(t);t=$0;gsub(/[^Nn]/,"",t);N+=length(t);} END{print N==0?"NA":C/N;}' in.fasta

ADD REPLYlink modified 17 months ago • written 17 months ago by Pierre Lindenbaum129k

Thanks for the answer, Pierre, but the problem is a little more complicated than counting the instances of a string in each line. I've updated my question. My fasta sequences are just amino acids (with no information about carbon or nitrogen content), so what I actually need to do is reference a separate table that contains the number of carbon and nitrogen atoms per amino acid in order to calculate the C:N ratio for each sequence.

ADD REPLYlink written 17 months ago by kieft1bp0

There's 20 amino acids, it's fairly easy to create that list from the chemical formula in wikipedia, read it in a dictionary/hash, loop over your sequences, add up Cs and Ns, compute the ratio. Doesn't seem very complicated, or do I miss something

ADD REPLYlink written 17 months ago by Carambakaracho2.2k

Yes, you're right. I was just wondering if there was a tool already that was written to solve the same task. Just trying not to reinvent the wheel.

ADD REPLYlink written 17 months ago by kieft1bp0

I'm not saying it doesn't exist, but if it takes you longer to search for a tool than to write it then the choice is easy :-)

ADD REPLYlink written 17 months ago by WouterDeCoster44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 779 users visited in the last hour