Question: How can I calculate the C:N ratio (or just number of carbons and nitrogens) of each amino acid sequence in a multifasta file?
gravatar for kieft1bp
9 months ago by
United States
kieft1bp0 wrote:

I have a multifasta file of amino acid sequences, around 1000 seqs total, like so:

  • > seq_id_1
  • MAWT........
  • > seq_id_2
  • MTRA.......
  • ....
  • > seq_id_1000
  • MIVE.......

I want to calculate the molar C:N ratio (number of total carbon atoms in each sequence divided by the number of total nitrogen atoms in each sequence) for all seq IDs and print a tsv file, like so:

  • seq_id_1 \t 1.5
  • seq_id_2 \t 0.9
  • ...
  • seq_id_1000 \t 1.1

This C:N ratio is derived from the number of carbon and nitrogen atoms in each amino acid residue (e.g., there are 5 Cs and 1 N in Methionine) and the number of each amino acid in the protein sequence. Is there a tool available that can do this, or do I have to write my own? I am fine with using a web server, a pre-written suite that runs on unix (mac, linux), or custom scripts from someone (python, perl, ruby). Thanks!

ADD COMMENTlink modified 9 months ago • written 9 months ago by kieft1bp0

using awk:

awk '/^>/ {if(S>0) {print N==0?"NA":C/N;} C=0;N=0;S++;printf("%s\t",$0); ;next;} {t=$0; gsub(/[^Cc]/,"",t);C+=length(t);t=$0;gsub(/[^Nn]/,"",t);N+=length(t);} END{print N==0?"NA":C/N;}' in.fasta

ADD REPLYlink modified 9 months ago • written 9 months ago by Pierre Lindenbaum124k

Thanks for the answer, Pierre, but the problem is a little more complicated than counting the instances of a string in each line. I've updated my question. My fasta sequences are just amino acids (with no information about carbon or nitrogen content), so what I actually need to do is reference a separate table that contains the number of carbon and nitrogen atoms per amino acid in order to calculate the C:N ratio for each sequence.

ADD REPLYlink written 9 months ago by kieft1bp0

There's 20 amino acids, it's fairly easy to create that list from the chemical formula in wikipedia, read it in a dictionary/hash, loop over your sequences, add up Cs and Ns, compute the ratio. Doesn't seem very complicated, or do I miss something

ADD REPLYlink written 9 months ago by Carambakaracho1.9k

Yes, you're right. I was just wondering if there was a tool already that was written to solve the same task. Just trying not to reinvent the wheel.

ADD REPLYlink written 9 months ago by kieft1bp0

I'm not saying it doesn't exist, but if it takes you longer to search for a tool than to write it then the choice is easy :-)

ADD REPLYlink written 9 months ago by WouterDeCoster42k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1730 users visited in the last hour