I have a multifasta file of amino acid sequences, around 1000 seqs total, like so:
- > seq_id_1
- MAWT........
- > seq_id_2
- MTRA.......
- ....
- > seq_id_1000
- MIVE.......
I want to calculate the molar C:N ratio (number of total carbon atoms in each sequence divided by the number of total nitrogen atoms in each sequence) for all seq IDs and print a tsv file, like so:
- seq_id_1 \t 1.5
- seq_id_2 \t 0.9
- ...
- seq_id_1000 \t 1.1
This C:N ratio is derived from the number of carbon and nitrogen atoms in each amino acid residue (e.g., there are 5 Cs and 1 N in Methionine) and the number of each amino acid in the protein sequence. Is there a tool available that can do this, or do I have to write my own? I am fine with using a web server, a pre-written suite that runs on unix (mac, linux), or custom scripts from someone (python, perl, ruby). Thanks!
using awk:Thanks for the answer, Pierre, but the problem is a little more complicated than counting the instances of a string in each line. I've updated my question. My fasta sequences are just amino acids (with no information about carbon or nitrogen content), so what I actually need to do is reference a separate table that contains the number of carbon and nitrogen atoms per amino acid in order to calculate the C:N ratio for each sequence.
There's 20 amino acids, it's fairly easy to create that list from the chemical formula in wikipedia, read it in a dictionary/hash, loop over your sequences, add up Cs and Ns, compute the ratio. Doesn't seem very complicated, or do I miss something
Yes, you're right. I was just wondering if there was a tool already that was written to solve the same task. Just trying not to reinvent the wheel.
I'm not saying it doesn't exist, but if it takes you longer to search for a tool than to write it then the choice is easy :-)