Question

How to calculate GC% at 1st, 2nd and 3rd codon position for each sequence in a multifasta

0

Entering edit mode

4.4 years ago

forni.giobbe • 0

Hello everybody,

I am trying to calculate per-sequence GC3% of the cds I obtained from my RNA-seq assemblies .. I thought it could be a quick & trivial task, but it does not appear to be so straight-forward!

One way of doing that could be to extract every codon position from my cds multifasta files and then calculate GC% for each sequence. I tried to extract the 3rd codon positions for my cds .fasta (using the one-liner below), but it actually takes ages!

while read line; do if echo $line | grep -v ">"; then echo $line | sed "s/(..)./\1/g"; else echo $line > 3rd_postion.fasta; fi; done < cds.fasta

Any suggestion?

Thanks!

Assembly 3rd codon position GC% cds • 1.7k views

ADD COMMENT • link updated 4.4 years ago by Joe 21k • written 4.4 years ago by forni.giobbe • 0

0

Entering edit mode

Please use dedicated tools for 3/6 frame translation such as ORF finder: https://www.ncbi.nlm.nih.gov/orffinder/ or CLI tools as mentioned here: Best command line ORF finder . If you have multifasta file and for each sequence, if you want calculate GC%, you can use tools such seqkit. fx2tab function output per sequence GC% in tabular format.

Please post example data and expected output from next time. It would help addressing the issue faster @ forni.giobbe

ADD REPLY • link 4.4 years ago by cpad0112 21k

0

Entering edit mode

A blast from the past but codonw calculates per sequence GC3 and many other stats directly from the CDS fasta file http://codonw.sourceforge.net/#Downloading%20and%20Installation .

ADD REPLY • link 4.4 years ago by microfuge ★ 1.9k

0

Entering edit mode

Not really an answer but the reason your command is slow is that it launches and terminates both grep and sed for every single line of the .fasta file. In sed you can add a regex to switch on the replacement, so:

sed '/^>/!s/$..$./\1/g' cds.fasta

Where /^>/! means match all lines beginning with > but then invert the match (with ! acting like grep -v).

This only starts a single instance of sed and will be much faster. Still not sure it will get you the right answer for GC3%, though, especially if the FASTA file is line-wrapped at a length not a multiple of 3.

ADD REPLY • link 4.4 years ago by tim.booth ▴ 40