Question: How to calculate GC% at 1st, 2nd and 3rd codon position for each sequence in a multifasta
0
gravatar for forni.giobbe
12 months ago by
forni.giobbe0 wrote:

Hello everybody,

I am trying to calculate per-sequence GC3% of the cds I obtained from my RNA-seq assemblies .. I thought it could be a quick & trivial task, but it does not appear to be so straight-forward!

One way of doing that could be to extract every codon position from my cds multifasta files and then calculate GC% for each sequence. I tried to extract the 3rd codon positions for my cds .fasta (using the one-liner below), but it actually takes ages!

while read line; do if echo $line | grep -v ">"; then echo $line | sed "s/(..)./\1/g"; else echo $line > 3rd_postion.fasta; fi; done < cds.fasta

Any suggestion?

Thanks!

ADD COMMENTlink modified 12 months ago by Joe18k • written 12 months ago by forni.giobbe0

Please use dedicated tools for 3/6 frame translation such as ORF finder: https://www.ncbi.nlm.nih.gov/orffinder/ or CLI tools as mentioned here: Best command line ORF finder . If you have multifasta file and for each sequence, if you want calculate GC%, you can use tools such seqkit. fx2tab function output per sequence GC% in tabular format.

Please post example data and expected output from next time. It would help addressing the issue faster @ forni.giobbe

ADD REPLYlink modified 12 months ago • written 12 months ago by cpad011214k

A blast from the past but codonw calculates per sequence GC3 and many other stats directly from the CDS fasta file http://codonw.sourceforge.net/#Downloading%20and%20Installation .

ADD REPLYlink written 12 months ago by microfuge1.9k

Not really an answer but the reason your command is slow is that it launches and terminates both grep and sed for every single line of the .fasta file. In sed you can add a regex to switch on the replacement, so:

sed '/^>/!s/\(..\)./\1/g' cds.fasta

Where /^>/! means match all lines beginning with > but then invert the match (with ! acting like grep -v).

This only starts a single instance of sed and will be much faster. Still not sure it will get you the right answer for GC3%, though, especially if the FASTA file is line-wrapped at a length not a multiple of 3.

ADD REPLYlink modified 12 months ago • written 12 months ago by tim.booth20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1043 users visited in the last hour