Hello everybody,
I am trying to calculate per-sequence GC3% of the cds I obtained from my RNA-seq assemblies .. I thought it could be a quick & trivial task, but it does not appear to be so straight-forward!
One way of doing that could be to extract every codon position from my cds multifasta files and then calculate GC% for each sequence. I tried to extract the 3rd codon positions for my cds .fasta (using the one-liner below), but it actually takes ages!
while read line; do if echo $line | grep -v ">"; then echo $line | sed "s/(..)./\1/g"; else echo $line > 3rd_postion.fasta; fi; done < cds.fasta
Any suggestion?
Thanks!
Please use dedicated tools for 3/6 frame translation such as ORF finder: https://www.ncbi.nlm.nih.gov/orffinder/ or CLI tools as mentioned here: Best command line ORF finder . If you have multifasta file and for each sequence, if you want calculate GC%, you can use tools such seqkit. fx2tab function output per sequence GC% in tabular format.
Please post example data and expected output from next time. It would help addressing the issue faster @ forni.giobbe
A blast from the past but codonw calculates per sequence GC3 and many other stats directly from the CDS fasta file http://codonw.sourceforge.net/#Downloading%20and%20Installation .
Not really an answer but the reason your command is slow is that it launches and terminates both
grepandsedfor every single line of the .fasta file. Insedyou can add a regex to switch on the replacement, so:sed '/^>/!s/\(..\)./\1/g' cds.fastaWhere
/^>/!means match all lines beginning with>but then invert the match (with!acting likegrep -v).This only starts a single instance of
sedand will be much faster. Still not sure it will get you the right answer for GC3%, though, especially if the FASTA file is line-wrapped at a length not a multiple of 3.