Question: How to trim nucleotide sequences to length divisible by 3
0
gravatar for al-ash
21 months ago by
al-ash110
Japan/Okinawa/OIST
al-ash110 wrote:

I want to strip predicted nucleotide protein coding sequences of the last 1 or 2 bases which represent incomplete codon.

The input is multifasta consisting of nucleotide protein-coding sequences in reading frame 1. Some of them are trimmed at various positions at the 3'end resulting in 1-2 3'end nucleotides that does not form a codon (these are ignored when I translate the nucleotide sequences to protein sequences; the extra nucleotides are then source of problems in my further workflow).

I tried to extract the longest ORF from the nucleotide sequences to overcome this, however, it does not work in all cases (in some of the input sequences there are in-frame stop codons; other drawback was that using e.g. hmmer2go I could not restrict the search for ORFs only in frame +1 and thus for some of the fastas, alternative ORFs in reverse sense, +2 or +3 reading frames were retrieved as the longest ORFs)

I'd like to ideally use tools such as awk (and its parameter % - modulo) to trim down the sequences from 3'end to the longest sequence divisible by 3 - however, I did not manage to make anything to work. Any other solutions are, however, most welcomed as well.

awk incomplete codon • 834 views
ADD COMMENTlink modified 21 months ago • written 21 months ago by al-ash110
3
gravatar for al-ash
21 months ago by
al-ash110
Japan/Okinawa/OIST
al-ash110 wrote:

I'm not sure what is the correct protocol for answering my own question.

In the end, my input were multiple fasta files each including one sequence. I used bioawk and the awk modulo function (%) to create 3 cathegories of sequences: with sequence length divisible by 3, divisible by 3 with remainder 1 and with remainder 2. Sequences divisible with remainder were trimmed from the end by one respectively two characters.

#!/bin/bash
workdir=mydir

for file in ${workdir}/nt_renamed/* 
do filebasename=$(basename "$file" )
# if modulo 3 of the sequence length is 0, print the fasta header and the sequence and output it to file 
bioawk -c fastx 'length($seq) % 3 == 0  {print ">"$name; print $seq}' $file  >  ${workdir}/nt_renamed_bioawked/${filebasename}
# if modulo 3 of the sequence length is 1 or 2, print the fasta header and the sequence, trim lines which does not start with ">" from its end by one resp. 2 characters
bioawk -c fastx 'length($seq) % 3 == 1  {print ">"$name; print $seq}' $file | sed '/^>/! s/.$//'  >> ${workdir}/nt_renamed_bioawked/${filebasename} 
bioawk -c fastx 'length($seq) % 3 == 2  {print ">"$name; print $seq}' $file | sed '/^>/! s/..$//'  >> ${workdir}/nt_renamed_bioawked/${filebasename} 
done
ADD COMMENTlink modified 21 months ago • written 21 months ago by al-ash110

This is perfect. I will toggle this answer as accepted to provide closure to this thread.

ADD REPLYlink written 21 months ago by genomax71k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1751 users visited in the last hour