Process to remove terminal Ns from fasta?
3
0
Entering edit mode
23 months ago
stacy734 ▴ 40

Can anyone recommend a tool or unix command line to remove terminal (leading or trailing) Ns from a fasta file?

fasta • 1.9k views
0
Entering edit mode

Thank you everyone.

This request was to help with submission of genome assemblies to Genbank. They ask that the terminal Ns (gaps) be removed from the ends of contigs. However, I have found that if you simply leave them on they will remove them as part of their process.

0
Entering edit mode

stacy734 : While that may be the case since you asked this question in the first place can you please test the posted answers and accept any/all those that work. This would benefit future users who will find this thread by searching.

0
Entering edit mode

In fact, this no longer works, so you need one of the other solutions.

3
Entering edit mode
23 months ago

see if this works with seqkit to remove terminal Ns:

$seqkit -is replace -p "n+$" -r "" test.fa


To remove leading Ns as well (as mentioned in the OP), try following:

$seqkit -is replace -p "^n+|n+$" -r "" test.fa


Try this with sed to remove leading and trailing n:

$sed -r '/^>/! s/n+$|^n+//g' test.fa

0
Entering edit mode

Your sed command runs the risk of removing internal Ns that occur before/after line breaks in the sequence. To avoid that you need to linearize the sequence first. It also creates an undesirable empty line between the definition line and the sequence.

This revision avoids all that. It linearizes the record (awk) then converts it to a 2-line fasta record (tr) then removes the leading and trailing N (sed -- I use N instead of n, because typically my Ns are capitalized; you can handle both by substituting [Nn]), then it reformats (fold) the long sequence line into lines of 80 nt width. Make sure you use a width longer than the length of your longest definition line, otherwise it will break those deflines at 80 characters too.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' my.fasta | tr "\t" "\n" | sed -r '/^>/! s/N+$|^N+//g' | fold -w 80 > my.Ntrimmed.fasta  ADD REPLY 0 Entering edit mode 23 months ago Jianyu ▴ 530 Use seqkit subseq: example: seqkit subseq -r 6:-1 test.fa # remove the first 5 bases  ADD COMMENT 0 Entering edit mode OP does not want to remove a fixed number of bases but all N's. Which I assume may be of variable length. seqkit probably can do that. This is likely not the correct command. ADD REPLY 1 Entering edit mode I misunderstood the question, so remove the terminal N? not ATCG? a quick thought: awk '{if (/>.*/) {print} else { sub(/^N*/, "")sub(/N*$/, ""); print}}' test.fa

0
Entering edit mode
23 months ago
zubenel ▴ 120

Another option by using Perl oneliner:

perl -pe 's/^([ACGT][ACGTN]+?)N+$|(^N+)/$1/gi' test.fa


Apparently this command does work only if Ns are at the start OR at the end of a line. It does not work if Ns are from both sides.