Process to remove terminal Ns from fasta?
3
0
Entering edit mode
4.3 years ago
stacy734 ▴ 40

Can anyone recommend a tool or unix command line to remove terminal (leading or trailing) Ns from a fasta file?

Thanks in advance for any advice.

fasta • 3.8k views
ADD COMMENT
0
Entering edit mode

Thank you everyone.

This request was to help with submission of genome assemblies to Genbank. They ask that the terminal Ns (gaps) be removed from the ends of contigs. However, I have found that if you simply leave them on they will remove them as part of their process.

ADD REPLY
0
Entering edit mode

stacy734 : While that may be the case since you asked this question in the first place can you please test the posted answers and accept any/all those that work. This would benefit future users who will find this thread by searching.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

In fact, this no longer works, so you need one of the other solutions.

ADD REPLY
3
Entering edit mode
4.3 years ago

see if this works with seqkit to remove terminal Ns:

$ seqkit -is replace -p "n+$" -r "" test.fa

To remove leading Ns as well (as mentioned in the OP), try following:

$ seqkit -is replace -p "^n+|n+$" -r "" test.fa

Try this with sed to remove leading and trailing n:

$ sed -r '/^>/! s/n+$|^n+//g' test.fa
ADD COMMENT
0
Entering edit mode

Your sed command runs the risk of removing internal Ns that occur before/after line breaks in the sequence. To avoid that you need to linearize the sequence first. It also creates an undesirable empty line between the definition line and the sequence.

This revision avoids all that. It linearizes the record (awk) then converts it to a 2-line fasta record (tr) then removes the leading and trailing N (sed -- I use N instead of n, because typically my Ns are capitalized; you can handle both by substituting [Nn]), then it reformats (fold) the long sequence line into lines of 80 nt width. Make sure you use a width longer than the length of your longest definition line, otherwise it will break those deflines at 80 characters too.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' my.fasta | tr "\t" "\n" | sed -r '/^>/! s/N+$|^N+//g' |  fold -w 80 >  my.Ntrimmed.fasta
ADD REPLY
0
Entering edit mode
4.3 years ago
Jianyu ▴ 580

Use seqkit subseq:

example:

seqkit subseq -r 6:-1 test.fa # remove the first 5 bases
ADD COMMENT
0
Entering edit mode

OP does not want to remove a fixed number of bases but all N's. Which I assume may be of variable length. seqkit probably can do that. This is likely not the correct command.

ADD REPLY
1
Entering edit mode

I misunderstood the question, so remove the terminal N? not ATCG? a quick thought:

awk '{if (/>.*/) {print} else { sub(/^N*/, "")sub(/N*$/, ""); print}}' test.fa
ADD REPLY
0
Entering edit mode
4.3 years ago
zubenel ▴ 120

Another option by using Perl oneliner:

perl -pe 's/^([ACGT][ACGTN]+?)N+$|(^N+)/$1/gi' test.fa

Apparently this command does work only if Ns are at the start OR at the end of a line. It does not work if Ns are from both sides.

ADD COMMENT

Login before adding your answer.

Traffic: 2937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6