The GenBank assembly checker has become quite picky, not only can your contigs not have any hits to common contaminants, have to have min length of 200bp, or have no trailing or leading N's, it also has another condition:
There cannot be more than 5 N's in the 10 first and 10 last bases or more than 15 N's among the first or last 50 bp or the assembly will be rejected. The Perl/BioPerl script below drops such sequences, but because most of the pipeline can be built using seqkit, I thought it would be nice to remove the dependency on Perl and particularly BioPerl, BioPerl has become nearly impossible to install anyway. So let's take it as a code-golf and provide a one-liner in seqkit, sed, awk, etc. that does the same. This will be useful for anyone uploading assemblies to GenBank on a regular basis.
Thank you!
genbank-N-filter.pl
:
#!/usr/bin/env perl
use Bio::SeqIO;
my $seqio = Bio::SeqIO->new(-fh => \*STDIN, '-format' => 'Fasta');
my $out = Bio::SeqIO->new(-format => 'Fasta');
while(my $seq = $seqio->next_seq) {
my $n;
my $sub10r = substr $seq->seq, -10; # note how subseq cannot get sequences from the end like substr
my $sub10l = $seq->subseq (1, 10);
my $sub50r = substr $seq->seq, -50;
my $sub50l = $seq->subseq (1, 50);
my $m = () = $sub10l =~ /[Nn]/gi;
my $n = () = $sub10r =~ /[Nn]/gi;
my $o = () = $sub50l =~ /[Nn]/gi;
my $p = () = $sub50r =~ /[Nn]/gi;
next if $m > 5 or $n > 5 or $o > 15 or $p > 15;
$out->write_seq($seq);
}
__END__
Testdata (self-explaining headers):
>keep_1
TGTNTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTTTTATTGTCTTTTTTTTTTTTTCTAAATTTC
TTTTGTTTTTTTTTTTTTNTTT
>keep_2
TGTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTTTTATTGTCTTTTTTTTTTTTTCTAAATTTC
TTTTGTTTTTTTTTTTTNTNTNNNT
>reject_1
TGTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTTTTATTGTCTTTTTTTTTTTTTCTAAATTTC
TTTTGTTTTTTTTTTTnNTNTNNNT
>reject_2
TGTTNNTTNTTTTTTNNNNNnnNTTNNNTTTTTTTTTGTTNNTTNTTTTCTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTTTTATTGTCTTTTTTTTTTTTTCTAAATTTC
TTTTGTTTTTTTTTTTTNTNTnNNT
>reject_3
TnTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTNTTATTNNCTTNNNNNNNTTTTCTANNNNNA
TTTTGTTTTTTTTTTTTNTNTT
>reject_4
TGNNNNTNNTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTTTTATTGTCTTTTTTTTTTTTTCTAAATTTC
TTTTGTTTTTTTTTTTTTNTTT
Output:
>keep_1
TGTNTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTT
TTCTTATTTTTATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTT
GTTGTTAGCTTTTATTTTCATTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTT
TTATTGTCTTTTTTTTTTTTTCTAAATTTCTTTTGTTTTTTTTTTTTTNTTT
>keep_2
TGTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTT
TTCTTATTTTTATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTT
GTTGTTAGCTTTTATTTTCATTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTT
TTATTGTCTTTTTTTTTTTTTCTAAATTTCTTTTGTTTTTTTTTTTTNTNTNNNT