Prepare a GenBank submission: Can you turn this Perl script into a one-liner?
1
1
Entering edit mode
8 months ago

The GenBank assembly checker has become quite picky, not only can your contigs not have any hits to common contaminants, have to have min length of 200bp, or have no trailing or leading N's, it also has another condition:

There cannot be more than 5 N's in the 10 first and 10 last bases or more than 15 N's among the first or last 50 bp or the assembly will be rejected. The Perl/BioPerl script below drops such sequences, but because most of the pipeline can be built using seqkit, I thought it would be nice to remove the dependency on Perl and particularly BioPerl, BioPerl has become nearly impossible to install anyway. So let's take it as a code-golf and provide a one-liner in seqkit, sed, awk, etc. that does the same. This will be useful for anyone uploading assemblies to GenBank on a regular basis.

Thank you!

genbank-N-filter.pl:

#!/usr/bin/env perl

use Bio::SeqIO;
my $seqio = Bio::SeqIO->new(-fh => \*STDIN, '-format' => 'Fasta');
my $out = Bio::SeqIO->new(-format => 'Fasta');

while(my $seq = $seqio->next_seq) {
    my $n;
    my $sub10r = substr $seq->seq, -10; # note how subseq cannot get sequences from the end like substr
    my $sub10l = $seq->subseq (1, 10);
    my $sub50r = substr $seq->seq, -50;
    my $sub50l = $seq->subseq (1, 50);
    my $m = () = $sub10l =~ /[Nn]/gi;
    my $n = () = $sub10r =~ /[Nn]/gi;
    my $o = () = $sub50l =~ /[Nn]/gi;
    my $p = () = $sub50r =~ /[Nn]/gi;

    next if $m > 5 or $n > 5 or $o > 15 or $p > 15;
    $out->write_seq($seq);
}
__END__

Testdata (self-explaining headers):

>keep_1
TGTNTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTTTTATTGTCTTTTTTTTTTTTTCTAAATTTC
TTTTGTTTTTTTTTTTTTNTTT
>keep_2
TGTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTTTTATTGTCTTTTTTTTTTTTTCTAAATTTC
TTTTGTTTTTTTTTTTTNTNTNNNT
>reject_1
TGTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTTTTATTGTCTTTTTTTTTTTTTCTAAATTTC
TTTTGTTTTTTTTTTTnNTNTNNNT
>reject_2
TGTTNNTTNTTTTTTNNNNNnnNTTNNNTTTTTTTTTGTTNNTTNTTTTCTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTTTTATTGTCTTTTTTTTTTTTTCTAAATTTC
TTTTGTTTTTTTTTTTTNTNTnNNT
>reject_3
TnTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTNTTATTNNCTTNNNNNNNTTTTCTANNNNNA
TTTTGTTTTTTTTTTTTNTNTT
>reject_4
TGNNNNTNNTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTTTTCTTATTTT
TATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTTGTTGTTAGCTTTTATTTTCA
TTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTTTTATTGTCTTTTTTTTTTTTTCTAAATTTC
TTTTGTTTTTTTTTTTTTNTTT

Output:

>keep_1
TGTNTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTT
TTCTTATTTTTATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTT
GTTGTTAGCTTTTATTTTCATTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTT
TTATTGTCTTTTTTTTTTTTTCTAAATTTCTTTTGTTTTTTTTTTTTTNTTT
>keep_2
TGTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTTTTCTTCTTTTTTTTTTTGTTT
TTCTTATTTTTATTTCTTTATTATTTATTTCTTGTATTTATATAAGTTTTTTTTATTTTT
GTTGTTAGCTTTTATTTTCATTCAATTGTGTTATTTACGATTATCGTGTTCTCGTTCCTT
TTATTGTCTTTTTTTTTTTTTCTAAATTTCTTTTGTTTTTTTTTTTTNTNTNNNT
assembly perl code-golf • 251 views
ADD COMMENT
1
Entering edit mode
8 months ago
GenoMax 110k

With bbduk.sh with this simple example.

There cannot be more than 5 N's in the 10 first and 10 last bases

bbduk.sh -Xmx2g in=new.fa out=stdout.fa maxns=5 restrictleft=10 restrictright=10

more than 15 N's among the first or last 50 bp

bbduk.sh -Xmx2g in=new.fa out=stdout.fa maxns=15 restrictleft=50 restrictright=50

Michael Dondrup : May want to provide a more realistic example via pastebin.com or some other means.

Also can be done as a two step process to start less restrictive at beginning and make it more stringent later by piping.

bbduk.sh -Xmx2g in=new.fa out=stdout.fa maxns=15 restrictleft=50 restrictright=50 | bbduk.sh -Xmx2g in=stdin.fa out=stdout.fa maxns=5 restrictleft=10 restrictright=10
ADD COMMENT

Login before adding your answer.

Traffic: 2010 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6