4.8 years ago by
awk script assumes that the FASTA sequence following the header line spans only one line, which is not normally the case. As suggested by SES above,
bioawk is definitely a neat alternative.
Perl based approach would be like so:
my $minlen = shift or die "Error: `minlen` parameter not provided\n";
next unless /\w/;
my @chunk = split /\n/;
my $header = shift @chunk;
my $seqlen = length join "", @chunk;
print ">$_" if($seqlen >= $minlen);
You would then invoke the script like so:
$ perl removesmalls.pl 200 contigs.fasta > contigs-l200.fasta
This solution is making use of the
Perl record separator (which is a "
\n" newline by default) and switching it to "
>" so that when iterating through the FASTA file, we encounter whole FASTA records. This addresses the issue I mentioned above.