7.5 years ago by
awk script assumes that the FASTA sequence following the header line spans only one line, which is not normally the case. As suggested by @SES above,
bioawk is definitely a neat alternative.
Perl based approach would be like so:
my $minlen = shift or die "Error: `minlen` parameter not provided\n";
next unless /\w/;
my @chunk = split /\n/;
my $header = shift @chunk;
my $seqlen = length join "", @chunk;
print ">$_" if($seqlen >= $minlen);
You would then invoke the script like so:
perl removesmalls.pl 200 contigs.fasta > contigs-l200.fasta
This solution is making use of the
Perl record separator (which is a "
\n" newline by default) and switching it to "
>" so that when iterating through the FASTA file, we encounter whole FASTA records. This addresses the issue I mentioned above.