Question: How to filter a fasta file?
3.3 years ago by
oussama.badad10 wrote:

Dear All

I am trying to remove the sequences with no functional information from a functional annotation fasta file

>Oeu043107.1|g7lir3_medtr alpha beta-hydrolase superfamily protein os=medicago truncatula gn=mtr_8g086260 pe=4 sv=1

i was wondering if someone can help me with a python script or a shell script

Thank you


You could convert to single line fasta and use a simple grep to remove lines with --NA-- or use bio Python to have more control.

3.3 years ago by
United States
novice920 wrote:

Too easy bro:


use strict; use warnings;

my $print;

while (<>) {

    $print = m/---NA---/ ? 0: 1 if m/>/;

    print if $print;



Usage: $ perl sequences.fasta > filtered.fasta or ./ ... if you made the script executable.


Since I've got time on my hands and like Perl, I wrote you another script. This one would "slurp" the entire file into memory, so it's probably best to avoid if your file is huge.


use strict; use warnings;

my $whole = do { local $/; <> };

my @keep = map { s/\A([^>])/>$1/; $_ }

    grep { ! m/---NA---/; } split />/, $whole;

print @keep;


3.3 years ago by
Philadelphia, PA wrote:

Since you mentioned you wanted python:

Edit '':

#!/usr/bin/env python

import sys

input_file = sys.argv[1]
output file = sys.argv[2]

with open(input_file, 'r') as f:

        headers = []

        seqs = []

                for line in f:

                        if line.startswith(">"):




myseqs = dict(zip(headers,seqs))

with open(outfile, 'w') as out: 

        for m in myseqs:

                if '---NA---' not in m:

                         print >> out, m, '\n', myseqs[m]

Usage: python input.fasta output.fasta




