Question

How to filter a fasta file?

0

Entering edit mode

9.5 years ago

oussama.badad ▴ 10

Dear All

I am trying to remove the sequences with no functional information from a functional annotation fasta file

>Oeu043104.1|---NA---
MIESNFWDACWPHCLLRVLLSLLAESASQPLCPPLQRYNPKYLEDDYGVNQATEWLFYTPRDRENEIENIRNGVAVDGY
>Oeu043107.1|g7lir3_medtr alpha beta-hydrolase superfamily protein os=medicago truncatula gn=mtr_8g086260 pe=4 sv=1
MSTTQGTSPRGNINVKDEPDHLLVLVHGIMGSPSDWTYFEADLKRRLGKRFLIYASSCNTYTKTFTGIDGAGKRLAEEVMEIVRNTESLKKISFLAHSLG
GLFSRYAIAVLYMPNTSSDDSSVIAGSTNTSLKTSCYSNTGLIAGLEPSNFITLATPHLGVRGKKQVNPFSIILIDGPVLPFLLGLPFLEKIAAPLAPIF
TGRTGSQLFLTDGQPDRPPLLLRMASDCKDGKFVSALGAFRCRLLYANVSYDHMVGWRTSSIRRETELIKPPLQSLDGYKHVVSVEYCPPVSSEGPHFPE
EAAKAKQAAQNEPNNQNTVEYHETMEEEMIRGLQRLGWKKVDVSFHSAFWPFFAHNNINVKNEWLYNAGVGVVAHVADNIKQQENQQGSTYVAASL

I was wondering if someone can help me with a python script or a shell script

Thank you
Oussama

fasta sequence • 3.8k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.5 years ago by oussama.badad ▴ 10

1

Entering edit mode

What have you tried?

Take some thoughts from these posts:

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 9.5 years ago by Sukhi Singh 11k

0

Entering edit mode

You could convert to single line fasta and use a simple grep to remove lines with --NA-- or use bio Python to have more control.

ADD REPLY • link 9.5 years ago by GouthamAtla 12k

Ram · Answer 1 · 2016-01-25

Since you mentioned you wanted python:

Edit 'parse_fasta.py':

#!/usr/bin/env python

import sys

input_file = sys.argv[1]
output file = sys.argv[2]

with open(input_file, 'r') as f:
    headers = []
    seqs = []

    for line in f:
        if line.startswith(">"):
            headers.append(line.strip())
        else:
            seqs.append(line.strip())

myseqs = dict(zip(headers,seqs))

with open(outfile, 'w') as out: 
    for m in myseqs:
        if '---NA---' not in m:
            print >> out, m, '\n', myseqs[m]

Usage:

python parse_fasta.py input.fasta output.fasta

Ram · Answer 2 · 2016-01-25

Too easy bro:

#!/usr/bin/perl

use strict; use warnings;

my $print;

while (<>) {
    $print = m/---NA---/ ? 0: 1 if m/>/;
    print if $print;
}

Usage: perl filter.pl sequences.fasta > filtered.fasta or ./filter.pl ... if you made the script executable.

Since I've got time on my hands and like Perl, I wrote you another script. This one would "slurp" the entire file into memory, so it's probably best to avoid if your file is huge.

#!/usr/bin/perl

use strict; use warnings;

my $whole = do { local $/; <> };
my @keep = map { s/\A([^>])/>$1/; $_ }
    grep { ! m/---NA---/; } split />/, $whole;
print @keep;