Question: How to filter a fasta file?
0
gravatar for oussama.badad
3.3 years ago by
oussama.badad10 wrote:

Dear All

I am trying to remove the sequences with no functional information from a functional annotation fasta file

>Oeu043104.1|---NA---
MIESNFWDACWPHCLLRVLLSLLAESASQPLCPPLQRYNPKYLEDDYGVNQATEWLFYTPRDRENEIENIRNGVAVDGY
>Oeu043107.1|g7lir3_medtr alpha beta-hydrolase superfamily protein os=medicago truncatula gn=mtr_8g086260 pe=4 sv=1
MSTTQGTSPRGNINVKDEPDHLLVLVHGIMGSPSDWTYFEADLKRRLGKRFLIYASSCNTYTKTFTGIDGAGKRLAEEVMEIVRNTESLKKISFLAHSLG
GLFSRYAIAVLYMPNTSSDDSSVIAGSTNTSLKTSCYSNTGLIAGLEPSNFITLATPHLGVRGKKQVNPFSIILIDGPVLPFLLGLPFLEKIAAPLAPIF
TGRTGSQLFLTDGQPDRPPLLLRMASDCKDGKFVSALGAFRCRLLYANVSYDHMVGWRTSSIRRETELIKPPLQSLDGYKHVVSVEYCPPVSSEGPHFPE
EAAKAKQAAQNEPNNQNTVEYHETMEEEMIRGLQRLGWKKVDVSFHSAFWPFFAHNNINVKNEWLYNAGVGVVAHVADNIKQQENQQGSTYVAASL

i was wondering if someone can help me with a python script or a shell script

Thank you

Oussama

ADD COMMENTlink modified 3.3 years ago by st.ph.n2.4k • written 3.3 years ago by oussama.badad10

You could convert to single line fasta and use a simple grep to remove lines with --NA-- or use bio Python to have more control.

ADD REPLYlink written 3.3 years ago by geek_y9.6k
0
gravatar for novice
3.3 years ago by
novice920
United States
novice920 wrote:

Too easy bro:

#!/usr/bin/perl

use strict; use warnings;


my $print;

while (<>) {

    $print = m/---NA---/ ? 0: 1 if m/>/;

    print if $print;

}

 

Usage: $ perl filter.pl sequences.fasta > filtered.fasta or ./filter.pl ... if you made the script executable.

 

Since I've got time on my hands and like Perl, I wrote you another script. This one would "slurp" the entire file into memory, so it's probably best to avoid if your file is huge.

#!/usr/bin/perl

use strict; use warnings;

my $whole = do { local $/; <> };

my @keep = map { s/\A([^>])/>$1/; $_ }

    grep { ! m/---NA---/; } split />/, $whole;

print @keep;

 

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by novice920
0
gravatar for st.ph.n
3.3 years ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:

Since you mentioned you wanted python:

Edit 'parse_fasta.py':

#!/usr/bin/env python

import sys

input_file = sys.argv[1]
output file = sys.argv[2]

with open(input_file, 'r') as f:

        headers = []

        seqs = []

                for line in f:

                        if line.startswith(">"):

                                  headers.append(line.strip())

                        else:

                                   seqs.append(line.strip())

myseqs = dict(zip(headers,seqs))

with open(outfile, 'w') as out: 

        for m in myseqs:

                if '---NA---' not in m:

                         print >> out, m, '\n', myseqs[m]

Usage: python parse_fasta.py input.fasta output.fasta

 

        

 

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by st.ph.n2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1191 users visited in the last hour