Question: Extract Fasta Sequences Using Wild Card Search
4
gravatar for empyrean999
6.3 years ago by
empyrean999160
Canada
empyrean999160 wrote:

I wanted to extract fasta sequences using the wild card search

for example

grep "oxidase" input.fasta

The above statement provides all headers which match with oxidase and i had to extract the sequences as well.

For now how i was approaching is that i am writing above statements in to file and using some awk statements to remove unwanted characteres but with this approach, it becoming tiresome to execute all these commands as format keep changing. And using the file, i use below statement to extract sequences

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.file fasta.file

Does anyone have direct solution for this??

fasta perl unix awk • 3.9k views
ADD COMMENTlink modified 6.3 years ago by Martin A Hansen3.0k • written 6.3 years ago by empyrean999160

What do you mean by "direct solution"? (and what formats keep changing?)

ADD REPLYlink written 6.3 years ago by fo3c430

direct solution means that any script which takes wild card as input and provides output with sequences..

">testcomp2344c0seq1 BlastHit=ref|NP182170.1| inner membrane OXA1-like protein [Arabidopsis thaliana]sp|Q9SKD3.1|OXA1L_ARATH RecName: Full=Mitochondrial inner membrane protein OXA1-like; Flags: Precursorgb|AAD23047.1| putative cytochrome oxidase biogenesis protein "

That was one of the header i have but the script works with fasta id only which is "testcomp2344c0_seq1 " . so i have to perform preprocessing to remove rest of it.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by empyrean999160
2
gravatar for awitney
6.3 years ago by
awitney30
awitney30 wrote:

I would use BioPerl for this. Something like (untested):

use Bio::SeqIO;

my $infile  = $ARGV[0];
my $string  = $ARGV[1];

my $seqin  = Bio::SeqIO->new( -format => 'fasta', -file => $infile );
my $seqout = Bio::SeqIO->new( -format => 'fasta', -fh =>\*STDOUT );

while ( my $seq = $seqin->next_seq() ) {
    my $id     = $seq->id;
    my $description     = $seq->description;

    if ( $id =~ m/$string/i ||  $description =~ m/$string/i ) {
        $seqout->write_seq($seq);
    }
}
ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by awitney30
3

Excellent use of Bio::SeqIO!

If I may make a few observations...

  • Always use strict; use warnings;.
  • This limits the OP to searching a single file at a time.
  • You can match directly on $seq->id and $seq->description.
  • It's best to quotemeta on $string.

Given the above, consider the following:

use strict;
use warnings;
use Bio::SeqIO;

my $string = pop;
my $seqout = Bio::SeqIO->new( -format => 'Fasta', -fh => \*STDOUT );

for my $inFile (@ARGV) {
    my $seqin = Bio::SeqIO->new( -format => 'Fasta', -file => $inFile );

    while ( my $seq = $seqin->next_seq() ) {

        if (   $seq->id =~ m/\Q$string\E/i
            or $seq->description =~ m/\Q$string\E/i )
        {
            $seqout->write_seq($seq);
        }
    }
}

Since the search term is always the last item sent to the script, you can pop it off @ARGV (implied if no argument is passed to pop). File names are left in @ARGV, so the above iterates through each. Depending upon the file system, one could perl script.pl * oxidase on a dir of fasta files--or just write out the separate files.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Kenosis1.2k

Yes I assumed he would add the use strict/warnings and the shebang line. I was just trying to give the OP an idea of how to do it, to be adjusted to his exact wishes. Thanks for the observations though.

ADD REPLYlink written 6.3 years ago by awitney30
2
gravatar for Kenosis
6.3 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Here's another option:

use strict;
use warnings;

my $term = join '|', map "\Q$_\E", split ' ', pop;
my $found;

while (<>) {
    if (/^>/) {
        $found = /$term/i ? 1 : 0;
        next;
    }
    print if $found;
}

Usage: perl script.pl file(s) 'searchTerm [searchTerm]' [>outFile]

Sample fasta:

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>testcomp2344c0seq1 BlastHit=ref|NP182170.1| inner membrane OXA1-like protein [Arabidopsis thaliana]sp|Q9SKD3.1|OXA1L_ARATH RecName: Full=Mitochondrial inner membrane protein OXA1-like; Flags: Precursorgb|AAD23047.1| putative cytochrome oxidase biogenesis protein
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK

Output searching on 'oxidase':

KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM

Entering space-delimited search terms acts as an OR search. Remove next; if you want to print the header.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Kenosis1.2k
1

Cool script, works as described. BTW, add another:

print if $found;

right above 'next' to obtain entire FASTA records, including ids.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Israel Barrantes740
2
gravatar for Martin A Hansen
6.3 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Biopieces (www.biopieces.org) are really good at this kind of exercise:

Simple pattern matching of the sequence name:

read_fasta -i in.fna | grab -p <my pattern> -k SEQ_NAME | write_fasta -o out.fna -x

Using Regex:

read_fasta -i in.fna | grab -r <my pattern> -k SEQ_NAME | write_fasta -o out.fna -x

grab does a bunch of clever things - check it out.

ADD COMMENTlink written 6.3 years ago by Martin A Hansen3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1085 users visited in the last hour