Question: Perl How To Isolate Fasta Sequences With A Specific Keyword
0
gravatar for Raghul
8.1 years ago by
Raghul200
Italy
Raghul200 wrote:

HI to all, I have a file with lots of sequences but I want to extract sequences only with the keyword "FULL-LENGTH". I dont want sequences with keywords NON-FULL-LENGTH.I have a text file that has 8,000 sequences distributed in equal amount with these 2 keywords Can anybody suggest a perl program for this problem?

>isotig07104 FULL-LENGTH (BLAST)
GGTGAGTACTAAATTATaCGAAAGATTGAaGTCCAGTTATAGCTCTGCCT
ATAaTTAAAGCATGAATATCGTGAGTTCCTTCGTATGTGTTTACAGTTTC

>isotig07106 NON-FULL-LENGTH (BLAST)
TTAGCATATTCTAtCTTTTTtAGAcTAAGGAAaGATGGAAgTGtAaTtAA
aGAATTTGAaCCAAAAATTCATAGAtCTGTtATTAAGTCATGTGCTAAaT

Thank u raghul

"Hi to neilfws & all others Thanks for the response. I made-up the code, please correct it. Can you people suggest me tutorial links or textbooks for Bioperl.

!/usr/bin/perl -w
use strict;

use Bio::SeqIO;

my $seqin = Bio::SeqIO->new(-file => "euplotes.txt", -format => "fasta");

my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta");

while(my $seq = $seqin->next_seq)
  { 
  if($seq->desc) =~/^FULL-LENGTH\s+/ {
    $seqout->write_seq($seq);
  }
}

".

perl fasta sequence retrieval • 2.1k views
ADD COMMENTlink modified 8.1 years ago by Echo70 • written 8.1 years ago by Raghul200

Hi to neilfws & all others Thanks for the response. I made-up the code, please correct it. Can you people suggest me tutorial links or textbooks on Bioperl.

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $seqin = Bio::SeqIO->new(-file => "euplotes.txt", -format => "fasta"); my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta"); while(my $seq = $seqin->next_seq) { if($seq->desc) =~/^FULL-LENGTHs+/ { $seqout->write_seq($seq); } }

ADD REPLYlink written 8.1 years ago by Raghul200

Your code looks fine. As for Bioperl, everything you need is right there on the website - http://www.bioperl.org/wiki/Main_Page. Link to tutorials - http://www.bioperl.org/wiki/Tutorials.

ADD REPLYlink written 8.0 years ago by Neilfws48k
2
gravatar for Pierre Lindenbaum
8.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

My answer with awk, not perl. If the lines starting with '>' contains "NON-FULL-LENGTH" then don't print the remaining lines.

cat biostar7114.fasta | awk '/^>/   {
    ok=(index($0,"NON-FULL-LENGTH")==0);
    if(ok) print $0;
    next;
    }
    {
    if(ok) print $0;
    }'
>isotig07104 FULL-LENGTH (BLAST)
GGTGAGTACTAAATTATaCGAAAGATTGAaGTCCAGTTATAGCTCTGCCT
ATAaTTAAAGCATGAATATCGTGAGTTCCTTCGTATGTGTTTACAGTTTC
ADD COMMENTlink written 8.1 years ago by Pierre Lindenbaum119k
1
gravatar for Neilfws
8.1 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

You can use my answer to your previous question as a starting point.

In this case, the Bioperl method to use is $seq->desc. In your sample sequence fasta header:

>isotig07104 FULL-LENGTH (BLAST)

the description is everything after the first space. Since you are looking for descriptions that begin with the words "FULL-LENGTH", then something like:

if($seq->desc) =~/^FULL-LENGTH\s+/ {
  # write sequence to file as per my previous answer
}

should work.

I suggest investing some effort in learning at least a few of the Bioperl methods (and regular expressions); it will make solving these "variations on a theme" problems very easy indeed.

ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by Neilfws48k

I made as u have told & I am giving the code below, please correct it

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $seqin = Bio::SeqIO->new(-file => "inputfile.txt", -format => "fasta"); my $seqout = Bio::SeqIO->new(-file => ">full_length.txt", -format => "fasta"); while(my $seq = $seqin->next_seq) {

if($seq->desc) =~/^FULL-LENGTHs+/ { $seqout->write_seq($seq); } }

ADD REPLYlink written 8.1 years ago by Raghul200

Hi I made up the code, please correct it. Can u please give me some tutorial links for bioperl(for beginners) or suggest textbook thank you

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $seqin = Bio::SeqIO->new(-file => "euplotes.txt", -format => "fasta"); my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta"); while(my $seq = $seqin->next_seq) { if($seq->desc) =~/^FULL-LENGTHs+/ { $seqout->write_seq($seq); } }

ADD REPLYlink written 8.1 years ago by Raghul200

Hi to neilfws & all others Thanks for the response. I made-up the code, please correct it. Can you people suggest me tutorial links or textbooks for Bioperl.

!/usr/bin/perl -w

use strict;

use Bio::SeqIO;

my $seqin = Bio::SeqIO->new(-file => "euplotes.txt", -format => "fasta");

my $seqout = Bio::SeqIO->new(-file => ">outfile.txt", -format => "fasta");

while(my $seq = $seqin->next_seq)

{ if($seq->desc) =~/^FULL-LENGTHs+/ {

$seqout->write_seq($seq);

} }

ADD REPLYlink written 8.1 years ago by Raghul200
0
gravatar for Echo
8.1 years ago by
Echo70
United States
Echo70 wrote:

obviously, the bioperl is highly recommended, but the script below also works.

#read a paragrah every time. your may want to look up the $/ in perldoc perlvar

    $/="";
    while(<>){
        unless (/NON-FULL-LENGTH/){
        print;
        }
    }
ADD COMMENTlink written 8.1 years ago by Echo70

AFAIK, it won't work. Test your script with a set of sequences in FASTA format.

ADD REPLYlink written 8.1 years ago by Pierre Lindenbaum119k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 772 users visited in the last hour