!/usr/bin/perl -w

Question

Fasta Format Sequence Parsing With Bioperl

1

Entering edit mode

14.0 years ago

Raghul ▴ 200

Hi I am having a fasta format sequence but the header line is not like NCBI description(please see below).

>contig07303  length=480  numreads=24  gene=isogroup00008  status=isotig
CACTATCCTcTTCATTTAGGTTTTTGATAtCTTGAATTACTTTCTtCATTTTTCTAGTAG...................

I used a program to extract sequences with length >500 for "normal" fasta sequences (from NCBI) using Bioperl Bio::SeqIO & it worked. The same program when used modifying “keyword" to retrieve sequences with numreads >500, it is not working. Is it possible to do with Bio::SeqIO module with the sequences having the above FASTA descriptions or Do I have to create different code without Bio::SeqIO module.

Thank you very much for replying raghul

Previous code given by neilfws

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my $seqin  = Bio::SeqIO->new(-file => "myfile.fa",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">myfile_100.fa", -format => "fasta");
while(my $seq = $seqin->next_seq) {
  if($seq->length <= 100) {
    $seqout->write_seq($seq);
  }
}

fasta parsing bioperl • 11k views

ADD COMMENT • link updated 14.0 years ago by Neilfws 49k • written 14.0 years ago by Raghul ▴ 200

0

Entering edit mode

Assuming that the lack of ">" in your example fasta file is just a typo can you clarify this somewhat? Your pasted code example doesn't match what you are trying to do when you say modifying "keyword" to retrieve sequences with numthreads > 500. What exactly are you trying to do and what is the code? Question just isn't very clear.

ADD REPLY • link 14.0 years ago by DG 7.3k

0

Entering edit mode

Yes, I misread this question too; it's not clear how the header looks because the sequence may not be formatted properly as shown here (that is, does it start with ">") ?

ADD REPLY • link 14.0 years ago by Neilfws 49k

0

Entering edit mode

Yes, the sequence starts with > The keyword here I mean the description in FASTA header line (sorry for the confusion).The above program detects sequences less than (<)100 base in length. When I changed the symbol to > to identify sequences with greater than 100 nt in length, I got wrong answer. The output also included sequences with length less than 100 in length. Thank you for replying raghul

ADD REPLY • link 14.0 years ago by Raghul ▴ 200

score 1 · Answer 1 · 2011-07-05

1

Entering edit mode

14.0 years ago

Neilfws 49k

EDIT: This is a new answer which assumes that your sequence header line begins with ">"

If I understand your question correctly, you are trying to extract the part of the header line which reads numreads=NN and determine whether NN > 500.

The answer is yes, this can be done using Bio::SeqIO. The portion of the header line directly following the ">" is the sequence ID. Anything after the first space is the sequence description. Bio::SeqIO has a method to retrieve the description, like so:

my $desc = $seq->desc;

So you need to get $desc, then use a regular expression for numreads:

if($seq->desc =~/\s+numreads=(\d+)\s+/) {
  my $numreads = $1;
  $seqout->write_seq($seq) if $numreads > 500;
}

A thorough read of the Bio::SeqIO documentation will help you answer many of these "variations on the same theme" problems.

ADD COMMENT • link 14.0 years ago by Neilfws 49k

0

Entering edit mode

The input is a fasta file with thousands of sequences. The numreads values for sequences range from 1 to 1000. I want to extract sequences with "numreads" values greater than 500.

I am reading this BioSeqIO documentation. Is there a textbook for bioperl just like James Tisdall book for Perl Thank you very much for the reply raghul

ADD REPLY • link 14.0 years ago by Raghul ▴ 200

0

Entering edit mode

There's a very old Bioperl book and newer Bioinformatics/Perl books with sections on Bioperl - go and search Amazon for "Bioperl". However, the Bioperl website documentation is better and more up to date.

ADD REPLY • link 14.0 years ago by Neilfws 49k

0

Entering edit mode

I am getting the following error, can you say what it means.

C:Perl>numreads.pl Replacement list is longer than search list at C:/Perl/site/lib/Bio/Range.pm line 251.

Thank you for patience I am giving the altered code below

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $desc = $seq->desc;

my $seqin=Bio::SeqIO->new(-file =>"num.txt", -format => "fasta"); my $seqout=Bio::SeqIO->new(-file =>">readoutput.txt", -format => "fasta"); while(my $seq=$seqin->next_seq) { if($seq->desc =~/s+numreads=(d+)s+/) { my $numreads=$1; $seqout->write_seq($seq) if $numreads > 500; } }

ADD REPLY • link 14.0 years ago by Raghul ▴ 200

0

Entering edit mode

I am getting the following error, can you say what it means. C:Perl>numreads.pl Replacement list is longer than search list at C:/Perl/site/lib/Bio/Range.pm line 251.

Thank you for the patience I am giving the altered code below

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $desc = $seq->desc;

my $seqin=Bio::SeqIO->new(-file =>"num.txt", -format => "fasta"); my $seqout=Bio::SeqIO->new(-file =>">readoutput.txt", -format => "fasta"); while(my $seq=$seqin->next_seq) { if($seq->desc =~/s+numreads=(d+)s+/) { my $numreads=$1; $seqout->write_seq($seq) if $numreads > 500; } }

ADD REPLY • link 14.0 years ago by Raghul ▴ 200

0

Entering edit mode

I am getting the following error, can you say what it means. C:Perl>numreads.pl Replacement list is longer than search list at C:/Perl/site/lib/Bio/Range.pm line 251.

Thank you for the patience I am giving the altered code below

!/usr/bin/perl -w

use strict; use Bio::SeqIO;

my $desc = $seq->desc;

my $seqin=Bio::SeqIO->new(-file =>"num.txt", -format => "fasta"); my $seqout=Bio::SeqIO->new(-file =>">readoutput.txt", -format => "fasta"); while(my $seq=$seqin->next_seq) { if($seq->desc =~/s+numreads=(d+)s+/) { my $numreads=$1; $seqout->write_seq($seq) if $numreads > 500; }

ADD REPLY • link 14.0 years ago by Raghul ▴ 200

0

Entering edit mode

Well, my Google search for that message (always a good first step) suggests that the warning can be ignored and is fixed in the latest bioperl-live from github. Did the script generate the expected output file? Also, pasting code in comments doesn't work too well; try something like pastebin - http://pastebin.com/.

ADD REPLY • link 14.0 years ago by Neilfws 49k

0

Entering edit mode

Yes I got the output.The output available appears correct. All the output sequence is as asked,ie.>500 nucleotide in length. But I cannot understand this. thank you raghul

ADD REPLY • link 14.0 years ago by Raghul ▴ 200

score 1 · Answer 2 · 2011-07-05

It's easier to use inplace editing sed mode.

Furthermore, '1s' command would add '>' at the beginning of the first line only (if you have only one sequence):

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

`sed -i '1s/^/>/' myfile.fa`;
my $seqin  = Bio::SeqIO->new(-file => "myfile.fa",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">myfile_100.fa", -format => "fasta");
while(my $seq = $seqin->next_seq) {
  if($seq->length <= 100) {
    $seqout->write_seq($seq);
  }
}