Question: How To Parse Fasta Files In Perl
7
gravatar for nikulina
9.0 years ago by
nikulina280
Cambridge
nikulina280 wrote:

Dear colleagues! I have a file with lots of sequences in FASTA format. I want to write a perl script to analyze each sequence (to count the length of certain fragment). So, how can I manage to treat each sequence as a variable? Should I use an array to read my file?

So, here is my script. It might be not very nice, but it works. I would like to modify it in order to work with FASTA data.

$string_filename = 'file.txt';  

open(FILE, $string_filename);  

@array = FILE;     

close FILE;  

foreach $string(@array) {
    $R = length $string;  
    if ( $string =~ /ggc/ ) {   
        $M = $';   
        $W = length $M;  
        if ( $string =~ /atg/ ) {   
            $K = $`;   
            $Z = length $K;  
            $x = $W + $Z - $R;     
            print " \n\ the distance is the following: \n\n ";  
            print $x;  
        } else {  
            print "\n\ I couldn\'t find the start codone.\n\n";  
        }  
    } else {  
        print "\n\ I couldn\'t find the binding site.\n\n"; }  
}  
exit;

I will be grateful for your help :)

perl fasta • 28k views
ADD COMMENTlink modified 9 months ago by RamRS22k • written 9.0 years ago by nikulina280

Could you also show us an example sequence for which this code works? If the code is supposed to do what I think it is supposed to do, I think there may be quite a few problems with it.

ADD REPLYlink written 9.0 years ago by Neilfws48k

Are you really sure none of to 10000 topics about how to parse file XXX did match your needs?

ADD REPLYlink written 7.5 years ago by Fabian Bull1.3k
21
gravatar for Neilfws
9.0 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

First, there is no need to reinvent the wheel. As Stefano wrote, Bioperl will parse fasta sequences for you and do a whole lot more besides. Once installed, it is as simple as:

use Bio::SeqIO;
my $seqio = Bio::SeqIO->new(-file => "file.fa", '-format' => 'Fasta');
while(my $seq = $seqio->next_seq) {
    my $string = $seq->seq;
    # do stuff with $string
}

Second, there are some issues with your code. It should be "@array = <file>" - although as Stefano points out, you should not read the whole file into an array.

So far as I can tell, you are trying to find sub-sequences which begin "atg" and end with "ggc". Some other issues with your code:

  1. It seems to assume that there is only one each of "atg" and "ggc", because you use if() to match the regular expressions, not while().
  2. It returns negative values for length of the sub-sequence. Is this what you want? It is unclear whether you are looking for "atg" which lie upstream of "ggc" or whether they can be at any position in the sequence.
  3. It looks as though you are looking for start codons. There may be alternatives to atg: gtg or ttg.
  4. Your regular expressions are case-sensitive and would miss, for example, ATG.

Assuming that you are trying to find the region atg -> ggc, you could try something like:

while(my $string =~/atg(.*)ggc/gi) {
    # do something with match
    # e.g. match start = $-[0]+1, match end = $+[0]
}

That example uses the special Perl variables @- and @+ to get match positions, but Bioperl will also provide you with plenty of methods for analysing sub-sequences.

ADD COMMENTlink modified 9 months ago by RamRS22k • written 9.0 years ago by Neilfws48k

Thank you for your attention to my question. In fact I would like to find the distance between binding sites for RNAP II and start of transcription in certain human genes. So, I used some sample motifs ('ggc' and 'atg') in my perl script, just to make the task easier and to test how it works in this simple variant. The binding site is situated before the start codone, that's why i didn't take into consideration those variants, where the first motif is situated after the second. Once more, thank you for your help.

ADD REPLYlink written 9.0 years ago by nikulina280
7
gravatar for Stefano Berri
9.0 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

If you are planning to read and manipulate a lot of files with fasta sequences, do it properly. Use Bioperl. It make life easier (see an example here ). It takes some time to set it up and learn the "philosophy" behind, but then you can do much more: read from NCBI/EMBL, read/write to different formats... all with the same interface. Already debugged for you.

Also, if you use big files, don't do this:

open(FILE, $string_filename);  
@array = FILE;

It will load the whole file in memory. Nowadays fasta files might be Huuuuuge.

ADD COMMENTlink modified 9 months ago by RamRS22k • written 9.0 years ago by Stefano Berri4.1k

Thank you! indeed i recognise that my variant is not very convinient and consumes lots of memory. I'll try to examine BIOperl and use it for futher tasks.

ADD REPLYlink written 9.0 years ago by nikulina280
4
gravatar for Hanif Khalak
9.0 years ago by
Hanif Khalak1.2k
Doha, QA
Hanif Khalak1.2k wrote:

Along the lines of answers to this question, you can read/process one FASTA sequence at a time. I'd modify your code like this:

$string_filename = 'file.txt';  
open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");  

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
chomp $seq;
    $seq =~ s/^>*.+\n//;  # remove FASTA header
    $seq =~ s/\n//g;  # remove endlines

    $R = length $seq;
    if ( $seq =~ /ggc/ ) {   
        $M = $';
        $W = length $M;
        if ( $seq =~ /atg/ ) {   
            $K = $`;   
            $Z = length $K;  
            $x = $W + $Z - $R;     
            print "\n\ the distance is the following: $x\n\n";
        } else {  
            print "\n\ I couldn't find the start codon.\n\n";
        }
    } else {  
        print "\n\ I couldn't find the binding site.\n\n"; }  
    }

}  # end while

close FILE;  

exit;
ADD COMMENTlink modified 9 months ago by RamRS22k • written 9.0 years ago by Hanif Khalak1.2k

Thank you! it works!

ADD REPLYlink written 9.0 years ago by nikulina280
0
gravatar for Tarah
7.5 years ago by
Tarah0
Tarah0 wrote:

Can I please add a question to this? What if you want to remove that string/sequence that you are looking for? I have a control phage in my illumina data that I want to remove, but am having a hard time finding out how to do this. Thanks so much!

ADD COMMENTlink written 7.5 years ago by Tarah0
1

Tarah, please delete this and post as a separate question.

ADD REPLYlink written 7.5 years ago by Casey Bergman18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1687 users visited in the last hour