Question

Perl Print Nextline From Within While Loop?

1

Entering edit mode

13.6 years ago

User 3739 ▴ 10

I am trying to query a large FASTA file in which I start with a list of genes, search for the gene in the FASTA, and then I want to print out the header line (>gene) and the protein sequence from the subsequent line.

Due to the large size of the FASTA file, I am using a while loop. I can print the line that contains the string I am searching for, but I cannot figure out how to print the next line that contains the protein sequence.

Can someone please tell me the easiest way to do that?

Thanks.

perl fasta • 13k views

ADD COMMENT • link updated 10.1 years ago by Prakki Rama ★ 2.7k • written 13.6 years ago by User 3739 ▴ 10

1

Entering edit mode

See the answers to this question. Do you realise that the next line after the header may not contain the entire sequence?

ADD REPLY • link updated 4.7 years ago by Ram 43k • written 13.6 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

Hi Keith, I think your comment is appropriate as an answer as well. You might want to consider adding it.

ADD REPLY • link 13.6 years ago by Istvan Albert 100k

0

Entering edit mode

Duly added, with a pointer to another relevant answer too.

ADD REPLY • link 13.6 years ago by biobot 0.0.77.a.1099 6.2k

Ram · Answer 1 · 2010-09-13

6

Entering edit mode

13.6 years ago

biobot 0.0.77.a.1099 6.2k

See the answers to this this question on Fasta parsing in Perl and the answers to this question on extracting sequences from very large Fasta files.

Do you realise that the next line after the header may not contain the entire sequence? If you are unsure of the details involved in parsing a file format, it may be better to use a library such as Bioperl.

ADD COMMENT • link updated 4.7 years ago by Ram 43k • written 13.6 years ago by biobot 0.0.77.a.1099 6.2k

Ram · Answer 2 · 2010-09-15

0

Entering edit mode

13.6 years ago

Keeney • 0

This may be an inefficient way of doing it because I'm still fairly new to it, but one idea is to loop through it once, building an array by splitting up the gene inputs by newlines, then loop through your array looking for sequence matches. If it finds a match, just print the current position in the array +1.

So, something along the lines of this (the syntax is wrong):

@array = split(/\n/, ListOfGenes.fasta);

$i = 0;

# Now loop through the big FASTA file.
while ( <IN> ) {
  if ( $_ == $array[$i] ) {
    print OUT $array[$i+1]
  }
$i++;
}

It unnecessarily tries to match with everything in the array (which could be avoided by advancing $i by 2 or 3 or whatever's appropriate for your list, rather than by 1). Like I said, I'm still pretty new to it, so I don't know if that works, but it's quick to try!

That's my two cents, sorry if it's completely useless information!

ADD COMMENT • link updated 4.7 years ago by Ram 43k • written 13.6 years ago by Keeney • 0

1

Entering edit mode

You realise that there are flaws in your proposal - good, that's the first step to improvement. Splitting on \n doesn't break the Fasta file on record boundaries. Assuming <IN> is reading in gene names, you're stepping both the array index and gene name at the same time when you want a nested loop (and what you really want is a hash lookup). Your $_ test should be eq for strings, not ==. The test will always fail because Fasta headers contain a leading >. The next element in the array will not always contain the entire sequence. The Fasta file is large and probably won't fit in memory.

ADD REPLY • link updated 4.7 years ago by Ram 43k • written 13.6 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

Hmm, so I may have misunderstood the input files; in this proposal, I was assuming the [?] was the large FASTA file with no newline characters, and ListOfGemes.fasta was an alternating list of genes and their protein sequences, separated by newlines (I assumed this because in the question he asks for the protein sequence from the "next line"). Maybe that's not what was meant...? I think you make a great point that a hash might be better, though. I'm still not too experienced with them so I didn't know how to propose that! And you're right, it should have been 'eq' rather than '=='.

ADD REPLY • link 13.6 years ago by Keeney • 0

0

Entering edit mode

Maybe a pattern match in between > and whatever else (methionine, for example) would be better than splitting by newline, then?

Like (again, syntax is probably wrong):

if ( ListOfGenes.fasta =~ m/>M/g ) { ...

ADD REPLY • link updated 4.7 years ago by Ram 43k • written 13.6 years ago by Keeney • 0

Ram · Answer 3 · 2014-04-04

0

Entering edit mode

10.1 years ago

Prakki Rama ★ 2.7k

You can try this:

open FH,"filename.txt";

while(<FH>)
 {
    $line1=$_;
    $line2=&lt;FH&gt;;
    print "Line1: $line1";
    print "Line2: $line2";
}

close(FH);

ADD COMMENT • link updated 4.7 years ago by Ram 43k • written 10.1 years ago by Prakki Rama ★ 2.7k