Question: Perl Print Nextline From Within While Loop?
1
gravatar for User 3739
9.2 years ago by
User 373910
User 373910 wrote:

I am trying to query a large FASTA file in which I start with a list of genes, search for the gene in the FASTA, and then I want to print out the header line (>gene) and the protein sequence from the subsequent line.

Due to the large size of the FASTA file, I am using a while loop. I can print the line that contains the string I am searching for, but I cannot figure out how to print the next line that contains the protein sequence.

Can someone please tell me the easiest way to do that?

Thanks.

perl fasta • 10.0k views
ADD COMMENTlink modified 5.7 years ago by Prakki Rama2.3k • written 9.2 years ago by User 373910
1

See the answers to this question. Do you realise that the next line after the header may not contain the entire sequence?

ADD REPLYlink modified 3 months ago by RamRS25k • written 9.2 years ago by iw9oel_ad6.0k

Hi Keith, I think your comment is appropriate as an answer as well. You might want to consider adding it.

ADD REPLYlink written 9.2 years ago by Istvan Albert ♦♦ 81k

Duly added, with a pointer to another relevant answer too.

ADD REPLYlink written 9.2 years ago by iw9oel_ad6.0k
6
gravatar for iw9oel_ad
9.2 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

See the answers to this this question on Fasta parsing in Perl and the answers to this question on extracting sequences from very large Fasta files.

Do you realise that the next line after the header may not contain the entire sequence? If you are unsure of the details involved in parsing a file format, it may be better to use a library such as Bioperl.

ADD COMMENTlink modified 3 months ago by RamRS25k • written 9.2 years ago by iw9oel_ad6.0k
0
gravatar for Keeney
9.2 years ago by
Keeney0
Denver, CO
Keeney0 wrote:

This may be an inefficient way of doing it because I'm still fairly new to it, but one idea is to loop through it once, building an array by splitting up the gene inputs by newlines, then loop through your array looking for sequence matches. If it finds a match, just print the current position in the array +1.

So, something along the lines of this (the syntax is wrong):

@array = split(/\n/, ListOfGenes.fasta);

$i = 0;

# Now loop through the big FASTA file.
while ( <IN> ) {
  if ( $_ == $array[$i] ) {
    print OUT $array[$i+1]
  }
$i++;
}

It unnecessarily tries to match with everything in the array (which could be avoided by advancing $i by 2 or 3 or whatever's appropriate for your list, rather than by 1). Like I said, I'm still pretty new to it, so I don't know if that works, but it's quick to try!

That's my two cents, sorry if it's completely useless information!

ADD COMMENTlink modified 3 months ago by RamRS25k • written 9.2 years ago by Keeney0
1

You realise that there are flaws in your proposal - good, that's the first step to improvement. Splitting on \n doesn't break the Fasta file on record boundaries. Assuming <IN> is reading in gene names, you're stepping both the array index and gene name at the same time when you want a nested loop (and what you really want is a hash lookup). Your $_ test should be eq for strings, not ==. The test will always fail because Fasta headers contain a leading >. The next element in the array will not always contain the entire sequence. The Fasta file is large and probably won't fit in memory.

ADD REPLYlink modified 3 months ago by RamRS25k • written 9.2 years ago by iw9oel_ad6.0k

Hmm, so I may have misunderstood the input files; in this proposal, I was assuming the [?] was the large FASTA file with no newline characters, and ListOfGemes.fasta was an alternating list of genes and their protein sequences, separated by newlines (I assumed this because in the question he asks for the protein sequence from the "next line"). Maybe that's not what was meant...? I think you make a great point that a hash might be better, though. I'm still not too experienced with them so I didn't know how to propose that! And you're right, it should have been 'eq' rather than '=='.

ADD REPLYlink written 9.2 years ago by Keeney0

Maybe a pattern match in between > and whatever else (methionine, for example) would be better than splitting by newline, then?

Like (again, syntax is probably wrong):

if ( ListOfGenes.fasta =~ m/>M/g ) { ...
ADD REPLYlink modified 3 months ago by RamRS25k • written 9.2 years ago by Keeney0
0
gravatar for Prakki Rama
5.7 years ago by
Prakki Rama2.3k
Singapore
Prakki Rama2.3k wrote:

You can try this:

open FH,"filename.txt";

while(<FH>)
 {
    $line1=$_;
    $line2=&lt;FH&gt;;
    print "Line1: $line1";
    print "Line2: $line2";
}

close(FH);
ADD COMMENTlink modified 3 months ago by RamRS25k • written 5.7 years ago by Prakki Rama2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 770 users visited in the last hour