Question: Is it possible using a regex in perl to include both accession number and the matched sequence when printing?
gravatar for agabali
6.4 years ago by
agabali0 wrote:

Hello BioStars community,

New to perl and programming in general, so I thought I might try out my luck asking a question here.

I am trying to match a fairly conserved protein sequence to a proteome using a regex.  I am able to output the matching lines, as well as their positions, but I cannot find a way to output the accession numbers along with lines that match my conserved protein. 

Here's part of my code:

my $proteins;
open( file, "Athaliana_167_protein.fa" ) or die "can't open file!";
while (<file>){
        if (/W[S]TRRKIAI/) {print}

Would using lookahead/lookbehinds possibly work to print out the match line and accession number? 



regex match protein perl • 2.6k views
ADD COMMENTlink modified 6.4 years ago by promise0 • written 6.4 years ago by agabali0

Your code does most likely not work for finding the sequence you are looking for, most fasta files contain linebreaks in the sequence where you will miss your pattern in case it is wrapped, you need to put the whole sequence into one string first.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Michael Dondrup47k
gravatar for Michael Dondrup
6.4 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

This statement is important for parsing fasta efficiently in perl (unless you want to use BioPerl):

local($/) = "\n>";

It allows you to read a complete fasta record instead of each line.

  local($/) = "\n>"; # read each fasta record, always use local!
  while (my $fastarec = <FASTA>) {
    chomp $fastarec;
    my ($defline, @seq) = split "\n", $fastarec;  #seq id is the first line
    $defline =~ s/^\>//; # remove left over >, just in case
    my $seq = join "", @seq; # put together the sequence again
    $seq =~ s/\s//g; # remove potential left-over spaces, empty lines etc.
    if $seq =~ /W[H|A]TEVER/ {
     print (join "\n", ">$defline", @seq), "\n"; # output sequence in original formatting
ADD COMMENTlink modified 8 months ago by RamRS30k • written 6.4 years ago by Michael Dondrup47k

+1 Nice tip!

ADD REPLYlink written 6.4 years ago by Alex Reynolds30k

Thank you for the tips! This worked very well.

ADD REPLYlink written 6.4 years ago by agabali0

That's actually how BioPerl does it internally :)

ADD REPLYlink written 6.4 years ago by Chris Fields2.1k
  1. Your script doesn't compile.
  2. Always: use strict; use warnings;
  3. Use lexical file handles.
  4. No need to s/^\>// since > is chomped as the unique FASTA record separator.
  5. No need to s/\s//g when splitting on ' '.
  6. No need to join the sequence lines for the pattern matching (set a local copy of $" to an empty string and do a match on "@{ [ split ' ', $fastarec ] }").

Here's an alternative solution:

use strict;
use warnings;

open my $FASTA, '<', 'Athaliana_167_protein.fa' or die "Can't open file: $!";
local ( $/, $" ) = ( '>', '' );

while (<$FASTA>) {
    chomp if s/(.+)\n// and my $defline = $1 or next;<br />
    print ">$defline\n$_" if "@{ [ split ' ' ] }" =~ /W(H|A)TEVER/;
ADD REPLYlink modified 8 months ago by RamRS30k • written 6.4 years ago by Kenosis1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1629 users visited in the last hour