Question: how to compare 2 fasta sequences with perl
0
gravatar for Lila M
3.5 years ago by
Lila M 470
UK
Lila M 470 wrote:

Hello 

I've started using perl a couples of month ago.

I need to parse a fasta file with 2 sequences. I can read them and export in another txt without the header. Now, I need to compare both: their length, their amino acids.. but I dont know how to do that, this is my code: 

my $fasta_file = "shortAligForDist.txt";                               

my $outfile = "Output.txt";            

open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";

my $fh;
open($fh, $fasta_file) or die "can't open $fasta_file: $!\n";

my %sequence_data;
while (read_fasta_sequence($fh, \%sequence_data)) {
  #print ">$sequence_data{header}\n$sequence_data{seq}\n\n";
  print "\n$sequence_data{seq}\n\n";
    print OUTFILE "\n$sequence_data{seq}\n\n";
}

sub read_fasta_sequence {
   my ($fh, $seq_info) = @_;

   $seq_info->{seq} = undef; # clear out previous sequence

   # put the header into place
   $seq_info->{header} = $seq_info->{next_header} if $seq_info->{next_header};

   my $file_not_empty = 0; 
   while (<$fh>) {
      $file_not_empty = 1;
      next if /^\s*$/;  # skip blank lines
      chomp;    

      if (/^>/) { # fasta header line
         my $h = $_;    
         $h =~ s/^>//;  
         if ($seq_info->{header}) {
            $seq_info->{next_header} = $h;
            return $seq_info;   
         }              
         else { # first time through only
            $seq_info->{header} = $h;
         }              
      }         
      else {    
         s/\s+//;  # remove any white space
         $seq_info->{seq} .= $_;
      }         
   }    

   if ($file_not_empty) {
      return $seq_info;
   }    
   else {
      # clean everything up
      $seq_info->{header} = $seq_info->{seq} = $seq_info->{next_header} = undef;

      return;   
   }    
}

 

sequence perl fasta • 1.8k views
ADD COMMENTlink modified 3.5 years ago by SES8.2k • written 3.5 years ago by Lila M 470
2

Your question is underspecified.  Unless you fully describe what you want to accomplish, people can only offer vague hints.  Please try harder to describe the intended input and output of the program. "Compare the amino acids of two sequences" does not really mean anything.

Also, this smells like a homework assignment.  The point of homework is to figure it out for yourself, not to ask other people for the answer.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Brian Bushnell16k

A "homework" flag should exist to discourage answers to this kind of questions

ADD REPLYlink written 3.5 years ago by Israel Barrantes740

Hi is not a homework assignment. I am a PhD student and I'm trying to do that in order to perfor a program that avoid me to compare sequences in order to make my analysis more efficiency. 

ADD REPLYlink written 3.5 years ago by Lila M 470

Use BioPerl, unless you have a valid reason to avoid it. If you stick to plain text processing, you'll stumble a lot.

ADD REPLYlink written 3.5 years ago by RamRS21k
0
gravatar for RamRS
3.5 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

I'd recommend you use BioPerl. Also, comparing length is easy, to compare the amino acids, you'd need to align the sequences. You can of course compare the amino acid composition if you're just doing this for practice.

For the former, looking into the length() function. For amino acid composition, you'll have to use either associative arrays hashes with counters or the substitution operation (s/<AA>/<blank>/g)with the length() function (which is a neat trick). These don't need BioPerl and can help you flex your barebone string manipulation skills.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by RamRS21k

Not to be picky, but it would be better to the term hash rather than associative array (with Perl). At a high level they are a similar concept but not the same and using that term with Perl is a bit dated (I think this term was used a long, long time ago). It will certainly be easier to find relevant help for term the term hash, that was my only concern.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by SES8.2k

Hmmm. I was gonna say associative array (hash), but I guess I forgot to add that. I think I caught on to the term when I was first introduced to Perl. Hash it is!

ADD REPLYlink written 3.5 years ago by RamRS21k
0
gravatar for venu
3.5 years ago by
venu6.1k
Germany
venu6.1k wrote:

You can look at this script protein_stats.pl and write your own script / modify this script. It gives the percentage of each Amino acid and some other classes of Amino acids.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by venu6.1k
0
gravatar for SES
3.5 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

Comparing amino acid sequences may actually take a lot of work depending on if you want something beyond just length or composition. I recommend you check out the Bio::Tools::SeqStats module. It is very handy for these tasks. This will allow you to compare any number of different sequence features (amino acid composition, hydropathicity, molecular weight, length, etc.) with just a few lines of code.

ADD COMMENTlink written 3.5 years ago by SES8.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1111 users visited in the last hour