Question: How to match a FASTA header for extraction using Perl?
0
gravatar for Mimmi Ahlmén
21 months ago by
Mimmi Ahlmén20 wrote:

Hi!

So I have a FASTA file containing sequences, I want to replace old FASTA headers with new ones, and the first step to do so is to match with the header names. It's the name I want the match with, so after the '>'. How do I do this? All sequences have headers somewhat like this:

>Halobacterium_salinarum

This is the part of the code where I find the headers:

     while (my $line = <$IN>) {  if ($line =~ /^>/) {
     my $x =           # Here I want to match with "Halobacterium_salinarum" 
                       # and all the other different species names

I have tried for hours to find out in the right match characters. Is it "any word character": \w? I also want to save the old species name in a hash, then I should save it like this: (\w+) and finish with \s cause thats where the name ends, right?

perl • 726 views
ADD COMMENTlink modified 20 months ago by JC10k • written 21 months ago by Mimmi Ahlmén20

Try the script form following article.

https://www.perlmonks.org/?node_id=975419

ADD REPLYlink written 21 months ago by Arup Ghosh2.6k

So, people still use Perl for Bioinformatics!

ADD REPLYlink written 20 months ago by Santosh Anand5.1k

Probably using bioperl will ease your life:

use Bio::SeqIO;
use strict;
use warnings;

my $fasta  = Bio::SeqIO->new(-file => $file , -format => 'Fasta');
while ( my $seq = $fasta->next_seq() ) {
  my $header = $seq->id;
  if ($header =~ m/>(.+)/){
     print "My species name = $1\n";
  }
}
ADD REPLYlink written 20 months ago by Juke344.5k
1
gravatar for Juke34
20 months ago by
Juke344.5k
Sweden
Juke344.5k wrote:
while (my $line = <$IN>) {
  if ($header =~ m/>(.+)/){
     print "My species name = $1\n";
 }
}
ADD COMMENTlink written 20 months ago by Juke344.5k
0
gravatar for JC
20 months ago by
JC10k
Mexico
JC10k wrote:

The \w in Perl matches any alphanumeric char and the underscore, and using (\w+) should match any word and stop to the first no-word char (space or new line). If you want to save this in a hash:

#!/usr/bin/perl

use strict;
use warnings;

my %species = ();
while (<>) {
    if ( m/^>(\w+)/ ) {
         $species{$1}++;
}

print "Species\tCount\n";
while (my ($sp, $cnt) = each %species) {
    print "$sp\t$cnt\n";
}
ADD COMMENTlink written 20 months ago by JC10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1114 users visited in the last hour