Question: Comparing fasta headers to set of 3 letter codes that is found in the header
gravatar for pasanfernando87
6.3 years ago by
United States
pasanfernando870 wrote:

hi guys! I am a newbie to perl and I need help with finding list of fasta seuences which has specefic 3 letter code in the header. The question is explained below

Scenario: I have a fasta file which contains list of sequences as follows (test.fa)

>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p


>cfa-miR-761 MIMAT0009936 Canis familiaris miR-761


>cfa-miR-764 MIMAT0009937 Canis familiaris miR-764


>lus-miR167c MIMAT0027158 Linum usitatissimum miR167c


I have set of organism codes in a separate file (codes.txt)




what I want to do is search through test.fa for the codes and only print out the sequences which has that particular code in the header The example above contains only few sequences (not the entire file) So far I managed to create a hash and store the headers into keys and sequences into values. And I read through codes.txt and stored the codes in an array

The problem is when I use 'if exists' function to find whether each code in array exists in the hash here is my code,


use warnings;
use Bio::Perl;
use Bio::Seq;
use Bio::SeqIO;
my $filename = 'report.fa';
# Reading the first file and store it into a hash
#the sequence header is stored in the hash key and sequence is stored in the value

my $FastaFile1 = Bio::SeqIO->new(-file => "test.fa", -format => 'fasta', -alphabet => 'dna') or die "Failed to create SeqIO object from \n";

my %fastaH1 =();
while( my $seqFile1 = $FastaFile1->next_seq() ) {
    unless (exists $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc}) {
        my $k = $seqFile1->display_id."\t".$seqFile1->desc;
        $k =~ s/^\s+|\s+$//g;
        $fastaH1{$k} = $seqFile1->seq; #key of the hash is fasta header (all line) and value is sequence.

# printing the fasta headers
print "stored fasta headers:\n";
foreach my $key (keys %fastaH1){
        print  "$key\n";

# reading the codes.txt file and creating the array
open my $file, '<', "codes.txt";
chomp(my @lines = <$file>);
close $file;

print "stored organism codes\n";
foreach (@lines) {
     print " $_\n";
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
# if each code is found on the hash print match is found
 foreach my $line (@lines) {
    chomp $line;
   if(  exists $fastaH1{$line} ){
        print $fh"match found\n";
        print $fh">$line\n$fastaH1{$line}\n";

close $fh;

The code should be printing found match with its sequences in a separate file. But it gives an empty file as the output

For example since 'lus' is in the code set, it should output

>lus-miR167c MIMAT0027158 Linum usitatissimum miR167c


in the output. can you help me?

perl • 3.3k views
ADD COMMENTlink modified 6.3 years ago by JC12k • written 6.3 years ago by pasanfernando870

grep -A2 -E 'lus-|cel-|cfa-|any_other_code' test.fa > new_test.fa should work provided that fasta sequences are contained within a single line which are in your case as you are working with mature miRNA sequences. But I would still recommend you to keep working on the perl code to enhance your programming skills. Thanks. 

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Ashutosh Pandey12k

I think your approach is better than OP's. Perl would need to pass the file multiple times for each query pattern, and that can never be as efficient as grep. Also, OP's subject file is single-line fasta. Couldn't get any UNIX friendlier!

ADD REPLYlink written 6.3 years ago by _r_am32k

Just a comment. Storing in memory every sequence using its ID as key and sequence as value will not scale properly if you have millions of sequences that are relatively long. I suggest, read your IDs first, store them in a hash and parse your fasta file and if it exists in the hash, print the ID+sequence.

ADD REPLYlink written 6.3 years ago by Gabriel R.2.8k

I wonder why OP stores them in the first place. Isn't this a simple print-if-found scenario? Why use a hash at all?

ADD REPLYlink written 6.3 years ago by _r_am32k

Still, JC's solution "stores" them in a string and he needs to search linearly in the IDs. A hash provides a logarithmic solution at a reasonable memory case. 

ADD REPLYlink written 6.3 years ago by Gabriel R.2.8k

But we were discussing a case where the sequences number in the millions and/or the sequences are quite long, no? In which case JC's solution is better.

ADD REPLYlink written 6.3 years ago by _r_am32k

The size of the input was unspecified. On average, using a more efficient algorithm from the get-go saves more time than modifying inefficient code later :-)

ADD REPLYlink written 6.3 years ago by Gabriel R.2.8k

I agree. I have the solution you suggest implemented on my github repo - it is my go to solution. It is not really efficient in this case though, with pattern matching involved. We'd need a hash for the IDs and one for the patterns.

I think Python is better with its if X in array syntax in this case.

ADD REPLYlink written 6.3 years ago by _r_am32k

Jumping on the discussion, my solution is intended is your "list" of elements is short, storing a long list as a string will slow a lot the RegEx search. Of course you can store the elements in a hash and do iteration per element testing for a matching pattern, but I'm not sure if that will be faster than Python's "if X in array" even when is equivalent.

ADD REPLYlink written 6.3 years ago by JC12k

Just to improve on Ashutosh comment, assuming your fasta sequences are singleline fasta format, you can try:

LC_ALL=C fgrep -A 1 -f codes.txt test.fa >new_test.fa

should work more faster than normal grep.


ADD REPLYlink written 6.3 years ago by Prakki Rama2.4k
gravatar for JC
6.3 years ago by
JC12k wrote:

Here is my solution without BioPerl and using RegEx-fu:


use strict;
use warnings;

$ARGV[2] or die "use LIST FASTA OUTPUT\n";
my $list   = shift @ARGV;
my $fasta  = shift @ARGV;
my $out    = shift @ARGV;
my $search = '';

open (my $lh, "<", $list) or die;
while (<$lh>) {
    $search .= "$_|";
close $lh;
$search =~ s/\|$//; # delete the last pipe

open (my $oh, ">", $out) or die;
open (my $fh, "<", $fasta) or die;
$/ = "\n>"; # Fasta sequence slurp-mode
while (<$fh>) {
    print $oh ">$_" if (m/^($search)/);
close $fh;
close $oh;
ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by JC12k

This is a good solution, but took me a while to understand the inner working. It's always the inevitable tradeoff between brilliant implementations and readability, no?

ADD REPLYlink written 6.3 years ago by _r_am32k

I don't remember the reference exactly, but you can see a Perl code encrypted and decrypted and see no differences. ;)

ADD REPLYlink written 6.3 years ago by JC12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1734 users visited in the last hour