How To Randomly Select 2 Sequences From Fasta Format In Perl?
Entering edit mode
12.7 years ago
Smandape ▴ 120

Hello Experts,

I am handling sequences in FASTA format. I used bioperl to parse the sequences. Now, I have to select any two sequences randomly at a time and use it to find the unknown motif. My initial aim is to choose any sequences randomly. I read that function rand() is used to do randomization. But my concern is how can I do it using function rand()? As rand() uses array for randomization and I cannot put the sequences in an array as I think it will not be efficient. Also, does rand() chooses the sequences only once or it may choose the sequences more than once randomly? This code is suppose to run on any number of sequences with a fasta file as input.

My aim is to follow the greedy algorithm approach from the book An Introduction to Bioinformatics Algorithm by Jones and Pevzner. Here they have used only first two sequences to find (s1,s2) and then rest of the sequences compared with these two vectors to find the score. In my homework problem I have to choose the pair of sequences randomly. The code is as follows:

1 bestMotif ← (1, 1, . . . , 1)
2 s ← (1, 1, . . . , 1)
3 for s1 ← 1 to n − l + 1
4  for s2 ← 1 to n − l + 1
5   if Score(s, 2,DNA) > Score(bestMotif , 2,DNA)
6     BestMotif1 ← s1
7     BestMotif2 ← s2
8 s1 ← BestMotif1
9 s2 ← BestMotif2
10 for i ← 3 to t
11   for si ← 1 to n − l + 1
12     if Score(s, i,DNA) > Score(bestMotif, i,DNA)
13        bestMotifi ← si
14   si ← bestMotifi
15 return bestMotif

where l is the length of the motif I have to find(this is taken from user), n is the number of nucleotide in a single sequence and t are the number of sequences. Thank you for looking at this question. Any help is greatly appreciated. Let me know if my question is not put in a clear way and I will try to rewrite it in other words.

homework algorithm random sequence fasta • 4.7k views
Entering edit mode
12.7 years ago

not perl, but using a simple command line:

awk '/^>/ {printf("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' < seq.fasta  |\ #linearizze
egrep -v '^$' |\ #no blank line
shuf |\ #shuffle
head -n 2 |\ #two line
tr "\t" "\n" #restore fasta
Entering edit mode

@Pierre: Thank you for your answer. But I have to use this in my PERL program as I have to do these randomization at least 20 times. Is it not possible in PERL?

Entering edit mode
12.7 years ago

The typical way you would go about it is like this:

  1. You parse the FASTA file into some data structure. This could be a hash table in which you have the name of the sequence as key and the sequence as value. Or if you do not care about a name, it could just be an array in which each element is a sequence.
  2. You find the number of elements (N) in you data structure (i.e. the number of keys in your hash or the length of your array).
  3. You use rand(N) to generate a random floating point in the interval [0;N) and typecast it to an integer. Call that number i.
  4. You fetch sequence number i from your data structure.
  5. You repeat steps 3 and 4 to get a second sequence (making sure that you do not pick the same one twice).

Nothing Perl specific about this solution; this is how you would generally go about it. Unless you resort to nifty command-line hacks like what Pierre suggested, of course. ;-)

Entering edit mode
12.7 years ago
Chris Penkett ▴ 490

A couple of hints for your homework ;-)

Put the sequence names into an array and then use BioPerl indexing to retrieve the actual sequence from a sequence name:

This perl one-liner uses the rand() function to get a random sequence name from a fasta file. If you want two, you'll need to be careful you don't get the same sequence twice.

perl -lne 'push @seq, $_ if /^>/; END {$len = @seq; $rseq = int(rand($len)); $sname = $seq[$rseq]; $sname =~ s/^>//; print $sname}' seq.fasta



Login before adding your answer.

Traffic: 2721 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6