Question: How To Randomly Select 2 Sequences From Fasta Format In Perl?
0
gravatar for Smandape
6.1 years ago by
Smandape60
United States
Smandape60 wrote:

Hello Experts,

I am handling sequences in FASTA format. I used bioperl to parse the sequences. Now, I have to select any two sequences randomly at a time and use it to find the unknown motif. My initial aim is to choose any sequences randomly. I read that function rand() is used to do randomization. But my concern is how can I do it using function rand()? As rand() uses array for randomization and I cannot put the sequences in an array as I think it will not be efficient. Also, does rand() chooses the sequences only once or it may choose the sequences more than once randomly? This code is suppose to run on any number of sequences with a fasta file as input.

My aim is to follow the greedy algorithm approach from the book An Introduction to Bioinformatics Algorithm by Jones and Pevzner. Here they have used only first two sequences to find (s1,s2) and then rest of the sequences compared with these two vectors to find the score. In my homework problem I have to choose the pair of sequences randomly. The code is as follows:

GREEDYMOTIFSEARCH(DNA, t, n, l)
1 bestMotif ← (1, 1, . . . , 1)
2 s ← (1, 1, . . . , 1)
3 for s1 ← 1 to n − l + 1
4  for s2 ← 1 to n − l + 1
5   if Score(s, 2,DNA) > Score(bestMotif , 2,DNA)
6     BestMotif1 ← s1
7     BestMotif2 ← s2
8 s1 ← BestMotif1
9 s2 ← BestMotif2
10 for i ← 3 to t
11   for si ← 1 to n − l + 1
12     if Score(s, i,DNA) > Score(bestMotif, i,DNA)
13        bestMotifi ← si
14   si ← bestMotifi
15 return bestMotif

where l is the length of the motif I have to find(this is taken from user), n is the number of nucleotide in a single sequence and t are the number of sequences. Thank you for looking at this question. Any help is greatly appreciated. Let me know if my question is not put in a clear way and I will try to rewrite it in other words.

ADD COMMENTlink modified 6.1 years ago by Chris Penkett470 • written 6.1 years ago by Smandape60
3
gravatar for Pierre Lindenbaum
6.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:

not perl, but using a simple command line:

awk '/^>/ {printf("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' < seq.fasta  |\ #linearizze
egrep -v '^$' |\ #no blank line
shuf |\ #shuffle
head -n 2 |\ #two line
tr "\t" "\n" #restore fasta
ADD COMMENTlink written 6.1 years ago by Pierre Lindenbaum101k

@Pierre: Thank you for your answer. But I have to use this in my PERL program as I have to do these randomization at least 20 times. Is it not possible in PERL?

ADD REPLYlink written 6.1 years ago by Smandape60
3
gravatar for Lars Juhl Jensen
6.1 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

The typical way you would go about it is like this:

  1. You parse the FASTA file into some data structure. This could be a hash table in which you have the name of the sequence as key and the sequence as value. Or if you do not care about a name, it could just be an array in which each element is a sequence.
  2. You find the number of elements (N) in you data structure (i.e. the number of keys in your hash or the length of your array).
  3. You use rand(N) to generate a random floating point in the interval [0;N) and typecast it to an integer. Call that number i.
  4. You fetch sequence number i from your data structure.
  5. You repeat steps 3 and 4 to get a second sequence (making sure that you do not pick the same one twice).

Nothing Perl specific about this solution; this is how you would generally go about it. Unless you resort to nifty command-line hacks like what Pierre suggested, of course. ;-)

ADD COMMENTlink written 6.1 years ago by Lars Juhl Jensen11k
1
gravatar for Chris Penkett
6.1 years ago by
Chris Penkett470
Cambridge, UK
Chris Penkett470 wrote:

A couple of hints for your homework ;-)

Put the sequence names into an array and then use BioPerl indexing to retrieve the actual sequence from a sequence name:

http://www.bioperl.org/wiki/Module:Bio::Index::Fasta

This perl one-liner uses the rand() function to get a random sequence name from a fasta file. If you want two, you'll need to be careful you don't get the same sequence twice.

perl -lne 'push @seq, $_ if /^>/; END {$len = @seq; $rseq = int(rand($len)); $sname = $seq[$rseq]; $sname =~ s/^>//; print $sname}' seq.fasta

Chris

ADD COMMENTlink written 6.1 years ago by Chris Penkett470
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 613 users visited in the last hour