Question: How To Find Any Of Many Motifs In Certain Sequences Using Perl?
1
gravatar for nikulina
9.1 years ago by
nikulina280
Cambridge
nikulina280 wrote:

Dear colleagues, I am applying for your help once more. Here is my perl script for counting the length between the binding sites and the start points of exons. So, there are some genes saved in the file called sequence.txt and some amount of binding sites in motif.txt. The thing I want to do is to count the length of the fragment for every gene if it has any of the binding sites from motif.txt. This current script does not work in the way I would like it. How should it be changed? Shall I add another while loop for $motif?

$string_filename = 'sequence.txt';  

open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");

$motif_filename = 'motif.txt';   

open(MOTIF, $motif_filename) || die("Couldn't read file $motif_filename\n"); 

local $/ = "\n>";  

   while (my $seq = <FILE>) {
chomp $seq;
$seq =~ s/^>*.+\n//;  
$seq =~ s/\n//g;  

$R = length $seq;  
  $motif = <MOTIF>; 
chomp $motif;
$motif =~ s/^>*.+\n//; 
$motif =~ s/\n//g; 
  if ( $seq =~ /$motif/ ) {     ## insert actual binding site 
    $M = $';
    $W = length $M;

    if ( $seq =~ /[A-Z]/) {    ##  exon start
        $K = $`;   
        $Z = length $K;  
        $x = $W + $Z - $R;     
        print "\n\ the distance is the following: $x\n\n";

    } else {  
        print "\n\ I couldn't find the start codon.\n\n";
    }
} else {  
    print "\n\ I couldn't find the binding site.\n\n";   
}

}

close MOTIF;   
close FILE;   

exit;
motif sequence perl • 5.2k views
ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.1 years ago by nikulina280
3

please, show us what your inputs look like.

ADD REPLYlink written 9.1 years ago by Pierre Lindenbaum121k
1

There is a FASTQ-FASTA converter here - and Bioperl's Bio::SeqIO::fastq will handle some FASTQ formats.

ADD REPLYlink modified 10 months ago by RamRS22k • written 9.1 years ago by Neilfws48k

the binding sites look like this

ADD REPLYlink modified 10 months ago by RamRS22k • written 9.1 years ago by nikulina280

and the genes

ADD REPLYlink modified 10 months ago by RamRS22k • written 9.1 years ago by nikulina280

binding site=A FASTQ file ??!!!!!

ADD REPLYlink written 9.1 years ago by Pierre Lindenbaum121k

I recognise that it is a problem but i haven't found the way to convert it to fasts yet.

ADD REPLYlink written 9.1 years ago by nikulina280

And further discussion of FASTQ-FASTA conversion at StackOverflow.

ADD REPLYlink modified 10 months ago by RamRS22k • written 9.1 years ago by Neilfws48k
8
gravatar for Neilfws
9.1 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

I think you need to take a step back and think about the logic of your problem, before writing the code. Break your problem down into steps, like this:

  1. Read motifs into a data structure over which you can loop (e.g. an array)
  2. Read in your sequence(s)
  3. For each sequence, identify segments between binding site and exon start
  4. For each segment...
  5. For each motif, check for match to segment
  6. If at least one match, store/print segment length (and perhaps other useful information)

For tools to convert FASTQ-FASTA, see comments under your question.

As we said in your previous query, Bioperl makes this kind of task very trivial. There is a learning curve, but I would urge you to investigate - it will pay off in the long term. Using more "standard" code also makes it a lot easier for other people to read and debug.

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.1 years ago by Neilfws48k

Thank you, I'll try to do something with it.

ADD REPLYlink written 9.1 years ago by nikulina280
3
gravatar for Jeremy Leipzig
9.1 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

you should load your motifs into an array and then iterate over that motif array for each sequence

ADD COMMENTlink written 9.1 years ago by Jeremy Leipzig18k
1
gravatar for nikulina
9.1 years ago by
nikulina280
Cambridge
nikulina280 wrote:

I eventually recognised how to modify my script to do the job I wished. Now it looks like the following:)

$string_filename = 'seq.txt';  
open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");   
$motif_filename = 'motif.txt';   
open(MOTIF, $motif_filename) || die("Couldn't read file $motif_filename\n");   
local $/ = "\n>";

while (@motif = <MOTIF>) {              
  while (my $seq = <FILE>) {
    chomp $seq;
    $seq =~ s/^>*.+\n//;  
    $seq =~ s/\n//g;  
    $R = length $seq;    
    foreach $site (@motif) {
      chomp $site;
      $site =~ s/^>*.+\n//; 
      $site =~ s/\n//g;  
      if ( $seq =~ /$site/ ) {     
        $M = $';
        $W = length $M;
        if ( $seq =~ /[AGTC]/) {   
          $K = $`;   
          $Z = length $K;  
          $x = $W + $Z - $R;     
          print "\nthe distance is the following: $x\n\n";
                               } 
                             } 
                            } 
                           }  
                          } 
close FILE;    
exit;

Today I tried to do the same using BioPerl module and recognised it to be much more easier indeed. Thanks a million for your help and support:)

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.1 years ago by nikulina280
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 947 users visited in the last hour