Question: How To Find Any Of Many Motifs In Certain Sequences Using Perl?
1
gravatar for nikulina
10.1 years ago by
nikulina280
Cambridge
nikulina280 wrote:

Dear colleagues, I am applying for your help once more. Here is my perl script for counting the length between the binding sites and the start points of exons. So, there are some genes saved in the file called sequence.txt and some amount of binding sites in motif.txt. The thing I want to do is to count the length of the fragment for every gene if it has any of the binding sites from motif.txt. This current script does not work in the way I would like it. How should it be changed? Shall I add another while loop for $motif?

$string_filename = 'sequence.txt';  

open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");

$motif_filename = 'motif.txt';   

open(MOTIF, $motif_filename) || die("Couldn't read file $motif_filename\n"); 

local $/ = "\n>";  

   while (my $seq = <FILE>) {
chomp $seq;
$seq =~ s/^>*.+\n//;  
$seq =~ s/\n//g;  

$R = length $seq;  
  $motif = <MOTIF>; 
chomp $motif;
$motif =~ s/^>*.+\n//; 
$motif =~ s/\n//g; 
  if ( $seq =~ /$motif/ ) {     ## insert actual binding site 
    $M = $';
    $W = length $M;

    if ( $seq =~ /[A-Z]/) {    ##  exon start
        $K = $`;   
        $Z = length $K;  
        $x = $W + $Z - $R;     
        print "\n\ the distance is the following: $x\n\n";

    } else {  
        print "\n\ I couldn't find the start codon.\n\n";
    }
} else {  
    print "\n\ I couldn't find the binding site.\n\n";   
}

}

close MOTIF;   
close FILE;   

exit;
motif sequence perl • 5.8k views
ADD COMMENTlink modified 22 months ago by RamRS27k • written 10.1 years ago by nikulina280
3

please, show us what your inputs look like.

ADD REPLYlink written 10.1 years ago by Pierre Lindenbaum129k
1

There is a FASTQ-FASTA converter here - and Bioperl's Bio::SeqIO::fastq will handle some FASTQ formats.

ADD REPLYlink modified 22 months ago by RamRS27k • written 10.1 years ago by Neilfws48k

the binding sites look like this

ADD REPLYlink modified 22 months ago by RamRS27k • written 10.1 years ago by nikulina280

and the genes

ADD REPLYlink modified 9 months ago by RamRS27k • written 10.1 years ago by nikulina280

binding site=A FASTQ file ??!!!!!

ADD REPLYlink written 10.1 years ago by Pierre Lindenbaum129k

I recognise that it is a problem but i haven't found the way to convert it to fasts yet.

ADD REPLYlink written 10.1 years ago by nikulina280

And further discussion of FASTQ-FASTA conversion at StackOverflow.

ADD REPLYlink modified 22 months ago by RamRS27k • written 10.1 years ago by Neilfws48k
8
gravatar for Neilfws
10.1 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

I think you need to take a step back and think about the logic of your problem, before writing the code. Break your problem down into steps, like this:

  1. Read motifs into a data structure over which you can loop (e.g. an array)
  2. Read in your sequence(s)
  3. For each sequence, identify segments between binding site and exon start
  4. For each segment...
  5. For each motif, check for match to segment
  6. If at least one match, store/print segment length (and perhaps other useful information)

For tools to convert FASTQ-FASTA, see comments under your question.

As we said in your previous query, Bioperl makes this kind of task very trivial. There is a learning curve, but I would urge you to investigate - it will pay off in the long term. Using more "standard" code also makes it a lot easier for other people to read and debug.

ADD COMMENTlink modified 22 months ago by RamRS27k • written 10.1 years ago by Neilfws48k

Thank you, I'll try to do something with it.

ADD REPLYlink written 10.1 years ago by nikulina280
3
gravatar for Jeremy Leipzig
10.1 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

you should load your motifs into an array and then iterate over that motif array for each sequence

ADD COMMENTlink written 10.1 years ago by Jeremy Leipzig19k
1
gravatar for nikulina
10.1 years ago by
nikulina280
Cambridge
nikulina280 wrote:

I eventually recognised how to modify my script to do the job I wished. Now it looks like the following:)

$string_filename = 'seq.txt';  
open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");   
$motif_filename = 'motif.txt';   
open(MOTIF, $motif_filename) || die("Couldn't read file $motif_filename\n");   
local $/ = "\n>";

while (@motif = <MOTIF>) {              
  while (my $seq = <FILE>) {
    chomp $seq;
    $seq =~ s/^>*.+\n//;  
    $seq =~ s/\n//g;  
    $R = length $seq;    
    foreach $site (@motif) {
      chomp $site;
      $site =~ s/^>*.+\n//; 
      $site =~ s/\n//g;  
      if ( $seq =~ /$site/ ) {     
        $M = $';
        $W = length $M;
        if ( $seq =~ /[AGTC]/) {   
          $K = $`;   
          $Z = length $K;  
          $x = $W + $Z - $R;     
          print "\nthe distance is the following: $x\n\n";
                               } 
                             } 
                            } 
                           }  
                          } 
close FILE;    
exit;

Today I tried to do the same using BioPerl module and recognised it to be much more easier indeed. Thanks a million for your help and support:)

ADD COMMENTlink modified 22 months ago by RamRS27k • written 10.1 years ago by nikulina280
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1152 users visited in the last hour