Question: pattern matching tools
1
gravatar for jaqy
2.9 years ago by
jaqy20
French
jaqy20 wrote:

Hello, please i have 1000 motifs (octamer) and i want determine the total number of each motifs in the whole genome of arabidopsis thaliana .Can you help me please, know you a tools to do this or what can i do ? I have tried RSAT tools but the compilation take a lot of time (i have submitted my work since 2 days but I did not receive the results . thanks you very much ?

sequence genome • 958 views
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by jaqy20
1

thank you very much to all of you, your response is very helpful.

ADD REPLYlink written 2.9 years ago by jaqy20
2
gravatar for estebanpw
2.9 years ago by
estebanpw30
estebanpw30 wrote:

I am no biologist, but I am supposing octamer motifs are just 8 particular bases (ie ATTCGTGT). You can download the whole genome (its about 135 Mb), and then run a simple program that counts the number of motifs matching yours (This is like counting the kmers for k=8).

You could do it like this in C:

#include <stdlib.h>
#include <stdio.h>

int main(){


    char myOctamer[8] = "ATTCGTGT"; // Your octamer here

    FILE * genomeFasta = fopen64("file.txt", "rt");

    char c;
    int totalFound = 0, current = 0;
    while(!feof(genomeFasta)){
        c = fgetc(genomeFasta);

        if(c != '\n'){
            if(c == myOctamer[current]){
                current++;
                if(current == 8){
                    current = 0;
                    totalFound++;
                }
            }else{
                current = 0;
            }   
        }


    }
    fprintf(stdout, "Found %d occurrences.\n", totalFound);
    return 0;
}


This works for only one octamer, but generalizing it to n octamers would not be difficult. Notice this will not work for overlapped sequences. And in case you actually use it, I recommend compiling with D_FILE_OFFSET_BITS=64 to be able to handle large sequences (over 2GB).

Hope this helps and that I am not too far from the main point, Esteban

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by estebanpw30
1

Woah, thats a really cool technique i've never seen before. You don't store more than a byte at a time or do any copying, when comparing to the barcode. Awesome to see more C programmers on the forum :)

Unfortunately, the lack of being able to support overlapping sequences could be a big big issue depending on the barcode. For example, this code wouldn't find the barcode "AAC" in the genome "....AAAC....", no matter what came before or after in the genome. This makes it unsuitable for this sort of application -- however, I learnt something new, so still awesome :)

ADD REPLYlink written 2.9 years ago by John12k
1

Thank you for your feedback! Yes, its more of an illustrative code and would probably not be used in real applications (at least, as it is). I thought it could have been helpful for the original poster, and in case he had needed it he could have developed it or ask for further help.

Likewise, awesome to see more C programmers!

ADD REPLYlink written 2.9 years ago by estebanpw30
2
gravatar for harold.smith.tarheel
2.9 years ago by
United States
harold.smith.tarheel4.3k wrote:

A kmer counter like khmer or Jellyfish can be used to obtain all octamer frequencies, then filtered for the subset of interest.

ADD COMMENTlink written 2.9 years ago by harold.smith.tarheel4.3k
2
gravatar for Asaf
2.9 years ago by
Asaf5.5k
Israel
Asaf5.5k wrote:

compseq if EMBOSS can give you the number of times every possible octamer appears in a sequence. Can be run via galaxy. Make sure to set all frames in the parameters.

ADD COMMENTlink written 2.9 years ago by Asaf5.5k

thanks it's very simple and fast

ADD REPLYlink written 2.9 years ago by jaqy20
0
gravatar for jaqy
2.9 years ago by
jaqy20
French
jaqy20 wrote:

thank you very much to all of you, your response is very helpful.

ADD COMMENTlink written 2.9 years ago by jaqy20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1464 users visited in the last hour