Normalizing a number of a given motif on the length of sequences
1
0
Entering edit mode
7.6 years ago
kevinm ▴ 40

Hi everyone ! I am a newbie on data treatment and...

I am working on a data set of sequences (fasta format) and i had found a motif by ab initio alignement. Now i have found a way to know the number of motif by sequence in my fasta file. I just want to know if someone know how to normalized the motif count per sequence into the length of the sequence, because, correct me if i'm wrong, there is more chance of finding a motif on a longer sequence.

For the example, i am using a 4 nt motif (the binding motif of a RNA binding protein), and i can easily see that a longer sequence have more motif than shorter one... Can someone help me for this case...

Just for indication that's how i know the number of motif by sequence :

library(Biostrings)

library(seqinr)

fasta <- read.fasta("X.fasta", as.string=T)

pattern <- "tcaa" # for example

dict <- PDict(pattern, max.mismatch=0)

seq <- DNAStringSet(unlist(fasta))

result <- vcountPDict(dict, seq)

result

It return a matrix with a n number of columns (each sequence are in a column) and the corresponding number of motif in the corresponding sequence on a second row.

Thanks

rna-seq R sequence RNA-Seq • 1.5k views
ADD COMMENT
1
Entering edit mode
7.6 years ago

You might want to look at zero order Markov models, for example here:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0009841

ADD COMMENT

Login before adding your answer.

Traffic: 1803 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6