How To Generate Motif Counts That Are Computed Over Multiple Sequences
3
2
Entering edit mode
10.9 years ago

Hello all, i have a big fasta file (6000 seq) , in which i wanna calculate the frequency of number of repeated sequence for each sequence for ex- ATTAG is the motif and i wanna calculate the number of occurence of ATTAG in the seq and this is for every seq, i am trying it with a software and some codes but they are not giving result for each seq (there output is the total count of motif).plz help in solving this problem

sequence • 5.4k views
ADD COMMENT
0
Entering edit mode

Farhat's perl script A: 7n motif search over the genome should solve your problem.

ADD REPLY
2
Entering edit mode
10.9 years ago

You can use R/Bioconductor to solve this. For example, with the first 10 sequences of one of my fasta files:

library(seqinr)

library(Biostrings)

fasta_file <- read.fasta("mm10_refGene.fa", as.string = T) # read fasta file; every sequence will be one string

pattern <- "attag" # the pattern to look for

dict <- PDict(pattern, max.mismatch = 0) # make a dictionary from the patterns that you want to look for

seq <- DNAStringSet( unlist(fasta_file)[1:10] ) # make a DNAStringSet from the DNA sequences (only the first ten for this examples)

result <- vcountPDict(dict, seq) # count pattern in each of the sequences

result

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1    1    1    0    2    3    1    0    1     4

The result is a matrix with one column per sequence. If you have several patterns, you provide them as a vector. In the result, every pattern will have one row. You can add the annotation of your fasta sequences as column names:

colnames(result) <- attr(fasta_file, "names")[1:10]

And save the matrix as a .csv:

write.csv2(result, "result.csv")

There is just one problem: If you have any characters in your sequences that are not part of the DNA alphabet, you will get an error.

DNA_ALPHABET

[1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
ADD COMMENT
1
Entering edit mode
7.4 years ago

Hi Kevin,

it would be good to have a p-value for the number of motifs that you found in each sequence. I addressed this problem in two different ways: You can take each sequence that you analyzed, scramble the nucleotides (e.g. with sample() in R), and then count the motif again. You repeat this many times and each time, you store the number of motifs that you found. Let's say you do it 1000 times, and 3 times you found as many motifs or even more than in the original, unscrambled sequence, your p-value would be 3/1000 = 0.003. This approach would also take the nucleotide composition of each individual sequence into account. If your motif is GC rich, you have a higher probability to find it in a GC rich sequence than in an AT rich sequence, for example.

For a more analytical approach, you could calculate for each sequence, how many 4-mers it has (the length of the sequence minus 3), and multiply this number by the probability to see your motif given a certain frequency of nucleotides. Let's say your motif is ATGA, and your collection of sequences has 20% As, 20% Ts, 30% Gs, and 30% Cs (I am making these numbers up now), the probability to see ATGA would be 0.2x0.2x0.3x0.2 = 0.024. In a sequence that is 103 nt long, you would expect to see 0.024x100 = 2.4 times ATGA just by chance. Now you can compare, for each sequence, the expected number of motifs to the observed number of motifs. You can go even further and calculate the probability to find your motif at least as many times as you actually did, given the number of 4-mers in your sequence and the binomial distribution, with pbinom(). What might be not entirely correct about this second approach is that the 4-mers are not independent from each other. E.g., finding AATG makes it quite likely that the next 4-mer in the sequence will be ATGA. I once compared both approaches for a more complicated, structural RNA motif and the results were identical.

ADD COMMENT
0
Entering edit mode
7.6 years ago
kevinm ▴ 40

Hi everyone ! I know this topic was open 3 years ago, and the answer just help me.

I just want to know if someone know how to normalized the motif count per sequence into the length of the sequence, because, correct me if i'm wrong, there is more chance of finding a motif on a longer sequence.

For the example, i am using a 4 nt motif (the binding motif of a RNA binding protein, cause i am working on RNAseq results), and i can easily see that a longer sequence have more motif than shorter one... Can someone help me for this case...

Thanks

ADD COMMENT

Login before adding your answer.

Traffic: 2562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6