How compute the average distance between two distinct motifs insidse a list of DNA sequence?
2
2
Entering edit mode
7.5 years ago
kevinm ▴ 40

I've got a new issue...

I would like to compute the average distance between two distinct motifs in the same sequences list than before. Have you some clue on how manage this ???

I just come to do it for one motif like that :

source("motifOccurrence.R")

motif <- c("T", "C", "A", "A")

 motidist <- sapply(df, FUN=function(df, motif) {

      computeDistance(coordMotif(df, motif))

}, motif = motif)

This R code give me the average distance between a define motif inside all the sequences of my list. And I would like to do the same but with two motif... Can someone help me ?

To extend informations about what i wanted to do :

i worked with a fasta file in input :

'> 1

GACTCTACTATAAACGGGAGATAGCAATCTAACGCAGTGCTTCAACTCCTCCTCCATCTGAACACCCTTCAACCTTTGATACTCAGACGTTTTAGGTCGG

'> 2

ACCACCCCTTTGTCCAGAAATAGGACTCTTGGGCCTGTTGCCTGAATAAAGTCCAACCACCACAACCACTACACTACCATATGTAAGCTTCACTGATGGT

'> 3

CACCACAAGTGCGCGCCACGACGTGCATAGCCTCTAGATCGGCAACTCAGGCGAGAAGTGTTTTATTTCGGTGTGGCCGGTCCTGGGCATTTTACGGAAA

'> 4

GTTAGTGTACAAGTCCGAATAGAGTCACGAAAGACCCACACAACCACGTAATGACCTCGCTGTAATGAGATCAGTTGGCTCATGAAGGAAGAACGTAATG

'> 5

TGAGCGTTCGCCAATAACCATCCCTCTCGTTCCTTGTAACTGTACTATGATAGCGGGCGCCCCCCTAATTAAATAGCGGACGCCCTGACCTATTGTATGA

'> 6

TGATATATCTACTCGATAAGGATATAGAGGTCTAATTGTTGAGAAGTGTACCACCTTAGAGCACGAGTTTAGGATACTTAGTAGGTTCTTGCGAAGGATA

etcetc

Then the motidist object look like this :

                     1                               2

                     152                           94 

                     3                               4 

                      36                          138 

                     5                               6

                      92                          113

And the distances given by the function stand for one motif, and now, i would like to do the same but for the disatnce between two motifs like this :

atcgacatagacgactgatcgtcag MOTIF1 acggtagacagt MOTIF2 agcagatgacta # And this for all sequences in the file !

Thanks by advance

R sequence motif average distance • 3.4k views
ADD COMMENT
2
Entering edit mode

Can't you just loop over the sequence, record the index of the motif if found and subtract that when you find the next one? Additional question: what should be done if you see:

1) atcgacatagacgactgatcgtcag MOTIF1 acggtagacagt MOTIF2 agcagatgacta MOTIF2 acgtcgtagctgatgctcggct (twice motif 2 after each other)

2) atcgacatagacgactgatcgtcag MOTIF1 acggtagacagt MOTIF2 agcagatgactaacgtgtgtgtg MOTIF1 acgtcgtagctgatgctcggct (motif2 sandwhiched by motif1 with different lengths)

3) atcgacatagacgactgatcgtcag MOTIF1 acggtagacagtagcagatgactaacgtcgtagctgatgctcggct (just a motif1 without motif2)

ADD REPLY
0
Entering edit mode

What would you mean by loop over the sequence ?

And for the additional questions :

1) I would like to know only the distance between the closer motifs motif 1 and first motif 2.

2) In this case the two informations are interesting and in a first time the average distance will be a sufficient info.

3) If there is no motif 2, return me 0.

ADD REPLY
1
Entering edit mode

You should have a look at the re python module, regular expression, and just get the index of the position(s) in which a motif is found. This should be pretty easy.

ADD REPLY
1
Entering edit mode

Can you detailled your answer with a running example

ADD REPLY
1
Entering edit mode

I've edited my post with more details.

ADD REPLY
1
Entering edit mode

Do you need it in R or any other language would work?

ADD REPLY
0
Entering edit mode

R would be the easier way for me, but i can handle with python or perl if you have an idea in those languages !

ADD REPLY
1
Entering edit mode

What does source("motifOccurrence.R") does? What does df looks like? A minimum working example would help.

ADD REPLY
0
Entering edit mode

For source("motifOccurrence.R") ==> https://www.r-bloggers.com/calculate-the-average-distance-between-a-given-dna-motif-within-dna-sequences-in-r/

And df is just DNAStringSet instance looking like this :

width seq                                               names               

[1]  3144 ATGGTCGAAAATAGTACGCAGAA...TGTTTGGCAATGACTTTGCTTAA NC_001133.9:48563...

[2]  1242 TTATAAGTACTTCGTAAAATTCG...GATTCAGGAGAATATAATCCCAT NC_001133.9:80709...

[3]   921 TTATTGTGCCACTTCTTGCTCAG...TGCCAAACAATTTCGTCGGACAT NC_001133.9:10022...

[4]  1404 ATGTTGAAGTCAGCCGTTTATTC...ACGACACTTTATTAAAACAGTAA NC_001133.9:22545...

[5]  1920 TTAGATATATCGTATCTTTTCAT...AATCCAGAATCAGATACCGGCAT NC_001134.8:13834...

[6]  2070 TTAATCAATTTTGTCGTACAATT...GATTGTGAACCTTGGGCTTGCAT NC_001134.8:32187...
ADD REPLY
4
Entering edit mode
7.5 years ago
ddiez ★ 2.0k

Assuming I understood your problem this is my attempt to a solution. I do not use the functions in the script you pointed out, but the Biostrings Bioconductor package. Also, I use the Bioconductor package BSgenome.Scerevisiae.UCSC.sacCer3 for the yeast genome as example sequence data.

# load required packages.
library(ggplot2) # for qplot().
library(Biostrings)
library(BSgenome.Scerevisiae.UCSC.sacCer3)

# get sequence data (DNAStringSet)
seq <- getSeq(Scerevisiae)

# create motif dictionary.
dict0 <- DNAStringSet(
  x = c("TCAA",
        "GGAT",
        "ATAT",
        "GGCC")
)
pdict0 <- PDict(dict0)

# search with the dictionary every sequence in seq.
res <- lapply(seq, function(s) matchPDict(pdict0, s))
res[[1]]

MIndex object of length 4
[[1]]
IRanges object with 1533 ranges and 0 metadata columns:
             start       end     width
         <integer> <integer> <integer>
     [1]       114       117         4
     [2]       161       164         4
     [3]       456       459         4
     [4]       719       722         4
     [5]       776       779         4
     ...       ...       ...       ...
  [1529]    228963    228966         4
  [1530]    229173    229176         4
  [1531]    229260    229263         4
  [1532]    229817    229820         4
  [1533]    229825    229828         4

...
<3 more elements>

# calculate combinations of motifs to compare
# assuming you want to compute different motifs to each other.
motifcomp <- t(combn(seq_len(length(pdict0)), 2))
colnames(motifcomp) <- c("motif_i", "motid_j")
motifcomp

     motif_i motid_j
[1,]       1       2
[2,]       1       3
[3,]       1       4
[4,]       2       3
[5,]       2       4
[6,]       3       4

# iterate over the comparison matrix. for each pair of motifs and
# compute the distance to the nearest one over all sequences.
# this gives a list with each elements being the distances.
foo <- lapply(seq_len(nrow(motifcomp)), function(i) {
  m <- motifcomp[i,]
  d <- sapply(res, function(r) {
    mcols(distanceToNearest(r[[m[1]]], r[[m[2]]]))$distance
  })
  unlist(d)
})

# we can check these distances are not normally distributed.
qplot(foo[[1]])

# now compute mean (and maybe other summaries, here standard deviation).
hoo <- lapply(foo, function(x) {
  data.frame(mean = mean(x), sd = sd(x))
})
hoo <- do.call(rbind, hoo)

data.frame(motifcomp, hoo)

  motif_i motid_j      mean        sd
1       1       2 150.15481 154.67573
2       1       3  71.63235  85.10428
3       1       4 358.76577 390.83777
4       2       3  69.26831  87.50602
5       2       4 348.63816 389.88737
6       3       4 378.64977 409.81555
ADD COMMENT
0
Entering edit mode
7.5 years ago
Macherki M E ▴ 120

I think that I already give answer here. Moreover, the distance between two motifs A and B is usually formulated as geometric distribution Where:

P(A,B,distance =k)=p(A)P(B)*(1-P(B))^k

You can simulate k (exemple 1:10000), and the Expected value will be the sum of pi*ni

ADD COMMENT

Login before adding your answer.

Traffic: 2585 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6