Question

Statistics Of Enrichment Of Indels

0

Entering edit mode

12.0 years ago

PoGibas 5.1k

Usually I use Fisher or Wilcoxon rank sum, but this time data is different and I don't know what should I use.

Data: Four DNA sequences -> different amount of mutations

I want to prove that sequenceA is enriched of mutations compared to sequenceB (their length is equal)

Also I do have sequenceC (prolonged sequenceA) & sequenceD (prolonged sequenceB).

How one should do it?

My data looks kinda like that (positions of indels):

254, 1000, 1036, 5448, 7315 -> sum = 6 mut;

63, 75, 967, 3691 -> sum = 4 mut;

Really looking forward to your answers. Thanks in advance.

It's for quick glance at p value so I don't need anything fancy. Hope someone will help.

statistics • 1.9k views

ADD COMMENT • link updated 12.0 years ago by Michael 54k • written 12.0 years ago by PoGibas 5.1k

1

Entering edit mode

Why doesn't Fisher's exact test work for you? I think it is a reasonable approach.

ADD REPLY • link 12.0 years ago by Michael 54k

score 3 · Answer 1 · 2012-04-23

I think if the number of indels is small compared to the gene length, then Fisher's exact test should just be an ok approximation. Count the number of positions where an indel occurred vs. the number of positions without a mutation, yielding a contigency table like this example (given gene length =10000 for both):

cont.table= matrix(c(6,4,10000-6,10000-4), ncol=2, byrow=T)
cont.table
     [,1] [,2]
[1,]    6    4
[2,] 9994 9996

Then apply fisher.test to test for the alternative hypothesis that column 1 is enriched with respect to 2:

fisher.test(cont.table, alternative="greater")

    Fisher's Exact Test for Count Data

data:  cont.table 
p-value = 0.3769
alternative hypothesis: true odds ratio is greater than 1 
95 percent confidence interval:
 0.4357716       Inf 
sample estimates:
odds ratio 
  1.500262