Question: depth of coverage - ANOVA - R
0
gravatar for soleimani_homa
7 months ago by
soleimani_homa0 wrote:

Hello, I'm analyzing output of depth of coverage between five breeds. I want to understand where the depth of coverage of genome has changed between five breeds. I have 5 files containing positions in the rows and samples of each breeds in the columns. then I joint them . Now I want to perform an ANOVA in each row to identify positions with different depth across all my samples. How can I do an ANOVA test in R? How should I introduce groups to R?

example of my data:

positon A1  A2  A3  A4  A5  B1  B2  B3  B4  B5  C1  C2  C3  C4  C5
500000  11.6    30.3    28.5    11.3    28.6    26.6    27.5    23.9    19.8    8.4 27.6    19.4    20.4    30.2    31.3
501000  10.8    4.7 5.5 6.9 6.3 7.0 4.2 7.5 5.0 5.3 4.5 6.8 7.4 5.3 5.8
502000  5.3 2.5 4.1 3.1 2.4 5.2 3.5 4.2 5.1 6.1 4.1 4.3 6.2 2.8 3.9
503000  5.0 5.2 3.6 5.1 3.3 3.9 4.3 5.1 4.4 4.0 4.9 2.8 3.3 5.1 4.9
504000  13.0    9.4 10.5    19.2    10.2    9.0 11.1    9.0 8.3 20.2    9.2 18.5    10.9    7.3 9.2
505000  6.5 9.8 10.4    46.9    15.0    6.7 13.6    13.3    9.8 43.6    12.3    43.9    11.7    10.6    14.5
506000  9.4 12.5    14.3    13.4    14.0    11.1    14.1    15.3    11.5    14.5    14.5    15.4    15.1    15.3    15.1
508100  1.2 0.0 0.4 0.0 0.1 4.1 0.8 2.1 1.3 1.4 4.5 2.2 4.7 2.6 5.3

Thanks in advance

anova depth of coverage R • 244 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by soleimani_homa0

Since ANOVA is run in each row for each position of genome between breeds, and the number of positions are around 900,000, I can't plot them, so I want to know is it necessary checking the normality? How can I do that? If Shapiro Test is over-sensitive what procedure is recommended? Thank you for your time

ADD REPLYlink written 7 months ago by soleimani_homa0

I moved your message to a comment (you had posted it as an answer).

Ah - it is in these situations with large variable numbers whereby the Shapiro test will virtually always say 'not normally distributed', but don't quote me on this. I am not a Professor of Statistics. There is a good discussion here: https://stats.stackexchange.com/questions/12053/what-should-i-check-for-normality-raw-data-or-residuals

Also, given your large number of variables, you will want to 'parallelise' my code (below). You can do this by replacing %do% with %dopar% in the foreach loop. You will also require doParallel package. Please see here for information on how to choose number of threads / CPU cores (system dependent): R functions for parallel processing

ADD REPLYlink modified 7 months ago • written 7 months ago by Kevin Blighe52k
3
gravatar for Kevin Blighe
7 months ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

Hey, you will have to manipulate your data such that it is in this format:

position <- c('500','501','502','503','504','505')
coverage <- data.frame(
  position = position,
  A1=c(30,20,29,22,30,30),
  A2=c(35,33,22,43,32,29),
  B1=c(21,32,1,33,43,44),
  B2=c(25,25,33,31,32,33),
  C1=c(50,45,39,40,50,55),
  C2=c(60,59,23,56,55,44))

rownames(coverage) = coverage$position
coverage <- coverage[,-1]

group <- c('A','A','B','B','C','C')

coverage <- data.frame(group, t(coverage))

coverage
   group X500 X501 X502 X503 X504 X505
A1     A   30   20   29   22   30   30
A2     A   35   33   22   43   32   29
B1     B   21   32    1   33   43   44
B2     B   25   25   33   31   32   33
C1     C   50   45   39   40   50   55
C2     C   60   59   23   56   55   44

Now, we can set up a loop that will test each position in an ANOVA. You should study the functionality of the foreach() function. The actual ANOVA call is made with aov()

require(foreach)
res <- foreach(i = 2:ncol(coverage), .inorder = TRUE) %do% {
  pos <- colnames(coverage)[i]
  f <- as.formula(paste0(pos, ' ~ group'))
  summary(aov(f, data = coverage))[[1]]["Pr(>F)"][1,1]
}

data.frame(
  position = position,
  pvalue = do.call(rbind, res))
  position     pvalue
1      500 0.01516230
2      501 0.09260065
3      502 0.67507024
4      503 0.36883595
5      504 0.04883831
6      505 0.11202618

The result is a 2-column data-frame with position and p-value from the ANOVA.

I expect you to adapt this code to your own situation.

Kevin

ADD COMMENTlink modified 7 months ago • written 7 months ago by Kevin Blighe52k
1

To do a non-parametric ANOVA (Kruskal-Wallis test), which may be advisable with this data and your low sample numbers ('low' based on the data that you pasted in your original question), you just need to change the line in the foreach loop to:

kruskal.test(f, data = coverage)$p.value
ADD REPLYlink modified 7 months ago • written 7 months ago by Kevin Blighe52k

Hi Kevin Blighe, thanks for the reply. I have 115 samples totally, each breed has 20 or 25 samples. Before doing ANOVA should I do a normality test (Shapiro test) or dip test or a non-parametric ANOVA (Kruskal-Wallis test)? Which one is better? At first I did the dip test, but most of the positions were removed.

ADD REPLYlink written 7 months ago by soleimani_homa0

With those numbers, parametric ANOVA should be fine - nobody can criticise you too much if you use Kruskal-Wallis (non-parametric), though. Generally, I always check the distribution of my data before I perform any tests, e.g., histograms, boxplots, min, max, mean, median, sdev, etc. I do not have much comment on the Shapiro Test - sorry. If my memory serves me correctly, it is over-sensitive in certain situations and will always result in failure of normality.

ADD REPLYlink written 7 months ago by Kevin Blighe52k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1772 users visited in the last hour