depth of coverage - ANOVA - R
1
0
Entering edit mode
3.1 years ago

Hello, I'm analyzing output of depth of coverage between five breeds. I want to understand where the depth of coverage of genome has changed between five breeds. I have 5 files containing positions in the rows and samples of each breeds in the columns. then I joint them . Now I want to perform an ANOVA in each row to identify positions with different depth across all my samples. How can I do an ANOVA test in R? How should I introduce groups to R?

example of my data:

positon A1  A2  A3  A4  A5  B1  B2  B3  B4  B5  C1  C2  C3  C4  C5
500000  11.6    30.3    28.5    11.3    28.6    26.6    27.5    23.9    19.8    8.4 27.6    19.4    20.4    30.2    31.3
501000  10.8    4.7 5.5 6.9 6.3 7.0 4.2 7.5 5.0 5.3 4.5 6.8 7.4 5.3 5.8
502000  5.3 2.5 4.1 3.1 2.4 5.2 3.5 4.2 5.1 6.1 4.1 4.3 6.2 2.8 3.9
503000  5.0 5.2 3.6 5.1 3.3 3.9 4.3 5.1 4.4 4.0 4.9 2.8 3.3 5.1 4.9
504000  13.0    9.4 10.5    19.2    10.2    9.0 11.1    9.0 8.3 20.2    9.2 18.5    10.9    7.3 9.2
505000  6.5 9.8 10.4    46.9    15.0    6.7 13.6    13.3    9.8 43.6    12.3    43.9    11.7    10.6    14.5
506000  9.4 12.5    14.3    13.4    14.0    11.1    14.1    15.3    11.5    14.5    14.5    15.4    15.1    15.3    15.1
508100  1.2 0.0 0.4 0.0 0.1 4.1 0.8 2.1 1.3 1.4 4.5 2.2 4.7 2.6 5.3


R ANOVA depth of coverage • 920 views
0
Entering edit mode

Since ANOVA is run in each row for each position of genome between breeds, and the number of positions are around 900,000, I can't plot them, so I want to know is it necessary checking the normality? How can I do that? If Shapiro Test is over-sensitive what procedure is recommended? Thank you for your time

0
Entering edit mode

Ah - it is in these situations with large variable numbers whereby the Shapiro test will virtually always say 'not normally distributed', but don't quote me on this. I am not a Professor of Statistics. There is a good discussion here: https://stats.stackexchange.com/questions/12053/what-should-i-check-for-normality-raw-data-or-residuals

Also, given your large number of variables, you will want to 'parallelise' my code (below). You can do this by replacing %do% with %dopar% in the foreach loop. You will also require doParallel package. Please see here for information on how to choose number of threads / CPU cores (system dependent): R functions for parallel processing

4
Entering edit mode
3.1 years ago

Hey, you will have to manipulate your data such that it is in this format:

position <- c('500','501','502','503','504','505')
coverage <- data.frame(
position = position,
A1=c(30,20,29,22,30,30),
A2=c(35,33,22,43,32,29),
B1=c(21,32,1,33,43,44),
B2=c(25,25,33,31,32,33),
C1=c(50,45,39,40,50,55),
C2=c(60,59,23,56,55,44))

rownames(coverage) = coverage$position coverage <- coverage[,-1] group <- c('A','A','B','B','C','C') coverage <- data.frame(group, t(coverage)) coverage group X500 X501 X502 X503 X504 X505 A1 A 30 20 29 22 30 30 A2 A 35 33 22 43 32 29 B1 B 21 32 1 33 43 44 B2 B 25 25 33 31 32 33 C1 C 50 45 39 40 50 55 C2 C 60 59 23 56 55 44  Now, we can set up a loop that will test each position in an ANOVA. You should study the functionality of the foreach() function. The actual ANOVA call is made with aov() require(foreach) res <- foreach(i = 2:ncol(coverage), .inorder = TRUE) %do% { pos <- colnames(coverage)[i] f <- as.formula(paste0(pos, ' ~ group')) summary(aov(f, data = coverage))[[1]]["Pr(>F)"][1,1] } data.frame( position = position, pvalue = do.call(rbind, res)) position pvalue 1 500 0.01516230 2 501 0.09260065 3 502 0.67507024 4 503 0.36883595 5 504 0.04883831 6 505 0.11202618  The result is a 2-column data-frame with position and p-value from the ANOVA. I expect you to adapt this code to your own situation. Kevin ADD COMMENT 1 Entering edit mode To do a non-parametric ANOVA (Kruskal-Wallis test), which may be advisable with this data and your low sample numbers ('low' based on the data that you pasted in your original question), you just need to change the line in the foreach loop to: kruskal.test(f, data = coverage)$p.value

0
Entering edit mode

Hi Kevin Blighe, thanks for the reply. I have 115 samples totally, each breed has 20 or 25 samples. Before doing ANOVA should I do a normality test (Shapiro test) or dip test or a non-parametric ANOVA (Kruskal-Wallis test)? Which one is better? At first I did the dip test, but most of the positions were removed.

0
Entering edit mode

With those numbers, parametric ANOVA should be fine - nobody can criticise you too much if you use Kruskal-Wallis (non-parametric), though. Generally, I always check the distribution of my data before I perform any tests, e.g., histograms, boxplots, min, max, mean, median, sdev, etc. I do not have much comment on the Shapiro Test - sorry. If my memory serves me correctly, it is over-sensitive in certain situations and will always result in failure of normality.