Question

Formatting RNA-Seq Data for Kruskal Wallis Test on R

0

Entering edit mode

8 months ago

eco2249 • 0

I am trying to perform Kruskal-Wallis test on RNA-Seq data, which has 20 genes of interest and looks like following:

 Gene_id     Con_1          Con_2        Con_3             Mut_1           Mut_2            Mut_3
Gene_001  -0.173575646  0.519571535 -5.87735812 2.023648932 1.94668789  1.56102541
Gene_002 -0.185999458   -0.118197772    0.129667462 -0.071623581    0.249618688 -0.003465339
Gene_003  3.486046831   -5.693834334    -1.088664148    0.009948141 3.682020477 -0.395516967

Only the first 3 rows are shown to demonstrate what data looks like. The first column is the gene id and columns 2-4 (WT) are three independent biological replicates of control and columns (5-7) are three independent biological replicates of mutations (Mut) as treatment. The data has been log transformed.

Here are my questions:

Q #1. A couple of posts on this forum suggested to transpose and/or melt data on R. How exactly should I do it?

Q #2. Given control and treatments have three replicates each, how the Kruskal-Wallis test should be performed? Should I average three replicates for control and average for mutation? In other words, what would be the best way to perform the Kruskal-Wallis test with RNA-seq data where it has replicates?

Any help/suggestion would be appreciated. Thank you.

R Kruskal-Wallis RNA-seq • 1.4k views

ADD COMMENT • link updated 8 months ago by swbarnes2 14k • written 8 months ago by eco2249 • 0

1

Entering edit mode

You can do something like this. I m using your data above and Gene_004-Gene_009 are duplicates of what you have in above.

mydata=read.table("mydata.txt", h=T, sep="\t", row.names = "Gene_id")
head(mydata)

       Gene_001     Gene_002     Gene_003   Gene_004     Gene_005     Gene_006   Gene_007     Gene_008     Gene_009
Con_1 -0.1735756 -0.185999458  3.486046831 -0.1735756 -0.185999458  3.486046831 -0.1735756 -0.185999458  3.486046831
Con_2  0.5195715 -0.118197772 -5.693834334  0.5195715 -0.118197772 -5.693834334  0.5195715 -0.118197772 -5.693834334
Con_3 -5.8773581  0.129667462 -1.088664148 -5.8773581  0.129667462 -1.088664148 -5.8773581  0.129667462 -1.088664148
Mut_1  2.0236489 -0.071623581  0.009948141  2.0236489 -0.071623581  0.009948141  2.0236489 -0.071623581  0.009948141
Mut_2  1.9466879  0.249618688  3.682020477  1.9466879  0.249618688  3.682020477  1.9466879  0.249618688  3.682020477
Mut_3  1.5610254 -0.003465339 -0.395516967  1.5610254 -0.003465339 -0.395516967  1.5610254 -0.003465339 -0.395516967

#Transposing your data
tmydata=t(mydata)
head(tmydata)

#Creating dataframe here
tmydata=data.frame(tmydata)
groups <-c("Con", "Con", "Con", "Mut", "Mut", "Mut")
new_data <- cbind(groups, tmydata)
rownames(new_data) <- NULL
head(new_data)

     groups   Gene_001     Gene_002     Gene_003   Gene_004     Gene_005     Gene_006   Gene_007     Gene_008     Gene_009
1    Con -0.1735756 -0.185999458  3.486046831 -0.1735756 -0.185999458  3.486046831 -0.1735756 -0.185999458  3.486046831
2    Con  0.5195715 -0.118197772 -5.693834334  0.5195715 -0.118197772 -5.693834334  0.5195715 -0.118197772 -5.693834334
3    Con -5.8773581  0.129667462 -1.088664148 -5.8773581  0.129667462 -1.088664148 -5.8773581  0.129667462 -1.088664148
4    Mut  2.0236489 -0.071623581  0.009948141  2.0236489 -0.071623581  0.009948141  2.0236489 -0.071623581  0.009948141
5    Mut  1.9466879  0.249618688  3.682020477  1.9466879  0.249618688  3.682020477  1.9466879  0.249618688  3.682020477
6    Mut  1.5610254 -0.003465339 -0.395516967  1.5610254 -0.003465339 -0.395516967  1.5610254 -0.003465339 -0.395516967

#run for a single Gene_001
kruskal.test(Gene_001 ~ groups, data = new_data)
kruskal.test(Gene_001 ~ groups, data = new_data)$p.value
[1] 0.04953461

#run for all genes
genes <- colnames(new_data)[-1]
genes
[1] "Gene_001" "Gene_002" "Gene_003" "Gene_004" "Gene_005" "Gene_006" "Gene_007" "Gene_008" "Gene_009"

res <- sapply(genes, function(x) {
  f <- as.formula(paste0(x, ' ~ groups'))
  model <- kruskal.test(f, data = new_data)
  p <- model$p.value
  p
})
names(res) <- genes
res
Gene_001   Gene_002   Gene_003   Gene_004   Gene_005   Gene_006   Gene_007   Gene_008   Gene_009 
0.04953461 0.27523352 0.27523352 0.04953461 0.27523352 0.27523352 0.04953461 0.27523352 0.27523352

ADD REPLY • link updated 8 months ago by Ram 43k • written 8 months ago by bk11 ★ 2.4k

0

Entering edit mode

bk11 , thank you very very much! It worked! I sincerely appreciate example codes with detailed comments. If it is ok, may I ask what does res do for calculating the Kruskal-Wallis test? It is basically for looping calculation of the test for all genes, which are defined in ~ groups?

ADD REPLY • link 8 months ago by eco2249 • 0

1

Entering edit mode

Yes, you are correct, that is what it is meant to do.

ADD REPLY • link updated 8 months ago by Ram 43k • written 8 months ago by bk11 ★ 2.4k

0

Entering edit mode

Thank you very much once again for your help!

ADD REPLY • link 8 months ago by eco2249 • 0

0

Entering edit mode

I find it curious that the p values for different genes turned out to be one of two values 0.04953461 and 0.27523352 why is that?

ADD REPLY • link 8 months ago by Istvan Albert 100k

0

Entering edit mode

For demo purpose, I created duplicates of Gene001-Gene003 for Gene004-Gene009.

ADD REPLY • link 8 months ago by bk11 ★ 2.4k

0

Entering edit mode

I understand that, but that does not fully explain why Gene_002 has the same p value as Gene_003, that is what stood out to me when looking at the data. The data for the two columns seems so radically different.

I would recommended generating these examples with some sort of random data. An example that shows an unexpected pattern can be very distracting when troubleshooting or answering a question - people may be diverted to suspect some sort of methodological error along the way

ADD REPLY • link 8 months ago by Istvan Albert 100k

0

Entering edit mode

The lowest, second lowest, and second highest values are controls in both genes 002 and 003. That it's so easy to get the same output suggest to me that this is a not helpful test for n = 6.

ADD REPLY • link 8 months ago by swbarnes2 14k

0

Entering edit mode

What is the purpose of the test being conducted here? Are you trying to get one p-value per gene? In which case, why are you doing a Kruskal-Wallice test, rather than a Mann-Whitney U test if you insist on doing a non-parametric test (although see ATPoint's post below, you don't really have the power for non-parametric when testing per-gene)?

Or is the idea to do some sort of gene-set test and get one overall p-value?

ADD REPLY • link 8 months ago by i.sudbery 19k

Ram · Answer 1 · 2023-08-24

I don't see the point of non-parametric tests with only n=3 per group. You will never get significances after multiple-testing correction. Use tests that are tailored for RNA-seq and low sample size. See the user guides of packages such as limma or DESeq2, and then see which of their strategies fits your purpose best. With 3 vs 3 so two groups it will come down to a standard groupwise comparison. Use expert software, what you aim to do is underpowered.

See for example. Even with extreme difference (0 vs 1000) is is barely significant at 3 vs 3 and this is not even multiple-testing corrected given that you do this test for many genes.

d <- data.frame(group=rep(c("A","B"), each=3), value=c(0,0,0,1000,1000,1000))
d
#>   group value
#> 1     A     0
#> 2     A     0
#> 3     A     0
#> 4     B  1000
#> 5     B  1000
#> 6     B  1000
kruskal.test(d$value, d$group)
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  d$value and d$group
#> Kruskal-Wallis chi-squared = 5, df = 1, p-value = 0.02535
Created on 2023-08-24 with reprex v2.0.2