Hello everyone,

I am currently using limma to look for Differentially Expressed Genes (DEG) in a small dataset (78 samples) Log2 Quantile Normalized gene expression data. There are three groups: Controls (CNT), Traumatized (TE), and Disorder (DIS).

I have run into a problem that I am fairly certain has been addressed here and/or on bioconductor which is that the nothing appears to be Differentially expressed after adjusting p-valus using False Discovery Rate (fdr), with all adjusted p-values showing up as the same value and universally high while my raw p-values are variable and some <.05.(see below). This, I understand is a product of the second step in correcting the p-value and is NOT an artifact or glitch.

However, I was wondering if I had performed my analysis correctly, as this is my first time using limma and I am also rather new to R. Its one thing to find no statistically significant evidence for DEG because its not there in the data, its another thing altogether to find no DEG because you BOTCHED the design matrix or something, and then having that pointed out to you later.

So if you have the time, and are so inclined would you please look over the following commands and let me know If you spot an error?

I Began with table of Quantile Normalized Log2 values of gene expression:

```
array.dat <- read.table("proj2_data_miRNA_GRPD.txt", header = TRUE, row.names = 1)
```

Manual setup of of labels to be used in making of design matrix:

```
> labels <- factor(
c(rep('CNT', 19), rep('DIS', 23), rep('TE', 36)),
levels = c('CNT', 'DIS', 'TE')
)
```

generate design matrix, model specified by (~0 + labels), ~0 means no intercept:

```
>design <- model.matrix(~ 0 + labels)
>colnames(design) <- levels(labels)
>design
#Design matrix:
CNT DIS TE
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 1 0 0
7 1 0 0
8 1 0 0
9 1 0 0
10 1 0 0
11 1 0 0
12 1 0 0
13 1 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 1 0 0
18 1 0 0
19 1 0 0
20 0 1 0
21 0 1 0
22 0 1 0
23 0 1 0
24 0 1 0
25 0 1 0
26 0 1 0
27 0 1 0
28 0 1 0
29 0 1 0
30 0 1 0
31 0 1 0
32 0 1 0
33 0 1 0
34 0 1 0
35 0 1 0
36 0 1 0
37 0 1 0
38 0 1 0
39 0 1 0
40 0 1 0
41 0 1 0
42 0 1 0
43 0 0 1
44 0 0 1
45 0 0 1
46 0 0 1
47 0 0 1
48 0 0 1
49 0 0 1
50 0 0 1
51 0 0 1
52 0 0 1
53 0 0 1
54 0 0 1
55 0 0 1
56 0 0 1
57 0 0 1
58 0 0 1
59 0 0 1
60 0 0 1
61 0 0 1
62 0 0 1
63 0 0 1
64 0 0 1
65 0 0 1
66 0 0 1
67 0 0 1
68 0 0 1
69 0 0 1
70 0 0 1
71 0 0 1
72 0 0 1
73 0 0 1
74 0 0 1
75 0 0 1
76 0 0 1
77 0 0 1
78 0 0 1
```

Create contrast matrix. In the lest element of this matrix I am trying to compare the samples to the combined DIS and TE groups (as if they had all fallen under the DIS label for instance) I believe this is the correct way to do that:

```
> contrast_miRNA.mat <- makeContrasts(
DISvCNT = DIS -CNT, #Compare disorder samples to Control
TEvCNT = CNT - TE, #Compare trauma samples to Control
DISvTE = DIS - TE, #Compare Disorder to trauma samples
CNTvDISnTE = CNT - (DIS + TE)/2, #Compare Disorder&Trauma samples to control
levels=design)
> contrast_miRNA.mat
Levels DISvCNT TEvCNT DISvTE CNTvDISnTE
CNT -1 1 0 1.0
DIS 1 0 1 -0.5
TE 0 -1 -1 -0.5
```

Fit linear model to data using design matrix

```
>fit_miRNA <- lmFit(array.dat, design)
```

Create contrast matrix

```
>fit_miRNA.cont <- contrasts.fit(fit_miRNA, contrast_miRNA.mat)
```

apply eBayes adjustment

```
>fit_miRNA.eb <- eBayes(fit_miRNA.cont)
```

Use topTable command to search for differentially expressed genes, adjust p values using the false discovery rate (q-values), p-value cutoff would actually be=.05, I have set it to 1.0 here so that everything is saved and I am able to display the final analysis products, rather than an empty object:

```
>deg_miRNA.DISvCNT <- topTable(fit_miRNA.eb, coef ="DISvCNT", number=nrow(array.dat), adjust.method = 'fdr',p.value = 1.0)
>deg_miRNA.TEvCNT <- topTable(fit_miRNA.eb, coef ="TEvCNT", number=nrow(array.dat), adjust.method = 'fdr',p.value = 1.0)
>deg_miRNA.DISvTE <- topTable(fit_miRNA.eb, coef ="DISvTE", number=nrow(array.dat), adjust.method = 'fdr',p.value = 1.0)
>deg_miRNA.CNTvDISnTE <- topTable(fit_miRNA.eb, coef ="CNTvDISnTE", number=nrow(array.dat), adjust.method = 'fdr',p.value = 1.0)
```

Looking at the output of these analysis (ex. deg_miRNA.CNTvDISnTE) nothing passes the cutoff. With a p value cutoff of .05 I simply get an empty data frame of 0 rows and 0 columns.

Again I am very new to this so if there is anything you see here that is off I'd be very grateful if you could point it out. If I have done everything correctly it would also be wonderful to have that confirmed!

I have also looked at including other data on the samples as covariates in the analysis (batch, sex, RIN, age etc.) but while I have done this for the stuff that can be treated as factorials (batch for instance) I am unsure of how to do this for the continuous numeric data such as age. Also, this post seemed awful long as is. If you have any recommendations as to how I can modify my process to include those numeric data like age I'd also be pretty thrilled to hear them!

Once again thank you for any input and for reading.

is this microarray data?

Yes it is. miRNA chip v3.0 in this case.

could you check whether the sample names in your ExpressionSet are in the order you expected: that all the controls are first, all the DIS next, then all the TRE.

You could also use method = "nestedF" or "hierarchical" since you are testing multiple contrasts.

Indeed, treble check that

`labels`

lines up with the columns in your input data. Also, you should state how you produced the data contained in theproj2_data_miRNA_GRPD.txtfileAlso, what is the lowest p-value that you get? What if you just do a simple boxplot of your gene of interest, stratified by your groups - do you see any visual difference in the distribution?

The data was actually produced by other researchers, and normalized by them. I began my participation with this table. I have indeed checked those labels and the numbers are correct, and the order of columns in my data file is the same.

The lowest p-value (unadjusted) for the CNT to Disease+trauma group was: 0.0006342028

That boxplot idea is interesting, I am really looking not at one specific gene but for any and all DE in between the control and disease/trauma. The next step would be to conduct WGCNA analysis on such DEGs...

Thats a good thought. I did just that and confirmed that all of my samples are in the correct order within the expression set. Here is a small sample of that table:

Nothing looks peculiar here? My work starts with this table, normalization and work with the original .cel files was done by other researchers.

In regards to nestedF and hierarchal, given my data begins with this table, could I still use those methods? I glanced in the user manual and found the following:

The "hierarchical" method offers power advantages when used with adjust.method="holm" to control the family-wise error rate. However its properties are not yet well understood with adjust="BH". method="nestedF" has a more specialised aim to give greater weight to probes which are significance for two or more contrasts. Most multiple testing methods tend to underestimate the number of such probes. There is some practical experience to suggest that method="nestedF" gives less conservative results when finding probes which respond to several different contrasts at once. However this method should still be viewed as experimental. It provides formal false discovery rate control at the probe level only, not at the contrast level.That bit about "probe level only, not the contrast level" gave me pause.

I was also looking into trying the arrayWheights function. I found a discussion of the function here:

https://support.bioconductor.org/p/67984/

Which makes me think it may increase my sensitivity to detecting DEGs. That then begs the question though, can I run array wheights on this data? Or do I need the raw .cel files? Im only just looking into it but from the MIT.edu websites description of the function its object is

Would my dataset fit this description I wonder?

What if you generate a boxplot or histogram of the expression data? For microarray, the median of all samples in the boxplot should line up fairly well. The histogram should follow the normal distribution.

Here is that boxplot:

https://imgur.com/a/cbDG2Ql

It looked like the values lined up pretty well to me?

The medians line up... but the number of outliers is excessive. That will affect the statistical inferences. I would confirm exactly how the samples were processed.

I should add: a linear fit, as used by limma, is sensitive to outliers. The statistics will be skewed when outliers are present. The line of probes with constant expression at the top end of your data is also unusual. Did your colleagues remove control probes at all? How did they normalise?

I'm going to have to get back to you on that. I'm meeting with the PI tomorrow, but the outliers did look a bit...thick to me too. I just don't have much experience here and wasn't sure if that was normal. Its good to know its not, I'd love to turn up

somethingin my data hahaha. I will find out the particulars of the normalization tomorrow.I'm also going to get the raw data and do normalization myself, as this is something I need to learn anyway. In the meantime, is there any sort of transformation I can do on the data to reduce the effect of these outliers? Or is it really a matter of working with the raw data?

Well, it's just that I've analysed dozens of microarrays of different types over the years and I've never seen this excessive amount of outliers. The methods for processing these are quite standard, though, so it's actually difficult for the analyst to do the normalisation incorrectly... RMA normalisation is typically used.

It could very well be a wet-lab issue, but as the analyst, good luck convincing the wet lab folk that they did something wrong. You could ask indirect questions like:

Other diagnostic plots like MA plots (as suggested by russhh), volcano plots, histograms and boxplots, can help.

Did you check the histogram? - I don't know but are you sure that the data has been log-transformed? Some of the publicly available datasets are released as unlogged, while the statistical inferences ought to be performed on the logged.

I actually need to run the histogram, I'm looking into the commands to do that atm (still learning as I go).

I will definitely ask about the sample prep and show the PI this boxplot. I don't suppose it could be something I've done wrong in my commands for boxplot generation?

As an aside, whatever problems there may be in this table, I have a second dataset of mRNA that produces a pretty clean looking boxplot diagram (only a few outliers per sample), and have yet to detect DEG that passes the statistical significance threshold there, so I will looking into the other suggestions by russhh and yourself (MA plots, Volcano plots etc) for that data at least.

Here is my histogram

https://imgur.com/a/AQJi69q

generated by the following commands:

Ah!

lightbulb!!There is something not quite right there... you may just need to confirm with the PI how this data was processed.

Excellent! Thank you for all your feedback on this. I can't tell you how helpful you and russhh have been. I truly appreciate it.

Wow. There's a lot of data points in this (afraid I've never used that platform). Do you have multiple assays for each miRNA, and if so, shouldn't they be combined in some way to get a summary score for each miRNA before running through limma?

What's the dimension of your ExpressionSet?

I was operating under the assumption that the data within this table was one assay per miRNA. looking at its dimensions, it has 2578 rows (a number not too far off the number of reported mature miRNAs for humans) and 78 columns corresponding to the 78 samples , I dint think there was anything left to combine in this, Im pretty sure that would have been mentioned, but I will have to ask tomorrow.

As an aside, I have another dataset for mRNA whose boxplots do NOT show this excessive number of outliers (there are just a handful per sample) and I have yet to find DE that passes statistical significance there. So even though something may be off in this table, my other questions about increasing sensitivity still stand for that. So I will be looking into things like the volcano plots MA plots etc for that table at least.

Oh, that's strange, the boxplot looked more dense than I'd have expected for 2.5k features (I retract my "there's a lot of data points", and in particular my "Wow")

There's various other things you could have a look at.

first do an MA plot for each of your contrasts to see if there's any bias relative to expression intensity;

look at the topTable across all-contrasts to identify if there's any miRNAs with evidence of variable expression between any of the groups;

if you can account for array batches in your design matrix, you should definitely do it (although I doubt it will have a huge effect here).

if there are any miRNAs with a known association with your disease state in the literature, pull them out, see if a boxplot over the log-intensities looks unimodal within each group

Thank you russhh, I'm looking into this now. I take it you are suggesting the second application of the function plotMD:

limma generalized the concept of an MA-plot in two ways. First, the idea was extended to apply to sets of single-channel expression values. In this case, the plot is used to compare each sample to the average of all other samples. A virtual array is constructed by averaging the log-expression value for all the samples other than the sample of interest, and then a mean-difference plot is made between the single array and the virtual array. Second, the idea was extended to apply to the fitted model objects.-(Ritchie et al. 2015)In this case, the plot compares the log-fold-changes for a chosen contrast versus the average log-expression values of each gene across all samples.In effect, this plots a coefficient of the linear model versus an overall mean intercept parameter.Do you have an example of the commands to create the MA plot from contrasts? I looked into it and have found many examples of this first application, looking at 1 sample vs all others, but i take it from your suggestion I should look at one

groupvs all samples? If so I have struggled to find an example of the function plotMD() being used in this context.i'm afraid I don't remember. Don't you just do

`plotMA(fit_miRNA.eb, coef = "DIS_vs_CNT")`

? This should give an MA plot for one group (the Disorder group) relative to another group (the Control group) because you have defined the contrast-coefficient to represent the (estimated) difference between those two groups.[Aside] As it stands, your second contrast will confuse a few people, if the contrast is called "TE_vs_CNT", they will naturally assume the contrast was

`TE - CNT`

rather than`CNT - TE`

That command worked. Thank you! I will whip up the MA plots for my contrasts and have a look. I'll correct that TE -CNT issue as well. Good eye.