Too many differentially expressed genes in voom
0
0
Entering edit mode
3.9 years ago
rzm0015 • 0

I am attempting to identify DE genes between different tissue types in approx. 17,000 RNA-seq samples using the voom function in Limma. I am starting with raw counts and I have the total number of reads per each library. The majority of these samples came from the GTEx database. Here is my process:

  1. Generate an MDS plot. As you can see there is a bit of variation, dim 1 is the x axis and dim 2 is the y. Samples in red are the ones I picked for further analysis as they contained the minimal amount of variation while including critical samples I needed in the analysis:

    dim 1-2 mds plot: https://ibb.co/0Jvv79R

    samples I selected: https://ibb.co/xqFzK2b

  2. Take the samples in red, and run them through voom:

    library(edgeR)
    counts <- readRDS("counts_matrix.txt, header=TRUE")
    library_sizes <- read.table("voom_sample_info.txt")$V2
    sample_names <- read.table("voom_sample_info.txt")$V1
    sample_groups <- interaction(as.list(sample_names))
    mm <- model.matrix(~0 + sample_groups)
    d0 <- DGEList(counts)
    d0 <- calcNormFactors(d0)
    cutoff <- 12
    drop <- which(apply(cpm(f0), 1, max) < cutoff)
    d <- f0[-drop,]
    y <- voom(d, mm, lib.size = library_sizes, plot=T, save.plot = T)
    

    This generates the following plot: https://ibb.co/PgDtZyg

  3. Clearly, I need to remove some of the low-count genes on the left-hand side of the plot. I used iplots (http://iplots.org/) to generate a scatter plot in R and manually remove the low-count genes. This leaves me with ~20,000 genes. I reran the above code and generated the following plot:

    https://ibb.co/Qv9vPY7

  4. That looks much more like what I need, so I continue on with identifying DE genes. Using the comparison of two groups "Artery_Coronary" and "Artery_Tibial" as an example:

    fit <- lmFit(y, mm)
    contr <- makeContrasts(groupArtery_Coronary - groupIArtery_Tibial, levels = colnames(coef(fit)))
    tmp <- contrasts.fit(fit, contr)
    tmp <- eBayes(tmp)
    top.table <- topTable(tmp, sort.by = "P", n = Inf)
    #print number of DE genes with p <0.05
    length(which(top.table$adj.P.Val < 0.05))
    17041
    

Here's where I run into trouble. Out of 20,000 genes I do not believe that 17,041 are differentially expressed between two different types of artery. I repeat this with other similar tissues such as vagina vs uterus (16,026), brain amgydala vsbrain cortex (13864), subcutaneous fat vs visceral fat (16919), and I get similar results. I've checked around this site, and as far as I can tell I am doing everything correctly. I even found a question that had the same problem, however it looks like my pipeline is already the same as what was suggested in the answers.

Does anybody have any idea what the issue could be? I have seen publications that have used the GTEx dataset in the past, so I know it can be done...

I'm pretty new at all this, so any advice or criticism or anything is appreciated. Apologies I could not get the images to embed properly

Thanks

RNA-Seq R limma voom • 2.2k views
ADD COMMENT
2
Entering edit mode

I can't see the figures (maybe it's just on my end). A few thoughts:

  1. No covariates? Batch effect etc.
  2. It's a known problem in statistics, when you have a big n you will have a significant p-value even for very low effects, maybe consider taking the effect into account as well
  3. Manually removing low-count genes? Why manually?
ADD REPLY
0
Entering edit mode

Urgh, the image embedding I cannot get to work. I added them I as hyper links. I removed the samples manually because even when I put in a ridiculously high cut-off value (300):

cutoff <- 12
drop <- which(apply(cpm(f0), 1, max) < cutoff)
d <- f0[-drop,]

I still got the cluster of decreased variance genes at the lower end of the counts spectrum. Here is my voom plot with a cutoff of 300: https://ibb.co/6435QDf

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode

Why do you not believe that that many genes are differentially expressed? You have a very large number of samples so you can almost certainly detect very small changes very accurately. The categorisation of genes into significant and non-signifcant is an artificial human one, not based in reality. In reality expression, if we could measure expression with perfect precision, there would always some difference between two tissues for every gene, even if its only 0.0000001 TPM difference. The precision with which we can measure levels is connected to the number of samples, and you now have sufficient samples that you can detect that those differences arn't zero for most genes.

So now you have to think about the biology. What am I going to do with these genes when I find them? Is there a minimum amount of change I would find biologically (not statistically) relevant? If you can set an lfc threshold you consider biologically relevant, then you can test against having a change at least that big using treat and topTreat in limma.

Secondly it is bad practice to preselect samples for low variance. It screws with the distributional assumptions of the tests. If you want to preselect samples, if should be on some basis that is orthogonal to signficance (which variance certainly isn't).

ADD REPLY
0
Entering edit mode

Thanks for the reply, lot's of useful info in it.

ADD REPLY
0
Entering edit mode

To expand on the issue with preselection of samples, it may be of use to explore limma's functionality of defining weights for samples with the voomWithQualityWeights (see this answer and the links there). This can be used to tackle heterogeneity but without losing all of the information contained in the samples because you're not removing them.

ADD REPLY
0
Entering edit mode

In addition to the other comments, I believe limma assumes a certain "proportion" of genes expected to be differentially expressed in the experiment (and this influences the multiple testing adjustment). In the eBayes function I believe this is the proportion argument (1% by default). This is discussed at the end of page 61 in the user guide. Though personally I have never encountered changes of the magnitude described here and have never fiddled with that parameter, it may be worth investigating.

ADD REPLY

Login before adding your answer.

Traffic: 2687 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6