What problem does the MA plot diagnose? And how do you solve it?
1
0
Entering edit mode
5.4 years ago
hbw ▴ 80

I have an RNASeq experiment, and I am using DESeq2. After I get the results, I plot the MA plot. This is the output of plotMA:

And this is my attempt at the MA plot:

res$significant = (res$padj < .05)
res$significant = as.factor(res$significant)
res$significant[is.na(res$significant)] = F
ggplot(as.data.table(res), aes(x=log2(baseMean), y=log2FoldChange, color=significant)) +
geom_point() +
geom_hline(color = "blue3", yintercept = 0) +
stat_smooth(se = FALSE, method = "loess", color = "red3") +
scale_color_manual(values=c("Black","Red"))


1. There is a slight bias at the end, so genes with a high A, tend to have a high M, and we are detecting more up-regulation than down. Is this a problem? What might be causing this, and more importantly, is there something we can do to fix it?
2. Even if the slight effect is too little to be a problem, what causes problems like this? Imbalanced sampling depth at the two conditions? Why doesn't normalization (sample size factors) fix this?

Also, is there a reason why DESeq2::plotMA doesn't plot the best fit line?

Disclaimer: Cross posted to Biconductor questions.

RNA-Seq rna-seq • 3.7k views
1
Entering edit mode
5.4 years ago

There is a slight bias at the end, so genes with a high A, tend to have a high M, and we are detecting more up-regulation than down. Is this a problem?

I don't think that you should expect the number of up- and down regulated genes to be about the same. For instance, I had once a dataset where a mutation down-regulated a significant subset of genes, while only a few genes were up-regulated. We found a molecular mechanism explaining down-regulation of the genes and we think that the few upregulations are likely indirect effects ... so the real question is : "does your results make sense biologically ?"

However, there could be a problem when a lot of the genes are up/down-regulated because DESeq2 assumes that most of the genes are unaffected. EDIT : see Michael Love comment below for a more correct and detailed explanation.

3
Entering edit mode

The median ratio or TMM assumption for normalization is not that the majority of genes have LFC=0, but that the median or trimmed mean of ratios captures the technical shifts in counts. So it's more about the reliability of the center of the distribution of LFCs to capture technical shifts rather than requiring so many genes strictly to belong to the null. If the entire distribution is shifted, e.g. global increase in expression, then obviously computational normalization cannot be relied upon. But you could have, e.g. 40% upregulated and 20% downregulated and still have the median or trimmed mean roughly capturing the center of the distribution (the technical shift in counts due to sequencing depth).

0
Entering edit mode

Thanks for this precision ! (I need to read your DESeq2 paper again)