Could someone help me understand why my MA plot from DEseq2 has this line of points close to zero, and a second line at about 1.2?
I checked the genes from my count table corresponding to the points on these lines, and they have a very low read count in both samples. Thus I would really like to know what is going on mathematically. Could someone detail how exactly DEseq2 does normalization and obtains the log fold change? Thank you all for your help.
Hi Kevin,
Thank you for the reply. I think I kinda figured it out. Basically, it is described here: https://support.bioconductor.org/p/62927/
For every gene (row) DeSeq2 does a plus 1, to avoid 0 counts. Then it normalizes all samples based on the total count of each sample, and that gives a size factor for each sample (sizeFactor()). Thus like you said, if I have two samples contains 0 counts for many genes, I will have a same log2(1*sizeFactor) (y axis) for any of these genes, while the base mean being different depending on all samples of the row. This would gives me a horizontal line.
Yes, that's true. Also true that, for DESeq2's MA plot, the x-axis is the log of the mean expression + 1. What you could do prior to normalisation is remove all genes that have mean raw count (across all samples) <10 (a bit on the stringent side), or those that are 0 in a large proportion (e.g. 0 in >50%). There's no real standard cut-off.