Hello,
I have a question regarding edgeR normalization,
I have a three tissue RNAseq dataset, with males and females sequenced for both sexes,
After normalization (TMM for read-depth and "composition"+FPKM for gene length) I got similar median of expression for both sexes for two tissues but not the third one (see below),
Because the third tissue possess a lot of genes that are uniquely expressed in one sex, I tried to perform the normalization utilizing only the genes that are expressed in both sexes of all tissues as control genes (sort of 'housekeeping genes' - see vertical line above for the threshold between expressed and non expressed genes),
However, even this way, the difference persists (see below),
Can someone help me to understand what is going on ?
Thank you, Vincent
New Nature paper: Makes and females are physiologically not the same. Joke aside, which tissue is number 3? A better diagnostic plot would be an MA-plot by the way and not using FPKM since this is not compatible with edgeR (or any) serious testing framework.
How do raw counts look? How many genes are 'detected' in each sample, perhaps using something like 10 counts threshold? Since tissue 3 has more sex-specific expression patterns, perhaps the overall expression is too different to be normalized like you expect. The assumption is that the majority of genes do not change expression. Can you confirm that assumption is true for this tissue?
Thanks for the replies,
Here are MAplots for the three tissues (TMM normalization, no FPKM, only autosomal genes included),
Third tissue is gonads, so yes I do expect (large) differences ... Yet, because of the litterature, I must admit that I was surprised to find this difference of median log2FC between sexes for autosomal genes,
There is between 8000 and 10000 of genes with more than 10 counts for each tissue,
"The assumption is that the majority of genes do not change expression. Can you confirm that assumption is true for this tissue?" From the MAplot I would say maybe not,
Arguably the most sex-specific tissue that exists, of course this looks different.
This sounds normal. You can prefilter using
filterByExpr(). Usually one finds about (crude ballpark estimate) 15k genes using this function in most cases I've seen.To me, everything looks perfectly fine, so normalization looks decent and the plots at the bottom, indicating, large changes, which is expected given the biology.