What kind of distribution to you expect to see for RNA-Seq expression levels?
3
2
Entering edit mode
9.2 years ago

Maybe this question is too broad/vague but I want to ask it agnostically:

If I plot a histogram of RPKM-normalized gene expression for one gene across the a cohort (lets say a TCGA cohort), what shape do you expect to see? A normal-looking bell curve? Skewed to the left or right? Bimodal?

RNA-Seq • 6.0k views
ADD COMMENT
0
Entering edit mode

I was trying to figure it out myself and just saw this old thread. If still relevant for anybody, different tools assume the distribution either to be normal (limma) or negative binomial (EBSeq and DESeq2). You can find a little bit of explanation here: https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbx122/4524048

ADD REPLY
5
Entering edit mode
9.2 years ago

I honestly wouldn't expect to see any particular distribution. Many genes will have a normal distribution, due to not being differentially expressed in any of the samples. Others will have a skew, typically due to a floor effect. Others will have a bimodal or multimodal distribution, due to being up/down-regulated in some cancer types.

ADD COMMENT
0
Entering edit mode

Hi Devon,

Could you elaborate on why you'd expect to see a normal distribution for many genes, given that RNA-seq count data is generally over-dispersed? I am currently analyzing a large RNA-seq dataset with hundreds of individuals, and have not seen an example of a gene with normally distributed RPKMs. Also -- isn't the skew you're describing due to the mean-variance relationship, i.e. greater variance at greater expression values?

Thanks, Allie

ADD REPLY
0
Entering edit mode

Could you elaborate in more details regard the cases you mentioned?

I have been under the impression that not differentially expressed gene would follow negative binomial distribution due to biological / technical variation. In the cancer sample pool, over / under expressed genes would follow negative bibomial distribution as well maybe with even larger variance. In the cancer / normal mixed sample pool, differentially expressed gene may follow bimodal distribution. Do I understand these right?

ADD REPLY
1
Entering edit mode
9.2 years ago
Bert Overduin ★ 3.7k

I would expect to see a normal distribution, although bimodal distributions have been observed: Bessarabova et al. Bimodal gene expression patterns in breast cancer. BMC Genomics 2010, 11(Suppl 1):S8.

ADD COMMENT
0
Entering edit mode
9.2 years ago

Overall, I would expect to see a mostly normal sample distribution if you worked with log2 (RPKM + 0.1) values, except for a peak at the rounding cutoff (which you could fix by removing the genes that almost never varied from that rounding cutoff across the samples, if you wanted).

For a gene-centric distribution, I agree with the other comments: it will vary between genes, and I wouldn't be surprised if it varied depending upon the context of the experiment (for example, depending upon the heterogeneity of the samples).

Maybe it is a bit of a tangent, but I've played around a bit with modeling bimodal gene expression, and I've described my experiences here:

http://cdwscience.blogspot.com/2011/05/modeling-bimodal-gene-expression.html

That blog post was influenced by the work I did for this project:

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0077769

ADD COMMENT
0
Entering edit mode

log2 (RPKM + 0.1) or log2 (RPKM + 1.0)?

ADD REPLY
0
Entering edit mode

I think that is a matter of personal preference. I think 1.0 is a bit conservative, as it may be throwing out ~1/2 of your genes. For example, see Figure 1 in this paper.

ADD REPLY

Login before adding your answer.

Traffic: 2303 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6