Question

When exactly is a gene "expressed"?

0

Entering edit mode

5.1 years ago

exin ▴ 60

I'm interested in comparing the genes that are expressed in several cell types in order to infer functionings of these cell types.

I use DESeq to find genes that are differentially expressed (DE). However, a gene doesn't always have to be DE to be involved in the functionings of a cell. So ideally I'd like to come up with a reasonable method to find genes that are "expressed".

Some previous members of my lab came up with the Quartile Expression method: a gene is considered expressed if its transcript count (normalised but not transformed) is in the upper quartile of all genes across all samples. A major issue: Some genes just have much more transcript counts, so they're always expressed (across all samples), and this skews the Q values.

Anyone can point me to a reference where other methods have been attempted? Any thoughts?

gene expression RNA-Seq • 1.4k views

ADD COMMENT • link updated 5.1 years ago by Charles Warden 8.2k • written 5.1 years ago by exin ▴ 60

2

Entering edit mode

Is this a question about actual bioinformatics methods, or is this a philosophical question about what it means to “express” a gene?

ADD REPLY • link 5.1 years ago by Joe 21k

0

Entering edit mode

Probably stats/ bioinformatics. Especially the part on dealing with genes with exceptionally large transcript counts across all samples.

ADD REPLY • link 5.1 years ago by exin ▴ 60

score 4 · Answer 1 · 2019-03-21

If you define an FPKM value of 0.1 as a rough value for a gene expressed above background, that would probably indicate 60-70% of genes are expressed in your sample. I think this seems reasonable (as a rough guideline).

If you want to use a higher FPKM / expression value (such as FPKM > 1) to choose genes that are easier to validate (or otherwise reduce your gene list), that would be a valid point. However, that doesn't mean genes with lower expression (or even expression of FPKM = 0.95) aren't expressed. So, I think that is a slightly different question.

That said, I think upper-quartile is probably too stringent to define a gene as expressed (I think that would correspond to a FPKM value greater than 1).

score 3 · Answer 2 · 2019-03-21

A gene is expressed if its DNA is transcribed into RNAs. So by definition if you detect RNAs for a gene then that gene is expressed. What expression level should be considered relevant for the biological question at hand may not be easy to define which is why people go for DE genes, that's at least statistically tractable if not always biologically relevant.

score 3 · Answer 3 · 2019-03-21

There is no right answer for this question. The combination of low sensitivity (sequencing is not exhaustive) and transcriptional noise means one cannot answer the question. Any solution to this problem will be an arbitrary cutoff.

Practical considerations:

If a genomic feature have to few reads our statical methods cannot theoretically find any significant changes meaning they are not worth testing.
A major reason why we typically filter out low expressed features is that testing "to many" features will result in the FDR correction being very hard and we potentially miss relevant targets.
Personally I think quantile based approaches are kinda strange and prefer absolute cutoffs instead. I think is easier to interpret and also fits better with point (1).

Tool solutions:

You can pre-filter using some arbitrary cutoffs/functions (e.g. edgeRs filterByExpression() function)
You can after testing all your features weight the p-values by the expression with tools such as IHW (note a version of this approach is actually build into DESeq2)

Lastly since you clearly want to compare two conditions you are properly well of doing the DE analysis.