Question

Differential expression of all genes vs only annotated genes

0

Entering edit mode

5.1 years ago

transcripto • 0

Dear all,

I have a data set from a non-model organism that has no annotation for many genes (roughly 29,000 genes are annotated out of 121133 total). I would like to carry out differential expression analysis, and I am only interested in the differential expression of genes with annotations.

At what point is it appropriate to filter out un-annotated genes? Can I do it prior to carrying out differential expression? Or should I include them in the DE analysis and tiler them out just prior to the adjustment of raw p-values?

Thanks!

deseq2 rnaseq RNA-Seq • 1.5k views

ADD COMMENT • link updated 5.1 years ago by h.mon 35k • written 5.1 years ago by transcripto • 0

score 1 · Answer 1 · 2019-04-04

1

Entering edit mode

5.1 years ago

h.mon 35k

It sounds like you have a de novo transcriptome assembly from short Illumina reads, which typically end up with thousands of transcripts - I mean, more thousands than the typical number of genes a equivalent well-annotated genome have.

If that is the case, I consider a good metric to filter genes the ExN50 metric. This will filter just lowly expressed genes, then you should proceed to differential expression analysis without filtering non-annotated genes. If you want to focus on annotated genes, do that after DE analysis and multiple testing p-value correction - but you may be missing significative changes at well-supported but non-annotated genes.

ADD COMMENT • link 5.1 years ago by h.mon 35k

0

Entering edit mode

You're right, it is a de novo transcriptome from Illumina reads. In my DE pipeline, I normally filter out many of the lowly expressed transcripts using filterByExpr() from the edgeR package. I thought that filtering for annotated genes before the DE analysis/before adjusting p-values might allow more power to detect differences between genes that are annotated, and, hence, can make more functional inferences from. I'm not entirely interested in discovering novel genes at present. If the annotation filtering would unduly bias my results, I'll follow your advice. Am I correct in interpreting your advice to be:

Filter lowly expressed genes.
Run DE analysis like normal, including adjustment of p-values.
Filter to retain annotated genes only. (e.g., annotated_results <- results[annotations,])

Thanks!

ADD REPLY • link 5.1 years ago by transcripto • 0

1

Entering edit mode

I think annotation filtering would have unpredictable results on your analysis, so is best to avoid it.

You are correct in your interpretation, that is my (second) advice - the first being don't filter non-annotated genes.