Dear all,
I have a data set from a non-model organism that has no annotation for many genes (roughly 29,000 genes are annotated out of 121133 total). I would like to carry out differential expression analysis, and I am only interested in the differential expression of genes with annotations.
At what point is it appropriate to filter out un-annotated genes? Can I do it prior to carrying out differential expression? Or should I include them in the DE analysis and tiler them out just prior to the adjustment of raw p-values?
Thanks!
You're right, it is a de novo transcriptome from Illumina reads. In my DE pipeline, I normally filter out many of the lowly expressed transcripts using
filterByExpr()
from the edgeR package. I thought that filtering for annotated genes before the DE analysis/before adjusting p-values might allow more power to detect differences between genes that are annotated, and, hence, can make more functional inferences from. I'm not entirely interested in discovering novel genes at present. If the annotation filtering would unduly bias my results, I'll follow your advice. Am I correct in interpreting your advice to be:annotated_results <- results[annotations,]
)Thanks!
I think annotation filtering would have unpredictable results on your analysis, so is best to avoid it.
You are correct in your interpretation, that is my (second) advice - the first being don't filter non-annotated genes.
This seems sensible, thanks for suggesting.