Hi, I am planning to run differential expression analysis and then enrichment analysis for an RNA-seq experiment from arabidopsis. The RNA-seq count table has TAIR tags/ids as rows. I was thinking that some of these tags are non-coding genes (e.g. micro RNA, etc) and so when converting TAIR ids to ENTREZ ids, around 80% are convertible (20% without associated entrez id). Since the purpose of the study is detecting DE genes and performing enrichment analysis (and for enrichment analysis most non-coding genes do not annotation available), is it wise to only keep the convertible tags and remove the rest (20% of features) from the beginning I mean before running DE analysis, etc?
Exactly, I was thinking of FDR values in both DE and enrichment analysis. Like, 20% of the background reference would be unmapped for enrichment analysis. But on the other hand the sample size is very small (3 per group) so I wanted to have more rows for data sharing while running DE analysis. But all in all, I am not very confident of the final results due to small sample size and was thinking to act properly in every steps.
It would make no difference for the enrichment analysis because enrichment tools will filter the background automatically before they start.
Yes, I meant when correcting for multiple testing in DE analysis, all the tags are used and that affects the fdr values and final list (number) of DE genes. But when performing e.g. Fisher exact test in enrichment analysis, only 80% of the tags (around 20% are unmapped and removed) are used as reference distribution. I was just concerned if these two analysis steps must have same number of total features or not. But yes maybe it wont make much of difference. Thanks!
Sorry if I'm being dense, but I'm confused. The fisher's test in enrichment analysis tags/tag counts/tag distributions arn't used, only the number of genes that are DE and the number of genes that are not DE, there is not reference distribution.
Sorry if haven't been clear. By reference distribution I meant total number of detected genes in the experiment (# DE + # non-DE genes).
Okay.
So what I'm trying to say is, the background distribution will be the same whether you leave them in or not, because even if you leave them in the enrichment tool will automatically subtract them from the number of detected genes in the experiment (as well as the number of DE genes).