Question: should non-protein-coding rna(e.g. lncRNA) be removed in RNA-Seq differential expression analysis
22 months ago
hellocita20 wrote:

Hi, I was doing RNA-Seq differential expression analysis, I wonder if some non-protein coding genes, such as lnc-rna or the Pseudogene, should be removed before analysis? Since the purpose is to reveal the expression difference of control and observe group, and to relate the difference with some known biological pathway/functions?

More information:The data I used was RNA-seq data (polyA enriched RNA with Illumina HiSeq). I mapped the reads to evidence-based annotation of the human genome (GRCh38) , version 24 (Ensembl 83), download from GENCODE. Finally, I got deferential expressed genes(DEGs). Then I am trying to converted these DEGs from ensemble id to hgnc symbol and search for their biological functions. However I found some of the genes, such as ENSG00000270000, ENSG00000257155, were lnc rna and do not have hgnc symbol. And I found they were not protein coding genes.

I wonder if I have done it wrong:(?

Thanks for your answer

22 months ago
michael.ante wrote:

Hi Hellocita,

Both genes you mentioned have a A-rich region at the cDNA's 3' site (e.g. ENSG00000257155 / ENST00000548096). Therefore, the polyA fishing / enrichment can result in reads from these transcripts.

I guess you did nothing wrong.

Regarding of keeping these genes in your analysis: you can do both DE-analysis and see how strongly the influence of these genes to the variance/oversdispersion is. These genes seem to be detected due to off-target effects, which may follow different statistical processes than polyadenylated genes.



Hi Michael, I still do not fully understand why the off-target effect are related to the non-coding RNA i got, since the off-target effect are mostly related to siRNA. Do you mean that these genes are called DEGs because off-target effect during the experiment? Therefore one should use different statistics to double-check them?

Hi Hellocita,

I mean that the genes which are not protein coding genes, (especially the two you mentioned in your questions) are off-targets of the polyA-enrichment. Your oligo-dT primer has a certain length and might bind to intrinsic A-rich regions of certain transcripts. The enrichment of these non-target genes might follow a different statistical process than the enrichment of the polyadenylated genes.

Therefore, I'd double check the results of the DE-analysis.



I see, thank you Michael!

