Question

Genes vs transcripts in RNA-seq analysis

0

Entering edit mode

3.9 years ago

rstepien095 • 0

Hello,

I am working on my thesis where I compare RNA-seq results between pipelines. I feel little bit confused about where do I work with transcripts and where with genes. Here is what I do: I have cDNA reads, which are pseud-aligned to genome using cDNA+ncRNA using kallisto software. At this moment everything seems ok and i have example output file: https://ibb.co/985hdf2

In target_id column there are transcripts (ENSTXXXXXX). Hovewer when I pass raw counts to DESeq2, I obtain output file with column GeneID with transcripts names (ENSTXXXXXX). Moreover MAplot according to DESeq2 documentation represents each gene with a dot. https://ibb.co/ky1VxjQ

Now do DESeq2 turns transcripts into corresponding genes? If not, then why in output file I have transcripts names in GeneID column?. https://ibb.co/2k0Vhf0

Lastly, when I want to obtain different expressed genes and DESeq2 returns only statistically significant genes (transcripts?), how do I know which log2FoldChange values indicate upregulated and downregulated genes? Is there a way to know some threshold point? I also consider option to intepret all genes with p-value < 0.01 as different expressed, but then my analysis shows 50k upregulated and 50k downregulated genes, which does not seems real, because most publications is treating about 100-3k DEGs. Thanks in advance.

RNA-Seq R deseq deseq2 kallisto • 1.6k views

ADD COMMENT • link updated 3.9 years ago by caggtaagtat ★ 1.9k • written 3.9 years ago by rstepien095 • 0

score 3 · Answer 1 · 2020-06-04

You state you used kallisto to pseudoalign to genome. That's incorrect; you pseudoaligned to transcriptome.

When you passed the kallisto counts to DESeq2, you are getting the results of differential expression analysis at the transcript-level, hence why you see transcript IDs (ENSTXXXXXX) -- those are not genes and you are definitely not doing gene-level analysis. If you want to do gene-level analysis with DESeq2, you should use tximport which is designed to summarize kallisto's transcript-level estimates for gene-level analysis.

Another suggestion is to use sleuth (instead of DESeq2) to perform gene-level analysis. This is what I typically use when working with kallisto-generated count estimates.

Log2FoldChange > 0 means upregulation; Log2FoldChange < 0 means downregulation. That's the beauty of the log-transform; if your fold change is 1 (i.e. no change) and you take the log of that, you get 0. Less than 0: downregulation; Greater than 0: upregulation.

As for what threshold you want to use for the magnitude of the log2FoldChange, that's entirely up to you. What do you consider biologically relevant?

Using p-value < 0.01 is a bad idea. You need to use the padj column (i.e. the adjusted p-values; look up the multiple comparisons problem).

Here's an example of what you might want to consider doing: Select all genes with adjusted p-value < 0.05, and among those genes, further select the genes with a log2FoldChange > 1 ("upregulated") and the genes with a log2FoldChange < -1 ("downregulated").

score 1 · Answer 2 · 2020-06-04

Hi,

after uploading you data in R for the DGE, you have to summarize the transcript data to the gene level. I would look into the tximport R-package. txi <- tximport(files, type="salmon", tx2gene=tx2gene) All you need is you transcript count table (files) and a data frame which contains the association of every transcript to its gene (tx2gene). Than you get you count matrix but with gene names.

Currently, the way you used DESeq2 was, that it treats every transcript like an independent gene, which most certainly interfers with the statistical assumptions of this tool.

There is a paper, about which log2FoldChange tresholds to use for which DGE-tool and number of biological replicates. Some people use a treshold of 1 log2FoldChange ( and -1) and an adjusted p-value of 0.01. A log2FoldChange of 1 would mean an increase of by 100% in expression. Alternativly, you could also use 0.585, which represents an increase by at least 50%. But sometimes, you want every so slight difference, so you could use the treshold 0. That depends on the experiment.