Question

RSEM gene-level estimates to DESeq2

0

Entering edit mode

3.1 years ago

mehdim • 0

Hello, I used an RSEM, gene-level count estimates matrix and just rounded the values using the round() function in R to feed them to DESeq2 but the results are very odd (VERY low number of deferentially expressed genes). Is this really a solid way to go about deferential analysis or should I resort to starting from raw data again.

I read multiple threads that suggested that the rounding method, although not optimal, should work fine.

NOTE: The study I got the data from provided the data as RSEM gene-level count matrix and FPKM normalized matrix. I had to use rsem since I know DESeq2 only takes non-normalized counts. The study used an independent t-test on the FPKM file to do the analysis, but I read somewhere that it is highly discouraged.

RNA-seq DESeq2 RSEM DGEA • 3.0k views

ADD COMMENT • link updated 3.1 years ago by ATpoint 82k • written 3.1 years ago by mehdim • 0

1

Entering edit mode

3.1 years ago

rodolfo.peacewalker ▴ 390

In this sense, as rpolicastro comments:

You must avoid to use FPKM values to perform differential expression analysis. Use the raw counts estimated by RSEM since DESeq2 or edgeR are aware of library size, composition and over-dispersion of genes to normalize the data. You will find an excelent explanaiton about the normalization method of DESeq2 here

Please, share your code to create the dds object, the design formula and for obtaining your results with padj and lfc thresholds.
Also, in your volcano plot it looks like that you are using a different lfc threshold (vertical lines) to obtain your DEGs than the employed for obtain the res object. Perhaps, this could be the reason why you are recovering a small number of DEGs. Take a look at this comment

Remember that in differential expression analysis, these approaches assume that there are a low number of DEG between comparisons.

Best regards!

ADD COMMENT • link 3.1 years ago by rodolfo.peacewalker ▴ 390

score 4 · Accepted Answer · 2021-03-31

4

Entering edit mode

3.1 years ago

rpolicastro 13k

The study used an independent t-test on the FPKM file to do the analysis, but I read somewhere that it is highly discouraged.

This is wrong on many levels, which is why methods such as DESeq2 and edgeR were developed. I won't go into detail, but it's imperative to control for library sizes differences, compositional effects, and over-dispersion of counts. I would consider a t-test on FPKM counts a random number generator for all intents and purposes.

It might actually be the case there are very little differentially expressed genes, but the original study found some due to using a random number generator analysis. I would first make sure your code is correct, and if so a good QC step would be to perform PCA analysis using PCAtools to look for any separation of samples by condition in the first PCs. Use the rlog counts from DESeq2 for this. Another step would be to generate a histogram of p-values from the analysis. If there was no effect the histogram would look fairly uniform. If you see a bimodal histogram, or counts piling up near 1, post the picture here for further QC.

ADD COMMENT • link 3.1 years ago by rpolicastro 13k

0

Entering edit mode

0Hello! Thank you very much for your timely reply. I check the code again to make sure I did not miss any important step or detail and I got the following MA, Dispersion and Volcano plots in my analysis so far: Disp Volcano

I will do a PCA and get the p-value plots as well and post them here. I posted the above plots to make sure I did not miss anything so far.

I should also add that the analysis is between three grades of glioma 2, 3 and 4 as conditions and that the plots are the plots are for the 4vs2 results from DESeq2.

ADD REPLY • link 3.1 years ago by mehdim • 0

1

Entering edit mode

Looks all normal to me. Be happy that you have a limited number of genes to focus on rather than thousands of DEGs and the entire transcriptome going wild with no idea which genes to start with. How many DEGs is that, like a hundred, that is ok I'd say, nothing to worry about. If you expected many more DEGs (or altered gene functions in general) based on a strong phenotype you might want to investigate mechanisms beyond transcription such as posttranscriptional processes that might drive the phenotype. If this is published data you can try recount which is a Bioconductor package that provides raw counts for almost all RNA-seq studies in the SRA (at least from mouse/human afaik) processed by an independent method (not RSEM, I think it was called Rail-RNA and uses a custom bowtie2 version that is splice-aware). Anyway, this gives raw counts that you could use to check that the result is not a processing issue.

ADD REPLY • link 3.1 years ago by ATpoint 82k