DEseq differential expression and RPKM
4
2
Entering edit mode
8.4 years ago

Hi all,

I have a question regarding differential expression using DEseq. I have 2 plant tissues (leaf and stem) and 2 species (A and B) with 3 biological replicate each. After performing differential expression using DEseq, can i compare the top 10 differential expressed genes and see how their RPKM values are compared? For example for gene1 can i calculate ratio of RPKM between A/B for leaf and stem respectively?

I wanted to do this so that i have two independent analyses (one using RPKM and the other using DEseq). Am i thinking correct or i need to compare the differential expressed genes with normalized counts or no need to compare at all. Any help is appreciated.

Thanks
Upendra

DEseq RPKM RNA-Seq • 11k views
1
Entering edit mode

I'm afraid you might need to be more specific to get fruitful answers from the community. Assuming this question is actually a DESeq question, you might wanna state clearly what you're looking to achieve and why you have chosen this path. Then delve deeper and frame the question so it is actually about DESeq. Right now, while you have beautifully explained what you have, I'm not sure what you are looking to analyze.

0
Entering edit mode

Why are you talking about RPKM and DESeq in the same sentence. Please tell me that you're not putting rounded RPKM values into DESeq...

4
Entering edit mode

It sounds like Upendra wants to do two independent statistical analyses, one by DESeq and another by ratio of RPKM. I don't really see a question, other than 'is this a good idea?'

Consider the result as a venn diagram, some genes found DE by either method. What good does that do? What will you conclude of a gene found DE by one method and not another? This is the path towards madness.

0
Entering edit mode

I wanted to be a bit civil about making that last statement, hence my stretched out but pointless comment :)

0
Entering edit mode

Ah, I pretty much stopped reading after seeing RPKM and DESeq mentioned together. I've simply seen way too many people try to use DESeq/edgeR/etc. with inappropriate data.

Anyway, it would seem to make more sense to perform the statistical tests and then filter by fold-change to get a list of likely biologically relevant changes.

0
Entering edit mode

What will you conclude of a gene found DE by one method and not another? This is the path towards madness.

It's good to point this out and to be diplomatic. However, any given method for calling DE makes assumptions, and trying different methods can test the limits of these assumptions. One can perhaps have more confidence for genes in the intersection. Genes resulting from only one or the other may be near the edge, and if it's your favorite gene and perhaps providing a clue, it may be telling you something about your experiment. If you have no clues about any of the genes, and are simply doing a massive screen of all genes - then you're right, these are simply distracting, and considering them is pointless.

0
Entering edit mode

Thank you all for the replies with your comments/suggestions/criticisms etc., @Devon I am not putting RPKM values into DESeq and I am using counts only for performing DESeq and so I guess I am ok with it. @karl you are right.. my intention is to perform two different analyses and then compare them together but maybe I am thinking it is not a good idea since RPKM has already proved to be not good. So you all think I just should stick with DESeq and don't bother about RPKM then?

0
Entering edit mode

Youre free to do either analysis, but think about your goals. What will you do with the genes found to be DE by one tool and not the other? Maybe ignore them, and keep the ones where both methods agree? Maybe a smaller gene list is what youre after, but most of us will warn you that a simple ratio of RPKM will mislead you with artifacts. I've seen some people report >4000 fold change due to small RPKM in a ratio; and it's nonsense. DESeq will more accurately model the read distribution than a ratio. Maybe instead, do just DESeq or edgeR, then when you have a good gene list, go ahead and look at only their RPKM ratios if those are easier for you to report.

0
Entering edit mode

Thanks @karl. Sounds good. But I have one final question. After I get the differentially expressed genes with significant FDR, do I need to worry about reporting their RPKM ratios since DESeq itself is proved to be quite reliable for modeling RNASeq data and differential expression?

0
Entering edit mode

I wouldn't waste the time on it. Your target journal might ask for qPCR verification anyway. I'm more interested in the RPKM's ratio to whichever "housekeeping gene" your lab likes to use.

5
Entering edit mode
8.4 years ago

If the intention is to check the validity or robustness of your results with respect to different normalization methods, you could do the following:

• Calculate Spearman's rank correlation between fold changes reported by DEseq and RPKM fold change
• Take some top 100, 1000 etc. lists and check if they significantly overlapping doing a hypergeometric test for sets ranked by different unit (DEseq fold change, RPKM fold change)
• Get the library size normalized expression values per library and compare with RPKM per library. cluster the correlation matrix. Do samples cluster 'correctly' based on replicates?
• Do an MDS plot of the library size normalized abundances and RPKM, do the replicates group 'correctly'?

There have been many concerns raised against the use of transcript-length normalization FPKM/RPKM as a unit to report gene abundance, some found on BioStar as well. As a consequence, I would not use FPKM as a unit to report expression levels, but abundances and library size normalized fold changes (for partial reports where only a subset of genes is reported), readers could length normalize abundances afterwards if they really wanted to, while it is difficult to transform FPKM back to abundance afterwards.

For reports containing all genes, reporting the raw counts is probably best, then the user can choose the methods for computing abundance, fold change and test freely.

0
Entering edit mode
8.4 years ago

I think standard statistical analysis of rounded, log2 transformed RPKM values and DESeq should both provide fairly robust results. If you really want, you could take the overlap, but I would typically just use one or the other.

If you wanted to get an idea about that the overlap looks like for lists of genes with FDR < 0.05 and |fold-change| > 1.5, then you can see a couple comparisons here:

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

0
Entering edit mode
8.4 years ago

Thanks again for all for you comments. I started this thread after I saw the below table in one of the papers and I started to wonder about it for my own analyses.

Gene model     annotation         tissue1 RPKM     tissue2 RPKM     tissue1/tissue2 RPKM
Gene1          some annotation    13.72            0.00             13716
Gene2          some annotation    13.40            0.00             13396
Gene3          some annotation    480.38           0.18             2654


What do you think about this? From what I gathered so far from all of your comments above, I guess it is not necessary to include the columns that shows the RPKM values and only include columns1 and 2 which are basically top differentially expressed genes and their annotations using DESeq. Right?

Thanks
Upendra

0
Entering edit mode
8.4 years ago

I would round the RPKM values to avoid unreasonably high fold-change values from low-coverage genes (and you will also never divide by zero this way). The rounding factor I typically use 0.1. If you use a log2ratio to represent the difference, I would use something like:

log2(RPKM_tissue1 + 0.1) - log2(RPKM_tissue2 + 0.1)


You can see the impact of the various rounding values here, but the bottom line is that I would use something between 0.01 and 1 (depending upon how conservative you want to be):

http://bioinfo.aizeonpublishers.net/content/2013/6/285-292.html

0
Entering edit mode

Thanks @Charles. I think in the paper that i was referring to, they have converted all "0" RPKMs to "0.001" to avoid losing ratio data. According to you it sounds that they should have used 0.1. Right? Also do you think it is advisable to do this log2(RPKM_tissue1 + 0.1) - log2(RPKM_tissue2 + 0.1) compared to tissue1/tissue2 RPKM (after rounding of with 0.1) or it is just optional.....

1
Entering edit mode

Different people will have different opinions. As I think you understand, some reads are infinitely more than no reads, so I would definitely recommend not focusing on the low coverage genes (although the definition of "low" will vary between different people).

The bottom line is that I think Gene3 in the example above is more likely to show a stronger difference using something like qPCR validation. If you rounded the RPKM values to 0.1, then the last column would say 137 for 134 for Gene1 and Gene2, respectively. Now, Gene 3 ranks higher than Gene1 or Gene2.

I would recommend visualizing the alignment for a few genes if you want to get an idea about what RPKM values really look like in terms of coverage in your own dataset.

I don't think it matters whether you report the change on a linear or log scale, as long as it is labeled accordingly.

0
Entering edit mode

Thanks Charles for your help. Much appreciated....