Selecting range of log fold change values for representing differentially expressed genes in RNA-Seq experiment
4
1
Entering edit mode
9.8 years ago
menorca89 ▴ 20

I am carrying out a RNA Seq study with four time points each of two knockout strains, having NO replicates. Since I cannot consider the p-value/q-value in this case. I thought of looking at the log fold change values obtained form Cuffdiff.I want to represent the most differentially expressed genes across the four time points of a strain as a heatmap, but am unsure of how to select the same. Should I sort out the values from pair wise comparisons and select the top 10 and bottom 10? Or should I choose a range of values, such as -2 to +2 ? I was confused because the sample values corresponding to positive or negative INF are extremely high, and I thought those genes should be represented for sure.

If anyone has any suggestions about representing DE Genes across time points or between the strains (T1 Knockout1 vs T1 Knockout2, etc), I would be grateful for the same.

Many thanks in advance!

Cummerbund Heatmap log-fold RNA-Seq Cuffdiff • 7.7k views
ADD COMMENT
1
Entering edit mode
9.8 years ago

For looking purely at difference due to the knock-out you do have replicates, so you can just use a likelihood ratio test and get some useable p-values. For the comparisons at individual time-points, I would recommend not bothering. Your study isn't designed to look for differences at time-points, so you'd largely just be wasting your time.

ADD COMMENT
0
Entering edit mode

I am going to chime in to doubly agree with Devon. Using fold changes is not a good substitute for calculating a p-value with replicates because some genes bounce around a lot in environmental conditions (3-4x) for no apparent reason while others would be expected to have low variance, where a 1.5X fold change may be biologically important. You don't want to spend time on data if you have no way of knowing what you are looking at.

ADD REPLY
1
Entering edit mode
9.8 years ago

I would check the absolute expression values as well because low expression genes can have deceivingly high fold-change values (since something is infinitely more than nothing).

I tend to use log2(RPKM + 0.1) expression values, but I think you could add a rounding factor between 0.01 and 1. If you want to get an idea about the effect if different rounding values, you can see this paper.

ADD COMMENT
1
Entering edit mode
9.8 years ago
menorca89 ▴ 20

Thankyou for your reply, Charles. I do realize that having no replicates is a problem and that the experiment should have been designed differently. But, with the present data, I am now thinking of first removing those entries for which Cuffdiff could not find much info and stated 'NO TEST' or 'FAIL'. Then, for those with FPKM between 0 and 0.05, if the corresponding value is above 10 (arbitrarily), I shall keep them, else discard. And finally, perhaps filter on the basis of log fold change values, and keep those more than +/- 1.5 or so.

I hope this will be better than simply going by the p-values in this case.

ADD COMMENT
0
Entering edit mode
9.8 years ago
menorca89 ▴ 20

Hello,

Thank you for your replies.

Running Cuffdiff on the double knockout and triple knockout strain (at the same time point), I have got a table with the fold change, p-value,q value,etc. What I observe is, that the genes with a log fold change as high as 11 or 10, are not deemed as significant, and mostly the genes with a value of 0 in one case, are categorized as 'significant'. The top 2 genes are actually knocked out in the first sample, though they still show slight expression, and I would have expected them to show up as being DE genes. So, I am still confused over the range of log fold change to be considered.The values in this case range from -11 to 11. Some of the entries can be seen below. Thank you in advance.

value_1     value_2     log2(fold_change)     test_stat     p_value     q_value      significant
0.091093    262.705     11.4938               0.124772      0.01725     0.369422     no
4.21505     5174        10.2615               8.37264       0.00405     0.219178     no
2.13768     139.416     6.02721               0.756259      0.1381      0.683679     no
0           4.69E+06    inf                   #NAME?        5.00E-05    0.017711     yes
0           1.91E+06    inf                   #NAME?        5.00E-05    0.017711     yes


1657.79     0.602447    -11.4261              -20.6374      0.0652      0.543262     no
215.63      0.172036    -10.2916              -10.4816      0.14555     0.691207     no
ADD COMMENT
2
Entering edit mode

I think the problem is that you don't have any replicates. This is a situation where I think cuffdiff is especially likely to give weird results. To be fair, I don't think there actually is a good way to analyze gene expression without replicates.

A fisher exact test of read counts will probably correlate with fold-change (except for low expression genes, which should be ignored anyways). Or, you can just rank based upon fold-change without a p-value.

However, no matter what, I think the problem is that fold-change values will likely change if you were to reproduce the experiment, and the highest fold-change values may actually show the greatest difference between replicate experiments (for example, they may come from a gene with high variability in gene expression). So, I think this is something you will just have to keep in mind, making sure to do qPCR validation with replicates for genes of interest.

ADD REPLY

Login before adding your answer.

Traffic: 1398 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6