As part of an RNA seq course I planned to let my students calculate the p-value (calculated by cuffdiff 2) for differential expression by hand using emperically derived distributions.
However, upon doing so I got problems with the formula E[ log[Y] ] / Var[ log[Y] ] which is considered to be the Test statistic. I would appreciate if someone could consider the issue.
If I understood correctly, the T statistic should be derivable by first creating a log( fold_change) distribution by dividing the FPKM distibutions of condition 1 by that of condition 2 and taking the log of the distribution. According to the formula the mean divided by the variance of this distribution should give the T statistic. However, when I do this in an empirical fashion I only obtain the T statistic by diving the mean by the SD. (Which actually makes more sense to me). Is this correct or am I missing an important fact here. Down I provide a summary of the reconstruction (partially R code) I used for this purpose.
Thanks in advance and greetings, Edouard
the corresponding line from cuffdiff. ( 2.97091) is the test statistic.
#AT1G78870.2-4065-0 AT1G78870 - Chr1:29650453-29652593 q1 q2 OK 39.5705 30.852 -0.35906 2.97091 0.0029691
##corresponding R code (I recalculated the BNB parameters from the cuffdiff output files).
condition1 = rnbinom( 100000, p = rbeta( 100000, shape1 = 11516.9 , shape2 = 14527 ), size = 569 ) condition2 = rnbinom( 100000, p = rbeta( 100000, shape1 = 7586.58 , shape2 = 7108.11 ), size = 605 )
#convert the count distributions to FPKM distributions condition1 = condition1 * ( 39.5705 / mean( condition1 ) ) condition2 = condition2 * ( 30.852 / mean( condition2 ) )
#calculate log fold change distribution fold_change = log( condition1 / condition2 )
#incorrect T value mean( fold_change ) / var( fold_change ) 35.45583
#correct T value mean( fold_change ) / sd( fold_change ) 2.971857