Hello All Recently I got new project it has three control c1 c2 c3 and three test s1 s2 s3 My pipeline which I followed was tophat cufflink cuffdiff I aligned my reads to hg19 For differential gene expression I used cuffdiff I got gene.diff file in that there is pvalue and there is qval Now my question is how should I filter my significant upregulated or down regulates genes . should I consider qval (0.05) or pvalue (0.05). If it's pvalue please I need help in understanding why we are not considering qval ? Also I have heard that scientifically to have statistical significance we need minimum three replicate why is that so?
qval, not the raw
pvalue. The "minimum of 3 replicates" is a good general rule of thumb since you need that many to have a decent shot at measuring variance. I personally recommend at least 6 replicates, which happen to fit nicely on a single lane of a HiSeq if you have a standard two group comparison setup.
Imagine you're doing a statistical test on some data and you're 99.9% sure that it's correct. Then you can be pretty sure that what it tells you is right (the p value).
But if you do 1,000 tests, you're probably going to get one test that says something untrue (100% - 99.9%). If you do 30,000 tests then you're going to get a lot of false positive values by the end.
The q-value is a modified p-value that takes into account that you'll get some false positives based on how many tests you're doing. This is called a False Discovery Rate (FDR) and there are multiple ways of calculating it.
Long story short: Use the q-value, it reduces the number of false positives.