Hi guys, suppose to be in the following situation:
SampleA1 SampleA2 SampleA3 Ctrl1 Ctrl2 SampleB1 SampleB2 SampleB3 234 1 32 5 2 0 21 12344 2434 134 0 2 0 0 0 0 1 0 0 1 1 1234 456 345 .................................................................................
Specifically rows are genes while columns are samples. Data are counts of an RNA seq experiment.
Suppose you want to perform the differential gene expression analysis and you want to compare Ctrl* vs Sample* condition. To do this you first of all filter the raw count matrix on (cpm>1) > n (n == number of samples you decide) using edgeR for example. Once this is done you have the data matrix I showed you. Then you apply glmQLFTest (after the design etc) and you will have logFC. Now my point is: suppose your boss don't want that you apply a more stringent filter on (cpm>1) > n how is it possible to avoid high logFC values even if the genes are poorly expressed as in line 3 for SampleA* vs Ctrl? LogFC will be "comparable" in terms of magnitude to the logFC referring to genes highly expressed versus 0 (line 1 for example). Moreover....suppose that gene is highly expressed in SampleB and you cannot remove it because otherwise you will remove this information when you compare SampleB* vs Ctrl. The logFC of SampleA vs Ctrl* will be high as the logFC of SampleB* vs Ctrl* but they refer to genes differently expressed in terms of magnitude. How to deal with this situation? I thought to treat the comparisons independently, i.e. considering different sets of genes when comparing SampleA* vs Ctrl* and SampleB* vs Ctrl* but I'm not sure it is correct.
Can anyone help me please?