I need to compute gene co-expression for a compendium of GEO microarrays. I downloaded a number of GDS datasets corresponding to two GPL platforms and merged them in a gene expression table. After lo2 transforming them I obtained a lot of negative values. Negative values come from small gene expression values (between 0 and 1), however due to the log2 transformation they create outliers. These outliers are influencing any type of co-expression measurements. The GDS datasets are supposed to be both background corrected and normalized, but I performed quantile normalization to re-align the probe distribution among datasets. I still have too many negative values though.
How do you recommend me to proceed?
- Download raw .CEL files and perform unitary background correction/normalization? I saw people saying that this improves the overall quality but I am not convinced. Mainly because these operations are mostly performed to eliminate consistent noise due to specific experimental conditions. Second, negative values are already present in the GDS datasets after all the statistical proofing, so what is to guarantee I will not endup in the same situation, especially since I will use many different experiments?
- Add 1.0 to all expression values before log2 transforming them. This is my favored solution.
- Not using any log2 transformation (why is this used anyway?). However this would make outliers even stronger.