Hello, I'm not sure if anybody addressed this issue here before. I'm using the custom CDF to analyze gene expression differences between two phenotypes from 3 individual tumor datasets (hgu133a, raw .CEL files). Preprocessing is done by rma(). I have a very interesting observation about the fold changes between these two phenotypes. For example, I'm looking at the ESR1 gene expression. Given FDR of 0.05, if I use custom CDF file, this gene (probeset 2099at, ENTREZG) is differentially expressed between two phenotypes (using limma package in bioconductor). If I use Affymetrix original CDF file where 9 probesets are mapped to ESR1, two of the probsets (205225at and 211235sat) are differentially expressed between two phenotypes. When I look at the log2 fold changes between the phenotypes, here comes the part that I am not sure how I can interpret. From literature I learned that ESR1 has large fold changes between the phenotypes. But by looking at the log2 fold changes listed below, I find that the probeset from custom CDF shows much lower fold changes compared to probeset 205225_at from Affy CDF. I understand that fold changes in microarray experiments are not as accurate as those in qPCR. But the difference of the scale shown in my observation confuses me. Which fold change should I trust? I'd like to hear your take on this issue. Thanks a lot!
NetAffy
- affyId dataset1 dataset2 dataset3
- 205225_at 4.040 3.580 4.130
- 211233xat 0.835 0.437 0.691
- 211234xat 0.656 0.378 0.517
- 211235sat 1.034 0.580 1.013
- 211627xat -0.008 0.069 0.012
- 215551_at 0.108 0.086 0.040
- 215552sat 0.802 0.582 0.989
- 217190xat 0.001 0.044 0.003
- 217163_at 0.293 0.209 0.301
Custom CDF
- probesetId dataset1 dataset2 dataset3
- 2099_at 1.015 0.646 1.123
Thanks Daniel! The expression of 2009_at from custom CDF is actually similar as 211235_s_at from NetAffy as you can see. I'll take a closer look at the probes as you suggested. Thanks again! -Lei
Following Daniel's suggestion, I extract the probe sequences for 205225_at from NetAffy and probe sequences for 2099_at from Custom CDF file. Then I BLAT them against hg19. They are all 100% mapped to genomic sequence of ESR1. The 11 probes of 205225_at are mapped to the most 3' end while 46 probe sequences (including 11 probes above) for 2099_at are located across the genomic region of that gene. The latter may better represent the overall gene.
Of course the probes being spread over the length of the gene, rather than the 3' end might leave you at the mercy of a) splice variants and b) RNA degradation, both of which might skew the results. A quick glance suggests ESR1 has all kinds of alternatively spliced isoforms and a number of different promoters, so are you sure which isoforms you're picking up across the length of the gene with those probesets?