I have 10 gene expression data sets (i.e. gene expression data:.CEL files and phenotype data: age of sample; either as a category "old" or "young", or the numerical age of the sample) that I downloaded from GEO. The 10 data sets are different organisms (i.e. mouse, rat and human) and are all Affymetrix single-channel (although different platforms, e.g. GPL85, GPL96). The aim is to do a differential expression analysis of the genes with age per data set (and then I will do a vote-counting method to look at generally which genes are differentially expressed overall between the data sets). I have been looking at using the limma package for this (see here).
My question: I think my data is slightly different, and I'm not sure how to input my data into the algorithm, and I can't seem to find an example that has my particular type of data.
First, for each data set, I used RMA on each data set separately:
library(affy) data <-ReadAffy() eset <-rma(data) write.exprs(eset,file='Dataset1.output')
Then, because I wanted to do a gene expression, and not probe expression analysis (and yes, I know the advantages of doing probe-based, I definitely want gene-based), I added in a "Gene" column that I obtained from the "_full.soft" files from GEO and aggregated the data per gene:
options(max.print=1000000000) data <-read.table('Dataset1.WithGeneAddedIn',header=T) sink('Dataset1.Gene.Output') write.table(aggregate(data[, -c(1,2)],by = list(Gene = data$Gene),FUN = mean,na.rm = TRUE),sep=' ',row.names=FALSE,quote=FALSE) sink()
So now, I have a standard text file with gene expression data (file called GeneExpression.txt) with the expression levels per gene, instead of per probe, e.g:
Gene Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Gene1 8.06 7.84 7.62 7.73 7.85 7.94 7.82 7.84 Gene2 9.00 9.06 9.02 9.04 9.05 9.05 9.09 9.16 Gene3 4.8 4.81 4.98 4.78 4.79 5.02 4.91 4.99 Gene4 7.00 7.18 7.14 7.0 7.07 7.12 7.16 7.07 Gene5 6.46 6.40 6.34 6.49 6.46 6.46 6.45 6.35
And I have a phenotype file like this (File called PhenotypeFile.txt):
SampleID Age Sample1 14 Sample2 14 Sample3 14 Sample4 14 Sample5 2 Sample6 2 Sample7 2 Sample8 2
So I want to know, for each gene, is there a significant difference in gene expression between the set of young samples (i.e. those individuals that are 2 months old compared to 14 months old in this example, in other data sets, I want to know is there a difference between a range of ages).
The problem: From the workflow in bioconductor, it seems to read in .CEL files and then will give me back differentially expressed probes? Because I want gene-level, and not probe-level analysis, I can't read it in the .CEL files (As I have combined the probes in the .CEL file to genes). So I don't understand how to read in the gene expression data to limma in a "plain text" format the way I have here, where I have already processed it with RMA and converted to gene data myself.Similarly, I want to read in the phenotype for each sample as a plain text file to limma. Then tell limma "This is the gene expression data and samples, this is the phenotype data and samples, this is the categories of samples that I want to compare, is there are genes differentially expressed?"
Would someone know of sample code they could show me, where I could read in my own gene expression data from a text file and not a .CEL file, and phenotypic data from a text file, and then perform a differential expression analysis (obviously don't worry about the parameters of the differential expression analysis e.g. what model to use, if I could just get the skeleton of the code I should use to perform a differential expression analysis using this type of data, I'll be able to understand and see what direction I need to read more in etc myself).
I have found loads of general limma tutorials, but just not specifically how I would do this, or if this is possible.