Reading in plain text gene expression and phenotypic data files to limma
1
1
Entering edit mode
5.6 years ago
StephanieK ▴ 110

Hi all.

I have 10 gene expression data sets (i.e. gene expression data:.CEL files and phenotype data: age of sample; either as a category "old" or "young", or the numerical age of the sample) that I downloaded from GEO. The 10 data sets are different organisms (i.e. mouse, rat and human) and are all Affymetrix single-channel (although different platforms, e.g. GPL85, GPL96). The aim is to do a differential expression analysis of the genes with age per data set (and then I will do a vote-counting method to look at generally which genes are differentially expressed overall between the data sets). I have been looking at using the limma package for this (see here).

My question: I think my data is slightly different, and I'm not sure how to input my data into the algorithm, and I can't seem to find an example that has my particular type of data.

First, for each data set, I used RMA on each data set separately:

library(affy)
eset <-rma(data)
write.exprs(eset,file='Dataset1.output')


Then, because I wanted to do a gene expression, and not probe expression analysis (and yes, I know the advantages of doing probe-based, I definitely want gene-based), I added in a "Gene" column that I obtained from the "_full.soft" files from GEO and aggregated the data per gene:

options(max.print=1000000000)
sink('Dataset1.Gene.Output')
write.table(aggregate(data[, -c(1,2)],by = list(Gene = data\$Gene),FUN = mean,na.rm = TRUE),sep='    ',row.names=FALSE,quote=FALSE)
sink()


So now, I have a standard text file with gene expression data (file called GeneExpression.txt) with the expression levels per gene, instead of per probe, e.g:

Gene    Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8
Gene1   8.06    7.84    7.62    7.73    7.85    7.94    7.82    7.84
Gene2   9.00    9.06    9.02    9.04    9.05    9.05    9.09    9.16
Gene3   4.8     4.81    4.98    4.78    4.79    5.02    4.91    4.99
Gene4   7.00    7.18    7.14     7.0    7.07    7.12    7.16    7.07
Gene5   6.46    6.40    6.34    6.49    6.46    6.46    6.45    6.35


And I have a phenotype file like this (File called PhenotypeFile.txt):

SampleID Age
Sample1 14
Sample2 14
Sample3 14
Sample4 14
Sample5 2
Sample6 2
Sample7 2
Sample8 2


So I want to know, for each gene, is there a significant difference in gene expression between the set of young samples (i.e. those individuals that are 2 months old compared to 14 months old in this example, in other data sets, I want to know is there a difference between a range of ages).

The problem: From the workflow in bioconductor, it seems to read in .CEL files and then will give me back differentially expressed probes? Because I want gene-level, and not probe-level analysis, I can't read it in the .CEL files (As I have combined the probes in the .CEL file to genes). So I don't understand how to read in the gene expression data to limma in a "plain text" format the way I have here, where I have already processed it with RMA and converted to gene data myself.Similarly, I want to read in the phenotype for each sample as a plain text file to limma. Then tell limma "This is the gene expression data and samples, this is the phenotype data and samples, this is the categories of samples that I want to compare, is there are genes differentially expressed?"

Would someone know of sample code they could show me, where I could read in my own gene expression data from a text file and not a .CEL file, and phenotypic data from a text file, and then perform a differential expression analysis (obviously don't worry about the parameters of the differential expression analysis e.g. what model to use, if I could just get the skeleton of the code I should use to perform a differential expression analysis using this type of data, I'll be able to understand and see what direction I need to read more in etc myself).

I have found loads of general limma tutorials, but just not specifically how I would do this, or if this is possible.

limma r affy gene expression differential • 3.1k views
0
Entering edit mode
5.6 years ago
ddiez ★ 1.9k

If you want to obtain gene level expression levels I would highly recommend you to use the custom CDF files found at the brainarray site. Basically each probe is mapped to the latest genome assembly and probes mapping to the same genomic unit are assigned to the same probe set. For example, if you use the CDFs that map to Entrez Gene ids, you will get gene expression estimates at the gene level. You can also map to Ensembl, at transcript or exon level as well. All the options for the latest release can be seen here. This method has (at least) two advantages:

1. Probes in a probe set map to the same gene (or whatever criteria used for defining probe sets), unlike in the original mapping.
2. You get probe sets that map directly to your unit of interest (in your case, gene level).

1. Unlike with official mappings, probe sets have different number of probes, so expression level estimates have different precisions.
0
Entering edit mode

Thank you for the reply, it is appreciated. I will definitely look into this, but first I just want to try and get an example working, that then I can tinker the details. I attempted to combine the information here and here here to come up with two solutions, but I am running into two problems.

Method 1: I have a phenotype file like this (file called "PhenotypicData"):

id,Age
GSM1330619,14
GSM1330620,14
GSM1330621,14
GSM1330622,14
GSM1330623,2
GSM1330624,2
GSM1330625,2
GSM1330626,2


and a set of .CEL.gz files in the current working directory

I was attempting to run a script:

library(Biobase)
library(affy)
library(limma)
pdata <-AnnotatedDataFrame(tmp)
files <-list.files(pattern="*.gz")  #I am doing this because for the justRMA command, I was getting an error saying I wasn't reading in the list of files properly, so I'm trying to make a list of files.
data_list <-lapply(files,read.table) #So make a list of the ".gz" files
#eset <- justRMA(data_list,phenoData=pdata) #And read this list into RMA
#design <-model.matrix(~Age,pdata(eset))
#fit <-lmFit(eset,Age)
#efit <-eBayes(fit)
#topTable(efit,coef=2)


When I run this script, I get: There were 48 warnings (use warnings() to see them)

When I add in an extra line to the script:

library(Biobase)
library(affy)
library(limma)
pdata <-AnnotatedDataFrame(tmp)
files <-list.files(pattern="*.gz")
eset <- justRMA(data_list,phenoData=pdata)
#design <-model.matrix(~Age,pdata(eset))
#fit <-lmFit(eset,Age)
#efit <-eBayes(fit)
#topTable(efit,coef=2)


I get the output: Error: the following are not valid files: /path/to/directory/data_list

When I edit the script:

library(Biobase)
library(affy)
library(limma)
eset <- justRMA("cel_files",phenoData=phenoData) #cel_files is a sub directory containing only .gz files
#design <-model.matrix(~Age,pdata(eset))
#fit <-lmFit(eset,Age)
#efit <-eBayes(fit)
#topTable(efit,coef=2)


...and various combinations of the justRMA (e.g. changing "cel_files" to "cel_files/*gz" line to try to tell the script where the .gz files are), I just keep getting:

Error: the following are not valid files: /path_to_files/cel_files/*.gz Execution halted

Method 2: I then tried an approach as in the second link described above.

Example of my gene-based expression data ("GeneExpressionData"):

symbol GSM1330623   GSM1330624  GSM1330625  GSM1330626  GSM1330619  GSM1330620  GSM1330621  GSM1330622
0610005C13Rik   4.15    4.12    4.29    4.44    4.1 4.08    4.49    4.2
0610006L08Rik   3.86    3.8 4.27    3.85    3.96    4   4.1 4.27
0610007P14Rik   9.18    9.14    9.57    9.48    9.37    9.5 9.09    9.63
0610009B22Rik   9.15    9.26    9.41    9.21    9.48    9.47    9.18    9.48
0610009L18Rik   7.76    7.49    7.15    7.16    6.88    7.22    6.92    7.11
0610009O20Rik   6.84    6.87    7.005   6.85    6.83    6.86    6.84    6.865


Example of my feature data ("FeatureData"):

id,symbol
1415670_at,Copg1
1415671_at,Atp6v0d1
1415672_at,Golga7
1415673_at,Psph
1415674_a_at,Trappc4
1415675_at,Dpm2


My phenotype data ("PhenotypeData"):

id,Age
GSM1330619,14
GSM1330620,14
GSM1330621,14
GSM1330622,14
GSM1330623,2
GSM1330624,2
GSM1330625,2
GSM1330626,2


I ran this code:

library(Biobase)
library(affy)

pdata <-AnnotatedDataFrame(tmp)

m <-as.matrix(tmp)

fdata <-AnnotatedDataFrame(tmp)

eset <-new("ExpressionSet",exprs=m, phenoData=pdata,featureData=fdata)


The error is:

    Error in validObject(.Object) :
Error in featureNames<-(*tmp*, value = c("1415670_at", "1415671_at",  :
'value' length (45101) must equal feature number in AssayData (26731)
Calls: new ... .nextMethod -> .local -> featureNames<- -> featureNames<-


I understand that the error is because in my GeneExpression data has data PER GENE, whereas my feature data are data PER PROBE. I just don't know how to fix this error. I also tried to run this analysis without providing feature data (i.e. probe to gene), but I got an error.

Would you know what's going on here to make the above code run? Many thanks again for your time.