Extract Expression Data For Candidate Gene List In Microarray
2
7
Entering edit mode
13.2 years ago
Kerem ▴ 80

Hi all,

I'm struggling my way through beginner's Bioconductor (and R in general) and have a specific problem I hope somebody can help with. I can load CEL files, do RMA processing on them and check QC etc, eventually I end up with a data frame with these data.

I have a few lists of about 200 candidate genes each that are of specific interest to us and I would like to run a script to print the expression data for just those genes to a tab-delim txt file.

Can anybody point me in the right direction to do this. I do have the Affymetrix Probe Set IDs for these, so can use either that or gene symbol.

Apologies if this is a very basic question, I'm very new to this.

Thanks in advance, Kerem.

microarray bioconductor r • 12k views
ADD COMMENT
0
Entering edit mode

In addition to the answers you might find this document useful: http://cran.r-project.org/doc/manuals/R-data.pdf

ADD REPLY
7
Entering edit mode
13.2 years ago
Christof Winter ★ 1.0k

After RMA processing, you should end up with an ExpressionSet object. Let's assume it's called eset, and your probe set IDs are in a character vector called yourProbeSetIDs. Then:

exprs(eset)    # gives a dataframe with expression levels, probe set IDs are the row names
exprs(eset)[yourProbeSetIDs, ]  # gives you of subset dataframe with your probe sets of interest

To write a tab-delimited file:

write.table(yourDataFrame, file="filename.txt", sep="\t", quote=F, row.names=T)

Christof

ADD COMMENT
1
Entering edit mode

It does work. Have a look at indexing with character vectors: http://cran.r-project.org/doc/manuals/R-intro.html#Index-vectors

ADD REPLY
1
Entering edit mode
library(oligo)
geneCELs <- list.celfiles(path, full.names=TRUE)
affyGeneFS <- read.celfiles(geneCELs)
genePS <- rma(affyGeneFS, target="probeset")
featureData(genePS) <- getNetAffx(genePS, "probeset")

so I can do exprs(genePS) and get the probesets, however I want the original probes, before they are "summarised" to the probe-set level.

So something like:

exprs(affyGeneFS)
ADD REPLY
0
Entering edit mode

AFAIK, exprs(eset)[yourProbeSetIDs, ] does not work. It either needs to be probeIndices or a boolean array of length(exprs) between the square brackets.

ADD REPLY
0
Entering edit mode

Dear both, Firstly, massive thanks for your help, really appreciated. Eventually got exprs(eset)[yourProbeSetIDs, ] to work. My fault - needed to make the char vector correctly and include the trailing ", " or else it fails. Very useful for me.

However, I've tested with 2 probeIDs and the output file has a couple of problems. First, the header row is missing column title for probeID. Maybe hints I've imported the data wrongly? Second, I want to include gene symbols for the rows. I've got and installed the appropriate .db for my mouse gene st chips. Any pointers?

Thanks again, K.

ADD REPLY
0
Entering edit mode

Didnt know that point 4 existed, thanks :) @Kerem: you may either reset the headers with colnames()= or take the subset from you dataframe instead of the expression data which would (I assume) preserve the headers. For the 2nd, see the annaffy documentation.

ADD REPLY
0
Entering edit mode

For mouse, also have a look at the org.Mm.eg.db package.

ADD REPLY
0
Entering edit mode

Anyone know how to get the probe set ids for the expression matrix row names BEFORE summarisation (so like 14029281_1, 14029281_2 etc.). When I load the cel files into R, the original unnormalised expression matrix has row names from 1:number of probes.

ADD REPLY
0
Entering edit mode

Can you post your code how you load and process the CEL files?

ADD REPLY
0
Entering edit mode

Sorry, that's rather messy, shall I repost this as a new question?

ADD REPLY
0
Entering edit mode
ADD REPLY
5
Entering edit mode
13.2 years ago

Here are some pointers:

  • I assume you know how to get the expression values from your dataframe by using exprs(dataframe)
  • From the expression values, you can get a vector of gene identifiers for the whole array by using rownames() or colnames() depending on the orientation of your dataframe

Then you need to check if each ID matches one of your list of interest and finally extract the appropriate dataset. When you are new to programming in general you may want to do this with loops, otherwise there are shortcuts. You can try to following commands and make them fit for your situation:

all = c(1:5)
wanted = c(1,4,5)
x = all %in% wanted
all[x]

Hope that helps :)

ADD COMMENT
1
Entering edit mode

Thanks Michael, enjoying the programming and this will come in useful. After I read your reply I found https://wiki.brandeis.edu/twiki/bin/view/Bio/BioConductor which has a subsection about 2/3 of the way down the page titled "Filtering Genes" which seems to use similar principles to do this. Also, I've asked a different question in reply to the answer below about including gene symbols in output. Wonder if you have any pointers on this for me? Thanks, Kerem.

ADD REPLY

Login before adding your answer.

Traffic: 2606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6