Question: Problems Filtering Probes From Series Matrix File With Bioconductor
gravatar for moranr
7.8 years ago by
moranr270 wrote:


I have a filtering function: First Filter out probes that are showing little expression above noise - log2(100) is arbitrary number we decide equivilant to 6.7/7 expression in 5% of a the sample set

filter1<- pOverA(0.05, log2(100))

Filter by interquartile range- gets rid of non variable/uninteresting probes- variable over 1.5, x=data

filter2<- function(x)(IQR(x)>1.5)

combine into a single function

ff<- filterfun(filter1,filter2)

I call it like so :

gse14814_probes <- genefilter(gse14814.gcrma,ff)

where gse14814.gcrma is an expression set created using ReadAffy() with CEL files and normalised with gcrma().

The function works for the above expression set, however when I use an expression set that was downloaded as a GEO series matrix file using :

gse4573<-getGEO(filename="GSE4573_series_matrix.txt", GSEMatrix=TRUE)

The same function does not work, as nothing is filterd. No warnings or anything come up, just nothing is filtered.

Any ideas on what my problem is ? Thanks in advance


ADD COMMENTlink modified 7.8 years ago by Sean Davis26k • written 7.8 years ago by moranr270
gravatar for Sean Davis
7.8 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

getGEO() returns a list of ExpressionSets. So, this should probably do the trick:

gse4573<-getGEO(filename="GSE4573_series_matrix.txt", GSEMatrix=TRUE)[[1]]
gse4573_probes <- genefilter(gse4573,ff)
# no filtering occurs because the constants used in the original
# filter function do not make sense for linear-scale data 
[1] 22282

Log-transform the data and things work as expected. Note that the code has not changed, but the data must be massaged to work with the code as it was written.

gse4573_probes <- genefilter(gse4573,ff)
# Only difference is that the data were log-transformed first.
[1] 2507

So, 2507 probesets survive the cutoff. To get the associated filtered ExpressionSet:

gsefiltered = gse4573[gse4573_probes,]
ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by Sean Davis26k

The above doesnt work unfortunately. I had considered applying a new function to loop through each gene in the matrix and test the conditions, however not entirely sure how to do this, Im quite new to this.

however something like this ? j = 0 for(i in numberofgenes){ overthreshold <- expessionmatrix[,currentgene] > log2(100) if IQR(expessionmatrix[,currentgene]) > 1.5 { if(sum(overthreshold) > 5) { retaingenes[i ] <- currentgene j = j +1 } } }

I dont no how to get 1: number of genes from the matrix, 2: current gene form the matrix,

ADD REPLYlink written 7.8 years ago by moranr270

I'm guessing you are applying your ff to the untransformed gse4573 ExpressionSet. Note that the expression values are NOT log-transformed, so the constants in your filter function will not filter anything. I edited the code above to include the log transformation step and you'll note that things work just fine.

ADD REPLYlink written 7.8 years ago by Sean Davis26k

Thank you so much, thank you for explaining why my function was not working, it really helps me learn . Thank you !

ADD REPLYlink written 7.8 years ago by moranr270

You are quite welcome!

ADD REPLYlink written 7.8 years ago by Sean Davis26k

Sorry Sean just abnother Q if you have time. I applied the filter function to 5 data sets, 3 are plus1 arrays- 22,000 probes, 2 are plus 2 arrays- 55,000 probes, On one of the datasets for the plus 2 arrays, 16,000 probes were returned. where as the other plus 2 only returned 4,000.The plus 1 arrays returned around 2000 probes. So should the plus two arrays not return around 4,000, as the are just over double the plus 1 arrays? I think 16,000 prbes returned under both of my conditions is a lot, in comparison to the other datasets? what do you think ?

ADD REPLYlink written 7.8 years ago by moranr270

There is no "right" answer to the question of how much to filter, at least that I know of. As to your question, using the same filter functions on two different datasets does not necessarily lead to proportional filtering due to differences between labs, normalization conditions, and experimental setup. You'll need to determine for yourself what the best filtering strategy is for each dataset.

ADD REPLYlink written 7.8 years ago by Sean Davis26k
gravatar for Ying W
7.8 years ago by
Ying W4.0k
South San Francisco, CA
Ying W4.0k wrote:

I haven't used getGEO() before but I'm guessing gse4573 and gse14814.gcrma contain different things. genefilter() wants a matrix or eSet as the first argument.

Could you post some info such as the genefilter() command you run with gse4573 and also what gse4573 contains? (ie. class(gse4573), names(gse4573), head(gse4573)) as well as sessionInfo() output

ADD COMMENTlink modified 7.8 years ago by Istvan Albert ♦♦ 85k • written 7.8 years ago by Ying W4.0k

I use the exact same genefilter() commmand as above only I put in the gse4573 rather than gse14814.gcrma. I am working from my own laptop today which uses the R64 terminal/environment, where i normally use terminal. I got a warning back this time: Error in apply(expr, 1, flist) : dim(X) must have a positive length, any ide what this means ?

ADD REPLYlink written 7.8 years ago by moranr270
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1465 users visited in the last hour