Question: Problems Filtering Probes From Series Matrix File With Bioconductor
0
gravatar for moranr
6.2 years ago by
moranr250
Ireland
moranr250 wrote:

Hi,

I have a filtering function: First Filter out probes that are showing little expression above noise - log2(100) is arbitrary number we decide equivilant to 6.7/7 expression in 5% of a the sample set

filter1<- pOverA(0.05, log2(100))

Filter by interquartile range- gets rid of non variable/uninteresting probes- variable over 1.5, x=data

filter2<- function(x)(IQR(x)>1.5)

combine into a single function

ff<- filterfun(filter1,filter2)

I call it like so :

gse14814_probes <- genefilter(gse14814.gcrma,ff)

where gse14814.gcrma is an expression set created using ReadAffy() with CEL files and normalised with gcrma().

The function works for the above expression set, however when I use an expression set that was downloaded as a GEO series matrix file using :

gse4573<-getGEO(filename="GSE4573_series_matrix.txt", GSEMatrix=TRUE)

The same function does not work, as nothing is filterd. No warnings or anything come up, just nothing is filtered.

Any ideas on what my problem is ? Thanks in advance

R

ADD COMMENTlink modified 6.2 years ago by Sean Davis25k • written 6.2 years ago by moranr250
2
gravatar for Sean Davis
6.2 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

getGEO() returns a list of ExpressionSets. So, this should probably do the trick:

gse4573<-getGEO(filename="GSE4573_series_matrix.txt", GSEMatrix=TRUE)[[1]]
gse4573_probes <- genefilter(gse4573,ff)
# no filtering occurs because the constants used in the original
# filter function do not make sense for linear-scale data 
sum(gse4573_probes)
[1] 22282

Log-transform the data and things work as expected. Note that the code has not changed, but the data must be massaged to work with the code as it was written.

exprs(gse4573)=log2(exprs(gse4573))
gse4573_probes <- genefilter(gse4573,ff)
# Only difference is that the data were log-transformed first.
sum(gse4573_probes)
[1] 2507

So, 2507 probesets survive the cutoff. To get the associated filtered ExpressionSet:

gsefiltered = gse4573[gse4573_probes,]
ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Sean Davis25k

The above doesnt work unfortunately. I had considered applying a new function to loop through each gene in the matrix and test the conditions, however not entirely sure how to do this, Im quite new to this.

however something like this ? j = 0 for(i in numberofgenes){ overthreshold <- expessionmatrix[,currentgene] > log2(100) if IQR(expessionmatrix[,currentgene]) > 1.5 { if(sum(overthreshold) > 5) { retaingenes[i ] <- currentgene j = j +1 } } }

I dont no how to get 1: number of genes from the matrix, 2: current gene form the matrix,

ADD REPLYlink written 6.2 years ago by moranr250
1

I'm guessing you are applying your ff to the untransformed gse4573 ExpressionSet. Note that the expression values are NOT log-transformed, so the constants in your filter function will not filter anything. I edited the code above to include the log transformation step and you'll note that things work just fine.

ADD REPLYlink written 6.2 years ago by Sean Davis25k

Thank you so much, thank you for explaining why my function was not working, it really helps me learn . Thank you !

ADD REPLYlink written 6.2 years ago by moranr250

You are quite welcome!

ADD REPLYlink written 6.2 years ago by Sean Davis25k

Sorry Sean just abnother Q if you have time. I applied the filter function to 5 data sets, 3 are plus1 arrays- 22,000 probes, 2 are plus 2 arrays- 55,000 probes, On one of the datasets for the plus 2 arrays, 16,000 probes were returned. where as the other plus 2 only returned 4,000.The plus 1 arrays returned around 2000 probes. So should the plus two arrays not return around 4,000, as the are just over double the plus 1 arrays? I think 16,000 prbes returned under both of my conditions is a lot, in comparison to the other datasets? what do you think ?

ADD REPLYlink written 6.2 years ago by moranr250
1

There is no "right" answer to the question of how much to filter, at least that I know of. As to your question, using the same filter functions on two different datasets does not necessarily lead to proportional filtering due to differences between labs, normalization conditions, and experimental setup. You'll need to determine for yourself what the best filtering strategy is for each dataset.

ADD REPLYlink written 6.2 years ago by Sean Davis25k
1
gravatar for Ying W
6.2 years ago by
Ying W3.9k
South San Francisco, CA
Ying W3.9k wrote:

I haven't used getGEO() before but I'm guessing gse4573 and gse14814.gcrma contain different things. genefilter() wants a matrix or eSet as the first argument.

Could you post some info such as the genefilter() command you run with gse4573 and also what gse4573 contains? (ie. class(gse4573), names(gse4573), head(gse4573)) as well as sessionInfo() output

ADD COMMENTlink modified 6.2 years ago by Istvan Albert ♦♦ 80k • written 6.2 years ago by Ying W3.9k

I use the exact same genefilter() commmand as above only I put in the gse4573 rather than gse14814.gcrma. I am working from my own laptop today which uses the R64 terminal/environment, where i normally use terminal. I got a warning back this time: Error in apply(expr, 1, flist) : dim(X) must have a positive length, any ide what this means ?

ADD REPLYlink written 6.2 years ago by moranr250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1522 users visited in the last hour