Question: Problems Filtering Probes From Series Matrix File With Bioconductor
0
7.8 years ago by
moranr270
Ireland
moranr270 wrote:

Hi,

I have a filtering function: First Filter out probes that are showing little expression above noise - log2(100) is arbitrary number we decide equivilant to 6.7/7 expression in 5% of a the sample set

`filter1<- pOverA(0.05, log2(100))`

Filter by interquartile range- gets rid of non variable/uninteresting probes- variable over 1.5, x=data

`filter2<- function(x)(IQR(x)>1.5)`

combine into a single function

`ff<- filterfun(filter1,filter2)`

I call it like so :

`gse14814_probes <- genefilter(gse14814.gcrma,ff)`

where gse14814.gcrma is an expression set created using `ReadAffy()` with CEL files and normalised with `gcrma()`.

The function works for the above expression set, however when I use an expression set that was downloaded as a GEO series matrix file using :

`gse4573<-getGEO(filename="GSE4573_series_matrix.txt", GSEMatrix=TRUE)`

The same function does not work, as nothing is filterd. No warnings or anything come up, just nothing is filtered.

Any ideas on what my problem is ? Thanks in advance

R

modified 7.8 years ago by Sean Davis26k • written 7.8 years ago by moranr270
2
7.8 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

getGEO() returns a list of ExpressionSets. So, this should probably do the trick:

``````gse4573<-getGEO(filename="GSE4573_series_matrix.txt", GSEMatrix=TRUE)[[1]]
gse4573_probes <- genefilter(gse4573,ff)
# no filtering occurs because the constants used in the original
# filter function do not make sense for linear-scale data
sum(gse4573_probes)
[1] 22282
``````

Log-transform the data and things work as expected. Note that the code has not changed, but the data must be massaged to work with the code as it was written.

``````exprs(gse4573)=log2(exprs(gse4573))
gse4573_probes <- genefilter(gse4573,ff)
# Only difference is that the data were log-transformed first.
sum(gse4573_probes)
[1] 2507
``````

So, 2507 probesets survive the cutoff. To get the associated filtered ExpressionSet:

``````gsefiltered = gse4573[gse4573_probes,]
``````
ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by Sean Davis26k

The above doesnt work unfortunately. I had considered applying a new function to loop through each gene in the matrix and test the conditions, however not entirely sure how to do this, Im quite new to this.

however something like this ? j = 0 for(i in numberofgenes){ overthreshold <- expessionmatrix[,currentgene] > log2(100) if IQR(expessionmatrix[,currentgene]) > 1.5 { if(sum(overthreshold) > 5) { retaingenes[i ] <- currentgene j = j +1 } } }

I dont no how to get 1: number of genes from the matrix, 2: current gene form the matrix,

1

I'm guessing you are applying your ff to the untransformed gse4573 ExpressionSet. Note that the expression values are NOT log-transformed, so the constants in your filter function will not filter anything. I edited the code above to include the log transformation step and you'll note that things work just fine.

Thank you so much, thank you for explaining why my function was not working, it really helps me learn . Thank you !

You are quite welcome!

Sorry Sean just abnother Q if you have time. I applied the filter function to 5 data sets, 3 are plus1 arrays- 22,000 probes, 2 are plus 2 arrays- 55,000 probes, On one of the datasets for the plus 2 arrays, 16,000 probes were returned. where as the other plus 2 only returned 4,000.The plus 1 arrays returned around 2000 probes. So should the plus two arrays not return around 4,000, as the are just over double the plus 1 arrays? I think 16,000 prbes returned under both of my conditions is a lot, in comparison to the other datasets? what do you think ?

1

There is no "right" answer to the question of how much to filter, at least that I know of. As to your question, using the same filter functions on two different datasets does not necessarily lead to proportional filtering due to differences between labs, normalization conditions, and experimental setup. You'll need to determine for yourself what the best filtering strategy is for each dataset.

1
7.8 years ago by
Ying W4.0k
South San Francisco, CA
Ying W4.0k wrote:

I haven't used `getGEO()` before but I'm guessing `gse4573` and `gse14814.gcrma` contain different things. `genefilter()` wants a matrix or eSet as the first argument.

Could you post some info such as the `genefilter()` command you run with gse4573 and also what gse4573 contains? (ie. `class(gse4573)`, `names(gse4573)`, `head(gse4573))` as well as `sessionInfo()` output

ADD COMMENTlink modified 7.8 years ago by Istvan Albert ♦♦ 85k • written 7.8 years ago by Ying W4.0k

I use the exact same genefilter() commmand as above only I put in the gse4573 rather than gse14814.gcrma. I am working from my own laptop today which uses the R64 terminal/environment, where i normally use terminal. I got a warning back this time: Error in apply(expr, 1, flist) : dim(X) must have a positive length, any ide what this means ?