Question: what is the strategy to set threshold value for filtering gene expression matrix?
0
Jurat Shahidin80 wrote:

I have an Affymetrix gene expression matrix where I intend to do gene filtering. However, I managed to find a correlation between the gene expression matrix and target pheno data. To do so, I tried to set a different threshold to keep high correlated genes in my experiment but didn't find best worked out a solution.

I am wondering is there any efficient way to select a threshold for gene filtering? any possible idea would be appreciated.

reproducible data:

I produced reproducible example for gene expression data and pheno data down below:

``````persons_df <- data.frame(person1=sample(1:20,10, replace = FALSE),
person2=as.factor(sample(10)),
person3=sample(1:25,10, replace = FALSE),
person4=sample(1:30,10, replace = FALSE),
person5=as.factor(sample(10)),
person6=as.factor(sample(10)))
row.names(persons_df) <-letters[1:10]
``````

whereas, in `persons_df`, different features (a.k.a, genes) in row-wise and different persons in column-wise are given.

and I have pheno metadata down below:

``````age_df <- data.frame(personID= colnames(persons_df),
age=sample(1:50, 6 , replace = FALSE))
``````

my objective:

I want to keep the features (a.k.a, genes in the rows) which show a high correlation with `age` from `age_df`

my solution for filtering:

``````corr_df = do.call(rbind,
apply(persons_df, 1, function(x){
temp = cor.test(age_df\$age, as.numeric(x))
data.frame(t = temp\$statistic, p = temp\$p.value,
cor_coef=temp\$estimate)
}))

indx <- which(abs(corr_df\$p)>0.15 &(upper.tri(corr_df\$cor_coef)), arr.ind = TRUE)
indx <- unique(c(indx[,1], indx[,2]))
corr_genes <- eset_HTA20[indx,]
``````

but when I subset original gene expression matrix, I left empty output. Is there any problem with my indexing, and subsetting of the gene expression matrix? can anyone point me out my mistake if there is any?

question:

what is the best strategy to keep highly correlated genes? how can I pick up descent threshold such as to take p-value, or both t-value, and p-value as my threshold for filtering? can anyone guide me on how to set up a reasonable threshold for gene filtering? thanks a lot

filtering gene-expression • 750 views
modified 19 months ago • written 19 months ago by Jurat Shahidin80
3
GenoMax95k wrote:

I am not answering your question but leaving a general comment. Without looking at your actual data no one here may be able to reasonably answer your question. You appear to be proficient in R so this is a matter of a small error or perhaps there are no genes that fit the criteria you are filtering on (since you said you get an empty result).

Every analysis is subject to peculiarities of the particular dataset, ultimate aim of the experiment and the ability/capacity of being able to test the hypothesis that are being generated. It is easy to change a value, re-run a filter and add/subtract genes from a list. Having 5 more genes in the list may double/triple the amount of work an experimentalist may need to do to test them.

If you are doing this analysis for someone else please talk with them about the results. If you are doing this for your own data then start thinking about downstream work that will be needed.

Thanks for your reply. I am experimenting gene expression matrix public dataset from here. Could you point me out any concrete strategies or approach I could try for gene filtering task? Do you think my way of indexing and subsetting expression matrix is problematic? what's the proper of setting a threshold (either pick p-value, t-value or both or take correlation coefficient)? any feasible approach to do this? thanks

1

I think that you may want to look at this line... it does not seem to do what you are expecting it to do(?)

``````indx <- which(abs(corr_df\$p)>0.15 &(upper.tri(corr_df\$cor_coef)), arr.ind = TRUE)
``````

Evaluate each part separately and you'll see. Also, with this, `abs(corr_df\$p)`, you are taking absolute p-values > 0.15 - are you sure that you want to do that?

Dear Kevin:

Thanks for your help. What I want to do is to keep the genes which have a high correlation with `age`. Any correction about what I've done with correlation analysis, indexing, and subsetting original gene expression matrix? your possible instruction would be appreciated a lot.

2

Well, just take a look at that line to which I pointed you... it is not doing what you want to do.

Just on the first part of it (`abs(corr_df\$p)>0.15`), there is no need to obtain the absolute value of a p-value because one can never have a negative p-value. Also, by filtering p-value > 0.15, you are filter including the correlations that are not statistically significant.

The second part of it, `upper.tri()`, is a function usually applied to a data matrix of, e.g., correlation values, and not to a summary table of stats values like you are using it (?).

To get the statistically significant correlates, you just need to do:

``````which(corr_df\$p <= 0.05)
``````

Otherwise good coding, as pointed out by Genomax.

Dear Kevin:

Thanks for your point. Could you give me a piece of opinion about the way of my above correlation analysis for gene filtering? Do you think what's the decent way for gene filtering here (I used Affymetrix gene expression data from here)? What else I can try for gene filtering? any possible idea? thank you again for your help.

There really are no standards.... no standards in anything in bioinformatics.

1

I do not know what is your end goal, so, I cannot comment much further. However, we have addressed your question:

what is the best strategy to keep highly correlated genes? how can I pick up descent threshold such as to take p-value, or both t-value, and p-value as my threshold for filtering? can anyone guide me on how to set up a reasonable threshold for gene filtering? thanks a lot

Just use p-value <= 0.05 for starting off. These represent the statistically significant correlates.