Question: How to correctly filer out gene expression matrix by measuring Pearson correlation coefficients?
0
gravatar for Jurat Shahidin
4 months ago by
Chicago, IL, USA
Jurat Shahidin80 wrote:

Hi

I found @Kevin Blighe answer for finding Pearson correlation coefficients between a list of genes, and want to learn this solution on my problem. Basically, I have Affymetrix gene level expression matrix (genes in the rows and sample ID on the columns), and I have annotation data of microarray experiment observation where sample ID in the rows and description identification on the columns.

example gene expression data

here is gene level expression matix:

> exprs_mat[1:4, 1:3]
        Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04
1_at           6.062215        6.125023        5.875502
10_at          3.796484        3.805305        3.450245
100_at         5.849338        6.191562        6.550525
1000_at        3.567779        3.452524        3.316134

and here is annotation data which contain experiment observation:

> head(ano)
                       SampleID   GA Batch     Set Train Platform
Tarca_001_P1A01 Tarca_001_P1A01 11.0     1 PRB_HTA     1    HTA20
Tarca_013_P1B01 Tarca_013_P1B01 15.3     1 PRB_HTA     1    HTA20
Tarca_025_P1C01 Tarca_025_P1C01 21.7     1 PRB_HTA     1    HTA20
Tarca_037_P1D01 Tarca_037_P1D01 26.7     1 PRB_HTA     1    HTA20
Tarca_049_P1E01 Tarca_049_P1E01 31.3     1 PRB_HTA     1    HTA20
Tarca_061_P1F01 Tarca_061_P1F01 32.1     1 PRB_HTA     1    HTA20

I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data. How can I get my expected correlation matrix and filter out the genes by correlation value? any idea to make this happen correctly?

my attemp by using limma:

Here is what I tried by using limma:

library(limma)
fit <- limma::lmFit(exprs_mat, design = model.matrix( ~ t(ano$GA))
fit <- eBayes(fit)
topTable(fit, coef=2)

but I got a dimension error and I need to have a proper design matrix for this. can anyone point me out any idea to do this?

desired output:

How can I get a sub (reduced dimension) gene-expression matrix of only these highly correlated genes? How to make the above implementation more efficient in R? any way to make this happen? Any thoughts?

desired output format:

I intend to filter out genes in the gene expression matrix expr_mat by using correlation value (only keep highly correlated genes), expected sub expression matrix should have same data structure as expr_mat. any idea to make this happen? any thought?

ADD COMMENTlink modified 3 months ago • written 4 months ago by Jurat Shahidin80

above script didn't return the correct output

Please describe the output you get and how it's different from the expected output.

Also, your code has final_df <- as.data.frame(), which will cause an error no matter what else is going on because as.data.frame() needs at least one parameter - the dataset to convert to a data.frame.

ADD REPLYlink written 4 months ago by RamRS24k

Thanks for your reply. I intend to filter out genes in the gene expression matrix expr_mat by using correlation value (only keep highly correlated genes), and expected sub expression matrix should have same data structure as expr_mat. Do you have a possible idea to make this happen? Thank you

I also found this biostars post as useful but my correlation matrix is not correct. Any thought? Thanks

ADD REPLYlink modified 4 months ago • written 4 months ago by Jurat Shahidin80

Take a look at the heatmap(cor(X)) and it should be square, symmetrical. Picking genes is tough because they are all strong on the diagonal. I bet every gene will be correlated to some part of the dataset.

ADD REPLYlink written 4 months ago by karl.stamm3.5k
2
gravatar for Asaf
4 months ago by
Asaf6.3k
Israel
Asaf6.3k wrote:

Your go to should be limma. You can fit a linear model taking GA as a covariate. Kevin answered on a technical note, he never mentioned it is a good idea to do that.

ADD COMMENTlink written 4 months ago by Asaf6.3k

That's not what I am trying to do at that moment. Do you have better idea for what I asked in my post? Thanks

ADD REPLYlink written 4 months ago by Jurat Shahidin80
2

I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data.

If this is what you were trying to do then linear regression is equivalent to correlation. see here: https://lindeloev.github.io/tests-as-linear/

ADD REPLYlink modified 4 months ago • written 4 months ago by Asaf6.3k

interesting point. What I am going to do is to get a reduced dimension of expression set for the training ML model. if I use limma linear model, how can I produce a sub expression matrix (set of filtered genes)? Could you elaborate your point with possible coding demonstration instead?

ADD REPLYlink written 4 months ago by Jurat Shahidin80
1
gravatar for karl.stamm
4 months ago by
karl.stamm3.5k
United States
karl.stamm3.5k wrote:

In your comment: "What I am going to do is to get a reduced dimension of expression set for the training ML model. if I use limma linear model, how can I produce a sub expression matrix (set of filtered genes)?"

I gather you're trying to find the genes that are highly correlated to each other. Just cor() on your expression matrix will calculate Pearson gene-gene similarity. It's not super informative in RNA-Seq, because you probably have orders of magnitude more dimensions than samples. Do a PCA and look at the loadings for the first dimension. The first principle component is going to be the linear combination of your genes that best represents the whole dataset.

R function prcomp() can calculate what you need. Take the highest N genes from the first few principle components. They will reconstitute most of your dataset. If the measures by each gene span orders of magnitude, Try to normalize your data first perhaps by log10() or a variance stabilizing transform like DESeq2::varianceStabilizingTransform. PCA is invariant with linear transforms but sensitive to dimension scale, so try to get your units of expression measure similarly scaled first.

Take a look at dimensionality reduction, there's lots of ways to reduce your gene set. I like the R package edgeR and DESeq2 to choose genes of interest to a phenotype. When you said "I intend to see how the genes in each sample are correlated with GA value of corresponding samples" that sounds like a more routine DEG or differential expression analysis. See transcriptome differential expression in bioconductor for this. https://www.bioconductor.org/packages/release/BiocViews.html#___GeneExpressionWorkflow

ADD COMMENTlink written 4 months ago by karl.stamm3.5k

@karl.stamm, @Asaf , @RamRS:

As you suggested, I tried limma::lmFit where I set ano$GA as a covariate, here is my new attempt:

library(limma)
fit <- limma::lmFit(exprs_mat, design = model.matrix( ~  ano$GA)
fit <- eBayes(fit)
topTable(fit, coef=2)

then I am expecting to get a correlation matrix from the above code, and would like to do following in order to get filtered sub expression matrix:

idx <- which( (abs(cor) > 0.8) & (upper.tri(cor)), arr.ind=TRUE)
idx <- unique(c(idx[, 1],idx[, 2])
correlated.genes <- matrix[idx, ]

but I still didn't get the right answer. I am confident about using limma approach but I couldn't figure out what went wrong above code again. Can anyone point me out how to make this work? Is there any efficient way to make this happen? any thought? thanks

ADD REPLYlink modified 3 months ago • written 4 months ago by Jurat Shahidin80

Your topTable() should have the genes that correlate nicely to $GA. It's the matrix-sub-selection code that's broken. just cut down exprs_mat with the shortlist of genes, and you're done.

ADD REPLYlink written 3 months ago by karl.stamm3.5k

Your topTable() should have the genes that correlate nicely to $GA

I am trying to get these nicely correlated genes with $GA, but as I said, I got a dimension error. Could you elaborate your point if possible coding demonstration is possible?

cut down exprs_mat with the shortlist of genes

I don't understand this phrase, how can I do that exactly? Would it be possible to extend your comment with bit more details and sample code instead? Thanks in advance

ADD REPLYlink modified 3 months ago • written 3 months ago by Jurat Shahidin80
1

I'm not going to provide the code. That's your job. It's basic data manipulation, that is a pre-requisite to work in this field. You can search other questions on this site or stackoverflow for specifics on how to subset a matrix in R. Our answers here are strategic, I have provided you many new key words to learn that will help your overall project or research goal, but will not tell you the keystrokes required to open or save a text file.

ADD REPLYlink written 3 months ago by karl.stamm3.5k

Agreed with you. I'll figure it out by myself. Thanks for your help

ADD REPLYlink written 3 months ago by Jurat Shahidin80

Using 0+ in the model.matrix formula is unnecessary and incorrect. Why are you forcing the regression to go through zero? Just remove it and the remaining limma code will run fine.

ADD REPLYlink modified 3 months ago • written 3 months ago by Gordon Smyth1.0k

thanks for your reply. I got this error after I remove 0+:

row dimension of design doesn't match column dimension of data object

why is that? how can I cut down exprs_mat with the shortlist of genes?

ADD REPLYlink written 3 months ago by Jurat Shahidin80
1

The error is telling you that your annotation matrix doesn't match with the expression matrix. You have to make sure that the rows of your annotation matrix corresponds to the columns of the expression matrix, which at present they don't. How can you expect to correlate GA with expression if you don't have GA and expression for the same set of samples?

ADD REPLYlink modified 3 months ago • written 3 months ago by Gordon Smyth1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2289 users visited in the last hour