Question

How to correctly filer out gene expression matrix by measuring Pearson correlation coefficients?

0

Entering edit mode

4.9 years ago

Jurat Shahidin ▴ 100

Hi

I found @Kevin Blighe answer for finding Pearson correlation coefficients between a list of genes, and want to learn this solution on my problem. Basically, I have Affymetrix gene level expression matrix (genes in the rows and sample ID on the columns), and I have annotation data of microarray experiment observation where sample ID in the rows and description identification on the columns.

example gene expression data

here is gene level expression matix:

> exprs_mat[1:4, 1:3]
        Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04
1_at           6.062215        6.125023        5.875502
10_at          3.796484        3.805305        3.450245
100_at         5.849338        6.191562        6.550525
1000_at        3.567779        3.452524        3.316134

and here is annotation data which contain experiment observation:

> head(ano)
                       SampleID   GA Batch     Set Train Platform
Tarca_001_P1A01 Tarca_001_P1A01 11.0     1 PRB_HTA     1    HTA20
Tarca_013_P1B01 Tarca_013_P1B01 15.3     1 PRB_HTA     1    HTA20
Tarca_025_P1C01 Tarca_025_P1C01 21.7     1 PRB_HTA     1    HTA20
Tarca_037_P1D01 Tarca_037_P1D01 26.7     1 PRB_HTA     1    HTA20
Tarca_049_P1E01 Tarca_049_P1E01 31.3     1 PRB_HTA     1    HTA20
Tarca_061_P1F01 Tarca_061_P1F01 32.1     1 PRB_HTA     1    HTA20

I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data. How can I get my expected correlation matrix and filter out the genes by correlation value? any idea to make this happen correctly?

my attemp by using limma:

Here is what I tried by using limma:

library(limma)
fit <- limma::lmFit(exprs_mat, design = model.matrix( ~ t(ano$GA))
fit <- eBayes(fit)
topTable(fit, coef=2)

but I got a dimension error and I need to have a proper design matrix for this. can anyone point me out any idea to do this?

desired output:

How can I get a sub (reduced dimension) gene-expression matrix of only these highly correlated genes? How to make the above implementation more efficient in R? any way to make this happen? Any thoughts?

desired output format:

I intend to filter out genes in the gene expression matrix expr_mat by using correlation value (only keep highly correlated genes), expected sub expression matrix should have same data structure as expr_mat. any idea to make this happen? any thought?

gene-expression limma microarray • 2.5k views

ADD COMMENT • link 4.8 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

above script didn't return the correct output

Please describe the output you get and how it's different from the expected output.

Also, your code has final_df <- as.data.frame(), which will cause an error no matter what else is going on because as.data.frame() needs at least one parameter - the dataset to convert to a data.frame.

ADD REPLY • link 4.9 years ago by Ram 43k

0

Entering edit mode

Thanks for your reply. I intend to filter out genes in the gene expression matrix expr_mat by using correlation value (only keep highly correlated genes), and expected sub expression matrix should have same data structure as expr_mat. Do you have a possible idea to make this happen? Thank you

I also found this biostars post as useful but my correlation matrix is not correct. Any thought? Thanks

ADD REPLY • link 4.9 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

Take a look at the heatmap(cor(X)) and it should be square, symmetrical. Picking genes is tough because they are all strong on the diagonal. I bet every gene will be correlated to some part of the dataset.

ADD REPLY • link 4.8 years ago by karl.stamm 4.1k

score 2 · Answer 1 · 2019-06-20

2

Entering edit mode

4.9 years ago

Asaf 10k

Your go to should be limma. You can fit a linear model taking GA as a covariate. Kevin answered on a technical note, he never mentioned it is a good idea to do that.

ADD COMMENT • link 4.9 years ago by Asaf 10k

0

Entering edit mode

That's not what I am trying to do at that moment. Do you have better idea for what I asked in my post? Thanks

ADD REPLY • link 4.9 years ago by Jurat Shahidin ▴ 100

2

Entering edit mode

I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data.

If this is what you were trying to do then linear regression is equivalent to correlation. see here: https://lindeloev.github.io/tests-as-linear/

ADD REPLY • link 4.9 years ago by Asaf 10k

0

Entering edit mode

interesting point. What I am going to do is to get a reduced dimension of expression set for the training ML model. if I use limma linear model, how can I produce a sub expression matrix (set of filtered genes)? Could you elaborate your point with possible coding demonstration instead?

ADD REPLY • link 4.9 years ago by Jurat Shahidin ▴ 100

score 1 · Answer 2 · 2019-06-21

1

Entering edit mode

4.8 years ago

karl.stamm 4.1k

In your comment: "What I am going to do is to get a reduced dimension of expression set for the training ML model. if I use limma linear model, how can I produce a sub expression matrix (set of filtered genes)?"

I gather you're trying to find the genes that are highly correlated to each other. Just cor() on your expression matrix will calculate Pearson gene-gene similarity. It's not super informative in RNA-Seq, because you probably have orders of magnitude more dimensions than samples. Do a PCA and look at the loadings for the first dimension. The first principle component is going to be the linear combination of your genes that best represents the whole dataset.

R function prcomp() can calculate what you need. Take the highest N genes from the first few principle components. They will reconstitute most of your dataset. If the measures by each gene span orders of magnitude, Try to normalize your data first perhaps by log10() or a variance stabilizing transform like DESeq2::varianceStabilizingTransform. PCA is invariant with linear transforms but sensitive to dimension scale, so try to get your units of expression measure similarly scaled first.

Take a look at dimensionality reduction, there's lots of ways to reduce your gene set. I like the R package edgeR and DESeq2 to choose genes of interest to a phenotype. When you said "I intend to see how the genes in each sample are correlated with GA value of corresponding samples" that sounds like a more routine DEG or differential expression analysis. See transcriptome differential expression in bioconductor for this. https://www.bioconductor.org/packages/release/BiocViews.html#___GeneExpressionWorkflow

ADD COMMENT • link 4.8 years ago by karl.stamm 4.1k

0

Entering edit mode

@karl.stamm, @Asaf , @RamRS:

As you suggested, I tried limma::lmFit where I set ano$GA as a covariate, here is my new attempt:

library(limma)
fit <- limma::lmFit(exprs_mat, design = model.matrix( ~  ano$GA)
fit <- eBayes(fit)
topTable(fit, coef=2)

then I am expecting to get a correlation matrix from the above code, and would like to do following in order to get filtered sub expression matrix:

idx <- which( (abs(cor) > 0.8) & (upper.tri(cor)), arr.ind=TRUE)
idx <- unique(c(idx[, 1],idx[, 2])
correlated.genes <- matrix[idx, ]

but I still didn't get the right answer. I am confident about using limma approach but I couldn't figure out what went wrong above code again. Can anyone point me out how to make this work? Is there any efficient way to make this happen? any thought? thanks

ADD REPLY • link 4.8 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

Your topTable() should have the genes that correlate nicely to $GA. It's the matrix-sub-selection code that's broken. just cut down exprs_mat with the shortlist of genes, and you're done.

ADD REPLY • link 4.8 years ago by karl.stamm 4.1k

0

Entering edit mode

Your topTable() should have the genes that correlate nicely to $GA

I am trying to get these nicely correlated genes with $GA, but as I said, I got a dimension error. Could you elaborate your point if possible coding demonstration is possible?

cut down exprs_mat with the shortlist of genes

I don't understand this phrase, how can I do that exactly? Would it be possible to extend your comment with bit more details and sample code instead? Thanks in advance

ADD REPLY • link 4.8 years ago by Jurat Shahidin ▴ 100

1

Entering edit mode

I'm not going to provide the code. That's your job. It's basic data manipulation, that is a pre-requisite to work in this field. You can search other questions on this site or stackoverflow for specifics on how to subset a matrix in R. Our answers here are strategic, I have provided you many new key words to learn that will help your overall project or research goal, but will not tell you the keystrokes required to open or save a text file.

ADD REPLY • link 4.8 years ago by karl.stamm 4.1k

0

Entering edit mode

Agreed with you. I'll figure it out by myself. Thanks for your help

ADD REPLY • link 4.8 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

Using 0+ in the model.matrix formula is unnecessary and incorrect. Why are you forcing the regression to go through zero? Just remove it and the remaining limma code will run fine.

ADD REPLY • link 4.8 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

thanks for your reply. I got this error after I remove 0+:

row dimension of design doesn't match column dimension of data object

why is that? how can I cut down exprs_mat with the shortlist of genes?

ADD REPLY • link 4.8 years ago by Jurat Shahidin ▴ 100

1

Entering edit mode

The error is telling you that your annotation matrix doesn't match with the expression matrix. You have to make sure that the rows of your annotation matrix corresponds to the columns of the expression matrix, which at present they don't. How can you expect to correlate GA with expression if you don't have GA and expression for the same set of samples?

ADD REPLY • link 4.8 years ago by Gordon Smyth ★ 7.0k