Question

How to replicate method from Brennand et al: clustering of microarray and RNA-seq data?

4

Entering edit mode

9.8 years ago

dvanic ▴ 250

I am trying to replicate the analysis conducted in Brennand et al to cluster microarray expression data (or, in my case, RNA-seq data) with data available in the Allen Brain Atlas. I'm afraid their methods description is not clear enough for me to understand what exactly they did and hence how to repeat it.

To quote the paper (which is open-source, btw):

Cross-platform comparisons of our hiPSC microarray and the AllenBrain Atlas microarrays were done by (1) ranking absolute gene expression for each microarray using Partek, (2) assigning a rank difference value for each gene using a MATLAB script, and (3) calculating Spearman Rank Correlation Coefficients for each microarray comparison in Microsoft Excel. Wilcoxon's rank-sum test was assessed if a category of interest (spatial, temporal or combined) had significantly higher Spearman correlations than the background of all pairwise correlations. No hard cutoffs of 'best matches' were used.

I have tried the following approach (using R):

Merge data frame with microarray data from ABA with my RNA-seq normalized count table. This gets rid of genes that are not interrogated with the microarray.
Rank data by absolute gene expression using the rank function (min for ties)
Use the cor(rankedJune, method="spearman") to calculate spearman rank correlation coefficients
Identify the Wilcoxon rank-sum test for each of the comparisons - log(pairwise.wilcox.test(correlationSpear2RelevantMelt$value, correlationSpear2RelevantMelt$variable, p.adjust = "bon")$p.value)

The outcome looks like complete bollocks. What am I doing wrong with my approach, and how do I replicate exactly what the authors have done on my data? (I've also tried doing a conventional spearman between my data and the ABA array, and the datasets (predictably) show that they are quite distinct.

Thanks in advance!

clustering R microarray RNA-Seq • 3.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by dvanic ▴ 250

Ram · Answer 1 · 2014-07-08

1

Entering edit mode

9.8 years ago

Michael 54k

Re-producing (or using) a specific method can be difficult, especially when it involves proprietary software, I would also recommend that you get the original data from that publication first and try to really re-produce their results. At the moment you are using a similar method to produce new results. ~~However, there is a specific error in your pipeline:~~

~~Rank data by absolute gene expression using the rank function (min for ties)~~
~~Use the cor(rankedJune, method="spearman") to calculate spearman rank correlation coefficients~~

Spearman's rank correlation is calculated from the original data values not their ranks. If you want to calculate Spearman's correlation from already computed ranks, ~~you have to~~ you can use Pearson's correlation coefficient on ranks (you can see that by comparing the equation for both), (but of course the ranks of the ranks are still the identical ranks). Also, if you calculate the ranks yourself, do not treat the ties other than the default using rank.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Michael 54k

0

Entering edit mode

I'm more trying to understand how they did it:

(1) ranking absolute gene expression for each microarray using Partek
(2) assigning a rank difference value for each gene using a MATLAB script
(3) calculating Spearman Rank Correlation Coefficients for each microarray comparison in Microsoft Excel

Don't they already have a "rank" followed by a z-score like "rank difference" after step 2 on which they are doing Spearmans???

ADD REPLY • link updated 4.8 years ago by Ram 43k • written 9.8 years ago by dvanic ▴ 250

0

Entering edit mode

Possibly they are describing how they assign their rank difference in the supplementary, did they provide some source code? I think that step differs from what you were describing. If they look at rank difference, that's something new, it is not the same as doing ranks, but I haven't read the paper yet. Btw, I just checked, ofc using spearman on ranks gives the same result as using pearson on the same ranks, it just doesn't make a difference. So ignore my "you have to use Pearson", but you still can.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Michael 54k

0

Entering edit mode

Nope, no source code or additional details!!! That's why it's so frustrating!!!

The problem is that their methods description is very confusing to me, AND they make a claim based on this approach that is very much critical for interpreting the results, especially the biology of their system. I've got a similar system which I'm also trying to "map" to the ABA data, so it makes sense to use their approach, cite them and see what the outcome of the mapping is - but I just can't get it to work!

ADD REPLY • link 9.8 years ago by dvanic ▴ 250

0

Entering edit mode

I think there is not too much magic in their steps, but certainly it shows why using proprietary software does no good in science. You should really get the original data, they should have submitted them to some repository? If not write to the author.

I didn't find further information on the methods they used but:

Gene ranking by Partek: How could that ranking be different from using Biconductor/affy and using rma normalized absolute intensity values, if there is a difference than it's possibly partec's fault. (If you really need to, you could get a trial license for 14 days (should be feasible to do a comparison)).
rank difference: sounds to me like: rank.diff <- rank(x) - rank(y), or abs (rank(x) - rank(y)) with x and y sample intensity vectors, that they need a MATLAB script for that, well...
Spearman's rank correlation in Excel, well that's cor(rank.diff.matrix, method='spearman') I guess that the rank difference is meant as something like a "non-parametric log-ratio". Maybe they should have used the rankprod package in Bioconductor.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Michael 54k