Entering edit mode
3.7 years ago
Natasha ▴ 40
I am looking for a procedure to identify tissues that are functionally related. For instance, there are expression levels of genes present in different tissues like muscle or adipose tissue.
Is there a way to identify tissues that are functionally related using the gene expression values?
Any suggestions or links to studies that have performed these kinds of analysis?
What do you mean by functionally related? A common definition of tissue is a group of cells that are organized to carry out a particular function. In this view, one would expect each tissue to have a characteristic gene expression signature. This view is the basis for a number of tissue atlas projects. You can in principle identify the same tissue across different samples by clustering the expression profiles as done for example in this paper.
Thanks a lot for the response. I am trying to cluster the tissues like the data presented in figure 1 of this paper The clustering in this study has been done based on 6600 proteins, I would like to do the clustering based on a set of 100 proteins. I am new to this field and I am not sure what would be the right way to proceed.
In fig 1 of the paper you linked to, the data consists of a profile of protein expression levels derived from antibody stainings that were manually annotated into 4 levels (no expression, weak, medium and high coded as 1,2,3,4). A distance matrix between samples was derived by computing 1 - correlation for all pairwise combination of profiles. Clustering was performed by applying hierarchical clustering with average linkage. In R, assuming the data is in data frame df where the rows are the samples and the columns the genes/proteins levels, this would be something like:
Many thanks for the explanation. I am looking for the data that has been used in their study on the protein atlas webpage . Would it be correct to use the data set i.e 1 . Normal data ?
The HPA normal tissue data set has the same form as what's reported in the paper but I don't know if it is the same data because the study is already 10 years old.
yes, it's just a clustering algorithm. You can use any unsupervised clustering method to do that.
I have read the data in the following format in python,
I would replace Not detected, low, medium and high as 1,2,3,4. Could you please suggest how to proceed after this?
Python? What's python? I'd do it with sed:
Thanks a lot! However, I am not sure how to proceed from here. In the previous posts, you mentioned
Could you please explain a bit more on this? If my understanding is right, using the data present in the column
Levelspearman's correlation has to be calculated. For every gene, there are multiple
levelsreported and each
Levelis from a different source tissue. Could you please explain how (and between which quantities) the correlation has to be calculated?
You would need to reshape the data from its current long format to wide format such that df is a table of tissues x gene expression levels. The expression levels for the different genes are the features on which the correlation is based. Again in R, the whole thing could be something like this:
I could understand the implementation till
df[,1] <- NULLwhich generates the above data.
I couldn't understand how the distance metric, which is used for hierarchical clustering, is computed and why it is required to do 1- cor(t(df)) (If my understanding is right this is done to convert the correlation measure to distance measure and use the distance measure for clustering). Could you please explain a bit on the last two steps?
I'd like to understand how the spearman's row correlation is computed? Is a matrix generated for correlation between each pair of gene?
A correlation is a measure of similarity, i.e. the more similar two items are, the higher the value. A distance is related to a similarity in that the more similar two objects are, the smaller the distance between them is. There are various ways of converting between similarity and distance, the simplest when the measures are bounded is D = max(S) - S which is what's used here (correlations have a maximum of 1) in the penultimate line. The reason such conversion is needed is because some algorithms and/or their implementations expect a similarity matrix while others (like hclust) expect a distance matrix. The R function computes correlation between columns of the data frame passed to it so to get correlation between genes we need to transpose the data with t(). The last line performs the hierarchical clustering using the specified method, It produces a tree which you can visualize with plot(tree). The Spearman's correlation is the Pearson's correlation computed when the values of the variables have been replaced by their ranks.
Transpose of the data is ,
From what I understand, the columns in the above data are converted to ranks and the dataframe with ranks will be
Could you please let me know if this is a right way to assign ranks when same values occur in every column?
If in the transpose the genes are in rows then they were in columns to start with so no need to transpose. You also don't need to compute the ranks yourself if you use method="spearman". Also for cases with many ties, it may be preferable to use Kendall's correlation (i.e. method="kendall").
I obtain the following error, NaN entries are present in the original dataframe. I am not sure how to resolve this error
Either there are NA values in the original data or they are produced during reformatting with dcast. You'll have to examine the cause to determine what to do with them.
Thanks a lot for the response. NA values are produced while reformatting the original data with dcast. If the presence of a gene has not been reported in a tissue the corresponding entry is assigned NaN. I am not sure how these entries must be defined.
You have to decide how you want to handle this. One way would be to consider that genes not reported are not expressed (i.e. give them a level of 1) or to get rid of them (e.g. using option na.rm = TRUE in dcast).