Question: Find tissues that are functionally related
0
gravatar for Natasha
9 months ago by
Natasha40
Natasha40 wrote:

I am looking for a procedure to identify tissues that are functionally related. For instance, there are expression levels of genes present in different tissues like muscle or adipose tissue.

Is there a way to identify tissues that are functionally related using the gene expression values?

Any suggestions or links to studies that have performed these kinds of analysis?

ADD COMMENTlink modified 9 months ago by Biostar ♦♦ 20 • written 9 months ago by Natasha40

What do you mean by functionally related? A common definition of tissue is a group of cells that are organized to carry out a particular function. In this view, one would expect each tissue to have a characteristic gene expression signature. This view is the basis for a number of tissue atlas projects. You can in principle identify the same tissue across different samples by clustering the expression profiles as done for example in this paper.

ADD REPLYlink written 9 months ago by Jean-Karim Heriche22k

Thanks a lot for the response. I am trying to cluster the tissues like the data presented in figure 1 of this paper The clustering in this study has been done based on 6600 proteins, I would like to do the clustering based on a set of 100 proteins. I am new to this field and I am not sure what would be the right way to proceed.

ADD REPLYlink modified 9 months ago • written 9 months ago by Natasha40

In fig 1 of the paper you linked to, the data consists of a profile of protein expression levels derived from antibody stainings that were manually annotated into 4 levels (no expression, weak, medium and high coded as 1,2,3,4). A distance matrix between samples was derived by computing 1 - correlation for all pairwise combination of profiles. Clustering was performed by applying hierarchical clustering with average linkage. In R, assuming the data is in data frame df where the rows are the samples and the columns the genes/proteins levels, this would be something like:

D <- 1 - cor(t(df), method = "spearman") # Spearman's correlation between rows
tree <- hclust(D, method = "average")
ADD REPLYlink written 9 months ago by Jean-Karim Heriche22k

Many thanks for the explanation. I am looking for the data that has been used in their study on the protein atlas webpage . Would it be correct to use the data set i.e 1 . Normal data ?

ADD REPLYlink written 9 months ago by Natasha40

The HPA normal tissue data set has the same form as what's reported in the paper but I don't know if it is the same data because the study is already 10 years old.

ADD REPLYlink written 9 months ago by Jean-Karim Heriche22k

yes, it's just a clustering algorithm. You can use any unsupervised clustering method to do that.

ADD REPLYlink written 9 months ago by shoujun.gu300

I have read the data in the following format in python,

              Gene Gene name         Tissue            Cell type  \
0  ENSG00000000003    TSPAN6  adrenal gland      glandular cells   
1  ENSG00000000003    TSPAN6       appendix      glandular cells   
2  ENSG00000000003    TSPAN6       appendix      lymphoid tissue   
3  ENSG00000000003    TSPAN6    bone marrow  hematopoietic cells   
4  ENSG00000000003    TSPAN6         breast           adipocytes   

          Level Reliability  
0  Not detected    Approved  
1        Medium    Approved  
2  Not detected    Approved  
3  Not detected    Approved  
4  Not detected    Approved

I would replace Not detected, low, medium and high as 1,2,3,4. Could you please suggest how to proceed after this?

ADD REPLYlink written 9 months ago by Natasha40

Python? What's python? I'd do it with sed:

sed -i 's/Not detected/1/g; s/Low/2/g; s/Medium/3/g; s/High/4/g;' normal_tissue.tsv
ADD REPLYlink written 9 months ago by Jean-Karim Heriche22k

Thanks a lot! However, I am not sure how to proceed from here. In the previous posts, you mentioned

D <- 1 - cor(t(df), method = "spearman") # Spearman's correlation between rows
tree <- hclust(D, method = "average")

Could you please explain a bit more on this? If my understanding is right, using the data present in the column gene name and Level spearman's correlation has to be calculated. For every gene, there are multiple levels reported and each Level is from a different source tissue. Could you please explain how (and between which quantities) the correlation has to be calculated?

ADD REPLYlink modified 9 months ago • written 9 months ago by Natasha40

You would need to reshape the data from its current long format to wide format such that df is a table of tissues x gene expression levels. The expression levels for the different genes are the features on which the correlation is based. Again in R, the whole thing could be something like this:

library(reshape2) # Needed for dcast() function
# Read the data
tbl <- read.delim("normal_tissue.tsv", header = TRUE, sep = '\t')
# Convert from long to wide format
# Average the values over the different cell types of a tissue
df <- dcast(tbl, Tissue ~ Gene, value.var = "Level", fun.aggregate = mean)
# Turn first column into row names
rownames(df) <- df[,1] # Assign row names
df[,1] <- NULL # Remove first column
# Compute correlation-based distance
D <- as.dist(1 - cor(t(df), method = "spearman")) # Make sure it's a dist object for use with hclust
# Cluster
tree <- hclust(D, method = "average")
ADD REPLYlink modified 9 months ago • written 9 months ago by Jean-Karim Heriche22k
      ENSG00000000003 ENSG00000000419        ...                                                                                                                                                          adrenal gland            1.000000        4.000000                    ...                                                                                                                                              appendix                 2.000000        3.500000                      ...                                                                                                                                            bone marrow              1.000000        3.000000                   ...

I could understand the implementation till df[,1] <- NULL which generates the above data.

I couldn't understand how the distance metric, which is used for hierarchical clustering, is computed and why it is required to do 1- cor(t(df)) (If my understanding is right this is done to convert the correlation measure to distance measure and use the distance measure for clustering). Could you please explain a bit on the last two steps?

I'd like to understand how the spearman's row correlation is computed? Is a matrix generated for correlation between each pair of gene?

ADD REPLYlink modified 9 months ago • written 9 months ago by Natasha40
1

A correlation is a measure of similarity, i.e. the more similar two items are, the higher the value. A distance is related to a similarity in that the more similar two objects are, the smaller the distance between them is. There are various ways of converting between similarity and distance, the simplest when the measures are bounded is D = max(S) - S which is what's used here (correlations have a maximum of 1) in the penultimate line. The reason such conversion is needed is because some algorithms and/or their implementations expect a similarity matrix while others (like hclust) expect a distance matrix. The R function computes correlation between columns of the data frame passed to it so to get correlation between genes we need to transpose the data with t(). The last line performs the hierarchical clustering using the specified method, It produces a tree which you can visualize with plot(tree). The Spearman's correlation is the Pearson's correlation computed when the values of the variables have been replaced by their ranks.

ADD REPLYlink written 9 months ago by Jean-Karim Heriche22k

Transpose of the data is ,

                adrenal gland appendix bone marrow   breast bronchus
ENSG00000000003             1      2.0           1 2.000000        4
ENSG00000000419             4      3.5           3 3.000000        3
ENSG00000000457             1      1.5           2 2.666667        1
ENSG00000000460             3      1.5           2 3.000000        3

From what I understand, the columns in the above data are converted to ranks and the dataframe with ranks will be

               adrenal gland appendix bone marrow   breast bronchus
ENSG00000000003             1      2           1 1       3
ENSG00000000419             3      3           3 3        2
ENSG00000000457             1      1           2 2        1
ENSG00000000460             2      1           2 3        2

Could you please let me know if this is a right way to assign ranks when same values occur in every column?

ADD REPLYlink modified 9 months ago • written 9 months ago by Natasha40

If in the transpose the genes are in rows then they were in columns to start with so no need to transpose. You also don't need to compute the ranks yourself if you use method="spearman". Also for cases with many ties, it may be preferable to use Kendall's correlation (i.e. method="kendall").

ADD REPLYlink written 9 months ago by Jean-Karim Heriche22k

I obtain the following error, NaN entries are present in the original dataframe. I am not sure how to resolve this error

Error in hclust(D, method = "average") : 
  NA/NaN/Inf in foreign function call (arg 11)
Execution halted
ADD REPLYlink written 9 months ago by Natasha40

Either there are NA values in the original data or they are produced during reformatting with dcast. You'll have to examine the cause to determine what to do with them.

ADD REPLYlink written 9 months ago by Jean-Karim Heriche22k

Thanks a lot for the response. NA values are produced while reformatting the original data with dcast. If the presence of a gene has not been reported in a tissue the corresponding entry is assigned NaN. I am not sure how these entries must be defined.

                 ENSG00000000003 ENSG00000000419
adrenal gland            1.000000        4.000000
appendix                 2.000000        3.500000
bone marrow              1.000000        3.000000
breast                   2.000000        3.000000
bronchus                 4.000000        3.000000
caudate                  1.000000        2.500000
cerebellum               1.000000        2.333333
cerebral cortex          1.500000        2.500000
cervix, uterine          4.000000        3.000000
colon                    1.666667        3.000000
duodenum                 2.000000             NaN
endometrium 1            2.500000             NaN
endometrium 2            2.500000             NaN
epididymis               3.000000             NaN
esophagus                4.000000             NaN
fallopian tube           4.000000             NaN
gallbladder              3.000000             NaN
heart muscle             1.000000             NaN
hippocampus              1.000000             NaN
kidney                   2.000000             NaN
liver                    2.000000             NaN
lung                     1.500000             NaN
ADD REPLYlink written 9 months ago by Natasha40

You have to decide how you want to handle this. One way would be to consider that genes not reported are not expressed (i.e. give them a level of 1) or to get rid of them (e.g. using option na.rm = TRUE in dcast).

ADD REPLYlink written 9 months ago by Jean-Karim Heriche22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 919 users visited in the last hour