Entering edit mode
6.1 years ago
svlachavas
▴
790
Dear Community,
based on a data table in R (txt file), i would like to implement a "distance-metric"/approach, in order to compute the dissimilarity between categorical variables, which in this case are drugs. My ultimate goal, is based on the gene symbols that these drugs are associated (total overlap of common.up and common.down genes below), to find the most "dissimilar" pairs of drugs, that have the smallest percentage of overlap.
A snapshot of my data table for the first two rows is the following:
head(drugs,2)
experiment_id Score cell_line chemical hours
19999 LJP005 0.2688172 HT29 PD-184352 24H
19980 LJP005 0.2365591 HT29 trametinib 24H
common.up
19999 c(NOP56, PAICS, COL1A1, COL3A1, DKC1, TPX2, BOP1, IARS, MCM2, MCM4, MCM7, MYC, NME1, PPAT, NAT10, NHP2, AURKA, CAD, EEF1E1, CDK1)
19980 c(EMG1, NOP56, PAICS, DKC1, TPX2, BOP1, HPRT1, IARS, MCM2, MCM4, MCM7, MYC, NME1, PPAT, NHP2, RAN, AURKA, CAD, EEF1E1, CDK1)
common.down dosage..uM.
19999 c(TXNIP, FGFR2, ANK3, HMGCL, CAT) 10.00
19980 c(FGFR2, HMGCL) 1.11
Any ideas or suggestions about which metrics/approaches would be robust for my approach ?
Probably Jaccard similarity.
Dear h.mon,
thank you for your answer.However, i have already used the Jaccard coefficient to rank these "resulted experiments" (Score column above), in a similar way of performing an overepresentation analysis described in a previous post (https://www.biostars.org/p/299820/#300014).
So, i would like to use a different method/approach. For example, cosine similarity would fit in your opinion for my goal ?
I don't know a proper answer for your question, for two reasons:
1) I am not an expert on similarity measures,
2) you do not say what you think is important and should be captured by the similarity measure - that is, you do not define similarity for your problem.
While you have no power to improve my knowledge on similarity indexes, you certainly do have power to think about the similarity you want to capture. Do non-common genes matter? Or only shared genes matter? Or even instead of looking at a subset of the genes (applying a cut-off and discarding "non-significant" genes), why not measure the similarity of changes in expression of the whole set of genes?
As you described the problem, I think Jaccard is the most appropriate. Why are you unsatisfied with it?
As a side-note, if you already used Jaccard similarity and are interested in alternatives / improvements, state that on your question and avoid wasting time - yours, and ours (the people potentially writing answers).
Dear h.mon,
thank you for your answer, and please excuse me if I was not clear or providing enough information about my approach. So, two quick comments on this matter:
1) The initial Jaccard similarity mentioned, is to generally rank the gene-sets from a drug-gene base (L1000), with my input DE genes, like an overepresentation analysis
2) My next goal, is based on these ranked experiments-drugs, is to identify the "most" disimilar pairs of drugs/experiment, that have the less amount of identified genes from my initial signature--that's why i also asked for alternative metrics.
Thus, in your opinion, using for this context the Jaccard coefficient (or another similarity measure), would be enough to find the most disimilar pairs of experiments from above ? based on their total annotated genes ? (both up and down) ?
Given your question:
then the answer by @h.mon is relevant. As a general rule, you should choose a measure that captures relevant properties of similarity between the items. If only the percentage of overlap is relevant (i.e. you want to ignore the sizes of the sets), then use it as similarity measure. There are plenty of other measures for measuring similarity between sets (aka binary similarity measures). Check the R package proxy or this survey of binary similarity measures. If this is not what you want then please clarify what your goal is. It looks like another case of the XY problem.
Dear Jean-Karim,
thank you also for your answer and suggestions- i will search the R package proxy and inspect various measures