Question

metric informing about the similarity of gene lists in OrderedList

0

Entering edit mode

11 months ago

kkolmus • 0

Hi,

I am wondering if there is a metric returned by in the OrderedList R Package that in addition to p-value informs about the overall similarity of compared lists. It does not matter to me whether it derives from the object created by the getOverlap function or compareLists().

Also, any hint on interpretation would be very useful.

Many thanks,

Krzysztof

orderedlist • 698 views

ADD COMMENT • link updated 11 months ago by LauferVA 4.2k • written 11 months ago by kkolmus • 0

0

Entering edit mode

Yes, there are many such tools - so many that it would be very helpful to know more about your goals. Once you have the similarity score, what will you use it for? This will help me know what to recommend.

ADD REPLY • link 11 months ago by LauferVA 4.2k

0

Entering edit mode

Thanks for your reply!

I have a plenty of gene lists to compare in a pairwise manner. I would like to know if a given pair of gene lists is similar or not. I would like retrieve a metric or preferably two of them from the objected generated in the course of analysis with the OrderedList package that will tell me (without the need to look at the exploratory plot) if two lists are similar or not.

I was thinking of running my comparisons as shown below. I would like to know if there is a similarity between the top hits. Which metrics from the x and y objects are suitable for this?

x <- compareLists(ID.List1, ID.List2, mapping = NULL, 
             two.sided = FALSE, no.reverse = TRUE)
y <- getOverlap(x)

Eventually, the similarity score will allow me exclude certain datasets from downstream analysis.

ADD REPLY • link 11 months ago by kkolmus • 0

score 0 · Answer 1 · 2023-05-26

Hello Krzystof!

Bear with me a moment, I'm going to provide a bit of background that I think you know in case others read this in the future:

Background: Scientifically and mathematically, terms like "similarity", "distance", "heterogeneity", and others have many definitions. Why? Consider a scientist just like yourself: you're interested in a real-world phenomenon, so, you study and study, then try multiple metrics (as you proposed) and then sometimes decide none of them works well for their specific goal. What then? Well, we invent (ahem! excuse me, I meant "devise a new mathematical expression for") descriptions that we think/believe/hope describe the reality of the phenomenon they are trying to understand.

As a result, if one fast-forwards from 1900 to 2023, its not surprising that at this point there are many many definitions for each concept. The trick is to understand the cases in which each definition "performs well" "performs poorly" etc.

Having got this far, I wonder if you will agree another way to reframe your question could be, "Which similarity metrics perform best in such and such scenario". Reflecting on this, we can see that this actually puts a great deal of emphasis on the scenario.

As a quick example, consider Whittaker, RH, Ecological Monographs, 1960. In this manuscript, Whittaker describes different kinds of diversity - alpha (local diversity), beta (differentiation among or across sites), and gamma (total diversity). As a result, if one wanted to describe the richness (total number of different kinds of things at a site) beta diversity would not be a good metric to use - alpha diversity would be better. Contrariwise, a researcher comparing ecosystems she is worried could collapse might prefer beta diversity.

I think its likely the reason you did not get a quick answer to this question is because of the mind boggling complexity of the total number and corpus of similarity metrics. The thing to understand is that these have proliferated both due to the types of data involved and the specific goals of the researchers. I'll try to illustrate:

The Problem In the case that you describe, then, we are comparing two lists in a pairwise fashion (although we could also compare them all at the same time if we wanted), that is a good start. But, here, the question arises, "With respect to what, exactly, do you want to compare them; what kinds of similarity do you want to accentuate?"

One could write a simple algorithm that counts the total number of genes that is shared in the two lists, and then use this as the metric. What kind of limitations would that have, though? Well, one big problem is that pathways annotations have been curated a bit haphazardly by different groups using different technologies and different cutoff thresholds over time. This results in a huge amount of variance in the size of pathways - some include 1500 genes; others only 11.

Comparing two lists of 1500 genes in length will likely generate more than 11 common to them both - thus exceeding the maximum score possible for the 11 gene pathway, even if that pathway ultimately ends up being more important to your work.

OK, so what do we do then? Well, we could say, "of the genes in the shorter list, what percentage is found in the longer list"? The problem becomes still more complicated if you consider the question, "do I want to weigh all similarities and all differences equally"? In pathway analysis, by assumption, people do - but among scientists comparing the similarity of amino acid sequences that assumption is not always made; one might not count a Glycine --> Alanine change the same as a Glycine --> Tryptophan dissimilarity.

From here, Im sure you can see that other minute details complicate the picture even further; for instance, its implicit in some of the above that some similarity metrics are designed for continuous data, while others might have been devised with discrete values (or membership in a list) in mind.

TL; DR - Could you please just recommend something I'd do a few things. First, you are looking for metrics of similarity between two sets. This alone will get you to a short list of metrics.

Second, I recommend finding a review on similarity and distance metrics for pathway analysis, because if you notice something in your data after you are well under way, you may be able to think back to the review and select a different metric with that in mind.

In terms of set similarity metrics, consider:

The Jaccard index, which is in essence (intersection / union) of two sets (reminiscent of the "percent" idea, above), as well as its entourage (jaccard similarity coefficient; jaccard distance; weighted jaccard index, etc.)
The similarity matching index which is similar but deals with the presence and absence of membership differently
The Sorenson-Dice coefficient (or F1 score)

One final caveat for the road Final thing: I really recommend taking the idea that not all pathways are equally well curated seriously. Where will you get the lists from? GSEA? MSigDB? KEGG? GO? All of the above? Generate them empirically?

That question will likely have a larger impact on your final results than the choice of metric, as long as you are comparing a couple metrics that are all well-suited to the task.

A few code snippets from old musings

string distance:

dataframeres= as.data.frame(res); dataframeres[dataframeres>=0.7]
lapply(dataframeres,function(x){if(x>=0.7){return(nrow(x))}})

qgrams('abcde', 'abdcde', q=1)
stringdist('abcde', 'abdcde', method='jaccard', q=1)  # https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
stringdist('abcdegga', 'abdcde',method='dl')
?stringdist
printf <- function(...) cat(sprintf(...))
printf("hello %d\n", 56)
sml07=c()


for(x in rownames(res)){
  for(y in colnames(res)){
    if(res[x,y]>=0.7 & x>y){
      distance=stringdist(toupper(substr(x,4,nchar(as.character(x)))),
                                                  toupper(substr(y,4,nchar(y))),
                                                  method="jaccard",q=1)
      distance1=stringdist(toupper(substr(x,4,nchar(as.character(x)))),
                          toupper(substr(y,4,nchar(y))),
                          method="jaccard",q=2)
      qCount=qgrams(toupper(substr(x,4,nchar(as.character(x)))),
             toupper(substr(y,4,nchar(y))), q=2)
      sml07=rbind(sml07,c(x,y,res[x,y],1-distance1))
      qCount
      printf("x:%s,y:%s,dist:%s,dist1:%s\n",x,y,distance,distance1)
      print(qCount)
    }
  }
}

like a jaccard metric:

uniqueNAME=unique(memberSource$NAME)
smlmtx=matrix(nrow=length(uniqueNAME),ncol=length(uniqueNAME))
rownames(smlmtx)=uniqueNAME
colnames(smlmtx)=uniqueNAME
for(i in uniqueNAME){
  S1mbr=memberSource$GENE_SYM[memberSource$NAME%in%i]
  for(j in uniqueNAME){
    if(i!=j){
      S2mbr=memberSource$GENE_SYM[memberSource$NAME%in%j]
      unionSize=length(union(S1mbr,S2mbr))
      intersectSize=length(intersect(S1mbr,S2mbr))
      smlmtx[i,j]=as.numeric(intersectSize/unionSize)
    }
  }
}
smlmtx
smlmtx[is.na(smlmtx)] = 0
res=smlmtx

Please do not use these - there are good, dedicated packages for this - i made these for fun, but I hope the help illustrate the idea that you can build these expressions yourself.

Hope this helps - let me know what you think.

VAL