My name is Ravi and I am a doctoral student studying the biological processes in human ageing. Recently we wanted to also have a bioinformatic analysis of the same. I am trying to understand the effect gene set size has when I am computing the GO semantic similarity score using the R package 'GOSemSim'.
I have a fixed data set containing about 2000 genes, labelled TraitA.
I compute the semantic similarity between TraitA and several other traits, labelled Trait_Random. Trait_Random will have anywhere from 10 to 2000 genes.
How does this difference in gene set size affects the score that I get?
Also is there any statistical method that I could use if there is a bias in the score generated?
Any thoughts or inputs on this would be very helpful. Thank you so much for your time.