Testing Significance Of Go Term Frequency In Biomedical Literature
1
0
Entering edit mode
11.9 years ago
user1409015 ▴ 20

If I have a set of Gene Ontology terms each term having a particular frequency associated with it (the number of the times the term has appeared in fixed corpus of papers), then is the following method of significance testing valid?

  1. calculate the median absolute deviation (MAD) of the GO term frequencies in the given corpus
  2. use MAD + median as a threshold above which the GO terms are deemed significantly associated with the given corpus and below which the GO terms are deemed non-siginificant.

Improvements, alternatives, rebuttals?

go • 2.2k views
ADD COMMENT
0
Entering edit mode

You have only a single corpus, or several?

ADD REPLY
0
Entering edit mode

Just one: collected using specific MedLine search terms. Problem is I don't know what search terms I would use for my null corpora/corpus.

ADD REPLY
0
Entering edit mode
11.9 years ago
seidel 11k

"...which the GO terms are deemed significantly associated with the given corpus" With only one corpus, you're trying to figure out how to hear the sound of only one hand clapping. The significance has to be defined relative to some reference. It seems to me from your comment to Sean, that you are trying to establish a relationship between medline search terms (used to return your corpus) and some set of GO terms found in that corpus. Perhaps more information about the search terms, or what kind of association you're trying to make would be helpful. (i.e. what is the point of your particular corpus, relative to a randomly chosen corpus of the same size that also contains some GO Terms?)

ADD COMMENT
0
Entering edit mode

A search term such as "autophagy" excluding reviews. So, basically what would I use as a control corpus for that? How do I randomly select n MedLine articles? And, from there how can I compare the GO term composition of my corpus of interest with the corpus/corpora of randomly-selected papers?

ADD REPLY
0
Entering edit mode

I thing the answer to "what is the control corpus" depends on the purpose for wanting to link autophagy to a pile of GO terms through medline. You might explain more about what you are trying to accomplish through query search term -> GO Term linkage that needs significance attached. How many query terms do you have? (consider that each one, given the process you described, generates a frequency vector for all GO Terms). What do they have in common? (this may be relevant in terms of what constitutes a control). Depending on what you're trying to do, perhaps the problem can be re-phrased: given a list of query terms, define a corpus for each one, this will generate x test sets. As a background reference, consider generating random corpora by using a universe defined by the set of journals contained across your test corpora. From this journal set, find a way to select articles randomly (i.e. journal title, year, perhaps some other property - pick articles randomly from the return list). This would establish a way to generate a background frequency for the GO Terms found in your test set. You can then compare frequencies of terms between your background universe and your test set. One could think of many variations on this theme - it gets back to what you're really trying to accomplish.

ADD REPLY

Login before adding your answer.

Traffic: 2956 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6