Quantify Popularity Of Genes
Entering edit mode
10.0 years ago

Dear all,

I have a list of about 1,000 human genes. I'd like to quantify their 'popularity' (i.e. to what extend these genes have been studied). I feel that counting the number of publications related to each of the genes could be a proxy for their popularity.

How would you approach this programmatically?

Thanks in advance,


Entering edit mode

While I agree with Pierre's answer as a quick approach to seeing what genes might be important, the gene2pubmed seems to be purely co-citation based. You can go about estimating the importance or "hubness" of genes by counting their interactions, but gene2pubmed derived "interactions" while providing free information are not really very good. This paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2641004/#!po=56.2500) for example claims: "Only 28% of the co-occurred pairs in PubMed abstracts appeared in any of the commonly used human PPI databases (HPRD, BioGRID and BIND). On the other hand, of the known PPIs in HPRD, 69% showed co-occurrences in the literature, and 65% shared GO terms." There are other papers that have looked at publication networks and shown them to be vaguely useful sometimes, but with caveats. If you read through some of the pubmed publications that are included in such co-citation lists, you'll find that the co-mentions are not always about the two genes or proteins interacting - there's also an issue with synonyms (some genes have multiple slightly different names sometimes - though if you only look at HUGO gene symbols, this shouldn't be an issue?), among other problems. So you might want to look at a meta database with some manually curated and non-citation based protein-protein (and so gene-gene) interaction lists, such as BioGRID: http://thebiogrid.org/ As part of a course some time back, I and some others had to look at protein-protein interactions from such a meta database (from one that included the then-current BioGRID interactions + some other smaller databases - specifically this one: http://cbg.garvan.unsw.edu.au/pina/interactome.stat.do). We found UBC (Ubiquitin C) to be by far the most numerously interacting gene of all (followed by SUMO2 and then p53).

Entering edit mode
10.0 years ago

The NCBI provides a mapping ncbi-gene <-> pubmed: ftp://ftp.ncbi.nih.gov/gene/DATA/gene2pubmed.gz

The 10 most 'popular' genes would be:

$ curl -s "ftp://ftp.ncbi.nih.gov/gene/DATA/gene2pubmed.gz" |\
gunzip -c | grep -v "#" | cut -f 2 |\
sort | uniq -c | sort -n | tail

   2820 3569
   2838 7422
   3007 22059
   3020 1956
   3037 7316
   3232 348
   3437 31271
   3545 14910
   4135 7124
   6113 7157


$ curl  -s "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=7157&retmode=xml" | xmllint --xpath '//Gene-ref_desc/text()' - && echo
tumor protein p53

7157: TP53 tumor protein p53 [ Homo sapiens (human) ] http://www.ncbi.nlm.nih.gov/gene/7157

Entering edit mode

I would have given the same answer. Gene2Pubmed counts for a gene is a good way to evaluate its importance. Just FYI, Gene2Pubmed will not give you the all the articles available for a gene. I think these articles are manually curated by people at NCBI. But for this question gene2pubmed thing should work. Gene2Rif can also be an alternative as Gene2Pubmed counts are normally contaminated with bunch of articles that are superficially associated to a genes. For example, large omics studies references. Another important thing is to take into account all the orthologs of that genes. Many human genes may get lower ranks but if you include articles related to their orthoogs their ranking can jump higher.


Login before adding your answer.

Traffic: 1506 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6