Question: Index MEDLINE/PubMed with Apache Solr
gravatar for paulparsons
4.2 years ago by
paulparsons130 wrote:

I am interested in doing some natural language processing on the corpus of MEDLINE/PubMed journal citations.

I have recently become aware of Apache Solr, which is a Lucene-based search server. I have no experience with Solr, but from what I can tell it seems to be a good way to go about tackling my task.

I'm wondering if anyone has experience with something like this and may have some insights to share. For example, how should the Solr schema be devised? What has to be done with the XML files so that they are in a format that Solr can index?

Aside from two posts here and here, which are old and don't offer any detailed advice, there is not much to be found on this topic. There is another post here about indexing the Gene Ontology with Solr, but I'm not quite sure how to translate the advice into what is necessary for MEDLINE/PubMed.

nlp medline lucene solr pubmed • 3.3k views
ADD COMMENTlink modified 13 months ago by Maria_Levchenko40 • written 4.2 years ago by paulparsons130
gravatar for Istvan Albert
4.2 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

I think you are overlooking the real complexity of the problem.

You seem to think that the missing ingredient is the search engine (Solr in this case) integration whereas indexing the documents is a solved problem . In fact it is the other way around, Solr is a neat and high performance search engine but it is just that. It has no understanding of the content of the papers.

The real challenge is what to index from each paper: how to identify and tokenize that paper so that whatever the search engine ends up indexing makes sense and is helpful later on.

The links that found actually touch upon this in great detail for example this How Do People Go About Pubmed Text Mining

Finally there is an enormous amount of work spent  by many groups in this research area, if you do a literature search you will find hundreds of papers that aim to index medline in various ways.

ADD COMMENTlink written 4.2 years ago by Istvan Albert ♦♦ 77k

Thanks for the comment. I wasn't necessarily making any assumptions about the difficulty of indexing . I do appreciate (at least to some degree) the challenge of indexing the journal citations. 

I have previously parsed the xml citations and loaded them into a relational database. However, I've recently come across a number of references to Lucene/Solr, and it appears that more people are using it for this type of task. I'm thinking that maybe this is a better approach than the way I was originally doing it. I have no background in this area, so I'm just trying to solicit some guidance or anecdotes from others who have done similar things (with MEDLINE/PubMed is a bonus). 

I will search the literature to see if I can find some useful information to get me started.


ADD REPLYlink written 4.2 years ago by paulparsons130
gravatar for bw.
4.1 years ago by
San Francisco
bw.140 wrote:

This paper talks about  indexing PubMed using Lucene/Solr:

Also, this tutorial discusses loading PubMed xml into a SQL database:


Edit 6/17/2014:    

Ps. It seems like Google Scholar and NCBI provide the most popular PubMed searches. I wonder why a better search doesn't exist anywhere. 



ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by bw.140
gravatar for Pierre Lindenbaum
4.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum109k wrote:

Thank you for citing my blog post " Indexing the content of Gene Ontology with apache SOLR " :-)

I also wrote a post about indexing NCBI gene with lucene:  . I don't have much more experience with lucene but Pubmed could be indexed the very same way...



ADD COMMENTlink written 4.1 years ago by Pierre Lindenbaum109k
gravatar for Maria_Levchenko
13 months ago by
Maria_Levchenko40 wrote:

Europe PMC indexes PubMed and PMC articles using Apache Solr: You can always contact them to ask more about technical details

ADD COMMENTlink written 13 months ago by Maria_Levchenko40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 691 users visited in the last hour