Question: Index MEDLINE/PubMed with Apache Solr
gravatar for paulparsons
6.5 years ago by
paulparsons130 wrote:

I am interested in doing some natural language processing on the corpus of MEDLINE/PubMed journal citations.

I have recently become aware of Apache Solr, which is a Lucene-based search server. I have no experience with Solr, but from what I can tell it seems to be a good way to go about tackling my task.

I'm wondering if anyone has experience with something like this and may have some insights to share. For example, how should the Solr schema be devised? What has to be done with the XML files so that they are in a format that Solr can index?

Aside from two posts here and here, which are old and don't offer any detailed advice, there is not much to be found on this topic. There is another post here about indexing the Gene Ontology with Solr, but I'm not quite sure how to translate the advice into what is necessary for MEDLINE/PubMed.

nlp medline lucene solr pubmed • 4.1k views
ADD COMMENTlink modified 3.4 years ago by Maria_Levchenko60 • written 6.5 years ago by paulparsons130
gravatar for Istvan Albert
6.5 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

I think you are overlooking the real complexity of the problem.

You seem to think that the missing ingredient is the search engine (Solr in this case) integration whereas indexing the documents is a solved problem . In fact it is the other way around, Solr is a neat and high performance search engine but it is just that. It has no understanding of the content of the papers.

The real challenge is what to index from each paper: how to identify and tokenize that paper so that whatever the search engine ends up indexing makes sense and is helpful later on.

The links that found actually touch upon this in great detail for example this A: How Do People Go About Pubmed Text Mining

Finally there is an enormous amount of work spent by many groups in this research area, if you do a literature search you will find hundreds of papers that aim to index medline in various ways.

ADD COMMENTlink modified 10 months ago by RamRS30k • written 6.5 years ago by Istvan Albert ♦♦ 85k

Thanks for the comment. I wasn't necessarily making any assumptions about the difficulty of indexing . I do appreciate (at least to some degree) the challenge of indexing the journal citations.

I have previously parsed the xml citations and loaded them into a relational database. However, I've recently come across a number of references to Lucene/Solr, and it appears that more people are using it for this type of task. I'm thinking that maybe this is a better approach than the way I was originally doing it. I have no background in this area, so I'm just trying to solicit some guidance or anecdotes from others who have done similar things (with MEDLINE/PubMed is a bonus).

I will search the literature to see if I can find some useful information to get me started.


ADD REPLYlink modified 10 months ago by RamRS30k • written 6.5 years ago by paulparsons130
gravatar for bw.
6.4 years ago by
San Francisco
bw.150 wrote:

This paper talks about indexing PubMed using Lucene/Solr.

Also, this tutorial discusses loading PubMed xml into a SQL database.

Edit 6/17/2014:

Ps. It seems like Google Scholar and NCBI provide the most popular PubMed searches. I wonder why a better search doesn't exist anywhere.

ADD COMMENTlink modified 10 months ago by RamRS30k • written 6.4 years ago by bw.150
gravatar for Pierre Lindenbaum
6.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

Thank you for citing my blog post "Indexing the content of Gene Ontology with apache SOLR" :-)

I also wrote a post about indexing NCBI gene with lucene: I don't have much more experience with lucene but Pubmed could be indexed the very same way...

ADD COMMENTlink modified 9 months ago by RamRS30k • written 6.4 years ago by Pierre Lindenbaum131k
gravatar for Maria_Levchenko
3.4 years ago by
Maria_Levchenko60 wrote:

Europe PMC indexes PubMed and PMC articles using Apache Solr: You can always contact them to ask more about technical details

ADD COMMENTlink written 3.4 years ago by Maria_Levchenko60
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1397 users visited in the last hour