Index MEDLINE/PubMed with Apache Solr
4
2
Entering edit mode
7.1 years ago
paulparsons ▴ 140

I am interested in doing some natural language processing on the corpus of MEDLINE/PubMed journal citations.

I have recently become aware of Apache Solr, which is a Lucene-based search server. I have no experience with Solr, but from what I can tell it seems to be a good way to go about tackling my task.

I'm wondering if anyone has experience with something like this and may have some insights to share. For example, how should the Solr schema be devised? What has to be done with the XML files so that they are in a format that Solr can index?

Aside from two posts here and here, which are old and don't offer any detailed advice, there is not much to be found on this topic. There is another post here about indexing the Gene Ontology with Solr, but I'm not quite sure how to translate the advice into what is necessary for MEDLINE/PubMed.

solr lucene medline pubmed NLP • 4.3k views
ADD COMMENT
2
Entering edit mode
7.1 years ago

I think you are overlooking the real complexity of the problem.

You seem to think that the missing ingredient is the search engine (Solr in this case) integration whereas indexing the documents is a solved problem . In fact it is the other way around, Solr is a neat and high performance search engine but it is just that. It has no understanding of the content of the papers.

The real challenge is what to index from each paper: how to identify and tokenize that paper so that whatever the search engine ends up indexing makes sense and is helpful later on.

The links that found actually touch upon this in great detail for example this A: How Do People Go About Pubmed Text Mining

Finally there is an enormous amount of work spent by many groups in this research area, if you do a literature search you will find hundreds of papers that aim to index medline in various ways.

ADD COMMENT
0
Entering edit mode

Thanks for the comment. I wasn't necessarily making any assumptions about the difficulty of indexing . I do appreciate (at least to some degree) the challenge of indexing the journal citations.

I have previously parsed the xml citations and loaded them into a relational database. However, I've recently come across a number of references to Lucene/Solr, and it appears that more people are using it for this type of task. I'm thinking that maybe this is a better approach than the way I was originally doing it. I have no background in this area, so I'm just trying to solicit some guidance or anecdotes from others who have done similar things (with MEDLINE/PubMed is a bonus).

I will search the literature to see if I can find some useful information to get me started.

Thanks.

ADD REPLY
1
Entering edit mode
7.0 years ago
bw. ▴ 200

This paper talks about indexing PubMed using Lucene/Solr.

Also, this tutorial discusses loading PubMed xml into a SQL database.

Edit 6/17/2014:

Ps. It seems like Google Scholar and NCBI provide the most popular PubMed searches. I wonder why a better search doesn't exist anywhere.

ADD COMMENT
0
Entering edit mode
7.0 years ago

Thank you for citing my blog post "Indexing the content of Gene Ontology with apache SOLR" :-)

I also wrote a post about indexing NCBI gene with lucene: http://plindenbaum.blogspot.fr/2009/07/indexing-and-searching-ncbi-genes-with.html. I don't have much more experience with lucene but Pubmed could be indexed the very same way...

ADD COMMENT
0
Entering edit mode
4.1 years ago

Europe PMC indexes PubMed and PMC articles using Apache Solr: http://blog.europepmc.org/2016/08/search-improvements.html You can always contact them to ask more about technical details

ADD COMMENT

Login before adding your answer.

Traffic: 2113 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6