I am interested in doing some natural language processing on the corpus of MEDLINE/PubMed journal citations.
I have recently become aware of Apache Solr, which is a Lucene-based search server. I have no experience with Solr, but from what I can tell it seems to be a good way to go about tackling my task.
I'm wondering if anyone has experience with something like this and may have some insights to share. For example, how should the Solr schema be devised? What has to be done with the XML files so that they are in a format that Solr can index?
Aside from two posts here and here, which are old and don't offer any detailed advice, there is not much to be found on this topic. There is another post here about indexing the Gene Ontology with Solr, but I'm not quite sure how to translate the advice into what is necessary for MEDLINE/PubMed.